表格中的列预测 - 部署优化¶
本教程将介绍如何执行端到端 AutoML 过程,以创建用于生产环境的优化且可部署的 AutoGluon 工件。
本教程假设您已经阅读了表格中的列预测 - 快速入门和表格中的列预测 - 深入探讨。
训练 TabularPredictor¶
我们将再次使用像之前教程中的 AdultIncome 数据集,并训练一个预测器来预测个人收入是否超过 50,000 美元,这记录在该表的 class
列中。
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
label = 'class'
subsample_size = 500 # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()
年龄 | 工作类别 | 最终权重 | 教育程度 | 教育程度编号 | 婚姻状况 | 职业 | 关系 | 种族 | 性别 | 资本收益 | 资本损失 | 每周工作小时数 | 原籍国 | 类别 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6118 | 51 | 私人 | 39264 | 大专 | 10 | 已婚,配偶是公民 | 行政管理 | 妻子 | 白人 | 女性 | 0 | 0 | 40 | 美国 | >50K |
23204 | 58 | 私人 | 51662 | 10年级 | 6 | 已婚,配偶是公民 | 其他服务 | 妻子 | 白人 | 女性 | 0 | 0 | 8 | 美国 | <=50K |
29590 | 40 | 私人 | 326310 | 大专 | 10 | 已婚,配偶是公民 | 手工维修 | 丈夫 | 白人 | 男性 | 0 | 0 | 44 | 美国 | <=50K |
18116 | 37 | 私人 | 222450 | 高中毕业 | 9 | 从未结婚 | 销售 | 非家庭成员 | 白人 | 男性 | 0 | 2339 | 40 | 萨尔瓦多 | <=50K |
33964 | 62 | 私人 | 109190 | 学士学位 | 13 | 已婚,配偶是公民 | 行政管理 | 丈夫 | 白人 | 男性 | 15024 | 0 | 40 | 美国 | >50K |
save_path = 'agModels-predictClass-deployment' # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version: 1.3.1b20250508
Python Version: 3.11.9
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count: 8
Memory Avail: 28.77 GB / 30.95 GB (93.0%)
Disk Space Avail: 211.93 GB / 255.99 GB (82.8%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
Recommended Presets (For more details refer to https://autogluon.cn/stable/tutorials/tabular/tabular-essentials.html#presets):
presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
presets='best' : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
presets='high' : Strong accuracy with fast inference speed.
presets='good' : Good accuracy with very fast inference speed.
presets='medium' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment"
Train Data Rows: 500
Train Data Columns: 14
Label Column: class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [' >50K', ' <=50K']
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type: binary
Preprocessing data ...
Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 29460.86 MB
Train Data (Original) Memory Usage: 0.28 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']
0.1s = Fit runtime
14 features in original data used to generate 14 features in processed data.
Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.08s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 400, Val Rows: 100
User-specified model hyperparameters to be fit:
{
'NN_TORCH': [{}],
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}],
'CAT': [{}],
'XGB': [{}],
'FASTAI': [{}],
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsUnif ...
0.73 = Validation score (accuracy)
0.04s = Training runtime
0.02s = Validation runtime
Fitting model: KNeighborsDist ...
0.65 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMXT ...
0.83 = Validation score (accuracy)
0.25s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
0.85 = Validation score (accuracy)
0.23s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestGini ...
0.84 = Validation score (accuracy)
0.61s = Training runtime
0.06s = Validation runtime
Fitting model: RandomForestEntr ...
0.83 = Validation score (accuracy)
0.52s = Training runtime
0.05s = Validation runtime
Fitting model: CatBoost ...
0.85 = Validation score (accuracy)
0.82s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesGini ...
0.82 = Validation score (accuracy)
0.54s = Training runtime
0.06s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.81 = Validation score (accuracy)
0.56s = Training runtime
0.06s = Validation runtime
Fitting model: NeuralNetFastAI ...
0.84 = Validation score (accuracy)
2.76s = Training runtime
0.01s = Validation runtime
Fitting model: XGBoost ...
0.85 = Validation score (accuracy)
0.37s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetTorch ...
0.83 = Validation score (accuracy)
2.31s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMLarge ...
0.83 = Validation score (accuracy)
0.49s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
Ensemble Weights: {'LightGBM': 1.0}
0.85 = Validation score (accuracy)
0.09s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 10.13s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 19380.4 rows/s (100 batch size)
Disabling decision threshold calibration for metric `accuracy` due to having fewer than 10000 rows of validation data for calibration, to avoid overfitting (100 rows).
`accuracy` is generally not improved through threshold calibration. Force calibration via specifying `calibrate_decision_threshold=True`.
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment")
接下来,加载单独的测试数据,以演示如何在推理时对新示例进行预测
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label] # values to predict
test_data.head()
Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769
年龄 | 工作类别 | 最终权重 | 教育程度 | 教育程度编号 | 婚姻状况 | 职业 | 关系 | 种族 | 性别 | 资本收益 | 资本损失 | 每周工作小时数 | 原籍国 | 类别 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31 | 私人 | 169085 | 11年级 | 7 | 已婚,配偶是公民 | 销售 | 妻子 | 白人 | 女性 | 0 | 0 | 20 | 美国 | <=50K |
1 | 17 | 自雇,非公司 | 226203 | 12年级 | 8 | 从未结婚 | 销售 | 亲生子女 | 白人 | 男性 | 0 | 0 | 45 | 美国 | <=50K |
2 | 47 | 私人 | 54260 | 职业副学士 | 11 | 已婚,配偶是公民 | 行政管理 | 丈夫 | 白人 | 男性 | 0 | 1887 | 60 | 美国 | >50K |
3 | 21 | 私人 | 176262 | 大专 | 10 | 从未结婚 | 行政管理 | 亲生子女 | 白人 | 女性 | 0 | 0 | 30 | 美国 | <=50K |
4 | 17 | 私人 | 241185 | 12年级 | 8 | 从未结婚 | 专业特长 | 亲生子女 | 白人 | 男性 | 0 | 0 | 20 | 美国 | <=50K |
我们使用训练好的模型对新数据进行预测
predictor = TabularPredictor.load(save_path) # unnecessary, just demonstrates how to load previously-trained predictor from file
y_pred = predictor.predict(test_data)
y_pred
0 <=50K
1 <=50K
2 >50K
3 <=50K
4 <=50K
...
9764 <=50K
9765 <=50K
9766 <=50K
9767 <=50K
9768 <=50K
Name: class, Length: 9769, dtype: object
我们可以使用排行榜来评估每个独立训练模型在我们标注的测试数据上的性能
predictor.leaderboard(test_data)
模型 | 测试得分 | 验证得分 | 评估指标 | 测试预测时间 | 验证预测时间 | 训练时间 | 测试边际预测时间 | 验证边际预测时间 | 边际训练时间 | 堆栈级别 | 可以推理 | 训练顺序 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RandomForestGini | 0.842870 | 0.84 | 准确率 | 0.113165 | 0.056597 | 0.611028 | 0.113165 | 0.056597 | 0.611028 | 1 | True | 5 |
1 | CatBoost | 0.842461 | 0.85 | 准确率 | 0.008556 | 0.003967 | 0.822158 | 0.008556 | 0.003967 | 0.822158 | 1 | True | 7 |
2 | RandomForestEntr | 0.841130 | 0.83 | 准确率 | 0.107514 | 0.046581 | 0.524085 | 0.107514 | 0.046581 | 0.524085 | 1 | True | 6 |
3 | XGBoost | 0.839902 | 0.85 | 准确率 | 0.057453 | 0.005959 | 0.373979 | 0.057453 | 0.005959 | 0.373979 | 1 | True | 11 |
4 | LightGBM | 0.839799 | 0.85 | 准确率 | 0.019854 | 0.004387 | 0.234844 | 0.019854 | 0.004387 | 0.234844 | 1 | True | 4 |
5 | WeightedEnsemble_L2 | 0.839799 | 0.85 | 准确率 | 0.021583 | 0.005160 | 0.321193 | 0.001729 | 0.000773 | 0.086348 | 2 | True | 14 |
6 | LightGBMXT | 0.836421 | 0.83 | 准确率 | 0.010530 | 0.003297 | 0.253594 | 0.010530 | 0.003297 | 0.253594 | 1 | True | 3 |
7 | ExtraTreesEntr | 0.833862 | 0.81 | 准确率 | 0.098933 | 0.056663 | 0.556951 | 0.098933 | 0.056663 | 0.556951 | 1 | True | 9 |
8 | ExtraTreesGini | 0.833862 | 0.82 | 准确率 | 0.106322 | 0.058232 | 0.542591 | 0.106322 | 0.058232 | 0.542591 | 1 | True | 8 |
9 | NeuralNetTorch | 0.833657 | 0.83 | 准确率 | 0.046994 | 0.010464 | 2.313713 | 0.046994 | 0.010464 | 2.313713 | 1 | True | 12 |
10 | NeuralNetFastAI | 0.828949 | 0.84 | 准确率 | 0.131369 | 0.010403 | 2.759325 | 0.131369 | 0.010403 | 2.759325 | 1 | True | 10 |
11 | LightGBMLarge | 0.817074 | 0.83 | 准确率 | 0.012141 | 0.003433 | 0.487990 | 0.012141 | 0.003433 | 0.487990 | 1 | True | 13 |
12 | KNeighborsUnif | 0.725970 | 0.73 | 准确率 | 0.036427 | 0.015048 | 0.036135 | 0.036427 | 0.015048 | 0.036135 | 1 | True | 1 |
13 | KNeighborsDist | 0.695158 | 0.65 | 准确率 | 0.036943 | 0.013660 | 0.010330 | 0.036943 | 0.013660 | 0.010330 | 1 | True | 2 |
使用 .clone() 创建预测器快照¶
现在我们有了一个可用的预测器工件,我们可能希望通过各种方式修改它以更好地满足我们的需求。例如,我们可能希望通过 .delete_models()
删除某些模型来减少磁盘使用量,或者通过 .fit_extra()
在现有模型的基础上训练额外的模型。
虽然您可以在预测器上执行所有这些操作,但您可能希望在出现问题时能够恢复到预测器的先前状态。这就是 predictor.clone()
的作用所在。
predictor.clone()
允许您创建给定预测器的快照,将预测器的工件克隆到新位置。然后,您可以自由地使用预测器,如果想撤消您的操作,始终可以加载较早的快照。
克隆预测器只需指定一个新的目标目录路径即可
save_path_clone = save_path + '-clone'
# will return the path to the cloned predictor, identical to save_path_clone
path_clone = predictor.clone(path=save_path_clone)
Cloned TabularPredictor located in '/home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment' to 'agModels-predictClass-deployment-clone'.
To load the cloned predictor: predictor_clone = TabularPredictor.load(path="agModels-predictClass-deployment-clone")
请注意,此逻辑会使磁盘使用量加倍,因为它会完全克隆磁盘上的每个预测器工件,以创建完全相同的副本。
现在我们可以加载克隆的预测器
predictor_clone = TabularPredictor.load(path=path_clone)
# You can alternatively load the cloned TabularPredictor at the time of cloning:
# predictor_clone = predictor.clone(path=save_path_clone, return_clone=True)
我们可以看到克隆的预测器具有与原始预测器相同的排行榜和功能
y_pred_clone = predictor.predict(test_data)
y_pred_clone
0 <=50K
1 <=50K
2 >50K
3 <=50K
4 <=50K
...
9764 <=50K
9765 <=50K
9766 <=50K
9767 <=50K
9768 <=50K
Name: class, Length: 9769, dtype: object
y_pred.equals(y_pred_clone)
True
predictor_clone.leaderboard(test_data)
模型 | 测试得分 | 验证得分 | 评估指标 | 测试预测时间 | 验证预测时间 | 训练时间 | 测试边际预测时间 | 验证边际预测时间 | 边际训练时间 | 堆栈级别 | 可以推理 | 训练顺序 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RandomForestGini | 0.842870 | 0.84 | 准确率 | 0.116544 | 0.056597 | 0.611028 | 0.116544 | 0.056597 | 0.611028 | 1 | True | 5 |
1 | CatBoost | 0.842461 | 0.85 | 准确率 | 0.007195 | 0.003967 | 0.822158 | 0.007195 | 0.003967 | 0.822158 | 1 | True | 7 |
2 | RandomForestEntr | 0.841130 | 0.83 | 准确率 | 0.111256 | 0.046581 | 0.524085 | 0.111256 | 0.046581 | 0.524085 | 1 | True | 6 |
3 | XGBoost | 0.839902 | 0.85 | 准确率 | 0.057376 | 0.005959 | 0.373979 | 0.057376 | 0.005959 | 0.373979 | 1 | True | 11 |
4 | LightGBM | 0.839799 | 0.85 | 准确率 | 0.015373 | 0.004387 | 0.234844 | 0.015373 | 0.004387 | 0.234844 | 1 | True | 4 |
5 | WeightedEnsemble_L2 | 0.839799 | 0.85 | 准确率 | 0.016761 | 0.005160 | 0.321193 | 0.001388 | 0.000773 | 0.086348 | 2 | True | 14 |
6 | LightGBMXT | 0.836421 | 0.83 | 准确率 | 0.008250 | 0.003297 | 0.253594 | 0.008250 | 0.003297 | 0.253594 | 1 | True | 3 |
7 | ExtraTreesGini | 0.833862 | 0.82 | 准确率 | 0.096716 | 0.058232 | 0.542591 | 0.096716 | 0.058232 | 0.542591 | 1 | True | 8 |
8 | ExtraTreesEntr | 0.833862 | 0.81 | 准确率 | 0.097025 | 0.056663 | 0.556951 | 0.097025 | 0.056663 | 0.556951 | 1 | True | 9 |
9 | NeuralNetTorch | 0.833657 | 0.83 | 准确率 | 0.057921 | 0.010464 | 2.313713 | 0.057921 | 0.010464 | 2.313713 | 1 | True | 12 |
10 | NeuralNetFastAI | 0.828949 | 0.84 | 准确率 | 0.138506 | 0.010403 | 2.759325 | 0.138506 | 0.010403 | 2.759325 | 1 | True | 10 |
11 | LightGBMLarge | 0.817074 | 0.83 | 准确率 | 0.011639 | 0.003433 | 0.487990 | 0.011639 | 0.003433 | 0.487990 | 1 | True | 13 |
12 | KNeighborsUnif | 0.725970 | 0.73 | 准确率 | 0.031415 | 0.015048 | 0.036135 | 0.031415 | 0.015048 | 0.036135 | 1 | True | 1 |
13 | KNeighborsDist | 0.695158 | 0.65 | 准确率 | 0.036131 | 0.013660 | 0.010330 | 0.036131 | 0.013660 | 0.010330 | 1 | True | 2 |
现在让我们对克隆进行一些额外的操作,例如调用 refit_full
predictor_clone.refit_full()
predictor_clone.leaderboard(test_data)
Refitting models via `predictor.refit_full` using all of the data (combined train and validation)...
Models trained in this way will have the suffix "_FULL" and have NaN validation score.
This process is not bound by time_limit, but should take less time than the original `predictor.fit` call.
To learn more, refer to the `.refit_full` method docstring which explains how "_FULL" models differ from normal models.
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsUnif_FULL ...
0.01s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsDist_FULL ...
0.01s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBMXT_FULL ...
0.21s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBM_FULL ...
0.19s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: RandomForestGini_FULL ...
0.57s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: RandomForestEntr_FULL ...
0.53s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: CatBoost_FULL ...
0.03s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: ExtraTreesGini_FULL ...
0.52s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: ExtraTreesEntr_FULL ...
0.52s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: NeuralNetFastAI_FULL ...
No improvement since epoch 0: early stopping
0.32s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: XGBoost_FULL ...
0.05s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: NeuralNetTorch_FULL ...
0.61s = Training runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBMLarge_FULL ...
0.2s = Training runtime
Fitting model: WeightedEnsemble_L2_FULL | Skipping fit via cloning parent ...
Ensemble Weights: {'LightGBM': 1.0}
0.09s = Training runtime
Updated best model to "WeightedEnsemble_L2_FULL" (Previously "WeightedEnsemble_L2"). AutoGluon will default to using "WeightedEnsemble_L2_FULL" for predict() and predict_proba().
Refit complete, total runtime = 4.15s ... Best model: "WeightedEnsemble_L2_FULL"
模型 | 测试得分 | 验证得分 | 评估指标 | 测试预测时间 | 验证预测时间 | 训练时间 | 测试边际预测时间 | 验证边际预测时间 | 边际训练时间 | 堆栈级别 | 可以推理 | 训练顺序 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CatBoost_FULL | 0.842870 | NaN | 准确率 | 0.006422 | NaN | 0.027023 | 0.006422 | NaN | 0.027023 | 1 | True | 21 |
1 | RandomForestGini | 0.842870 | 0.84 | 准确率 | 0.117268 | 0.056597 | 0.611028 | 0.117268 | 0.056597 | 0.611028 | 1 | True | 5 |
2 | CatBoost | 0.842461 | 0.85 | 准确率 | 0.007725 | 0.003967 | 0.822158 | 0.007725 | 0.003967 | 0.822158 | 1 | True | 7 |
3 | RandomForestEntr | 0.841130 | 0.83 | 准确率 | 0.111758 | 0.046581 | 0.524085 | 0.111758 | 0.046581 | 0.524085 | 1 | True | 6 |
4 | LightGBM_FULL | 0.840823 | NaN | 准确率 | 0.019103 | NaN | 0.193116 | 0.019103 | NaN | 0.193116 | 1 | True | 18 |
5 | WeightedEnsemble_L2_FULL | 0.840823 | NaN | 准确率 | 0.020436 | NaN | 0.279465 | 0.001333 | NaN | 0.086348 | 2 | True | 28 |
6 | XGBoost | 0.839902 | 0.85 | 准确率 | 0.056961 | 0.005959 | 0.373979 | 0.056961 | 0.005959 | 0.373979 | 1 | True | 11 |
7 | LightGBM | 0.839799 | 0.85 | 准确率 | 0.017601 | 0.004387 | 0.234844 | 0.017601 | 0.004387 | 0.234844 | 1 | True | 4 |
8 | WeightedEnsemble_L2 | 0.839799 | 0.85 | 准确率 | 0.019044 | 0.005160 | 0.321193 | 0.001443 | 0.000773 | 0.086348 | 2 | True | 14 |
9 | RandomForestGini_FULL | 0.839390 | NaN | 准确率 | 0.118220 | NaN | 0.574083 | 0.118220 | NaN | 0.574083 | 1 | True | 19 |
10 | RandomForestEntr_FULL | 0.839185 | NaN | 准确率 | 0.113545 | NaN | 0.527354 | 0.113545 | NaN | 0.527354 | 1 | True | 20 |
11 | LightGBMXT_FULL | 0.837957 | NaN | 准确率 | 0.009283 | NaN | 0.208477 | 0.009283 | NaN | 0.208477 | 1 | True | 17 |
12 | LightGBMXT | 0.836421 | 0.83 | 准确率 | 0.011218 | 0.003297 | 0.253594 | 0.011218 | 0.003297 | 0.253594 | 1 | True | 3 |
13 | XGBoost_FULL | 0.836319 | NaN | 准确率 | 0.059666 | NaN | 0.050707 | 0.059666 | NaN | 0.050707 | 1 | True | 25 |
14 | ExtraTreesEntr_FULL | 0.835705 | NaN | 准确率 | 0.110837 | NaN | 0.521020 | 0.110837 | NaN | 0.521020 | 1 | True | 23 |
15 | NeuralNetTorch_FULL | 0.835091 | NaN | 准确率 | 0.055345 | NaN | 0.610771 | 0.055345 | NaN | 0.610771 | 1 | True | 26 |
16 | ExtraTreesGini | 0.833862 | 0.82 | 准确率 | 0.097680 | 0.058232 | 0.542591 | 0.097680 | 0.058232 | 0.542591 | 1 | True | 8 |
17 | ExtraTreesEntr | 0.833862 | 0.81 | 准确率 | 0.107475 | 0.056663 | 0.556951 | 0.107475 | 0.056663 | 0.556951 | 1 | True | 9 |
18 | NeuralNetTorch | 0.833657 | 0.83 | 准确率 | 0.049717 | 0.010464 | 2.313713 | 0.049717 | 0.010464 | 2.313713 | 1 | True | 12 |
19 | ExtraTreesGini_FULL | 0.833453 | NaN | 准确率 | 0.106834 | NaN | 0.519521 | 0.106834 | NaN | 0.519521 | 1 | True | 22 |
20 | NeuralNetFastAI | 0.828949 | 0.84 | 准确率 | 0.136169 | 0.010403 | 2.759325 | 0.136169 | 0.010403 | 2.759325 | 1 | True | 10 |
21 | LightGBMLarge | 0.817074 | 0.83 | 准确率 | 0.012388 | 0.003433 | 0.487990 | 0.012388 | 0.003433 | 0.487990 | 1 | True | 13 |
22 | LightGBMLarge_FULL | 0.809704 | NaN | 准确率 | 0.011929 | NaN | 0.203436 | 0.011929 | NaN | 0.203436 | 1 | True | 27 |
23 | NeuralNetFastAI_FULL | 0.768349 | NaN | 准确率 | 0.128676 | NaN | 0.323018 | 0.128676 | NaN | 0.323018 | 1 | True | 24 |
24 | KNeighborsUnif | 0.725970 | 0.73 | 准确率 | 0.032110 | 0.015048 | 0.036135 | 0.032110 | 0.015048 | 0.036135 | 1 | True | 1 |
25 | KNeighborsUnif_FULL | 0.725151 | NaN | 准确率 | 0.027841 | NaN | 0.005243 | 0.027841 | NaN | 0.005243 | 1 | True | 15 |
26 | KNeighborsDist | 0.695158 | 0.65 | 准确率 | 0.027766 | 0.013660 | 0.010330 | 0.027766 | 0.013660 | 0.010330 | 1 | True | 2 |
27 | KNeighborsDist_FULL | 0.685434 | NaN | 准确率 | 0.026894 | NaN | 0.005183 | 0.026894 | NaN | 0.005183 | 1 | True | 16 |
我们可以看到我们能够拟合额外的模型,但出于某种原因,我们可能希望撤消此操作。
幸运的是,我们的原始预测器未受影响!
predictor.leaderboard(test_data)
模型 | 测试得分 | 验证得分 | 评估指标 | 测试预测时间 | 验证预测时间 | 训练时间 | 测试边际预测时间 | 验证边际预测时间 | 边际训练时间 | 堆栈级别 | 可以推理 | 训练顺序 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RandomForestGini | 0.842870 | 0.84 | 准确率 | 0.116893 | 0.056597 | 0.611028 | 0.116893 | 0.056597 | 0.611028 | 1 | True | 5 |
1 | CatBoost | 0.842461 | 0.85 | 准确率 | 0.007223 | 0.003967 | 0.822158 | 0.007223 | 0.003967 | 0.822158 | 1 | True | 7 |
2 | RandomForestEntr | 0.841130 | 0.83 | 准确率 | 0.107640 | 0.046581 | 0.524085 | 0.107640 | 0.046581 | 0.524085 | 1 | True | 6 |
3 | XGBoost | 0.839902 | 0.85 | 准确率 | 0.057026 | 0.005959 | 0.373979 | 0.057026 | 0.005959 | 0.373979 | 1 | True | 11 |
4 | LightGBM | 0.839799 | 0.85 | 准确率 | 0.016095 | 0.004387 | 0.234844 | 0.016095 | 0.004387 | 0.234844 | 1 | True | 4 |
5 | WeightedEnsemble_L2 | 0.839799 | 0.85 | 准确率 | 0.017516 | 0.005160 | 0.321193 | 0.001421 | 0.000773 | 0.086348 | 2 | True | 14 |
6 | LightGBMXT | 0.836421 | 0.83 | 准确率 | 0.008238 | 0.003297 | 0.253594 | 0.008238 | 0.003297 | 0.253594 | 1 | True | 3 |
7 | ExtraTreesGini | 0.833862 | 0.82 | 准确率 | 0.106841 | 0.058232 | 0.542591 | 0.106841 | 0.058232 | 0.542591 | 1 | True | 8 |
8 | ExtraTreesEntr | 0.833862 | 0.81 | 准确率 | 0.107799 | 0.056663 | 0.556951 | 0.107799 | 0.056663 | 0.556951 | 1 | True | 9 |
9 | NeuralNetTorch | 0.833657 | 0.83 | 准确率 | 0.048223 | 0.010464 | 2.313713 | 0.048223 | 0.010464 | 2.313713 | 1 | True | 12 |
10 | NeuralNetFastAI | 0.828949 | 0.84 | 准确率 | 0.140928 | 0.010403 | 2.759325 | 0.140928 | 0.010403 | 2.759325 | 1 | True | 10 |
11 | LightGBMLarge | 0.817074 | 0.83 | 准确率 | 0.011978 | 0.003433 | 0.487990 | 0.011978 | 0.003433 | 0.487990 | 1 | True | 13 |
12 | KNeighborsUnif | 0.725970 | 0.73 | 准确率 | 0.027841 | 0.015048 | 0.036135 | 0.027841 | 0.015048 | 0.036135 | 1 | True | 1 |
13 | KNeighborsDist | 0.695158 | 0.65 | 准确率 | 0.026983 | 0.013660 | 0.010330 | 0.026983 | 0.013660 | 0.010330 | 1 | True | 2 |
我们只需从原始预测器克隆一个新的预测器,就不会再受到之前克隆上调用 refit_full 的影响。
使用 .clone_for_deployment() 创建部署优化型预测器快照¶
除了克隆精确副本外,我们还可以克隆仅包含执行预测所需的最少工件集的副本。
请注意,此优化型克隆除了调用 predict 和 predict_proba 外,功能非常有限。例如,它将无法训练更多模型。
save_path_clone_opt = save_path + '-clone-opt'
# will return the path to the cloned predictor, identical to save_path_clone_opt
path_clone_opt = predictor.clone_for_deployment(path=save_path_clone_opt)
Cloned TabularPredictor located in '/home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment' to 'agModels-predictClass-deployment-clone-opt'.
To load the cloned predictor: predictor_clone = TabularPredictor.load(path="agModels-predictClass-deployment-clone-opt")
Clone: Keeping minimum set of models required to predict with best model 'WeightedEnsemble_L2'...
Deleting model KNeighborsUnif. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/KNeighborsUnif will be removed.
Deleting model KNeighborsDist. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/KNeighborsDist will be removed.
Deleting model LightGBMXT. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/LightGBMXT will be removed.
Deleting model RandomForestGini. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/RandomForestGini will be removed.
Deleting model RandomForestEntr. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/RandomForestEntr will be removed.
Deleting model CatBoost. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/CatBoost will be removed.
Deleting model ExtraTreesGini. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/ExtraTreesGini will be removed.
Deleting model ExtraTreesEntr. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/ExtraTreesEntr will be removed.
Deleting model NeuralNetFastAI. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/NeuralNetFastAI will be removed.
Deleting model XGBoost. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/XGBoost will be removed.
Deleting model NeuralNetTorch. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/NeuralNetTorch will be removed.
Deleting model LightGBMLarge. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/LightGBMLarge will be removed.
Clone: Removing artifacts unnecessary for prediction. NOTE: Clone can no longer fit new models, and most functionality except for predict and predict_proba will no longer work
predictor_clone_opt = TabularPredictor.load(path=path_clone_opt)
为了避免在每次预测调用中加载模型,我们可以通过以下方式将模型保留在内存中
predictor_clone_opt.persist()
Persisting 2 models in memory. Models will require 0.0% of memory.
['LightGBM', 'WeightedEnsemble_L2']
我们可以看到优化型克隆仍然做出相同的预测
y_pred_clone_opt = predictor_clone_opt.predict(test_data)
y_pred_clone_opt
0 <=50K
1 <=50K
2 >50K
3 <=50K
4 <=50K
...
9764 <=50K
9765 <=50K
9766 <=50K
9767 <=50K
9768 <=50K
Name: class, Length: 9769, dtype: object
y_pred.equals(y_pred_clone_opt)
True
predictor_clone_opt.leaderboard(test_data)
模型 | 测试得分 | 验证得分 | 评估指标 | 测试预测时间 | 验证预测时间 | 训练时间 | 测试边际预测时间 | 验证边际预测时间 | 边际训练时间 | 堆栈级别 | 可以推理 | 训练顺序 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LightGBM | 0.839799 | 0.85 | 准确率 | 0.015134 | 0.004387 | 0.234844 | 0.015134 | 0.004387 | 0.234844 | 1 | True | 1 |
1 | WeightedEnsemble_L2 | 0.839799 | 0.85 | 准确率 | 0.015754 | 0.005160 | 0.321193 | 0.000620 | 0.000773 | 0.086348 | 2 | True | 2 |
我们可以比较优化型克隆与原始预测器的磁盘使用情况
size_original = predictor.disk_usage()
size_opt = predictor_clone_opt.disk_usage()
print(f'Size Original: {size_original} bytes')
print(f'Size Optimized: {size_opt} bytes')
print(f'Optimized predictor achieved a {round((1 - (size_opt/size_original)) * 100, 1)}% reduction in disk usage.')
Size Original: 18431844 bytes
Size Optimized: 181793 bytes
Optimized predictor achieved a 99.0% reduction in disk usage.
我们还可以调查原始预测器和优化型预测器中存在的文件差异。
原始
predictor.disk_usage_per_file()
/models/ExtraTreesGini/model.pkl 5065899
/models/ExtraTreesEntr/model.pkl 5024129
/models/RandomForestGini/model.pkl 3408874
/models/RandomForestEntr/model.pkl 3267273
/models/XGBoost/xgb.ubj 506961
/models/LightGBMLarge/model.pkl 310841
/models/NeuralNetTorch/model.pkl 253965
/models/NeuralNetFastAI/model-internals.pkl 169919
/models/LightGBM/model.pkl 147764
/models/CatBoost/model.pkl 52238
/models/LightGBMXT/model.pkl 43110
/models/KNeighborsDist/model.pkl 40167
/models/KNeighborsUnif/model.pkl 40166
/utils/data/X.pkl 27584
/models/WeightedEnsemble_L2/model.pkl 10470
/learner.pkl 10351
/metadata.json 9265
/utils/data/X_val.pkl 8350
/utils/data/y.pkl 7462
/models/XGBoost/model.pkl 6152
/models/trainer.pkl 5681
/models/NeuralNetFastAI/model.pkl 2698
/utils/data/y_val.pkl 2355
/models/WeightedEnsemble_L2/utils/model_template.pkl 1246
/predictor.pkl 982
/models/WeightedEnsemble_L2/utils/oof.pkl 765
/utils/attr/LightGBM/y_pred_proba_val.pkl 551
/utils/attr/LightGBMLarge/y_pred_proba_val.pkl 551
/utils/attr/ExtraTreesEntr/y_pred_proba_val.pkl 551
/utils/attr/NeuralNetFastAI/y_pred_proba_val.pkl 551
/utils/attr/XGBoost/y_pred_proba_val.pkl 551
/utils/attr/NeuralNetTorch/y_pred_proba_val.pkl 551
/utils/attr/KNeighborsUnif/y_pred_proba_val.pkl 551
/utils/attr/LightGBMXT/y_pred_proba_val.pkl 551
/utils/attr/KNeighborsDist/y_pred_proba_val.pkl 551
/utils/attr/RandomForestGini/y_pred_proba_val.pkl 551
/utils/attr/CatBoost/y_pred_proba_val.pkl 551
/utils/attr/RandomForestEntr/y_pred_proba_val.pkl 551
/utils/attr/ExtraTreesGini/y_pred_proba_val.pkl 551
/version.txt 14
Name: size, dtype: int64
优化型
predictor_clone_opt.disk_usage_per_file()
/models/LightGBM/model.pkl 147792
/models/WeightedEnsemble_L2/model.pkl 10520
/learner.pkl 10351
/metadata.json 9265
/models/trainer.pkl 2869
/predictor.pkl 982
/version.txt 14
Name: size, dtype: int64
编译模型以最大化推理速度¶
为了进一步提高推理效率,我们可以调用 .compile()
自动将 sklearn 函数调用转换为等效的 ONNX 调用。请注意,这目前是一项实验性功能,仅改进 RandomForest 和 TabularNeuralNetwork 模型。编译和推理速度加速需要安装 skl2onnx
和 onnxruntime
包。要自动安装这些包的受支持版本,我们可以在现有 AutoGluon 安装的基础上调用 pip install autogluon.tabular[skl2onnx]
,或者在新 AutoGluon 安装上调用 pip install autogluon.tabular[all,skl2onnx]
。
确保预测器已克隆非常重要,因为一旦模型被编译,它将不支持训练。
predictor_clone_opt.compile()
Compiling 2 Models ...
Skipping compilation for LightGBM ... (No config specified)
Skipping compilation for WeightedEnsemble_L2 ... (No config specified)
Finished compiling models, total runtime = 0s.
使用编译后的预测器,预测结果可能不会完全相同,但应该非常接近。
y_pred_compile_opt = predictor_clone_opt.predict(test_data)
y_pred_compile_opt
0 <=50K
1 <=50K
2 >50K
3 <=50K
4 <=50K
...
9764 <=50K
9765 <=50K
9766 <=50K
9767 <=50K
9768 <=50K
Name: class, Length: 9769, dtype: object
现在剩下的就是将优化后的预测器上传到集中的存储位置,例如 S3。要在新机器/系统中使用此预测器,只需将工件下载到本地磁盘并加载预测器。确保加载预测器时使用与训练时相同的 Python 版本和 AutoGluon 版本,以避免不稳定。