表格中的列预测 - 部署优化

Open In Colab Open In SageMaker Studio Lab

本教程将介绍如何执行端到端 AutoML 过程,以创建用于生产环境的优化且可部署的 AutoGluon 工件。

本教程假设您已经阅读了表格中的列预测 - 快速入门表格中的列预测 - 深入探讨

训练 TabularPredictor

我们将再次使用像之前教程中的 AdultIncome 数据集,并训练一个预测器来预测个人收入是否超过 50,000 美元,这记录在该表的 class 列中。

from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
label = 'class'
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()
年龄 工作类别 最终权重 教育程度 教育程度编号 婚姻状况 职业 关系 种族 性别 资本收益 资本损失 每周工作小时数 原籍国 类别
6118 51 私人 39264 大专 10 已婚,配偶是公民 行政管理 妻子 白人 女性 0 0 40 美国 >50K
23204 58 私人 51662 10年级 6 已婚,配偶是公民 其他服务 妻子 白人 女性 0 0 8 美国 <=50K
29590 40 私人 326310 大专 10 已婚,配偶是公民 手工维修 丈夫 白人 男性 0 0 44 美国 <=50K
18116 37 私人 222450 高中毕业 9 从未结婚 销售 非家庭成员 白人 男性 0 2339 40 萨尔瓦多 <=50K
33964 62 私人 109190 学士学位 13 已婚,配偶是公民 行政管理 丈夫 白人 男性 15024 0 40 美国 >50K
save_path = 'agModels-predictClass-deployment'  # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version:  1.3.1b20250508
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Memory Avail:       28.77 GB / 30.95 GB (93.0%)
Disk Space Avail:   211.93 GB / 255.99 GB (82.8%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://autogluon.cn/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets='good'         : Good accuracy with very fast inference speed.
	presets='medium'       : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment"
Train Data Rows:    500
Train Data Columns: 14
Label Column:       class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       binary
Preprocessing data ...
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    29460.86 MB
	Train Data (Original)  Memory Usage: 0.28 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
		('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('int', ['bool']) : 1 | ['sex']
	0.1s = Fit runtime
	14 features in original data used to generate 14 features in processed data.
	Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.08s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 400, Val Rows: 100
User-specified model hyperparameters to be fit:
{
	'NN_TORCH': [{}],
	'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}],
	'CAT': [{}],
	'XGB': [{}],
	'FASTAI': [{}],
	'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsUnif ...
	0.73	 = Validation score   (accuracy)
	0.04s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: KNeighborsDist ...
	0.65	 = Validation score   (accuracy)
	0.01s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: LightGBMXT ...
	0.83	 = Validation score   (accuracy)
	0.25s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ...
	0.85	 = Validation score   (accuracy)
	0.23s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestGini ...
	0.84	 = Validation score   (accuracy)
	0.61s	 = Training   runtime
	0.06s	 = Validation runtime
Fitting model: RandomForestEntr ...
	0.83	 = Validation score   (accuracy)
	0.52s	 = Training   runtime
	0.05s	 = Validation runtime
Fitting model: CatBoost ...
	0.85	 = Validation score   (accuracy)
	0.82s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesGini ...
	0.82	 = Validation score   (accuracy)
	0.54s	 = Training   runtime
	0.06s	 = Validation runtime
Fitting model: ExtraTreesEntr ...
	0.81	 = Validation score   (accuracy)
	0.56s	 = Training   runtime
	0.06s	 = Validation runtime
Fitting model: NeuralNetFastAI ...
	0.84	 = Validation score   (accuracy)
	2.76s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: XGBoost ...
	0.85	 = Validation score   (accuracy)
	0.37s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetTorch ...
	0.83	 = Validation score   (accuracy)
	2.31s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: LightGBMLarge ...
	0.83	 = Validation score   (accuracy)
	0.49s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	Ensemble Weights: {'LightGBM': 1.0}
	0.85	 = Validation score   (accuracy)
	0.09s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 10.13s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 19380.4 rows/s (100 batch size)
Disabling decision threshold calibration for metric `accuracy` due to having fewer than 10000 rows of validation data for calibration, to avoid overfitting (100 rows).
	`accuracy` is generally not improved through threshold calibration. Force calibration via specifying `calibrate_decision_threshold=True`.
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment")

接下来,加载单独的测试数据,以演示如何在推理时对新示例进行预测

test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label]  # values to predict
test_data.head()
Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769
年龄 工作类别 最终权重 教育程度 教育程度编号 婚姻状况 职业 关系 种族 性别 资本收益 资本损失 每周工作小时数 原籍国 类别
0 31 私人 169085 11年级 7 已婚,配偶是公民 销售 妻子 白人 女性 0 0 20 美国 <=50K
1 17 自雇,非公司 226203 12年级 8 从未结婚 销售 亲生子女 白人 男性 0 0 45 美国 <=50K
2 47 私人 54260 职业副学士 11 已婚,配偶是公民 行政管理 丈夫 白人 男性 0 1887 60 美国 >50K
3 21 私人 176262 大专 10 从未结婚 行政管理 亲生子女 白人 女性 0 0 30 美国 <=50K
4 17 私人 241185 12年级 8 从未结婚 专业特长 亲生子女 白人 男性 0 0 20 美国 <=50K

我们使用训练好的模型对新数据进行预测

predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data)
y_pred
0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

我们可以使用排行榜来评估每个独立训练模型在我们标注的测试数据上的性能

predictor.leaderboard(test_data)
模型 测试得分 验证得分 评估指标 测试预测时间 验证预测时间 训练时间 测试边际预测时间 验证边际预测时间 边际训练时间 堆栈级别 可以推理 训练顺序
0 RandomForestGini 0.842870 0.84 准确率 0.113165 0.056597 0.611028 0.113165 0.056597 0.611028 1 True 5
1 CatBoost 0.842461 0.85 准确率 0.008556 0.003967 0.822158 0.008556 0.003967 0.822158 1 True 7
2 RandomForestEntr 0.841130 0.83 准确率 0.107514 0.046581 0.524085 0.107514 0.046581 0.524085 1 True 6
3 XGBoost 0.839902 0.85 准确率 0.057453 0.005959 0.373979 0.057453 0.005959 0.373979 1 True 11
4 LightGBM 0.839799 0.85 准确率 0.019854 0.004387 0.234844 0.019854 0.004387 0.234844 1 True 4
5 WeightedEnsemble_L2 0.839799 0.85 准确率 0.021583 0.005160 0.321193 0.001729 0.000773 0.086348 2 True 14
6 LightGBMXT 0.836421 0.83 准确率 0.010530 0.003297 0.253594 0.010530 0.003297 0.253594 1 True 3
7 ExtraTreesEntr 0.833862 0.81 准确率 0.098933 0.056663 0.556951 0.098933 0.056663 0.556951 1 True 9
8 ExtraTreesGini 0.833862 0.82 准确率 0.106322 0.058232 0.542591 0.106322 0.058232 0.542591 1 True 8
9 NeuralNetTorch 0.833657 0.83 准确率 0.046994 0.010464 2.313713 0.046994 0.010464 2.313713 1 True 12
10 NeuralNetFastAI 0.828949 0.84 准确率 0.131369 0.010403 2.759325 0.131369 0.010403 2.759325 1 True 10
11 LightGBMLarge 0.817074 0.83 准确率 0.012141 0.003433 0.487990 0.012141 0.003433 0.487990 1 True 13
12 KNeighborsUnif 0.725970 0.73 准确率 0.036427 0.015048 0.036135 0.036427 0.015048 0.036135 1 True 1
13 KNeighborsDist 0.695158 0.65 准确率 0.036943 0.013660 0.010330 0.036943 0.013660 0.010330 1 True 2

使用 .clone() 创建预测器快照

现在我们有了一个可用的预测器工件,我们可能希望通过各种方式修改它以更好地满足我们的需求。例如,我们可能希望通过 .delete_models() 删除某些模型来减少磁盘使用量,或者通过 .fit_extra() 在现有模型的基础上训练额外的模型。

虽然您可以在预测器上执行所有这些操作,但您可能希望在出现问题时能够恢复到预测器的先前状态。这就是 predictor.clone() 的作用所在。

predictor.clone() 允许您创建给定预测器的快照,将预测器的工件克隆到新位置。然后,您可以自由地使用预测器,如果想撤消您的操作,始终可以加载较早的快照。

克隆预测器只需指定一个新的目标目录路径即可

save_path_clone = save_path + '-clone'
# will return the path to the cloned predictor, identical to save_path_clone
path_clone = predictor.clone(path=save_path_clone)
Cloned TabularPredictor located in '/home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment' to 'agModels-predictClass-deployment-clone'.
	To load the cloned predictor: predictor_clone = TabularPredictor.load(path="agModels-predictClass-deployment-clone")

请注意,此逻辑会使磁盘使用量加倍,因为它会完全克隆磁盘上的每个预测器工件,以创建完全相同的副本。

现在我们可以加载克隆的预测器

predictor_clone = TabularPredictor.load(path=path_clone)
# You can alternatively load the cloned TabularPredictor at the time of cloning:
# predictor_clone = predictor.clone(path=save_path_clone, return_clone=True)

我们可以看到克隆的预测器具有与原始预测器相同的排行榜和功能

y_pred_clone = predictor.predict(test_data)
y_pred_clone
0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object
y_pred.equals(y_pred_clone)
True
predictor_clone.leaderboard(test_data)
模型 测试得分 验证得分 评估指标 测试预测时间 验证预测时间 训练时间 测试边际预测时间 验证边际预测时间 边际训练时间 堆栈级别 可以推理 训练顺序
0 RandomForestGini 0.842870 0.84 准确率 0.116544 0.056597 0.611028 0.116544 0.056597 0.611028 1 True 5
1 CatBoost 0.842461 0.85 准确率 0.007195 0.003967 0.822158 0.007195 0.003967 0.822158 1 True 7
2 RandomForestEntr 0.841130 0.83 准确率 0.111256 0.046581 0.524085 0.111256 0.046581 0.524085 1 True 6
3 XGBoost 0.839902 0.85 准确率 0.057376 0.005959 0.373979 0.057376 0.005959 0.373979 1 True 11
4 LightGBM 0.839799 0.85 准确率 0.015373 0.004387 0.234844 0.015373 0.004387 0.234844 1 True 4
5 WeightedEnsemble_L2 0.839799 0.85 准确率 0.016761 0.005160 0.321193 0.001388 0.000773 0.086348 2 True 14
6 LightGBMXT 0.836421 0.83 准确率 0.008250 0.003297 0.253594 0.008250 0.003297 0.253594 1 True 3
7 ExtraTreesGini 0.833862 0.82 准确率 0.096716 0.058232 0.542591 0.096716 0.058232 0.542591 1 True 8
8 ExtraTreesEntr 0.833862 0.81 准确率 0.097025 0.056663 0.556951 0.097025 0.056663 0.556951 1 True 9
9 NeuralNetTorch 0.833657 0.83 准确率 0.057921 0.010464 2.313713 0.057921 0.010464 2.313713 1 True 12
10 NeuralNetFastAI 0.828949 0.84 准确率 0.138506 0.010403 2.759325 0.138506 0.010403 2.759325 1 True 10
11 LightGBMLarge 0.817074 0.83 准确率 0.011639 0.003433 0.487990 0.011639 0.003433 0.487990 1 True 13
12 KNeighborsUnif 0.725970 0.73 准确率 0.031415 0.015048 0.036135 0.031415 0.015048 0.036135 1 True 1
13 KNeighborsDist 0.695158 0.65 准确率 0.036131 0.013660 0.010330 0.036131 0.013660 0.010330 1 True 2

现在让我们对克隆进行一些额外的操作,例如调用 refit_full

predictor_clone.refit_full()

predictor_clone.leaderboard(test_data)
Refitting models via `predictor.refit_full` using all of the data (combined train and validation)...
	Models trained in this way will have the suffix "_FULL" and have NaN validation score.
	This process is not bound by time_limit, but should take less time than the original `predictor.fit` call.
	To learn more, refer to the `.refit_full` method docstring which explains how "_FULL" models differ from normal models.
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsUnif_FULL ...
	0.01s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsDist_FULL ...
	0.01s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBMXT_FULL ...
	0.21s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBM_FULL ...
	0.19s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: RandomForestGini_FULL ...
	0.57s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: RandomForestEntr_FULL ...
	0.53s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: CatBoost_FULL ...
	0.03s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: ExtraTreesGini_FULL ...
	0.52s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: ExtraTreesEntr_FULL ...
	0.52s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: NeuralNetFastAI_FULL ...
No improvement since epoch 0: early stopping
	0.32s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: XGBoost_FULL ...
	0.05s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: NeuralNetTorch_FULL ...
	0.61s	 = Training   runtime
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBMLarge_FULL ...
	0.2s	 = Training   runtime
Fitting model: WeightedEnsemble_L2_FULL | Skipping fit via cloning parent ...
	Ensemble Weights: {'LightGBM': 1.0}
	0.09s	 = Training   runtime
Updated best model to "WeightedEnsemble_L2_FULL" (Previously "WeightedEnsemble_L2"). AutoGluon will default to using "WeightedEnsemble_L2_FULL" for predict() and predict_proba().
Refit complete, total runtime = 4.15s ... Best model: "WeightedEnsemble_L2_FULL"
模型 测试得分 验证得分 评估指标 测试预测时间 验证预测时间 训练时间 测试边际预测时间 验证边际预测时间 边际训练时间 堆栈级别 可以推理 训练顺序
0 CatBoost_FULL 0.842870 NaN 准确率 0.006422 NaN 0.027023 0.006422 NaN 0.027023 1 True 21
1 RandomForestGini 0.842870 0.84 准确率 0.117268 0.056597 0.611028 0.117268 0.056597 0.611028 1 True 5
2 CatBoost 0.842461 0.85 准确率 0.007725 0.003967 0.822158 0.007725 0.003967 0.822158 1 True 7
3 RandomForestEntr 0.841130 0.83 准确率 0.111758 0.046581 0.524085 0.111758 0.046581 0.524085 1 True 6
4 LightGBM_FULL 0.840823 NaN 准确率 0.019103 NaN 0.193116 0.019103 NaN 0.193116 1 True 18
5 WeightedEnsemble_L2_FULL 0.840823 NaN 准确率 0.020436 NaN 0.279465 0.001333 NaN 0.086348 2 True 28
6 XGBoost 0.839902 0.85 准确率 0.056961 0.005959 0.373979 0.056961 0.005959 0.373979 1 True 11
7 LightGBM 0.839799 0.85 准确率 0.017601 0.004387 0.234844 0.017601 0.004387 0.234844 1 True 4
8 WeightedEnsemble_L2 0.839799 0.85 准确率 0.019044 0.005160 0.321193 0.001443 0.000773 0.086348 2 True 14
9 RandomForestGini_FULL 0.839390 NaN 准确率 0.118220 NaN 0.574083 0.118220 NaN 0.574083 1 True 19
10 RandomForestEntr_FULL 0.839185 NaN 准确率 0.113545 NaN 0.527354 0.113545 NaN 0.527354 1 True 20
11 LightGBMXT_FULL 0.837957 NaN 准确率 0.009283 NaN 0.208477 0.009283 NaN 0.208477 1 True 17
12 LightGBMXT 0.836421 0.83 准确率 0.011218 0.003297 0.253594 0.011218 0.003297 0.253594 1 True 3
13 XGBoost_FULL 0.836319 NaN 准确率 0.059666 NaN 0.050707 0.059666 NaN 0.050707 1 True 25
14 ExtraTreesEntr_FULL 0.835705 NaN 准确率 0.110837 NaN 0.521020 0.110837 NaN 0.521020 1 True 23
15 NeuralNetTorch_FULL 0.835091 NaN 准确率 0.055345 NaN 0.610771 0.055345 NaN 0.610771 1 True 26
16 ExtraTreesGini 0.833862 0.82 准确率 0.097680 0.058232 0.542591 0.097680 0.058232 0.542591 1 True 8
17 ExtraTreesEntr 0.833862 0.81 准确率 0.107475 0.056663 0.556951 0.107475 0.056663 0.556951 1 True 9
18 NeuralNetTorch 0.833657 0.83 准确率 0.049717 0.010464 2.313713 0.049717 0.010464 2.313713 1 True 12
19 ExtraTreesGini_FULL 0.833453 NaN 准确率 0.106834 NaN 0.519521 0.106834 NaN 0.519521 1 True 22
20 NeuralNetFastAI 0.828949 0.84 准确率 0.136169 0.010403 2.759325 0.136169 0.010403 2.759325 1 True 10
21 LightGBMLarge 0.817074 0.83 准确率 0.012388 0.003433 0.487990 0.012388 0.003433 0.487990 1 True 13
22 LightGBMLarge_FULL 0.809704 NaN 准确率 0.011929 NaN 0.203436 0.011929 NaN 0.203436 1 True 27
23 NeuralNetFastAI_FULL 0.768349 NaN 准确率 0.128676 NaN 0.323018 0.128676 NaN 0.323018 1 True 24
24 KNeighborsUnif 0.725970 0.73 准确率 0.032110 0.015048 0.036135 0.032110 0.015048 0.036135 1 True 1
25 KNeighborsUnif_FULL 0.725151 NaN 准确率 0.027841 NaN 0.005243 0.027841 NaN 0.005243 1 True 15
26 KNeighborsDist 0.695158 0.65 准确率 0.027766 0.013660 0.010330 0.027766 0.013660 0.010330 1 True 2
27 KNeighborsDist_FULL 0.685434 NaN 准确率 0.026894 NaN 0.005183 0.026894 NaN 0.005183 1 True 16

我们可以看到我们能够拟合额外的模型,但出于某种原因,我们可能希望撤消此操作。

幸运的是,我们的原始预测器未受影响!

predictor.leaderboard(test_data)
模型 测试得分 验证得分 评估指标 测试预测时间 验证预测时间 训练时间 测试边际预测时间 验证边际预测时间 边际训练时间 堆栈级别 可以推理 训练顺序
0 RandomForestGini 0.842870 0.84 准确率 0.116893 0.056597 0.611028 0.116893 0.056597 0.611028 1 True 5
1 CatBoost 0.842461 0.85 准确率 0.007223 0.003967 0.822158 0.007223 0.003967 0.822158 1 True 7
2 RandomForestEntr 0.841130 0.83 准确率 0.107640 0.046581 0.524085 0.107640 0.046581 0.524085 1 True 6
3 XGBoost 0.839902 0.85 准确率 0.057026 0.005959 0.373979 0.057026 0.005959 0.373979 1 True 11
4 LightGBM 0.839799 0.85 准确率 0.016095 0.004387 0.234844 0.016095 0.004387 0.234844 1 True 4
5 WeightedEnsemble_L2 0.839799 0.85 准确率 0.017516 0.005160 0.321193 0.001421 0.000773 0.086348 2 True 14
6 LightGBMXT 0.836421 0.83 准确率 0.008238 0.003297 0.253594 0.008238 0.003297 0.253594 1 True 3
7 ExtraTreesGini 0.833862 0.82 准确率 0.106841 0.058232 0.542591 0.106841 0.058232 0.542591 1 True 8
8 ExtraTreesEntr 0.833862 0.81 准确率 0.107799 0.056663 0.556951 0.107799 0.056663 0.556951 1 True 9
9 NeuralNetTorch 0.833657 0.83 准确率 0.048223 0.010464 2.313713 0.048223 0.010464 2.313713 1 True 12
10 NeuralNetFastAI 0.828949 0.84 准确率 0.140928 0.010403 2.759325 0.140928 0.010403 2.759325 1 True 10
11 LightGBMLarge 0.817074 0.83 准确率 0.011978 0.003433 0.487990 0.011978 0.003433 0.487990 1 True 13
12 KNeighborsUnif 0.725970 0.73 准确率 0.027841 0.015048 0.036135 0.027841 0.015048 0.036135 1 True 1
13 KNeighborsDist 0.695158 0.65 准确率 0.026983 0.013660 0.010330 0.026983 0.013660 0.010330 1 True 2

我们只需从原始预测器克隆一个新的预测器,就不会再受到之前克隆上调用 refit_full 的影响。

使用 .clone_for_deployment() 创建部署优化型预测器快照

除了克隆精确副本外,我们还可以克隆仅包含执行预测所需的最少工件集的副本。

请注意,此优化型克隆除了调用 predict 和 predict_proba 外,功能非常有限。例如,它将无法训练更多模型。

save_path_clone_opt = save_path + '-clone-opt'
# will return the path to the cloned predictor, identical to save_path_clone_opt
path_clone_opt = predictor.clone_for_deployment(path=save_path_clone_opt)
Cloned TabularPredictor located in '/home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment' to 'agModels-predictClass-deployment-clone-opt'.
	To load the cloned predictor: predictor_clone = TabularPredictor.load(path="agModels-predictClass-deployment-clone-opt")
Clone: Keeping minimum set of models required to predict with best model 'WeightedEnsemble_L2'...
Deleting model KNeighborsUnif. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/KNeighborsUnif will be removed.
Deleting model KNeighborsDist. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/KNeighborsDist will be removed.
Deleting model LightGBMXT. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/LightGBMXT will be removed.
Deleting model RandomForestGini. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/RandomForestGini will be removed.
Deleting model RandomForestEntr. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/RandomForestEntr will be removed.
Deleting model CatBoost. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/CatBoost will be removed.
Deleting model ExtraTreesGini. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/ExtraTreesGini will be removed.
Deleting model ExtraTreesEntr. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/ExtraTreesEntr will be removed.
Deleting model NeuralNetFastAI. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/NeuralNetFastAI will be removed.
Deleting model XGBoost. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/XGBoost will be removed.
Deleting model NeuralNetTorch. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/NeuralNetTorch will be removed.
Deleting model LightGBMLarge. All files under /home/ci/autogluon/docs/tutorials/tabular/advanced/agModels-predictClass-deployment-clone-opt/models/LightGBMLarge will be removed.
Clone: Removing artifacts unnecessary for prediction. NOTE: Clone can no longer fit new models, and most functionality except for predict and predict_proba will no longer work
predictor_clone_opt = TabularPredictor.load(path=path_clone_opt)

为了避免在每次预测调用中加载模型,我们可以通过以下方式将模型保留在内存中

predictor_clone_opt.persist()
Persisting 2 models in memory. Models will require 0.0% of memory.
['LightGBM', 'WeightedEnsemble_L2']

我们可以看到优化型克隆仍然做出相同的预测

y_pred_clone_opt = predictor_clone_opt.predict(test_data)
y_pred_clone_opt
0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object
y_pred.equals(y_pred_clone_opt)
True
predictor_clone_opt.leaderboard(test_data)
模型 测试得分 验证得分 评估指标 测试预测时间 验证预测时间 训练时间 测试边际预测时间 验证边际预测时间 边际训练时间 堆栈级别 可以推理 训练顺序
0 LightGBM 0.839799 0.85 准确率 0.015134 0.004387 0.234844 0.015134 0.004387 0.234844 1 True 1
1 WeightedEnsemble_L2 0.839799 0.85 准确率 0.015754 0.005160 0.321193 0.000620 0.000773 0.086348 2 True 2

我们可以比较优化型克隆与原始预测器的磁盘使用情况

size_original = predictor.disk_usage()
size_opt = predictor_clone_opt.disk_usage()
print(f'Size Original:  {size_original} bytes')
print(f'Size Optimized: {size_opt} bytes')
print(f'Optimized predictor achieved a {round((1 - (size_opt/size_original)) * 100, 1)}% reduction in disk usage.')
Size Original:  18431844 bytes
Size Optimized: 181793 bytes
Optimized predictor achieved a 99.0% reduction in disk usage.

我们还可以调查原始预测器和优化型预测器中存在的文件差异。

原始

predictor.disk_usage_per_file()
/models/ExtraTreesGini/model.pkl                        5065899
/models/ExtraTreesEntr/model.pkl                        5024129
/models/RandomForestGini/model.pkl                      3408874
/models/RandomForestEntr/model.pkl                      3267273
/models/XGBoost/xgb.ubj                                  506961
/models/LightGBMLarge/model.pkl                          310841
/models/NeuralNetTorch/model.pkl                         253965
/models/NeuralNetFastAI/model-internals.pkl              169919
/models/LightGBM/model.pkl                               147764
/models/CatBoost/model.pkl                                52238
/models/LightGBMXT/model.pkl                              43110
/models/KNeighborsDist/model.pkl                          40167
/models/KNeighborsUnif/model.pkl                          40166
/utils/data/X.pkl                                         27584
/models/WeightedEnsemble_L2/model.pkl                     10470
/learner.pkl                                              10351
/metadata.json                                             9265
/utils/data/X_val.pkl                                      8350
/utils/data/y.pkl                                          7462
/models/XGBoost/model.pkl                                  6152
/models/trainer.pkl                                        5681
/models/NeuralNetFastAI/model.pkl                          2698
/utils/data/y_val.pkl                                      2355
/models/WeightedEnsemble_L2/utils/model_template.pkl       1246
/predictor.pkl                                              982
/models/WeightedEnsemble_L2/utils/oof.pkl                   765
/utils/attr/LightGBM/y_pred_proba_val.pkl                   551
/utils/attr/LightGBMLarge/y_pred_proba_val.pkl              551
/utils/attr/ExtraTreesEntr/y_pred_proba_val.pkl             551
/utils/attr/NeuralNetFastAI/y_pred_proba_val.pkl            551
/utils/attr/XGBoost/y_pred_proba_val.pkl                    551
/utils/attr/NeuralNetTorch/y_pred_proba_val.pkl             551
/utils/attr/KNeighborsUnif/y_pred_proba_val.pkl             551
/utils/attr/LightGBMXT/y_pred_proba_val.pkl                 551
/utils/attr/KNeighborsDist/y_pred_proba_val.pkl             551
/utils/attr/RandomForestGini/y_pred_proba_val.pkl           551
/utils/attr/CatBoost/y_pred_proba_val.pkl                   551
/utils/attr/RandomForestEntr/y_pred_proba_val.pkl           551
/utils/attr/ExtraTreesGini/y_pred_proba_val.pkl             551
/version.txt                                                 14
Name: size, dtype: int64

优化型

predictor_clone_opt.disk_usage_per_file()
/models/LightGBM/model.pkl               147792
/models/WeightedEnsemble_L2/model.pkl     10520
/learner.pkl                              10351
/metadata.json                             9265
/models/trainer.pkl                        2869
/predictor.pkl                              982
/version.txt                                 14
Name: size, dtype: int64

编译模型以最大化推理速度

为了进一步提高推理效率,我们可以调用 .compile() 自动将 sklearn 函数调用转换为等效的 ONNX 调用。请注意,这目前是一项实验性功能,仅改进 RandomForest 和 TabularNeuralNetwork 模型。编译和推理速度加速需要安装 skl2onnxonnxruntime 包。要自动安装这些包的受支持版本,我们可以在现有 AutoGluon 安装的基础上调用 pip install autogluon.tabular[skl2onnx],或者在新 AutoGluon 安装上调用 pip install autogluon.tabular[all,skl2onnx]

确保预测器已克隆非常重要,因为一旦模型被编译,它将不支持训练。

predictor_clone_opt.compile()
Compiling 2 Models ...
Skipping compilation for LightGBM ... (No config specified)
Skipping compilation for WeightedEnsemble_L2 ... (No config specified)
Finished compiling models, total runtime = 0s.

使用编译后的预测器,预测结果可能不会完全相同,但应该非常接近。

y_pred_compile_opt = predictor_clone_opt.predict(test_data)
y_pred_compile_opt
0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

现在剩下的就是将优化后的预测器上传到集中的存储位置,例如 S3。要在新机器/系统中使用此预测器,只需将工件下载到本地磁盘并加载预测器。确保加载预测器时使用与训练时相同的 Python 版本和 AutoGluon 版本,以避免不稳定。