向 AutoGluon 添加自定义模型(高级)¶
提示:如果您是 AutoGluon 的新手,请查看 预测表格中的列 - 快速入门 以了解 AutoGluon API 的基础知识。
在本教程中,我们将介绍自定义模型的高级选项,这些选项超出了 向 AutoGluon 添加自定义模型 中涵盖的主题。
本教程假定您已完整阅读 向 AutoGluon 添加自定义模型。
加载数据¶
首先我们将加载数据。在本教程中,我们将使用成人收入数据集,因为它包含整数、浮点数和分类特征的混合。
from autogluon.tabular import TabularDataset
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv') # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv') # another Pandas DataFrame
label = 'class' # specifies which column do we want to predict
train_data = train_data.sample(n=1000, random_state=0) # subsample for faster demo
train_data.head(5)
年龄 | 工作类型 | fnlwgt | 教育程度 | 教育程度数值 | 婚姻状况 | 职业 | 关系 | 种族 | 性别 | 资本利得 | 资本损失 | 每周工时 | 原籍国 | 类别 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6118 | 51 | 私营 | 39264 | 大学在读 | 10 | 已婚(民用配偶) | 执行/管理 | 妻子 | 白人 | 女性 | 0 | 0 | 40 | 美国 | >50K |
23204 | 58 | 私营 | 51662 | 十年级 | 6 | 已婚(民用配偶) | 其他服务 | 妻子 | 白人 | 女性 | 0 | 0 | 8 | 美国 | <=50K |
29590 | 40 | 私营 | 326310 | 大学在读 | 10 | 已婚(民用配偶) | 手工业/维修 | 丈夫 | 白人 | 男性 | 0 | 0 | 44 | 美国 | <=50K |
18116 | 37 | 私营 | 222450 | 高中毕业 | 9 | 从未结婚 | 销售 | 非家庭成员 | 白人 | 男性 | 0 | 2339 | 40 | 萨尔瓦多 | <=50K |
33964 | 62 | 私营 | 109190 | 学士 | 13 | 已婚(民用配偶) | 执行/管理 | 丈夫 | 白人 | 男性 | 15024 | 0 | 40 | 美国 | >50K |
强制将特征传递给模型而不进行预处理/丢弃¶
您想要这样做可能是因为您的模型逻辑要求特定列始终存在,无论其内容如何。例如,如果您正在微调一个预训练语言模型,该模型需要一个特征来指示给定行中文本的语言,这决定了文本如何被预处理,但训练数据只包含一种语言,那么如果不进行此调整,语言标识符特征将在拟合模型之前被丢弃。
强制特征在模型特定预处理中不被丢弃¶
为了避免自定义模型中因只有 1 个唯一值而丢弃特征,请将以下 _get_default_auxiliary_params
方法添加到您的自定义模型类中
from autogluon.core.models import AbstractModel
class DummyModel(AbstractModel):
def _fit(self, X, **kwargs):
print(f'Before {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
X = self.preprocess(X)
print(f'After {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
print(X.head(5))
class DummyModelKeepUnique(DummyModel):
def _get_default_auxiliary_params(self) -> dict:
default_auxiliary_params = super()._get_default_auxiliary_params()
extra_auxiliary_params = dict(
drop_unique=False, # Whether to drop features that have only 1 unique value, default is True
)
default_auxiliary_params.update(extra_auxiliary_params)
return default_auxiliary_params
强制特征在全局预处理中不被丢弃¶
虽然上述针对模型特定预处理的修复方法在特征在全局预处理后仍然存在时有效,但如果特征在到达模型之前已经被丢弃,则无效。为此,我们需要创建一个新的特征生成器类,该类将普通特征和用户覆盖特征之间的预处理逻辑分开。
这是一个示例实现
# WARNING: To use this in practice, you must put this code in a separate python file
# from the main process and import it or else it will not be serializable.)
from autogluon.features import BulkFeatureGenerator, AutoMLPipelineFeatureGenerator, IdentityFeatureGenerator
class CustomFeatureGeneratorWithUserOverride(BulkFeatureGenerator):
def __init__(self, automl_generator_kwargs: dict = None, **kwargs):
generators = self._get_default_generators(automl_generator_kwargs=automl_generator_kwargs)
super().__init__(generators=generators, **kwargs)
def _get_default_generators(self, automl_generator_kwargs: dict = None):
if automl_generator_kwargs is None:
automl_generator_kwargs = dict()
generators = [
[
# Preprocessing logic that handles normal features
AutoMLPipelineFeatureGenerator(banned_feature_special_types=['user_override'], **automl_generator_kwargs),
# Preprocessing logic that handles special features user wishes to treat separately, here we simply skip preprocessing for these features.
IdentityFeatureGenerator(infer_features_in_args=dict(required_special_types=['user_override'])),
],
]
return generators
上面的代码根据特征元数据中是否使用 'user_override'
特殊类型标记特征来划分其预处理逻辑。要以这种方式标记三个特征 ['age', 'native-country', 'dummy_feature']
,您可以执行以下操作
# add a useless dummy feature to show that it is not dropped in preprocessing
train_data['dummy_feature'] = 'dummy value'
test_data['dummy_feature'] = 'dummy value'
from autogluon.tabular import FeatureMetadata
feature_metadata = FeatureMetadata.from_df(train_data)
print('Before inserting overrides:')
print(feature_metadata)
feature_metadata = feature_metadata.add_special_types(
{
'age': ['user_override'],
'native-country': ['user_override'],
'dummy_feature': ['user_override'],
}
)
print('After inserting overrides:')
print(feature_metadata)
Before inserting overrides:
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 10 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
After inserting overrides:
('int', []) : 5 | ['fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
('int', ['user_override']) : 1 | ['age']
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('object', ['user_override']) : 2 | ['native-country', 'dummy_feature']
请注意,这只是一个具有分岔预处理逻辑的自定义特征生成器示例实现。用户可以根据自己的需求随意复杂化其标记和特征生成器逻辑。在此示例中,我们对未标记的特征执行标准预处理,对于标记的特征,我们通过 IdentityFeatureGenerator
处理它们,这是一个不做任何改变的无操作逻辑。您可以根据需要使用任何类型的特征生成器来替代 IdentityFeatureGenerator
。
综合起来¶
# Separate features and labels
X = train_data.drop(columns=[label])
y = train_data[label]
X_test = test_data.drop(columns=[label])
y_test = test_data[label]
# preprocess the label column, as done in the prior custom model tutorial
from autogluon.core.data import LabelCleaner
from autogluon.core.utils import infer_problem_type
# Construct a LabelCleaner to neatly convert labels to float/integers during model training/inference, can also use to inverse_transform back to original.
problem_type = infer_problem_type(y=y) # Infer problem type (or else specify directly)
label_cleaner = LabelCleaner.construct(problem_type=problem_type, y=y)
y_preprocessed = label_cleaner.transform(y)
y_test_preprocessed = label_cleaner.transform(y_test)
# Make sure to specify your custom feature metadata to the feature generator
my_custom_feature_generator = CustomFeatureGeneratorWithUserOverride(feature_metadata_in=feature_metadata)
X_preprocessed = my_custom_feature_generator.fit_transform(X)
X_test_preprocessed = my_custom_feature_generator.transform(X_test)
注意 user_override 特征未被预处理
print(list(X_preprocessed.columns))
X_preprocessed.head(5)
['fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'native-country', 'dummy_feature']
fnlwgt | 教育程度数值 | 性别 | 资本利得 | 资本损失 | 每周工时 | 工作类型 | 教育程度 | 婚姻状况 | 职业 | 关系 | 种族 | 年龄 | 原籍国 | dummy_feature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6118 | 39264 | 10 | 0 | 0 | 0 | 40 | 3 | 14 | 1 | 4 | 5 | 4 | 51 | 美国 | dummy value |
23204 | 51662 | 6 | 0 | 0 | 0 | 8 | 3 | 0 | 1 | 8 | 5 | 4 | 58 | 美国 | dummy value |
29590 | 326310 | 10 | 1 | 0 | 0 | 44 | 3 | 14 | 1 | 3 | 0 | 4 | 40 | 美国 | dummy value |
18116 | 222450 | 9 | 1 | 0 | 2339 | 40 | 3 | 11 | 3 | 12 | 1 | 4 | 37 | 萨尔瓦多 | dummy value |
33964 | 109190 | 13 | 1 | 15024 | 0 | 40 | 3 | 9 | 1 | 4 | 0 | 4 | 62 | 美国 | dummy value |
现在看看将这些数据发送给 dummy model 进行拟合时会发生什么
dummy_model = DummyModel()
dummy_model.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)
Before DummyModel Preprocessing (15 features):
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After DummyModel Preprocessing (14 features):
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
age workclass fnlwgt education education-num \
6118 51 Private 39264 Some-college 10
23204 58 Private 51662 10th 6
29590 40 Private 326310 Some-college 10
18116 37 Private 222450 HS-grad 9
33964 62 Private 109190 Bachelors 13
marital-status occupation relationship race sex \
6118 Married-civ-spouse Exec-managerial Wife White Female
23204 Married-civ-spouse Other-service Wife White Female
29590 Married-civ-spouse Craft-repair Husband White Male
18116 Never-married Sales Not-in-family White Male
33964 Married-civ-spouse Exec-managerial Husband White Male
capital-gain capital-loss hours-per-week native-country
6118 0 0 40 United-States
23204 0 0 8 United-States
29590 0 0 44 United-States
18116 0 2339 40 El-Salvador
33964 15024 0 40 United-States
<__main__.DummyModel at 0x7f712f7cc750>
注意在 preprocess 调用期间,模型如何丢弃了 dummy_feature
。现在看看如果我们使用 DummyModelKeepUnique
会发生什么
dummy_model_keep_unique = DummyModelKeepUnique()
dummy_model_keep_unique.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)
Before DummyModelKeepUnique Preprocessing (15 features):
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After DummyModelKeepUnique Preprocessing (15 features):
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
age workclass fnlwgt education education-num \
6118 51 Private 39264 Some-college 10
23204 58 Private 51662 10th 6
29590 40 Private 326310 Some-college 10
18116 37 Private 222450 HS-grad 9
33964 62 Private 109190 Bachelors 13
marital-status occupation relationship race sex \
6118 Married-civ-spouse Exec-managerial Wife White Female
23204 Married-civ-spouse Other-service Wife White Female
29590 Married-civ-spouse Craft-repair Husband White Male
18116 Never-married Sales Not-in-family White Male
33964 Married-civ-spouse Exec-managerial Husband White Male
capital-gain capital-loss hours-per-week native-country \
6118 0 0 40 United-States
23204 0 0 8 United-States
29590 0 0 44 United-States
18116 0 2339 40 El-Salvador
33964 15024 0 40 United-States
dummy_feature
6118 dummy value
23204 dummy value
29590 dummy value
18116 dummy value
33964 dummy value
<__main__.DummyModelKeepUnique at 0x7f712f923f50>
现在 dummy_feature
不再被丢弃!
上述代码逻辑可以重用于测试您自己的复杂模型实现,只需将 DummyModelKeepUnique
替换为您自定义的模型,并检查它是否保留了您想要使用的特征。
通过 TabularPredictor 保留特征¶
现在,我们将演示如何通过 TabularPredictor 以更少的代码行实现这一点。请注意,这段代码如果在本教程中运行会引发异常,因为自定义模型和特征生成器必须存在于其他文件中才能被序列化。因此,我们不会在教程中运行这段代码。(它也会引发异常,因为 DummyModel 不是一个真实的模型)
from autogluon.tabular import TabularPredictor
feature_generator = CustomFeatureGeneratorWithUserOverride()
predictor = TabularPredictor(label=label)
predictor.fit(
train_data=train_data,
feature_metadata=feature_metadata, # feature metadata with your overrides
feature_generator=feature_generator, # your custom feature generator that handles the overrides
hyperparameters={
'GBM': {}, # Can fit your custom model alongside default models
DummyModel: {}, # Will drop dummy_feature
DummyModelKeepUnique: {}, # Will not drop dummy_feature
# DummyModel: {'ag_args_fit': {'drop_unique': False}}, # This is another way to get same result as using DummyModelKeepUnique
}
)