向 AutoGluon 添加自定义模型(高级)

Open In Colab Open In SageMaker Studio Lab

提示:如果您是 AutoGluon 的新手,请查看 预测表格中的列 - 快速入门 以了解 AutoGluon API 的基础知识。

在本教程中,我们将介绍自定义模型的高级选项,这些选项超出了 向 AutoGluon 添加自定义模型 中涵盖的主题。

本教程假定您已完整阅读 向 AutoGluon 添加自定义模型

加载数据

首先我们将加载数据。在本教程中,我们将使用成人收入数据集,因为它包含整数、浮点数和分类特征的混合。

from autogluon.tabular import TabularDataset

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')  # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame
label = 'class'  # specifies which column do we want to predict
train_data = train_data.sample(n=1000, random_state=0)  # subsample for faster demo

train_data.head(5)
年龄 工作类型 fnlwgt 教育程度 教育程度数值 婚姻状况 职业 关系 种族 性别 资本利得 资本损失 每周工时 原籍国 类别
6118 51 私营 39264 大学在读 10 已婚(民用配偶) 执行/管理 妻子 白人 女性 0 0 40 美国 >50K
23204 58 私营 51662 十年级 6 已婚(民用配偶) 其他服务 妻子 白人 女性 0 0 8 美国 <=50K
29590 40 私营 326310 大学在读 10 已婚(民用配偶) 手工业/维修 丈夫 白人 男性 0 0 44 美国 <=50K
18116 37 私营 222450 高中毕业 9 从未结婚 销售 非家庭成员 白人 男性 0 2339 40 萨尔瓦多 <=50K
33964 62 私营 109190 学士 13 已婚(民用配偶) 执行/管理 丈夫 白人 男性 15024 0 40 美国 >50K

强制将特征传递给模型而不进行预处理/丢弃

您想要这样做可能是因为您的模型逻辑要求特定列始终存在,无论其内容如何。例如,如果您正在微调一个预训练语言模型,该模型需要一个特征来指示给定行中文本的语言,这决定了文本如何被预处理,但训练数据只包含一种语言,那么如果不进行此调整,语言标识符特征将在拟合模型之前被丢弃。

强制特征在模型特定预处理中不被丢弃

为了避免自定义模型中因只有 1 个唯一值而丢弃特征,请将以下 _get_default_auxiliary_params 方法添加到您的自定义模型类中

from autogluon.core.models import AbstractModel

class DummyModel(AbstractModel):
    def _fit(self, X, **kwargs):
        print(f'Before {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
        X = self.preprocess(X)
        print(f'After  {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
        print(X.head(5))

class DummyModelKeepUnique(DummyModel):
    def _get_default_auxiliary_params(self) -> dict:
        default_auxiliary_params = super()._get_default_auxiliary_params()
        extra_auxiliary_params = dict(
            drop_unique=False,  # Whether to drop features that have only 1 unique value, default is True
        )
        default_auxiliary_params.update(extra_auxiliary_params)
        return default_auxiliary_params

强制特征在全局预处理中不被丢弃

虽然上述针对模型特定预处理的修复方法在特征在全局预处理后仍然存在时有效,但如果特征在到达模型之前已经被丢弃,则无效。为此,我们需要创建一个新的特征生成器类,该类将普通特征和用户覆盖特征之间的预处理逻辑分开。

这是一个示例实现

# WARNING: To use this in practice, you must put this code in a separate python file
#  from the main process and import it or else it will not be serializable.)
from autogluon.features import BulkFeatureGenerator, AutoMLPipelineFeatureGenerator, IdentityFeatureGenerator


class CustomFeatureGeneratorWithUserOverride(BulkFeatureGenerator):
    def __init__(self, automl_generator_kwargs: dict = None, **kwargs):
        generators = self._get_default_generators(automl_generator_kwargs=automl_generator_kwargs)
        super().__init__(generators=generators, **kwargs)

    def _get_default_generators(self, automl_generator_kwargs: dict = None):
        if automl_generator_kwargs is None:
            automl_generator_kwargs = dict()

        generators = [
            [
                # Preprocessing logic that handles normal features
                AutoMLPipelineFeatureGenerator(banned_feature_special_types=['user_override'], **automl_generator_kwargs),

                # Preprocessing logic that handles special features user wishes to treat separately, here we simply skip preprocessing for these features.
                IdentityFeatureGenerator(infer_features_in_args=dict(required_special_types=['user_override'])),
            ],
        ]
        return generators

上面的代码根据特征元数据中是否使用 'user_override' 特殊类型标记特征来划分其预处理逻辑。要以这种方式标记三个特征 ['age', 'native-country', 'dummy_feature'],您可以执行以下操作

# add a useless dummy feature to show that it is not dropped in preprocessing
train_data['dummy_feature'] = 'dummy value'
test_data['dummy_feature'] = 'dummy value'

from autogluon.tabular import FeatureMetadata
feature_metadata = FeatureMetadata.from_df(train_data)

print('Before inserting overrides:')
print(feature_metadata)

feature_metadata = feature_metadata.add_special_types(
    {
        'age': ['user_override'],
        'native-country': ['user_override'],
        'dummy_feature': ['user_override'],
    }
)

print('After inserting overrides:')
print(feature_metadata)
Before inserting overrides:
('int', [])    :  6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 10 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
After inserting overrides:
('int', [])                   : 5 | ['fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
('int', ['user_override'])    : 1 | ['age']
('object', [])                : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('object', ['user_override']) : 2 | ['native-country', 'dummy_feature']

请注意,这只是一个具有分岔预处理逻辑的自定义特征生成器示例实现。用户可以根据自己的需求随意复杂化其标记和特征生成器逻辑。在此示例中,我们对未标记的特征执行标准预处理,对于标记的特征,我们通过 IdentityFeatureGenerator 处理它们,这是一个不做任何改变的无操作逻辑。您可以根据需要使用任何类型的特征生成器来替代 IdentityFeatureGenerator

综合起来

# Separate features and labels
X = train_data.drop(columns=[label])
y = train_data[label]
X_test = test_data.drop(columns=[label])
y_test = test_data[label]

# preprocess the label column, as done in the prior custom model tutorial
from autogluon.core.data import LabelCleaner
from autogluon.core.utils import infer_problem_type
# Construct a LabelCleaner to neatly convert labels to float/integers during model training/inference, can also use to inverse_transform back to original.
problem_type = infer_problem_type(y=y)  # Infer problem type (or else specify directly)
label_cleaner = LabelCleaner.construct(problem_type=problem_type, y=y)
y_preprocessed = label_cleaner.transform(y)
y_test_preprocessed = label_cleaner.transform(y_test)

# Make sure to specify your custom feature metadata to the feature generator
my_custom_feature_generator = CustomFeatureGeneratorWithUserOverride(feature_metadata_in=feature_metadata)

X_preprocessed = my_custom_feature_generator.fit_transform(X)
X_test_preprocessed = my_custom_feature_generator.transform(X_test)

注意 user_override 特征未被预处理

print(list(X_preprocessed.columns))
X_preprocessed.head(5)
['fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'native-country', 'dummy_feature']
fnlwgt 教育程度数值 性别 资本利得 资本损失 每周工时 工作类型 教育程度 婚姻状况 职业 关系 种族 年龄 原籍国 dummy_feature
6118 39264 10 0 0 0 40 3 14 1 4 5 4 51 美国 dummy value
23204 51662 6 0 0 0 8 3 0 1 8 5 4 58 美国 dummy value
29590 326310 10 1 0 0 44 3 14 1 3 0 4 40 美国 dummy value
18116 222450 9 1 0 2339 40 3 11 3 12 1 4 37 萨尔瓦多 dummy value
33964 109190 13 1 15024 0 40 3 9 1 4 0 4 62 美国 dummy value

现在看看将这些数据发送给 dummy model 进行拟合时会发生什么

dummy_model = DummyModel()
dummy_model.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)
Before DummyModel Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After  DummyModel Preprocessing (14 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial            Wife   White   Female   
23204   Married-civ-spouse     Other-service            Wife   White   Female   
29590   Married-civ-spouse      Craft-repair         Husband   White     Male   
18116        Never-married             Sales   Not-in-family   White     Male   
33964   Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country  
6118              0             0              40   United-States  
23204             0             0               8   United-States  
29590             0             0              44   United-States  
18116             0          2339              40     El-Salvador  
33964         15024             0              40   United-States
<__main__.DummyModel at 0x7f712f7cc750>

注意在 preprocess 调用期间,模型如何丢弃了 dummy_feature。现在看看如果我们使用 DummyModelKeepUnique 会发生什么

dummy_model_keep_unique = DummyModelKeepUnique()
dummy_model_keep_unique.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)
Before DummyModelKeepUnique Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After  DummyModelKeepUnique Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial            Wife   White   Female   
23204   Married-civ-spouse     Other-service            Wife   White   Female   
29590   Married-civ-spouse      Craft-repair         Husband   White     Male   
18116        Never-married             Sales   Not-in-family   White     Male   
33964   Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country  \
6118              0             0              40   United-States   
23204             0             0               8   United-States   
29590             0             0              44   United-States   
18116             0          2339              40     El-Salvador   
33964         15024             0              40   United-States   

      dummy_feature  
6118    dummy value  
23204   dummy value  
29590   dummy value  
18116   dummy value  
33964   dummy value
<__main__.DummyModelKeepUnique at 0x7f712f923f50>

现在 dummy_feature 不再被丢弃!

上述代码逻辑可以重用于测试您自己的复杂模型实现,只需将 DummyModelKeepUnique 替换为您自定义的模型,并检查它是否保留了您想要使用的特征。

通过 TabularPredictor 保留特征

现在,我们将演示如何通过 TabularPredictor 以更少的代码行实现这一点。请注意,这段代码如果在本教程中运行会引发异常,因为自定义模型和特征生成器必须存在于其他文件中才能被序列化。因此,我们不会在教程中运行这段代码。(它也会引发异常,因为 DummyModel 不是一个真实的模型)

from autogluon.tabular import TabularPredictor

feature_generator = CustomFeatureGeneratorWithUserOverride()
predictor = TabularPredictor(label=label)
predictor.fit(
    train_data=train_data,
    feature_metadata=feature_metadata,  # feature metadata with your overrides
    feature_generator=feature_generator,  # your custom feature generator that handles the overrides
    hyperparameters={
        'GBM': {},  # Can fit your custom model alongside default models
        DummyModel: {},  # Will drop dummy_feature
        DummyModelKeepUnique: {},  # Will not drop dummy_feature
        # DummyModel: {'ag_args_fit': {'drop_unique': False}},  # This is another way to get same result as using DummyModelKeepUnique
    }
)