AutoGluon 表格数据 - 特征工程¶
介绍¶
特征工程涉及获取原始表格数据,然后
将其转换为机器学习模型可以读取的格式
尝试增强某些列(在机器学习术语中称为“特征”)以向机器学习模型提供更多信息,希望获得更准确的结果。
AutoGluon 会为您完成其中的一些工作。本文档介绍了其工作原理以及如何进行扩展。我们在此描述了默认行为(其中大部分是可配置的),并提供了如何更改默认行为的提示。
列类型¶
AutoGluon 表格数据识别以下特征类型,并对它们进行单独处理:
特征类型 |
示例值 |
---|---|
布尔型 |
A, B |
数值型 |
1.3, 2.0, -1.6 |
类别型 |
红色, 蓝色, 黄色 |
日期时间型 |
2021/1/31, 3月31日 |
文本型 |
Mary had a little lamb |
此外,其他 AutoGluon 预测模块识别额外的特征类型,通过使用 多模态 选项,这些类型也可以在 AutoGluon 表格数据中启用。
特征类型 |
示例值 |
---|---|
图像型 |
path/image123.png |
列类型检测¶
布尔型列是任何只有 2 个唯一值的列。
任何字符串列都被视为类别型,除非它们是文本型(见下文)。如果您告诉某些模型哪些列是类别型,哪些列是连续型,它们的性能会更好。
数值列会直接传递,除了将它们标识为
float
或int
。目前,数值列不会被测试以确定它们是否可能是类别型。您可以使用 Pandas 语法.astype("category")
强制将它们视为类别型,详见下文。文本列的检测首先检查大多数行是否唯一。如果唯一,并且大多数行中检测到多个独立的词,则该行是文本列。详细信息请参阅源代码中的
common/features/infer_types.py
。日期时间列通过尝试将其转换为 Pandas 日期时间来检测。Pandas 可以检测多种日期时间格式。如果列中的许多值成功转换,则它们是日期时间。目前,看起来纯粹是数字的日期时间(例如 20210530)无法正确检测。任何 NaN 值都设置为该列的平均值。详细信息请参阅
common/features/infer_types.py
。
问题类型检测¶
如果用户未指定问题是分类问题还是回归问题,则会检查“label”列以尝试猜测。有几个迹象表明是回归问题:值是浮点非整数,并且有大量唯一值。在分类问题中,可以检测到多类别和二元(n=2 个类别)。详细信息请参阅 utils/utils.py
。
要覆盖自动推断,请显式地将 problem_type(“binary”、“regression”或“multiclass”之一)传递给 TabularPredictor()
。例如:
predictor = TabularPredictor(label='class', problem_type='multiclass').fit(train_data)
自动特征工程¶
数值列¶
目前,整数和浮点数等数值列都没有自动特征工程。
类别型列¶
由于许多下游模型要求将类别编码为整数,因此每个类别特征都映射到单调递增的整数。
日期时间列¶
被识别为日期时间的列会转换为多个特征:
一个数值型的 Pandas 日期时间。请注意,其最大值和最小值分别在 pandas.Timestamp.min 和 pandas.Timestamp.max 指定,这可能会影响遥远的未来或过去日期。
几个提取的列,默认是
[year, month, day, dayofweek]
。这可以通过 DatetimeFeatureGenerator 进行配置。
请注意,由上述逻辑生成的缺失、无效和超出范围的特征将被转换为所有有效行的平均值。
文本列¶
如果启用了 多模态 选项,则文本列会使用带有预训练 NLP 模型的完整 Transformer 神经网络模型进行处理。
否则,它们会通过两种更简单的方式进行处理:
一个 n-gram 特征生成器从文本特征中提取 n-gram(短字符串),添加许多额外的列,每列对应一个 n-gram 特征。这些列进行“n-hot”编码,如果原始特征包含该 n-gram 1 次或多次,则值为 1 或更大,否则为 0。默认情况下,所有文本列会在此阶段之前连接起来,并且 n-gram 是单个单词,而不是单词的子字符串。您可以通过 TextNgramFeatureGenerator 类进行配置。n-gram 的生成是在
generators/text_ngram.py
中完成的。计算一些额外的数值特征,例如词数、字符数、大写字母比例等。这可以通过 TextSpecialFeatureGenerator 进行配置。这项工作是在
generators/text_special.py
中完成的。
附加处理¶
仅包含 1 个值的列在传递给模型之前会被删除。
包含其他列的重复值的列在传递给模型之前会被删除。
特征工程示例¶
默认情况下,使用名为 AutoMLPipelineFeatureGenerator 的特征生成器。让我们看看它是如何工作的。我们将创建一个包含浮点列、整数列、日期时间列和类别型列的数据框。首先,我们来看一下创建的原始数据。
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
import numpy as np
import random
from sklearn.datasets import make_regression
from datetime import datetime
x, y = make_regression(n_samples = 100,n_features = 5,n_targets = 1, random_state = 1)
dfx = pd.DataFrame(x, columns=['A','B','C','D','E'])
dfy = pd.DataFrame(y, columns=['label'])
# Create an integer column, a datetime column, a categorical column and a string column to demonstrate how they are processed.
dfx['B'] = (dfx['B']).astype(int)
dfx['C'] = datetime(2000,1,1) + pd.to_timedelta(dfx['C'].astype(int), unit='D')
dfx['D'] = pd.cut(dfx['D'] * 10, [-np.inf,-5,0,5,np.inf],labels=['v','w','x','y'])
dfx['E'] = pd.Series(list(' '.join(random.choice(["abc", "d", "ef", "ghi", "jkl"]) for i in range(4)) for j in range(100)))
dataset=TabularDataset(dfx)
print(dfx)
A B C D E
0 -0.545774 0 2000-01-01 y abc abc ef jkl
1 -0.468674 0 2000-01-02 x ef ef d ghi
2 1.767960 0 1999-12-31 v ghi d ef d
3 -0.118771 1 2000-01-01 y d ghi abc ghi
4 0.630196 0 1999-12-31 w ghi ghi ef ef
.. ... .. ... .. ...
95 -1.182318 -1 2000-01-01 v jkl abc jkl d
96 0.562761 0 2000-01-01 v d abc ghi ef
97 -0.797270 0 2000-01-01 w ghi d ghi d
98 0.502741 0 1999-12-31 y jkl ef ef d
99 2.056356 0 1999-12-30 w ef ghi jkl ef
[100 rows x 5 columns]
现在,让我们调用默认的特征生成器 AutoMLPipeLineFeatureGenerator,不带任何参数,看看它会做什么。
from autogluon.features.generators import AutoMLPipelineFeatureGenerator
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)
A | B | D | E | C | C.year | C.month | C.day | C.dayofweek | E.char_count | E.symbol_ratio. | __nlp__.abc | __nlp__.ef | __nlp__.ghi | __nlp__.jkl | __nlp__._total_ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.545774 | 0 | 3 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 5 | 1 | 2 | 1 | 0 | 1 | 3 |
1 | -0.468674 | 0 | 2 | NaN | 946771200000000000 | 2000 | 1 | 2 | 6 | 2 | 4 | 0 | 2 | 1 | 0 | 2 |
2 | 1.767960 | 0 | 0 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 1 | 5 | 0 | 1 | 1 | 0 | 2 |
3 | -0.118771 | 1 | 3 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 4 | 2 | 1 | 0 | 2 | 0 | 2 |
4 | 0.630196 | 0 | 1 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 4 | 2 | 0 | 2 | 2 | 0 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | -1.182318 | -1 | 0 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 4 | 2 | 1 | 0 | 0 | 2 | 2 |
96 | 0.562761 | 0 | 0 | 1 | 946684800000000000 | 2000 | 1 | 1 | 5 | 3 | 3 | 1 | 1 | 1 | 0 | 3 |
97 | -0.797270 | 0 | 1 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 2 | 4 | 0 | 0 | 2 | 0 | 1 |
98 | 0.502741 | 0 | 3 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 2 | 4 | 0 | 2 | 0 | 1 | 2 |
99 | 2.056356 | 0 | 1 | NaN | 946512000000000000 | 1999 | 12 | 30 | 3 | 4 | 2 | 0 | 2 | 1 | 1 | 3 |
100 行 × 16 列
我们可以看到:
浮点列和整数列“A”和“B”没有改变。
日期时间列“C”已被转换为原始值(纳秒),并解析为额外的年、月、日和周几的列。
字符串类别列“D”已被一对一映射到整数 - 许多模型只接受数值输入。
自由格式文本列已被映射为一些摘要特征(例如“char_count”)以及一个 N-hot 矩阵,表示每个文本是否包含每个单词。
要获取更多详细信息,我们应该将管道作为 TabularPredictor.fit()
的一部分进行调用。由于 fit() 需要一个单一的数据框,我们需要将 dfx
和 dfy
DataFrames 合并。
df = pd.concat([dfx, dfy], axis=1)
predictor = TabularPredictor(label='label')
predictor.fit(df, hyperparameters={'GBM' : {}}, feature_generator=auto_ml_pipeline_feature_generator)
No path specified. Models will be saved in: "AutogluonModels/ag-20250508_205927"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version: 1.3.1b20250508
Python Version: 3.11.9
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count: 8
Memory Avail: 28.77 GB / 30.95 GB (93.0%)
Disk Space Avail: 211.75 GB / 255.99 GB (82.7%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
Recommended Presets (For more details refer to https://autogluon.cn/stable/tutorials/tabular/tabular-essentials.html#presets):
presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
presets='best' : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
presets='high' : Strong accuracy with fast inference speed.
presets='good' : Good accuracy with very fast inference speed.
presets='medium' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/AutogluonModels/ag-20250508_205927"
Train Data Rows: 100
Train Data Columns: 5
Label Column: label
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
Label info (max, min, mean, stddev): (186.98105511749836, -267.99365510467214, 9.38193, 71.29287)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type: regression
Preprocessing data ...
Using Feature Generators to preprocess the data ...
AutoMLPipelineFeatureGenerator is already fit, so the training data will be processed via .transform() instead of .fit_transform().
Types of features in original data (raw dtype, special dtypes):
('category', []) : 1 | ['D']
('datetime', []) : 1 | ['C']
('float', []) : 1 | ['A']
('int', []) : 1 | ['B']
('object', ['text']) : 1 | ['E']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 1 | ['D']
('category', ['text_as_category']) : 1 | ['E']
('float', []) : 1 | ['A']
('int', []) : 1 | ['B']
('int', ['binned', 'text_special']) : 2 | ['E.char_count', 'E.symbol_ratio. ']
('int', ['datetime_as_int']) : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
('int', ['text_ngram']) : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 80, Val Rows: 20
User-specified model hyperparameters to be fit:
{
'GBM': [{}],
}
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBM ...
-60.6688 = Validation score (-root_mean_squared_error)
0.25s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
Ensemble Weights: {'LightGBM': 1.0}
-60.6688 = Validation score (-root_mean_squared_error)
0.0s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 0.32s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 6127.1 rows/s (20 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/AutogluonModels/ag-20250508_205927")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7feda6a86550>
阅读输出,请注意:
字符串类别列“D”虽然被映射到整数,但仍被识别为类别型。
整数列“B”未被识别为类别型,尽管它只有少量唯一值。
print(len(set(dfx['B'])))
5
要将其标记为类别型,我们可以在原始数据框中显式地将其标记为类别型:
dfx["B"] = dfx["B"].astype("category")
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 29440.06 MB
Train Data (Original) Memory Usage: 0.01 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['E']
CountVectorizer fit with vocabulary size = 4
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', []) : 2 | ['B', 'D']
('datetime', []) : 1 | ['C']
('float', []) : 1 | ['A']
('object', ['text']) : 1 | ['E']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 2 | ['B', 'D']
('category', ['text_as_category']) : 1 | ['E']
('float', []) : 1 | ['A']
('int', ['binned', 'text_special']) : 2 | ['E.char_count', 'E.symbol_ratio. ']
('int', ['datetime_as_int']) : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
('int', ['text_ngram']) : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
0.1s = Fit runtime
5 features in original data used to generate 16 features in processed data.
Train Data (Processed) Memory Usage: 0.01 MB (0.0% of available memory)
A | B | D | E | C | C.year | C.month | C.day | C.dayofweek | E.char_count | E.symbol_ratio. | __nlp__.abc | __nlp__.ef | __nlp__.ghi | __nlp__.jkl | __nlp__._total_ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.545774 | 1 | 3 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 5 | 1 | 2 | 1 | 0 | 1 | 3 |
1 | -0.468674 | 1 | 2 | NaN | 946771200000000000 | 2000 | 1 | 2 | 6 | 2 | 4 | 0 | 2 | 1 | 0 | 2 |
2 | 1.767960 | 1 | 0 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 1 | 5 | 0 | 1 | 1 | 0 | 2 |
3 | -0.118771 | 2 | 3 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 4 | 2 | 1 | 0 | 2 | 0 | 2 |
4 | 0.630196 | 1 | 1 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 4 | 2 | 0 | 2 | 2 | 0 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | -1.182318 | 0 | 0 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 4 | 2 | 1 | 0 | 0 | 2 | 2 |
96 | 0.562761 | 1 | 0 | 1 | 946684800000000000 | 2000 | 1 | 1 | 5 | 3 | 3 | 1 | 1 | 1 | 0 | 3 |
97 | -0.797270 | 1 | 1 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 2 | 4 | 0 | 0 | 2 | 0 | 1 |
98 | 0.502741 | 1 | 3 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 2 | 4 | 0 | 2 | 0 | 1 | 2 |
99 | 2.056356 | 1 | 1 | NaN | 946512000000000000 | 1999 | 12 | 30 | 3 | 4 | 2 | 0 | 2 | 1 | 1 | 3 |
100 行 × 16 列
缺失值处理¶
为了说明缺失值处理,我们将第一行设置为所有 NaN:
dfx.iloc[0] = np.nan
dfx.head()
A | B | C | D | E | |
---|---|---|---|---|---|
0 | NaN | NaN | NaT | NaN | NaN |
1 | -0.468674 | 0 | 2000-01-02 | x | ef ef d ghi |
2 | 1.767960 | 0 | 1999-12-31 | v | ghi d ef d |
3 | -0.118771 | 1 | 2000-01-01 | y | d ghi abc ghi |
4 | 0.630196 | 0 | 1999-12-31 | w | ghi ghi ef ef |
现在如果我们重新处理:
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 29440.05 MB
Train Data (Original) Memory Usage: 0.01 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['E']
CountVectorizer fit with vocabulary size = 4
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', []) : 2 | ['B', 'D']
('datetime', []) : 1 | ['C']
('float', []) : 1 | ['A']
('object', ['text']) : 1 | ['E']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 2 | ['B', 'D']
('category', ['text_as_category']) : 1 | ['E']
('float', []) : 1 | ['A']
('int', ['binned', 'text_special']) : 3 | ['E.char_count', 'E.word_count', 'E.symbol_ratio. ']
('int', ['datetime_as_int']) : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
('int', ['text_ngram']) : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
4.3s = Fit runtime
5 features in original data used to generate 17 features in processed data.
Train Data (Processed) Memory Usage: 0.01 MB (0.0% of available memory)
A | B | D | E | C | C.year | C.month | C.day | C.dayofweek | E.char_count | E.word_count | E.symbol_ratio. | __nlp__.abc | __nlp__.ef | __nlp__.ghi | __nlp__.jkl | __nlp__._total_ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | 946687418181818240 | 2000 | 1 | 1 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | -0.468674 | 1 | 2 | NaN | 946771200000000000 | 2000 | 1 | 2 | 6 | 3 | 1 | 5 | 0 | 2 | 1 | 0 | 2 |
2 | 1.767960 | 1 | 0 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 2 | 1 | 6 | 0 | 1 | 1 | 0 | 2 |
3 | -0.118771 | 2 | 3 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 5 | 1 | 3 | 1 | 0 | 2 | 0 | 2 |
4 | 0.630196 | 1 | 1 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 5 | 1 | 3 | 0 | 2 | 2 | 0 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | -1.182318 | 0 | 0 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 5 | 1 | 3 | 1 | 0 | 0 | 2 | 2 |
96 | 0.562761 | 1 | 0 | 1 | 946684800000000000 | 2000 | 1 | 1 | 5 | 4 | 1 | 4 | 1 | 1 | 1 | 0 | 3 |
97 | -0.797270 | 1 | 1 | NaN | 946684800000000000 | 2000 | 1 | 1 | 5 | 3 | 1 | 5 | 0 | 0 | 2 | 0 | 1 |
98 | 0.502741 | 1 | 3 | NaN | 946598400000000000 | 1999 | 12 | 31 | 4 | 3 | 1 | 5 | 0 | 2 | 0 | 1 | 2 |
99 | 2.056356 | 1 | 1 | NaN | 946512000000000000 | 1999 | 12 | 30 | 3 | 5 | 1 | 3 | 0 | 2 | 1 | 1 | 3 |
100 行 × 17 列
我们看到浮点、整数、类别和文本字段“A”、“B”、“D”和“E”保留了 NaN 值,但日期时间列“C”被设置为非 NaN 值的平均值。
特征工程的定制¶
要定制您的特征生成管道,建议调用 PipelineFeatureGenerator,并根据需要将非默认参数传递给其他特征生成器。例如,如果我们认为下游模型可以通过删除稀有类别值并替换为 NaN 来受益,我们可以向 CategoryFeatureGenerator 提供 maximum_num_cat 参数,如下所示:
from autogluon.features.generators import PipelineFeatureGenerator, CategoryFeatureGenerator, IdentityFeatureGenerator
from autogluon.common.features.types import R_INT, R_FLOAT
mypipeline = PipelineFeatureGenerator(
generators = [[
CategoryFeatureGenerator(maximum_num_cat=10), # Overridden from default.
IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT])),
]]
)
然后,如果我们导出转换后的数据,可以看到所有列都已转换为数值型,因为这是大多数模型所要求的,并且稀有类别值已被替换为 NaN。
mypipeline.fit_transform(X=dfx)
Fitting PipelineFeatureGenerator...
Available Memory: 29413.81 MB
Train Data (Original) Memory Usage: 0.01 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Unused Original Features (Count: 1): ['C']
These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.
These features do not need to be present at inference time.
('datetime', []) : 1 | ['C']
Types of features in original data (raw dtype, special dtypes):
('category', []) : 2 | ['B', 'D']
('float', []) : 1 | ['A']
('object', ['text']) : 1 | ['E']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 2 | ['B', 'D']
('category', ['text_as_category']) : 1 | ['E']
('float', []) : 1 | ['A']
0.0s = Fit runtime
4 features in original data used to generate 4 features in processed data.
Train Data (Processed) Memory Usage: 0.00 MB (0.0% of available memory)
B | D | E | A | |
---|---|---|---|---|
0 | NaN | NaN | NaN | NaN |
1 | 1 | 2 | NaN | -0.468674 |
2 | 1 | 0 | NaN | 1.767960 |
3 | 2 | 3 | NaN | -0.118771 |
4 | 1 | 1 | NaN | 0.630196 |
... | ... | ... | ... | ... |
95 | 0 | 0 | NaN | -1.182318 |
96 | 1 | 0 | 1 | 0.562761 |
97 | 1 | 1 | NaN | -0.797270 |
98 | 1 | 3 | NaN | 0.502741 |
99 | 1 | 1 | NaN | 2.056356 |
100 行 × 4 列
有关自定义特征工程的更多信息,请参阅详细的笔记本 examples/tabular/example_custom_feature_generator.py
。