AutoMM 文本和图像实体提取 - 快速入门¶

我们已经介绍了如何使用文本数据训练实体提取模型。在这里，我们将进一步整合其他模态的数据。在许多实际应用中，文本数据通常伴随着其他模态的数据。例如，Twitter 允许您使用文本、照片、视频和 GIF 撰写推文。Amazon.com 使用文本、图像和视频来描述其产品。这些辅助模态可以作为实体的额外上下文解析。现在，使用 AutoMM，您可以轻松利用多模态数据来增强实体提取，而无需担心细节问题。

import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

获取 Twitter 数据集¶

在下面的示例中，我们将演示如何使用真实的 Twitter 数据集构建多模态命名实体识别模型。该数据集包含 2016 年至 2017 年抓取的推文，每条推文由一个句子和一张图片组成。让我们下载数据集。

download_dir = './ag_automm_tutorial_ner'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/ner/multimodal_ner.zip'
from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)

Downloading ./ag_automm_tutorial_ner/file.zip from https://automl-mm-bench.s3.amazonaws.com/ner/multimodal_ner.zip...

  0%|          | 0.00/423M [00:00<?, ?iB/s]
  2%|▏         | 10.4M/423M [00:00<00:03, 104MiB/s]
  5%|▍         | 20.8M/423M [00:00<00:07, 53.6MiB/s]
  8%|▊         | 32.0M/423M [00:00<00:05, 71.2MiB/s]
 10%|▉         | 40.7M/423M [00:00<00:06, 57.7MiB/s]
 12%|█▏        | 49.8M/423M [00:00<00:06, 59.4MiB/s]
 13%|█▎        | 56.5M/423M [00:01<00:08, 43.3MiB/s]
 15%|█▍        | 62.9M/423M [00:01<00:07, 47.3MiB/s]
 16%|█▌        | 68.7M/423M [00:01<00:08, 43.4MiB/s]
 19%|█▊        | 79.2M/423M [00:01<00:06, 56.5MiB/s]
 20%|██        | 85.9M/423M [00:01<00:07, 45.9MiB/s]
 23%|██▎       | 96.9M/423M [00:01<00:05, 59.3MiB/s]
 25%|██▍       | 104M/423M [00:01<00:05, 55.6MiB/s]
 27%|██▋       | 115M/423M [00:02<00:04, 68.4MiB/s]
 29%|██▉       | 123M/423M [00:02<00:05, 58.7MiB/s]
 31%|███       | 130M/423M [00:02<00:05, 55.9MiB/s]
 32%|███▏      | 137M/423M [00:02<00:05, 53.0MiB/s]
 35%|███▍      | 148M/423M [00:02<00:04, 65.0MiB/s]
 37%|███▋      | 155M/423M [00:02<00:05, 45.3MiB/s]
 38%|███▊      | 161M/423M [00:03<00:06, 38.5MiB/s]
 39%|███▉      | 166M/423M [00:03<00:07, 32.5MiB/s]
 41%|████      | 172M/423M [00:03<00:07, 32.3MiB/s]
 43%|████▎     | 183M/423M [00:03<00:05, 45.7MiB/s]
 45%|████▍     | 188M/423M [00:03<00:05, 43.5MiB/s]
 46%|████▌     | 194M/423M [00:04<00:06, 36.9MiB/s]
 49%|████▊     | 205M/423M [00:04<00:04, 52.2MiB/s]
 50%|█████     | 212M/423M [00:04<00:04, 42.8MiB/s]
 53%|█████▎    | 223M/423M [00:04<00:03, 52.5MiB/s]
 55%|█████▌    | 234M/423M [00:04<00:02, 64.1MiB/s]
 57%|█████▋    | 242M/423M [00:04<00:03, 55.1MiB/s]
 60%|█████▉    | 253M/423M [00:04<00:02, 67.3MiB/s]
 62%|██████▏   | 261M/423M [00:05<00:03, 51.6MiB/s]
 64%|██████▍   | 273M/423M [00:05<00:02, 64.0MiB/s]
 66%|██████▋   | 281M/423M [00:05<00:02, 55.7MiB/s]
 69%|██████▉   | 291M/423M [00:05<00:02, 63.2MiB/s]
 71%|███████   | 299M/423M [00:05<00:02, 50.6MiB/s]
 73%|███████▎  | 309M/423M [00:05<00:02, 51.2MiB/s]
 75%|███████▌  | 319M/423M [00:06<00:01, 59.8MiB/s]
 77%|███████▋  | 326M/423M [00:06<00:01, 56.7MiB/s]
 79%|███████▊  | 332M/423M [00:06<00:01, 54.1MiB/s]
 81%|████████  | 344M/423M [00:06<00:01, 56.3MiB/s]
 84%|████████▍ | 355M/423M [00:06<00:00, 68.4MiB/s]
 86%|████████▌ | 362M/423M [00:06<00:01, 54.0MiB/s]
 88%|████████▊ | 374M/423M [00:06<00:00, 66.2MiB/s]
 90%|█████████ | 382M/423M [00:07<00:00, 46.7MiB/s]
 92%|█████████▏| 388M/423M [00:07<00:00, 42.3MiB/s]
 93%|█████████▎| 395M/423M [00:07<00:00, 41.4MiB/s]
 96%|█████████▌| 406M/423M [00:07<00:00, 54.1MiB/s]
 98%|█████████▊| 413M/423M [00:07<00:00, 45.2MiB/s]
100%|█████████▉| 422M/423M [00:08<00:00, 33.2MiB/s]
100%|██████████| 423M/423M [00:08<00:00, 50.2MiB/s]

接下来，我们将加载 CSV 文件。

dataset_path = download_dir + '/multimodal_ner'
train_data = pd.read_csv(f'{dataset_path}/twitter17_train.csv')
test_data = pd.read_csv(f'{dataset_path}/twitter17_test.csv')
label_col = 'entity_annotations'

我们需要扩展图像路径以便在训练中加载它们。

image_col = 'image'
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])

def path_expander(path, base_folder):
	path_l = path.split(';')
	p = ';'.join([os.path.abspath(base_folder+path) for path in path_l])
	return p

train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

train_data[image_col].iloc[0]

'/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/ag_automm_tutorial_ner/multimodal_ner/twitter2017_images/17_06_1818.jpg'

每一行都包含一条推文的文本和图像，以及包含文本列的命名实体标注的 entity_annotataions。让我们看一个示例行，并显示推文的文本和图片。

example_row = train_data.iloc[0]

example_row

text_snippet           Uefa Super Cup : Real Madrid v Manchester United
image                 /home/ci/autogluon/docs/tutorials/multimodal/m...
entity_annotations    [{"entity_group": "B-MISC", "start": 0, "end":...
Name: 0, dtype: object

下面是这条推文的图片。

example_image = example_row[image_col]

from IPython.display import Image, display
pil_img = Image(filename=example_image, width =300)
display(pil_img)

../../../_images/c076ff68a445f46f5b29a045835912425d0c79074bdf6fdf9eef4c318bebbf0d.jpg

正如您所见，这张照片包含皇家马德里足球俱乐部、曼联足球俱乐部以及欧洲超级杯的标志。显然，推文句子的关键信息以不同的模态编码在这里。

训练¶

现在，让我们使用训练数据来拟合预测器。首先，我们需要将 problem_type 指定为 **ner**。由于我们的标注用于文本列，为确保模型找到正确的文本列进行实体提取，在存在多个文本列的情况下，我们需要使用 **column_types** 参数将相应的列类型设置为 text_ner。这里我们设置了一个紧张的时间预算，以便快速演示。

from autogluon.multimodal import MultiModalPredictor
import uuid

label_col = "entity_annotations"
model_path = f"./tmp/{uuid.uuid4().hex}-automm_multimodal_ner"
predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path)
predictor.fit(
	train_data=train_data,
	column_types={"text_snippet":"text_ner"},
	time_limit=300, #second
)

=================== System Info ===================
AutoGluon Version:  1.3.1b20250508
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.6.0+cu124
CUDA Version:       12.4
Memory Avail:       28.40 GB / 30.95 GB (91.8%)
Disk Space Avail:   180.79 GB / 255.99 GB (70.6%)
===================================================

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/2bedc65a8cb2477ea10e14a9c46427eb-automm_multimodal_ner
    ```
INFO: Seed set to 0

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 7
model_path = f"./tmp/{uuid.uuid4().hex}-automm_multimodal_ner"
predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path)
----> 7 predictor.fit(
train_data=train_data,
column_types={"text_snippet":"text_ner"},
time_limit=300, #second
)

File ~/autogluon/multimodal/src/autogluon/multimodal/predictor.py:540, in MultiModalPredictor.fit(self, train_data, presets, tuning_data, max_num_tuning_data, id_mappings, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_predictor, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts, predictions, labels, predictors)
   assert isinstance(predictors, list)
   learners = [ele if isinstance(ele, str) else ele._learner for ele in predictors]
--> 540 self._learner.fit(
   train_data=train_data,
   presets=presets,
   tuning_data=tuning_data,
   max_num_tuning_data=max_num_tuning_data,
   time_limit=time_limit,
   save_path=save_path,
   hyperparameters=hyperparameters,
   column_types=column_types,
   holdout_frac=holdout_frac,
   teacher_learner=teacher_learner,
   seed=seed,
   standalone=standalone,
   hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
   clean_ckpts=clean_ckpts,
   id_mappings=id_mappings,
   predictions=predictions,
   labels=labels,
   learners=learners,
)
return self

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:665, in BaseLearner.fit(self, train_data, presets, tuning_data, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_learner, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts, **kwargs)
self.fit_sanity_check()
self.prepare_fit_args(
   time_limit=time_limit,
   seed=seed,
   standalone=standalone,
   clean_ckpts=clean_ckpts,
)
--> 665 fit_returns = self.execute_fit()
self.on_fit_end(
   training_start=training_start,
   strategy=fit_returns.get("strategy", None),
   (...)
   clean_ckpts=clean_ckpts,
)
return self

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:577, in BaseLearner.execute_fit(self)
   return dict()
else:
--> 577     attributes = self.fit_per_run(**self._fit_args)
   self.update_attributes(**attributes)  # only update attributes for non-HPO mode
   return attributes

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/ner.py:203, in NERLearner.fit_per_run(self, max_time, save_path, ckpt_path, resume, enable_progress_bar, seed, hyperparameters, advanced_hyperparameters, config, df_preprocessor, data_processors, model, standalone, clean_ckpts)
config = self.update_config_by_data_per_run(config=config, df_preprocessor=df_preprocessor)
output_shape = self.get_output_shape_per_run(df_preprocessor=df_preprocessor)
--> 203 model = self.get_model_per_run(
   model=model,
   config=config,
   df_preprocessor=df_preprocessor,
   output_shape=output_shape,
)
model = self.compile_model_per_run(config=config, model=model)
peft_param_names = self.get_peft_param_names_per_run(model=model, config=config)

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/ner.py:105, in NERLearner.get_model_per_run(self, model, config, df_preprocessor, output_shape)
def get_model_per_run(
   self,
   model: nn.Module,
   (...)
   output_shape: int,
):
   if model is None:
--> 105         model = create_fusion_model(
           config=config,
           num_classes=output_shape,
           num_numerical_columns=len(df_preprocessor.numerical_feature_names),
           num_categories=df_preprocessor.categorical_num_categories,
       )
   return model

File ~/autogluon/multimodal/src/autogluon/multimodal/models/utils.py:1649, in create_fusion_model(config, num_classes, classes, num_numerical_columns, num_categories, numerical_fill_values, pretrained)
       single_models.append(model)
if len(single_models) > 1:
   # must have one fusion model if there are multiple independent models
-> 1649     model = fusion_model(models=single_models)
elif len(single_models) == 1:
   model = single_models[0]

File ~/autogluon/multimodal/src/autogluon/multimodal/models/fusion/fusion_ner.py:67, in MultimodalFusionNER.__init__(self, prefix, models, hidden_features, num_classes, adapt_in_features, activation, dropout_prob, normalization, loss_weight)
def __init__(
   self,
   prefix: str,
   (...)
   loss_weight: Optional[float] = None,
):
   """
   Parameters
   ----------
   (...)
       The weight of individual models.
   """
---> 67     super().__init__(
       prefix=prefix,
       models=models,
       loss_weight=loss_weight,
   )
   logger.debug("initializing MultimodalFusionNER")
   if loss_weight is not None:

TypeError: AbstractMultimodalFusionModel.__init__() got an unexpected keyword argument 'loss_weight'

在底层，AutoMM 会自动检测数据模态，从多模态模型池中选择相关模型，并训练选定的模型。如果存在多个骨干模型，AutoMM 会在它们之上附加一个后期融合模型。

评估¶

predictor.evaluate(test_data,  metrics=['overall_recall', "overall_precision", "overall_f1"])

预测¶

通过调用 predictor.predict()，您可以轻松获得预测结果。

prediction_input = test_data.drop(columns=label_col).head(1)
predictions = predictor.predict(prediction_input)
print('Tweet:', prediction_input.text_snippet[0])
print('Image path:', prediction_input.image[0])
print('Predicted entities:', predictions[0])

for entity in predictions[0]:
	print(f"Word '{prediction_input.text_snippet[0][entity['start']:entity['end']]}' belongs to group: {entity['entity_group']}")

重新加载和持续训练¶

训练好的预测器会自动保存，您可以使用路径轻松重新加载它。如果您对当前模型性能不满意，可以使用新数据继续训练已加载的模型。

new_predictor = MultiModalPredictor.load(model_path)
new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_multimodal_ner_continue_train"
new_predictor.fit(train_data, time_limit=60, save_path=new_model_path)
test_score = new_predictor.evaluate(test_data, metrics=['overall_f1'])
print(test_score)

其他示例¶

您可以访问 AutoMM 示例，探索关于 AutoMM 的其他示例。

定制¶

要了解如何定制 AutoMM，请参考定制 AutoMM。