AutoMM 文本和图像实体提取 - 快速入门

Open In Colab Open In SageMaker Studio Lab

我们已经介绍了如何使用文本数据训练实体提取模型。在这里,我们将进一步整合其他模态的数据。在许多实际应用中,文本数据通常伴随着其他模态的数据。例如,Twitter 允许您使用文本、照片、视频和 GIF 撰写推文。Amazon.com 使用文本、图像和视频来描述其产品。这些辅助模态可以作为实体的额外上下文解析。现在,使用 AutoMM,您可以轻松利用多模态数据来增强实体提取,而无需担心细节问题。

import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

获取 Twitter 数据集

在下面的示例中,我们将演示如何使用真实的 Twitter 数据集构建多模态命名实体识别模型。该数据集包含 2016 年至 2017 年抓取的推文,每条推文由一个句子和一张图片组成。让我们下载数据集。

download_dir = './ag_automm_tutorial_ner'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/ner/multimodal_ner.zip'
from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)
Downloading ./ag_automm_tutorial_ner/file.zip from https://automl-mm-bench.s3.amazonaws.com/ner/multimodal_ner.zip...
  0%|          | 0.00/423M [00:00<?, ?iB/s]
  2%|▏         | 10.4M/423M [00:00<00:03, 104MiB/s]
  5%|▍         | 20.8M/423M [00:00<00:07, 53.6MiB/s]
  8%|▊         | 32.0M/423M [00:00<00:05, 71.2MiB/s]
 10%|▉         | 40.7M/423M [00:00<00:06, 57.7MiB/s]
 12%|█▏        | 49.8M/423M [00:00<00:06, 59.4MiB/s]
 13%|█▎        | 56.5M/423M [00:01<00:08, 43.3MiB/s]
 15%|█▍        | 62.9M/423M [00:01<00:07, 47.3MiB/s]
 16%|█▌        | 68.7M/423M [00:01<00:08, 43.4MiB/s]
 19%|█▊        | 79.2M/423M [00:01<00:06, 56.5MiB/s]
 20%|██        | 85.9M/423M [00:01<00:07, 45.9MiB/s]
 23%|██▎       | 96.9M/423M [00:01<00:05, 59.3MiB/s]
 25%|██▍       | 104M/423M [00:01<00:05, 55.6MiB/s]
 27%|██▋       | 115M/423M [00:02<00:04, 68.4MiB/s]
 29%|██▉       | 123M/423M [00:02<00:05, 58.7MiB/s]
 31%|███       | 130M/423M [00:02<00:05, 55.9MiB/s]
 32%|███▏      | 137M/423M [00:02<00:05, 53.0MiB/s]
 35%|███▍      | 148M/423M [00:02<00:04, 65.0MiB/s]
 37%|███▋      | 155M/423M [00:02<00:05, 45.3MiB/s]
 38%|███▊      | 161M/423M [00:03<00:06, 38.5MiB/s]
 39%|███▉      | 166M/423M [00:03<00:07, 32.5MiB/s]
 41%|████      | 172M/423M [00:03<00:07, 32.3MiB/s]
 43%|████▎     | 183M/423M [00:03<00:05, 45.7MiB/s]
 45%|████▍     | 188M/423M [00:03<00:05, 43.5MiB/s]
 46%|████▌     | 194M/423M [00:04<00:06, 36.9MiB/s]
 49%|████▊     | 205M/423M [00:04<00:04, 52.2MiB/s]
 50%|█████     | 212M/423M [00:04<00:04, 42.8MiB/s]
 53%|█████▎    | 223M/423M [00:04<00:03, 52.5MiB/s]
 55%|█████▌    | 234M/423M [00:04<00:02, 64.1MiB/s]
 57%|█████▋    | 242M/423M [00:04<00:03, 55.1MiB/s]
 60%|█████▉    | 253M/423M [00:04<00:02, 67.3MiB/s]
 62%|██████▏   | 261M/423M [00:05<00:03, 51.6MiB/s]
 64%|██████▍   | 273M/423M [00:05<00:02, 64.0MiB/s]
 66%|██████▋   | 281M/423M [00:05<00:02, 55.7MiB/s]
 69%|██████▉   | 291M/423M [00:05<00:02, 63.2MiB/s]
 71%|███████   | 299M/423M [00:05<00:02, 50.6MiB/s]
 73%|███████▎  | 309M/423M [00:05<00:02, 51.2MiB/s]
 75%|███████▌  | 319M/423M [00:06<00:01, 59.8MiB/s]
 77%|███████▋  | 326M/423M [00:06<00:01, 56.7MiB/s]
 79%|███████▊  | 332M/423M [00:06<00:01, 54.1MiB/s]
 81%|████████  | 344M/423M [00:06<00:01, 56.3MiB/s]
 84%|████████▍ | 355M/423M [00:06<00:00, 68.4MiB/s]
 86%|████████▌ | 362M/423M [00:06<00:01, 54.0MiB/s]
 88%|████████▊ | 374M/423M [00:06<00:00, 66.2MiB/s]
 90%|█████████ | 382M/423M [00:07<00:00, 46.7MiB/s]
 92%|█████████▏| 388M/423M [00:07<00:00, 42.3MiB/s]
 93%|█████████▎| 395M/423M [00:07<00:00, 41.4MiB/s]
 96%|█████████▌| 406M/423M [00:07<00:00, 54.1MiB/s]
 98%|█████████▊| 413M/423M [00:07<00:00, 45.2MiB/s]
100%|█████████▉| 422M/423M [00:08<00:00, 33.2MiB/s]
100%|██████████| 423M/423M [00:08<00:00, 50.2MiB/s]

接下来,我们将加载 CSV 文件。

dataset_path = download_dir + '/multimodal_ner'
train_data = pd.read_csv(f'{dataset_path}/twitter17_train.csv')
test_data = pd.read_csv(f'{dataset_path}/twitter17_test.csv')
label_col = 'entity_annotations'

我们需要扩展图像路径以便在训练中加载它们。

image_col = 'image'
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])

def path_expander(path, base_folder):
	path_l = path.split(';')
	p = ';'.join([os.path.abspath(base_folder+path) for path in path_l])
	return p

train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

train_data[image_col].iloc[0]
'/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/ag_automm_tutorial_ner/multimodal_ner/twitter2017_images/17_06_1818.jpg'

每一行都包含一条推文的文本和图像,以及包含文本列的命名实体标注的 entity_annotataions。让我们看一个示例行,并显示推文的文本和图片。

example_row = train_data.iloc[0]

example_row
text_snippet           Uefa Super Cup : Real Madrid v Manchester United
image                 /home/ci/autogluon/docs/tutorials/multimodal/m...
entity_annotations    [{"entity_group": "B-MISC", "start": 0, "end":...
Name: 0, dtype: object

下面是这条推文的图片。

example_image = example_row[image_col]

from IPython.display import Image, display
pil_img = Image(filename=example_image, width =300)
display(pil_img)

正如您所见,这张照片包含皇家马德里足球俱乐部、曼联足球俱乐部以及欧洲超级杯的标志。显然,推文句子的关键信息以不同的模态编码在这里。

训练

现在,让我们使用训练数据来拟合预测器。首先,我们需要将 problem_type 指定为 **ner**。由于我们的标注用于文本列,为确保模型找到正确的文本列进行实体提取,在存在多个文本列的情况下,我们需要使用 **column_types** 参数将相应的列类型设置为 text_ner。这里我们设置了一个紧张的时间预算,以便快速演示。

from autogluon.multimodal import MultiModalPredictor
import uuid

label_col = "entity_annotations"
model_path = f"./tmp/{uuid.uuid4().hex}-automm_multimodal_ner"
predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path)
predictor.fit(
	train_data=train_data,
	column_types={"text_snippet":"text_ner"},
	time_limit=300, #second
)
=================== System Info ===================
AutoGluon Version:  1.3.1b20250508
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.6.0+cu124
CUDA Version:       12.4
Memory Avail:       28.40 GB / 30.95 GB (91.8%)
Disk Space Avail:   180.79 GB / 255.99 GB (70.6%)
===================================================

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/2bedc65a8cb2477ea10e14a9c46427eb-automm_multimodal_ner
    ```
INFO: Seed set to 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 7
      5 model_path = f"./tmp/{uuid.uuid4().hex}-automm_multimodal_ner"
      6 predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path)
----> 7 predictor.fit(
      8 	train_data=train_data,
      9 	column_types={"text_snippet":"text_ner"},
     10 	time_limit=300, #second
     11 )

File ~/autogluon/multimodal/src/autogluon/multimodal/predictor.py:540, in MultiModalPredictor.fit(self, train_data, presets, tuning_data, max_num_tuning_data, id_mappings, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_predictor, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts, predictions, labels, predictors)
    537     assert isinstance(predictors, list)
    538     learners = [ele if isinstance(ele, str) else ele._learner for ele in predictors]
--> 540 self._learner.fit(
    541     train_data=train_data,
    542     presets=presets,
    543     tuning_data=tuning_data,
    544     max_num_tuning_data=max_num_tuning_data,
    545     time_limit=time_limit,
    546     save_path=save_path,
    547     hyperparameters=hyperparameters,
    548     column_types=column_types,
    549     holdout_frac=holdout_frac,
    550     teacher_learner=teacher_learner,
    551     seed=seed,
    552     standalone=standalone,
    553     hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
    554     clean_ckpts=clean_ckpts,
    555     id_mappings=id_mappings,
    556     predictions=predictions,
    557     labels=labels,
    558     learners=learners,
    559 )
    561 return self

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:665, in BaseLearner.fit(self, train_data, presets, tuning_data, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_learner, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts, **kwargs)
    658 self.fit_sanity_check()
    659 self.prepare_fit_args(
    660     time_limit=time_limit,
    661     seed=seed,
    662     standalone=standalone,
    663     clean_ckpts=clean_ckpts,
    664 )
--> 665 fit_returns = self.execute_fit()
    666 self.on_fit_end(
    667     training_start=training_start,
    668     strategy=fit_returns.get("strategy", None),
   (...)
    671     clean_ckpts=clean_ckpts,
    672 )
    674 return self

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:577, in BaseLearner.execute_fit(self)
    575     return dict()
    576 else:
--> 577     attributes = self.fit_per_run(**self._fit_args)
    578     self.update_attributes(**attributes)  # only update attributes for non-HPO mode
    579     return attributes

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/ner.py:203, in NERLearner.fit_per_run(self, max_time, save_path, ckpt_path, resume, enable_progress_bar, seed, hyperparameters, advanced_hyperparameters, config, df_preprocessor, data_processors, model, standalone, clean_ckpts)
    201 config = self.update_config_by_data_per_run(config=config, df_preprocessor=df_preprocessor)
    202 output_shape = self.get_output_shape_per_run(df_preprocessor=df_preprocessor)
--> 203 model = self.get_model_per_run(
    204     model=model,
    205     config=config,
    206     df_preprocessor=df_preprocessor,
    207     output_shape=output_shape,
    208 )
    209 model = self.compile_model_per_run(config=config, model=model)
    210 peft_param_names = self.get_peft_param_names_per_run(model=model, config=config)

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/ner.py:105, in NERLearner.get_model_per_run(self, model, config, df_preprocessor, output_shape)
     97 def get_model_per_run(
     98     self,
     99     model: nn.Module,
   (...)
    102     output_shape: int,
    103 ):
    104     if model is None:
--> 105         model = create_fusion_model(
    106             config=config,
    107             num_classes=output_shape,
    108             num_numerical_columns=len(df_preprocessor.numerical_feature_names),
    109             num_categories=df_preprocessor.categorical_num_categories,
    110         )
    111     return model

File ~/autogluon/multimodal/src/autogluon/multimodal/models/utils.py:1649, in create_fusion_model(config, num_classes, classes, num_numerical_columns, num_categories, numerical_fill_values, pretrained)
   1645         single_models.append(model)
   1647 if len(single_models) > 1:
   1648     # must have one fusion model if there are multiple independent models
-> 1649     model = fusion_model(models=single_models)
   1650 elif len(single_models) == 1:
   1651     model = single_models[0]

File ~/autogluon/multimodal/src/autogluon/multimodal/models/fusion/fusion_ner.py:67, in MultimodalFusionNER.__init__(self, prefix, models, hidden_features, num_classes, adapt_in_features, activation, dropout_prob, normalization, loss_weight)
     23 def __init__(
     24     self,
     25     prefix: str,
   (...)
     33     loss_weight: Optional[float] = None,
     34 ):
     35     """
     36     Parameters
     37     ----------
   (...)
     65         The weight of individual models.
     66     """
---> 67     super().__init__(
     68         prefix=prefix,
     69         models=models,
     70         loss_weight=loss_weight,
     71     )
     72     logger.debug("initializing MultimodalFusionNER")
     74     if loss_weight is not None:

TypeError: AbstractMultimodalFusionModel.__init__() got an unexpected keyword argument 'loss_weight'

在底层,AutoMM 会自动检测数据模态,从多模态模型池中选择相关模型,并训练选定的模型。如果存在多个骨干模型,AutoMM 会在它们之上附加一个后期融合模型。

评估

predictor.evaluate(test_data,  metrics=['overall_recall', "overall_precision", "overall_f1"])

预测

通过调用 predictor.predict(),您可以轻松获得预测结果。

prediction_input = test_data.drop(columns=label_col).head(1)
predictions = predictor.predict(prediction_input)
print('Tweet:', prediction_input.text_snippet[0])
print('Image path:', prediction_input.image[0])
print('Predicted entities:', predictions[0])

for entity in predictions[0]:
	print(f"Word '{prediction_input.text_snippet[0][entity['start']:entity['end']]}' belongs to group: {entity['entity_group']}")

重新加载和持续训练

训练好的预测器会自动保存,您可以使用路径轻松重新加载它。如果您对当前模型性能不满意,可以使用新数据继续训练已加载的模型。

new_predictor = MultiModalPredictor.load(model_path)
new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_multimodal_ner_continue_train"
new_predictor.fit(train_data, time_limit=60, save_path=new_model_path)
test_score = new_predictor.evaluate(test_data, metrics=['overall_f1'])
print(test_score)

其他示例

您可以访问 AutoMM 示例,探索关于 AutoMM 的其他示例。

定制

要了解如何定制 AutoMM,请参考 定制 AutoMM