AutoMM 用于命名实体识别 - 快速入门

Open In Colab Open In SageMaker Studio Lab

命名实体识别 (NER) 指的是从非结构化文本中识别并分类关键信息(实体)。实体可以是一个单词或一系列单词,对应于城市、时间表达式、货币价值、设施、人物、组织等类别。NER 模型通常接受未标注的文本块作为输入,并输出标注的文本块,其中用预定义类别突出显示了命名实体。例如,给定以下句子,

  • 阿尔伯特·爱因斯坦出生于德国,被公认为最伟大的物理学家之一。

模型会告诉你,“阿尔伯特·爱因斯坦” 是 PERSON,“德国” 是 LOCATION。接下来,我们将介绍如何使用 AutoMM 进行 NER 任务,包括如何准备数据、如何训练模型以及模型的输出结果是什么样的。

准备数据

与 AutoMM 中的其他任务一样,您只需将数据准备为数据表(即 dataframes),其中包含一个文本列和一个标注列。文本列存储包含您要识别的实体的原始文本数据。相应地,标注列存储实体的标签信息(例如,字符级别的类别起始/结束偏移量)。AutoMM 要求标注列具有以下 json 格式(注意:在创建 dataframe 之前,不要忘记调用 json.dumps() 将 python 对象转换为 json 字符串)。

import json
json.dumps([
    {"entity_group": "PERSON", "start": 0, "end": 15},
    {"entity_group": "LOCATION", "start": 28, "end": 35}
])
'[{"entity_group": "PERSON", "start": 0, "end": 15}, {"entity_group": "LOCATION", "start": 28, "end": 35}]'

其中 entity_group 是实体的类别,而 start 是指示实体开始位置的字符级位置,end 代表实体的结束位置。为确保 AutoMM 能够识别您的 json 标注,在构建数据时,必须使用上面指定的完全相同的键/属性(entity_group、start、end)。您可以将“阿尔伯特·爱因斯坦”标注为一个实体组,也可以给每个单词分配一个标签。

以下是使用 visualize_ner 工具可视化标注的示例。

from autogluon.multimodal.utils import visualize_ner

sentence = "Albert Einstein was born in Germany and is widely acknowledged to be one of the greatest physicists."
annotation = [{"entity_group": "PERSON", "start": 0, "end": 15},
              {"entity_group": "LOCATION", "start": 28, "end": 35}]

visualize_ner(sentence, annotation)
阿尔伯特·爱因斯坦 人物 出生于 德国 地点 ,被公认为最伟大的物理学家之一。

如果您已经熟悉 NER 任务,您可能听说过 BIO (Benginning-Inside-Outside) 格式。您可以采用这种格式(非强制性),为每个标签添加 I-前缀B-前缀,以指示该标签是标注块的开头还是内部。例如,您可以将“Albert”标注为“B-PERSON”,因为它是一个名称的开头,将“Einstein”标注为“I-PERSON”,因为它在 PERSON 块内部。您无需担心 O 标签,O 标签表示一个单词不属于任何块,因为 AutoMM 会自动处理它。

现在,让我们看一个示例数据集。该数据集从 MIT 电影语料库转换而来,该语料库提供了演员、角色、导演、类型、歌曲、片名、预告片、年份等实体组的标注。

from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/train_v2.csv')
test_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/test_v2.csv')
train_data.head(5)
文本片段 实体标注
0 布鲁斯·威利斯出演了哪些电影 [{"entity_group": "ACTOR", "start": 17, "end":...
1 给我看德鲁·巴里摩尔在 1980 年代的电影 [{"entity_group": "ACTOR", "start": 19, "end":...
2 阿尔·帕西诺和罗伯特...都出演了哪些电影 [{"entity_group": "ACTOR", "start": 25, "end":...
3 找到哈罗德...主演的所有电影 [{"entity_group": "ACTOR", "start": 39, "end":...
4 找一部包含棒球引语的电影 []

让我们打印一行。

print(f"text_snippet: {train_data['text_snippet'][1]}")
print(f"entity_annotations: {train_data['entity_annotations'][1]}")
visualize_ner(train_data['text_snippet'][1], train_data['entity_annotations'][1])
text_snippet: show me films with drew barrymore from the 1980s
entity_annotations: [{"entity_group": "ACTOR", "start": 19, "end": 33}, {"entity_group": "YEAR", "start": 43, "end": 48}]
给我看 德鲁·巴里摩尔 演员 1980 年代 年份 的电影

训练

现在,让我们通过将 problem_type 设置为 ner 并指定标签列来创建一个命名实体识别预测器。然后,我们调用 predictor.fit() 训练模型五分钟。为了在您的应用中获得合理的性能,建议设置足够长的 time_limit(例如,30/60 分钟)。您还可以使用 hyperparameters 参数指定您的骨干模型和其他超参数。这里,我们将模型保存到 "automm_ner" 目录中。出于演示目的,我们使用轻量级的 'google/electra-small-discriminator' 骨干网络。

from autogluon.multimodal import MultiModalPredictor
import uuid

label_col = "entity_annotations"
model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner"  # You can rename it to the model path you like
predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path)
predictor.fit(
    train_data=train_data,
    hyperparameters={'model.ner_text.checkpoint_name':'google/electra-small-discriminator'},
    time_limit=300, #second
)
=================== System Info ===================
AutoGluon Version:  1.3.1b20250508
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.6.0+cu124
CUDA Version:       12.4
Memory Avail:       28.39 GB / 30.95 GB (91.7%)
Disk Space Avail:   175.89 GB / 255.99 GB (68.7%)
===================================================

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner
    ```
Seed set to 0
/home/ci/autogluon/multimodal/src/autogluon/multimodal/models/utils.py:1104: UserWarning: provided text_segment_num: 1 is smaller than google/electra-small-discriminator's default: 2
  warnings.warn(
GPU Count: 1
GPU Count to be Used: 1
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type              | Params | Mode 
----------------------------------------------------------------
0 | model             | HFAutoModelForNER | 13.5 M | train
1 | validation_metric | MulticlassF1Score | 0      | train
2 | loss_func         | CrossEntropyLoss  | 0      | train
----------------------------------------------------------------
13.5 M    Trainable params
0         Non-trainable params
13.5 M    Total params
53.959    Total estimated model params size (MB)
230       Modules in train mode
0         Modules in eval mode
Epoch 0, global step 34: 'val_ner_token_f1' reached 0.00000 (best 0.00000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=0-step=34.ckpt' as top 3
Epoch 0, global step 69: 'val_ner_token_f1' reached 0.50879 (best 0.50879), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=0-step=69.ckpt' as top 3
Epoch 1, global step 103: 'val_ner_token_f1' reached 0.78140 (best 0.78140), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=1-step=103.ckpt' as top 3
Epoch 1, global step 138: 'val_ner_token_f1' reached 0.82854 (best 0.82854), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=1-step=138.ckpt' as top 3
Epoch 2, global step 172: 'val_ner_token_f1' reached 0.85274 (best 0.85274), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=2-step=172.ckpt' as top 3
Epoch 2, global step 207: 'val_ner_token_f1' reached 0.86115 (best 0.86115), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=2-step=207.ckpt' as top 3
Time limit reached. Elapsed time is 0:05:00. Signaling Trainer to stop.
Epoch 3, global step 222: 'val_ner_token_f1' reached 0.86268 (best 0.86268), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=3-step=222.ckpt' as top 3
Start to fuse 3 checkpoints via the greedy soup algorithm.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://autogluon.cn/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fbbb5c5c610>

评估

评估也很简单,我们使用 seqeval 进行 NER 评估,支持的评估指标包括 整体召回率 (overall_recall)整体精确率 (overall_precision)整体 F1 分数 (overall_f1)整体准确率 (overall_accuracy)。如果您有兴趣查看特定实体组的性能,可以使用实体组名称作为评估指标,从而获得给定实体组的性能(精确率、召回率、F1 分数)

predictor.evaluate(test_data,  metrics=['overall_recall', "overall_precision", "overall_f1", "actor"])
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
{'overall_recall': np.float64(0.84941000187301),
 'overall_precision': np.float64(0.8322628005138557),
 'overall_f1': np.float64(0.8407489803485353),
 'actor': {'precision': np.float64(0.8325942350332595),
  'recall': np.float64(0.9248768472906403),
  'f1': np.float64(0.8763127187864643),
  'number': np.int64(812)}}

预测 + 可视化

您可以通过调用 predictor.predict() 轻松获得给定输入句子的预测结果。如果您在 Jupyter Notebook 中运行代码,您还可以使用 visualize_ner 函数轻松可视化预测结果,该函数将在文本中突出显示命名实体及其标签。

from autogluon.multimodal.utils import visualize_ner

sentence = "Game of Thrones is an American fantasy drama television series created by David Benioff"
predictions = predictor.predict({'text_snippet': [sentence]})
print('Predicted entities:', predictions[0])

# Visualize
visualize_ner(sentence, predictions[0])
Predicted entities: [{'entity_group': 'TITLE', 'start': np.int32(0), 'end': np.int32(15)}, {'entity_group': 'GENRE', 'start': np.int32(22), 'end': np.int32(30)}, {'entity_group': 'GENRE', 'start': np.int32(31), 'end': np.int32(44)}, {'entity_group': 'DIRECTOR', 'start': np.int32(74), 'end': np.int32(87)}]
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
权力的游戏 片名 是一部 美国 类型 奇幻剧 类型 电视连续剧,由 大卫·贝尼奥夫 导演 创作

预测概率

您还可以输出概率,以便深入分析预测结果。

predictions = predictor.predict_proba({'text_snippet': [sentence]})
print(predictions[0][0]['probability'])
{'O': np.float16(0.1353), 'B-RATINGS_AVERAGE': np.float16(0.005863), 'I-SONG': np.float16(0.0007167), 'B-REVIEW': np.float16(0.003164), 'B-GENRE': np.float16(0.003277), 'B-TITLE': np.float16(0.547), 'I-TRAILER': np.float16(0.000465), 'B-RATING': np.float16(0.0021), 'I-ACTOR': np.float16(0.001183), 'I-CHARACTER': np.float16(0.001431), 'B-DIRECTOR': np.float16(0.003176), 'B-SONG': np.float16(0.01203), 'I-GENRE': np.float16(0.0003023), 'B-PLOT': np.float16(0.1835), 'I-YEAR': np.float16(0.001536), 'I-RATINGS_AVERAGE': np.float16(0.0006504), 'I-DIRECTOR': np.float16(0.001287), 'B-YEAR': np.float16(0.0006814), 'I-RATING': np.float16(0.000454), 'I-TITLE': np.float16(0.01945), 'B-CHARACTER': np.float16(0.01915), 'B-ACTOR': np.float16(0.003555), 'I-REVIEW': np.float16(0.000772), 'I-PLOT': np.float16(0.05206), 'B-TRAILER': np.float16(0.0007)}
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

重新加载和持续训练

训练好的预测器会自动保存,您可以使用路径轻松重新加载它。如果您对当前的模型性能不满意,可以使用新数据继续训练加载的模型。

new_predictor = MultiModalPredictor.load(model_path)
new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner_continue_train"
new_predictor.fit(train_data, time_limit=60, save_path=new_model_path)
test_score = new_predictor.evaluate(test_data, metrics=['overall_f1', 'ACTOR'])
print(test_score)
Load pretrained checkpoint: /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/model.ckpt
=================== System Info ===================
AutoGluon Version:  1.3.1b20250508
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.6.0+cu124
CUDA Version:       12.4
Memory Avail:       26.76 GB / 30.95 GB (86.5%)
Disk Space Avail:   175.79 GB / 255.99 GB (68.7%)
===================================================

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/91743838931b4ef5a95a63c4beedc33a-automm_ner_continue_train
    ```
Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type              | Params | Mode 
----------------------------------------------------------------
0 | model             | HFAutoModelForNER | 13.5 M | train
1 | validation_metric | MulticlassF1Score | 0      | train
2 | loss_func         | CrossEntropyLoss  | 0      | train
----------------------------------------------------------------
13.5 M    Trainable params
0         Non-trainable params
13.5 M    Total params
53.959    Total estimated model params size (MB)
230       Modules in train mode
0         Modules in eval mode
Epoch 0, global step 34: 'val_ner_token_f1' reached 0.86089 (best 0.86089), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/91743838931b4ef5a95a63c4beedc33a-automm_ner_continue_train/epoch=0-step=34.ckpt' as top 3
Time limit reached. Elapsed time is 0:01:00. Signaling Trainer to stop.
Epoch 0, global step 45: 'val_ner_token_f1' reached 0.86089 (best 0.86089), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/91743838931b4ef5a95a63c4beedc33a-automm_ner_continue_train/epoch=0-step=45.ckpt' as top 3
Start to fuse 2 checkpoints via the greedy soup algorithm.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/91743838931b4ef5a95a63c4beedc33a-automm_ner_continue_train")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://autogluon.cn/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
{'overall_f1': np.float64(0.8388051419587534), 'ACTOR': {'precision': np.float64(0.8280898876404494), 'recall': np.float64(0.9076354679802956), 'f1': np.float64(0.8660399529964747), 'number': np.int64(812)}}

其他示例

您可以访问 AutoMM 示例 页面,探索 AutoMM 的其他示例。

自定义

要了解如何自定义 AutoMM,请参阅 自定义 AutoMM