AutoMM 用于命名实体识别 - 快速入门¶
命名实体识别 (NER) 指的是从非结构化文本中识别并分类关键信息(实体)。实体可以是一个单词或一系列单词,对应于城市、时间表达式、货币价值、设施、人物、组织等类别。NER 模型通常接受未标注的文本块作为输入,并输出标注的文本块,其中用预定义类别突出显示了命名实体。例如,给定以下句子,
阿尔伯特·爱因斯坦出生于德国,被公认为最伟大的物理学家之一。
模型会告诉你,“阿尔伯特·爱因斯坦” 是 PERSON,“德国” 是 LOCATION。接下来,我们将介绍如何使用 AutoMM 进行 NER 任务,包括如何准备数据、如何训练模型以及模型的输出结果是什么样的。
准备数据¶
与 AutoMM 中的其他任务一样,您只需将数据准备为数据表(即 dataframes),其中包含一个文本列和一个标注列。文本列存储包含您要识别的实体的原始文本数据。相应地,标注列存储实体的标签信息(例如,字符级别的类别和起始/结束偏移量)。AutoMM 要求标注列具有以下 json 格式(注意:在创建 dataframe 之前,不要忘记调用 json.dumps() 将 python 对象转换为 json 字符串)。
import json
json.dumps([
{"entity_group": "PERSON", "start": 0, "end": 15},
{"entity_group": "LOCATION", "start": 28, "end": 35}
])
'[{"entity_group": "PERSON", "start": 0, "end": 15}, {"entity_group": "LOCATION", "start": 28, "end": 35}]'
其中 entity_group 是实体的类别,而 start 是指示实体开始位置的字符级位置,end 代表实体的结束位置。为确保 AutoMM 能够识别您的 json 标注,在构建数据时,必须使用上面指定的完全相同的键/属性(entity_group、start、end)。您可以将“阿尔伯特·爱因斯坦”标注为一个实体组,也可以给每个单词分配一个标签。
以下是使用 visualize_ner
工具可视化标注的示例。
from autogluon.multimodal.utils import visualize_ner
sentence = "Albert Einstein was born in Germany and is widely acknowledged to be one of the greatest physicists."
annotation = [{"entity_group": "PERSON", "start": 0, "end": 15},
{"entity_group": "LOCATION", "start": 28, "end": 35}]
visualize_ner(sentence, annotation)
如果您已经熟悉 NER 任务,您可能听说过 BIO (Benginning-Inside-Outside) 格式。您可以采用这种格式(非强制性),为每个标签添加 I-前缀或 B-前缀,以指示该标签是标注块的开头还是内部。例如,您可以将“Albert”标注为“B-PERSON”,因为它是一个名称的开头,将“Einstein”标注为“I-PERSON”,因为它在 PERSON 块内部。您无需担心 O 标签,O 标签表示一个单词不属于任何块,因为 AutoMM 会自动处理它。
现在,让我们看一个示例数据集。该数据集从 MIT 电影语料库转换而来,该语料库提供了演员、角色、导演、类型、歌曲、片名、预告片、年份等实体组的标注。
from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/train_v2.csv')
test_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/test_v2.csv')
train_data.head(5)
文本片段 | 实体标注 | |
---|---|---|
0 | 布鲁斯·威利斯出演了哪些电影 | [{"entity_group": "ACTOR", "start": 17, "end":... |
1 | 给我看德鲁·巴里摩尔在 1980 年代的电影 | [{"entity_group": "ACTOR", "start": 19, "end":... |
2 | 阿尔·帕西诺和罗伯特...都出演了哪些电影 | [{"entity_group": "ACTOR", "start": 25, "end":... |
3 | 找到哈罗德...主演的所有电影 | [{"entity_group": "ACTOR", "start": 39, "end":... |
4 | 找一部包含棒球引语的电影 | [] |
让我们打印一行。
print(f"text_snippet: {train_data['text_snippet'][1]}")
print(f"entity_annotations: {train_data['entity_annotations'][1]}")
visualize_ner(train_data['text_snippet'][1], train_data['entity_annotations'][1])
text_snippet: show me films with drew barrymore from the 1980s
entity_annotations: [{"entity_group": "ACTOR", "start": 19, "end": 33}, {"entity_group": "YEAR", "start": 43, "end": 48}]
训练¶
现在,让我们通过将 problem_type 设置为 ner 并指定标签列来创建一个命名实体识别预测器。然后,我们调用 predictor.fit() 训练模型五分钟。为了在您的应用中获得合理的性能,建议设置足够长的 time_limit(例如,30/60 分钟)。您还可以使用 hyperparameters 参数指定您的骨干模型和其他超参数。这里,我们将模型保存到 "automm_ner"
目录中。出于演示目的,我们使用轻量级的 'google/electra-small-discriminator'
骨干网络。
from autogluon.multimodal import MultiModalPredictor
import uuid
label_col = "entity_annotations"
model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner" # You can rename it to the model path you like
predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path)
predictor.fit(
train_data=train_data,
hyperparameters={'model.ner_text.checkpoint_name':'google/electra-small-discriminator'},
time_limit=300, #second
)
=================== System Info ===================
AutoGluon Version: 1.3.1b20250508
Python Version: 3.11.9
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count: 8
Pytorch Version: 2.6.0+cu124
CUDA Version: 12.4
Memory Avail: 28.39 GB / 30.95 GB (91.7%)
Disk Space Avail: 175.89 GB / 255.99 GB (68.7%)
===================================================
AutoMM starts to create your model. ✨✨✨
To track the learning progress, you can open a terminal and launch Tensorboard:
```shell
# Assume you have installed tensorboard
tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner
```
Seed set to 0
/home/ci/autogluon/multimodal/src/autogluon/multimodal/models/utils.py:1104: UserWarning: provided text_segment_num: 1 is smaller than google/electra-small-discriminator's default: 2
warnings.warn(
GPU Count: 1
GPU Count to be Used: 1
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params | Mode
----------------------------------------------------------------
0 | model | HFAutoModelForNER | 13.5 M | train
1 | validation_metric | MulticlassF1Score | 0 | train
2 | loss_func | CrossEntropyLoss | 0 | train
----------------------------------------------------------------
13.5 M Trainable params
0 Non-trainable params
13.5 M Total params
53.959 Total estimated model params size (MB)
230 Modules in train mode
0 Modules in eval mode
Epoch 0, global step 34: 'val_ner_token_f1' reached 0.00000 (best 0.00000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=0-step=34.ckpt' as top 3
Epoch 0, global step 69: 'val_ner_token_f1' reached 0.50879 (best 0.50879), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=0-step=69.ckpt' as top 3
Epoch 1, global step 103: 'val_ner_token_f1' reached 0.78140 (best 0.78140), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=1-step=103.ckpt' as top 3
Epoch 1, global step 138: 'val_ner_token_f1' reached 0.82854 (best 0.82854), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=1-step=138.ckpt' as top 3
Epoch 2, global step 172: 'val_ner_token_f1' reached 0.85274 (best 0.85274), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=2-step=172.ckpt' as top 3
Epoch 2, global step 207: 'val_ner_token_f1' reached 0.86115 (best 0.86115), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=2-step=207.ckpt' as top 3
Time limit reached. Elapsed time is 0:05:00. Signaling Trainer to stop.
Epoch 3, global step 222: 'val_ner_token_f1' reached 0.86268 (best 0.86268), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/epoch=3-step=222.ckpt' as top 3
Start to fuse 3 checkpoints via the greedy soup algorithm.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
AutoMM has created your model. 🎉🎉🎉
To load the model, use the code below:
```python
from autogluon.multimodal import MultiModalPredictor
predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner")
```
If you are not satisfied with the model, try to increase the training time,
adjust the hyperparameters (https://autogluon.cn/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fbbb5c5c610>
评估¶
评估也很简单,我们使用 seqeval 进行 NER 评估,支持的评估指标包括 整体召回率 (overall_recall)、整体精确率 (overall_precision)、整体 F1 分数 (overall_f1)、整体准确率 (overall_accuracy)。如果您有兴趣查看特定实体组的性能,可以使用实体组名称作为评估指标,从而获得给定实体组的性能(精确率、召回率、F1 分数)
predictor.evaluate(test_data, metrics=['overall_recall', "overall_precision", "overall_f1", "actor"])
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
{'overall_recall': np.float64(0.84941000187301),
'overall_precision': np.float64(0.8322628005138557),
'overall_f1': np.float64(0.8407489803485353),
'actor': {'precision': np.float64(0.8325942350332595),
'recall': np.float64(0.9248768472906403),
'f1': np.float64(0.8763127187864643),
'number': np.int64(812)}}
预测 + 可视化¶
您可以通过调用 predictor.predict() 轻松获得给定输入句子的预测结果。如果您在 Jupyter Notebook 中运行代码,您还可以使用 visualize_ner
函数轻松可视化预测结果,该函数将在文本中突出显示命名实体及其标签。
from autogluon.multimodal.utils import visualize_ner
sentence = "Game of Thrones is an American fantasy drama television series created by David Benioff"
predictions = predictor.predict({'text_snippet': [sentence]})
print('Predicted entities:', predictions[0])
# Visualize
visualize_ner(sentence, predictions[0])
Predicted entities: [{'entity_group': 'TITLE', 'start': np.int32(0), 'end': np.int32(15)}, {'entity_group': 'GENRE', 'start': np.int32(22), 'end': np.int32(30)}, {'entity_group': 'GENRE', 'start': np.int32(31), 'end': np.int32(44)}, {'entity_group': 'DIRECTOR', 'start': np.int32(74), 'end': np.int32(87)}]
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
预测概率¶
您还可以输出概率,以便深入分析预测结果。
predictions = predictor.predict_proba({'text_snippet': [sentence]})
print(predictions[0][0]['probability'])
{'O': np.float16(0.1353), 'B-RATINGS_AVERAGE': np.float16(0.005863), 'I-SONG': np.float16(0.0007167), 'B-REVIEW': np.float16(0.003164), 'B-GENRE': np.float16(0.003277), 'B-TITLE': np.float16(0.547), 'I-TRAILER': np.float16(0.000465), 'B-RATING': np.float16(0.0021), 'I-ACTOR': np.float16(0.001183), 'I-CHARACTER': np.float16(0.001431), 'B-DIRECTOR': np.float16(0.003176), 'B-SONG': np.float16(0.01203), 'I-GENRE': np.float16(0.0003023), 'B-PLOT': np.float16(0.1835), 'I-YEAR': np.float16(0.001536), 'I-RATINGS_AVERAGE': np.float16(0.0006504), 'I-DIRECTOR': np.float16(0.001287), 'B-YEAR': np.float16(0.0006814), 'I-RATING': np.float16(0.000454), 'I-TITLE': np.float16(0.01945), 'B-CHARACTER': np.float16(0.01915), 'B-ACTOR': np.float16(0.003555), 'I-REVIEW': np.float16(0.000772), 'I-PLOT': np.float16(0.05206), 'B-TRAILER': np.float16(0.0007)}
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
重新加载和持续训练¶
训练好的预测器会自动保存,您可以使用路径轻松重新加载它。如果您对当前的模型性能不满意,可以使用新数据继续训练加载的模型。
new_predictor = MultiModalPredictor.load(model_path)
new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner_continue_train"
new_predictor.fit(train_data, time_limit=60, save_path=new_model_path)
test_score = new_predictor.evaluate(test_data, metrics=['overall_f1', 'ACTOR'])
print(test_score)
Load pretrained checkpoint: /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/d238f7752c7740f98b954242aa9ffb85-automm_ner/model.ckpt
=================== System Info ===================
AutoGluon Version: 1.3.1b20250508
Python Version: 3.11.9
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count: 8
Pytorch Version: 2.6.0+cu124
CUDA Version: 12.4
Memory Avail: 26.76 GB / 30.95 GB (86.5%)
Disk Space Avail: 175.79 GB / 255.99 GB (68.7%)
===================================================
AutoMM starts to create your model. ✨✨✨
To track the learning progress, you can open a terminal and launch Tensorboard:
```shell
# Assume you have installed tensorboard
tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/91743838931b4ef5a95a63c4beedc33a-automm_ner_continue_train
```
Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params | Mode
----------------------------------------------------------------
0 | model | HFAutoModelForNER | 13.5 M | train
1 | validation_metric | MulticlassF1Score | 0 | train
2 | loss_func | CrossEntropyLoss | 0 | train
----------------------------------------------------------------
13.5 M Trainable params
0 Non-trainable params
13.5 M Total params
53.959 Total estimated model params size (MB)
230 Modules in train mode
0 Modules in eval mode
Epoch 0, global step 34: 'val_ner_token_f1' reached 0.86089 (best 0.86089), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/91743838931b4ef5a95a63c4beedc33a-automm_ner_continue_train/epoch=0-step=34.ckpt' as top 3
Time limit reached. Elapsed time is 0:01:00. Signaling Trainer to stop.
Epoch 0, global step 45: 'val_ner_token_f1' reached 0.86089 (best 0.86089), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/91743838931b4ef5a95a63c4beedc33a-automm_ner_continue_train/epoch=0-step=45.ckpt' as top 3
Start to fuse 2 checkpoints via the greedy soup algorithm.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
AutoMM has created your model. 🎉🎉🎉
To load the model, use the code below:
```python
from autogluon.multimodal import MultiModalPredictor
predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/tmp/91743838931b4ef5a95a63c4beedc33a-automm_ner_continue_train")
```
If you are not satisfied with the model, try to increase the training time,
adjust the hyperparameters (https://autogluon.cn/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
{'overall_f1': np.float64(0.8388051419587534), 'ACTOR': {'precision': np.float64(0.8280898876404494), 'recall': np.float64(0.9076354679802956), 'f1': np.float64(0.8660399529964747), 'number': np.int64(812)}}
其他示例¶
您可以访问 AutoMM 示例 页面,探索 AutoMM 的其他示例。
自定义¶
要了解如何自定义 AutoMM,请参阅 自定义 AutoMM。