AutoMM 文本模型 - 多语言问题

Open In Colab Open In SageMaker Studio Lab

世界各地的人们使用多种语言。根据 SIL InternationalEthnologue: Languages of the World,目前有超过 7,100 种口语和手语。实际上,当前的网页数据高度多语言化,许多现实世界的问题涉及非英语语言编写的文本。

在本教程中,我们将介绍 MultiModalPredictor 如何帮助您构建多语言模型。为了演示目的,我们使用 跨语言 Amazon 产品评论情感 数据集,该数据集包含约 800,000 条用四种语言(英语、德语、法语和日语)编写的 Amazon 产品评论。我们将演示如何使用 AutoGluon Text 以两种方式在该数据集的德语部分构建情感分类模型:

  • 微调德语 BERT 模型

  • 从英语到德语的跨语言迁移

注意: 建议您也查阅 通过参数高效微调实现单 GPU 十亿级模型训练,了解如何通过参数高效微调获得更好的性能。

加载数据集

跨语言 Amazon 产品评论情感 数据集包含四种语言的 Amazon 产品评论。这里,我们加载数据集的英语和德语部分。在标签列中,0 表示负面情感,1 表示正面情感。

!wget --quiet https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip
!unzip -q -o amazon_review_sentiment_cross_lingual.zip -d .
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
                .sample(1000, random_state=123)
train_de_df.reset_index(inplace=True, drop=True)

test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123)
test_de_df.reset_index(inplace=True, drop=True)
print(train_de_df)
     label                                               text
0        0  Dieser Film, nur so triefend von Kitsch, ist h...
1        0  Wie so oft: Das Buch begeistert, der Film entt...
2        1  Schon immer versuchten Männer ihre Gefühle geg...
3        1  Wenn man sich durch 10 Minuten Disney-Trailer ...
4        1  Eine echt geile nummer zum Abtanzen und feiern...
..     ...                                                ...
995      0  Ich dachte dies wäre ein richtig spannendes Bu...
996      0  Wer sich den Schrott wirklich noch ansehen möc...
997      0  Sicher, der Film greift ein aktuelles und hoch...
998      1  Dieser Bildband lässt das Herz von Sarah Kay-F...
999      1  ...so das war nun mein drittes Buch von Jenny-...

[1000 rows x 2 columns]
train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv',
                          sep='\t',
                          header=None,
                          names=['label', 'text']) \
                .sample(1000, random_state=123)
train_en_df.reset_index(inplace=True, drop=True)

test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv',
                          sep='\t',
                          header=None,
                          names=['label', 'text']) \
               .sample(200, random_state=123)
test_en_df.reset_index(inplace=True, drop=True)
print(train_en_df)
     label                                               text
0        0  This is a film that literally sees little wron...
1        0  This music is pretty intelligent, but not very...
2        0  One of the best pieces of rock ever recorded, ...
3        0  Reading the posted reviews here, is like revis...
4        1  I've just finished page 341, the last page. It...
..     ...                                                ...
995      1  This album deserves to be (at least) as popula...
996      1  This book, one of the few that takes a more ac...
997      1  I loved it because it really did show Sagan th...
998      1  Stuart Gordons "DAGON" is a unique horror gem ...
999      0  I've heard Al Lee speak before and thought tha...

[1000 rows x 2 columns]

微调德语 BERT 模型

我们的第一种方法是微调由 deepset 预训练的 德语 BERT 模型。由于 MultiModalPredictor 集成了 Huggingface/Transformers(如 自定义 AutoMM 中所述),我们直接加载 Huggingface/Transformers 中可用的德语 BERT 模型,其键为 bert-base-german-cased。为了简化实验,我们也只微调了 4 个 epoch。

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label='label')
predictor.fit(train_de_df,
              hyperparameters={
                  'model.hf_text.checkpoint_name': 'bert-base-german-cased',
                  'optim.max_epochs': 2
              })
No path specified. Models will be saved in: "AutogluonModels/ag-20250508_210755"
=================== System Info ===================
AutoGluon Version:  1.3.1b20250508
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.6.0+cu124
CUDA Version:       12.4
Memory Avail:       28.39 GB / 30.95 GB (91.7%)
Disk Space Avail:   182.33 GB / 255.99 GB (71.2%)
===================================================
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [np.int64(0), np.int64(1)]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210755
    ```
Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 109 M  | train
1 | validation_metric | BinaryAUROC                  | 0      | train
2 | loss_func         | CrossEntropyLoss             | 0      | train
---------------------------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
436.332   Total estimated model params size (MB)
232       Modules in train mode
0         Modules in eval mode
Epoch 0, global step 3: 'val_roc_auc' reached 0.75566 (best 0.75566), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210755/epoch=0-step=3.ckpt' as top 3
Epoch 0, global step 7: 'val_roc_auc' reached 0.81676 (best 0.81676), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210755/epoch=0-step=7.ckpt' as top 3
Epoch 1, global step 10: 'val_roc_auc' reached 0.85797 (best 0.85797), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210755/epoch=1-step=10.ckpt' as top 3
Epoch 1, global step 14: 'val_roc_auc' reached 0.86193 (best 0.86193), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210755/epoch=1-step=14.ckpt' as top 3
`Trainer.fit` stopped: `max_epochs=2` reached.
Start to fuse 3 checkpoints via the greedy soup algorithm.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210755")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://autogluon.cn/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f31a2ac79d0>
score = predictor.evaluate(test_de_df)
print('Score on the German Testset:')
print(score)
Score on the German Testset:
{'roc_auc': np.float64(0.8506610576923077)}
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
score = predictor.evaluate(test_en_df)
print('Score on the English Testset:')
print(score)
Score on the English Testset:
{'roc_auc': np.float64(0.594111144608582)}
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

我们可以发现该模型在德语数据集上取得了良好的性能,但在英语数据集上表现不佳。接下来,我们将展示如何启用跨语言迁移,以便您获得一个可以神奇地同时适用于德语和英语的模型。

跨语言迁移

在现实世界中,训练好一个用于英语的模型,然后想将模型扩展以支持德语等其他语言,这种情况非常普遍。这种设置也称为跨语言迁移。解决这个问题的一种方法是应用机器翻译模型将其他语言(例如德语)的句子翻译成英语,然后应用英语模型。然而,正如 《大规模无监督跨语言表示学习》 中所示,通过大规模多语言预训练,有一种更好且成本更低的跨语言迁移方法。作者表明,通过大规模预训练,骨干网络(称为 XLM-R)能够进行零样本跨语言迁移,这意味着您可以直接将用英语数据集训练的模型应用于其他语言的数据集。它甚至优于基准方法“TRANSLATE-TEST”,即先将其他语言的数据翻译成英语再应用英语模型。

在 AutoGluon 中,您只需在 MultiModalPredictor 中开启 presets="multilingual" 即可加载适用于零样本迁移的骨干网络。在内部,我们将自动使用 DeBERTa-V3 等最先进的模型。

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label='label')
predictor.fit(train_en_df,
              presets='multilingual',
              hyperparameters={
                  'optim.max_epochs': 2
              })
No path specified. Models will be saved in: "AutogluonModels/ag-20250508_210916"
=================== System Info ===================
AutoGluon Version:  1.3.1b20250508
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.6.0+cu124
CUDA Version:       12.4
Memory Avail:       24.69 GB / 30.95 GB (79.8%)
Disk Space Avail:   181.92 GB / 255.99 GB (71.1%)
===================================================
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [np.int64(0), np.int64(1)]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210916
    ```
Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 278 M  | train
1 | validation_metric | BinaryAUROC                  | 0      | train
2 | loss_func         | CrossEntropyLoss             | 0      | train
---------------------------------------------------------------------------
278 M     Trainable params
0         Non-trainable params
278 M     Total params
1,112.881 Total estimated model params size (MB)
241       Modules in train mode
0         Modules in eval mode
Epoch 0, global step 3: 'val_roc_auc' reached 0.63075 (best 0.63075), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210916/epoch=0-step=3.ckpt' as top 1
Epoch 0, global step 7: 'val_roc_auc' reached 0.74915 (best 0.74915), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210916/epoch=0-step=7.ckpt' as top 1
Epoch 1, global step 10: 'val_roc_auc' reached 0.80492 (best 0.80492), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210916/epoch=1-step=10.ckpt' as top 1
Epoch 1, global step 14: 'val_roc_auc' was not in top 1
`Trainer.fit` stopped: `max_epochs=2` reached.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20250508_210916")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://autogluon.cn/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f32bee7f9d0>
score_in_en = predictor.evaluate(test_en_df)
print('Score in the English Testset:')
print(score_in_en)
Score in the English Testset:
{'roc_auc': np.float64(0.8436338056476737)}
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
score_in_de = predictor.evaluate(test_de_df)
print('Score in the German Testset:')
print(score_in_de)
Score in the German Testset:
{'roc_auc': np.float64(0.8411959134615385)}
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

我们可以看到该模型对德语和英语都有效!

我们也来检查一下模型在日语上的表现

test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123)
test_jp_df.reset_index(inplace=True, drop=True)
print(test_jp_df)
     label                                               text
0        1  原作はビクトル・ユーゴの長編小説だが、私が子供の頃読んだのは短縮版の「ああ無情」。それでもこ...
1        1  ほかの作品のレビューにみんな書いているのに、何故この作品について書いている人が一人しかいない...
2        0  一番の問題点は青島が出ていない事でしょう。  TV番組では『芸人が出ていればバラエティだから...
3        0  昔、 りんたろう監督によるアニメ「カムイの剣」があった。  「カムイの剣」…を観た人なら本作...
4        1  以前のアルバムを聴いていないのでなんとも言えないが、クラシックなメタルを聞いてきた耳には、と...
..     ...                                                ...
195      0  原作が面白く、このDVDも期待して観ただけに非常にがっかりしました。  脚本としては単に格闘...
196      0                              フェードインやフェードアウトが多すぎます。
197      0  流通形態云々については特に革命と言う気はしない。  これからもCDは普通に発売されるだろうし...
198      1  もうTVとか、最近の映画とか、観なくていいよ。  脳に楽なエンターテイメントだから。  脳を...
199      0  みんなさんは、手塚治虫先生の「1985への出発」という漫画を読んだことがありますでしょうか?...

[200 rows x 2 columns]
print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df))
score_in_jp = predictor.evaluate(test_jp_df)
print('Score in the Japanese Testset:')
print(score_in_jp)
Negative labe ratio of the Japanese Testset= 0.575
Score in the Japanese Testset:
{'roc_auc': np.float64(0.6880818414322251)}
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

令人惊叹的是,该模型对日语也有效!

其他示例

您可以访问 AutoMM 示例 来探索其他关于 AutoMM 的示例。

自定义

要了解如何自定义 AutoMM,请参考 自定义 AutoMM