使用 AutoMM 分类 PDF 文档¶

PDF 是 Portable Document Format（便携式文档格式）的缩写，是最流行的文档格式之一。PDF 随处可见，从个人简历到商业合同，从商业手册到政府文件。这样的列表可以无穷无尽。PDF 因其便携性而备受赞誉。无论接收者使用何种操作系统和设备型号，都不用担心他们无法查看文档或看到不完美的版本。

使用 AutoMM，您可以在 PDF 文档上处理和构建机器学习模型，就像处理文本和图像等其他模态一样，无需担心 PDF 的处理细节。在本教程中，我们将介绍如何使用文档基础模型，通过 AutoMM 自动分类 PDF 文档。让我们开始吧！

为了处理文档，AutoGluon 需要安装 poppler。查看 https://poppler.freedesktop.org 获取源代码

Windows 版本请查看 https://github.com/oschwartz10612/poppler-windows (安装后请确保将 bin/ 文件夹添加到 PATH 环境变量中)

brew install poppler 适用于 Mac

获取 PDF 文档数据集¶

我们手工抓取了一个简单的 PDF 数据集用于演示目的。它包含两类文档：简历和历史文件（从里程碑文件下载）。我们为每个类别选择了 20 个 PDF 文档。

现在，让我们下载数据集并将其分割成训练集和测试集。

import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
from autogluon.core.utils.loaders import load_zip

download_dir = './ag_automm_tutorial_pdf_classifier'
zip_file = "https://automl-mm-bench.s3.amazonaws.com/doc_classification/pdf_docs_small.zip"
load_zip.unzip(zip_file, unzip_dir=download_dir)

dataset_path = os.path.join(download_dir, "pdf_docs_small")
pdf_docs = pd.read_csv(f"{dataset_path}/data.csv")
train_data = pdf_docs.sample(frac=0.8, random_state=200)
test_data = pdf_docs.drop(train_data.index)

Downloading ./ag_automm_tutorial_pdf_classifier/file.zip from https://automl-mm-bench.s3.amazonaws.com/doc_classification/pdf_docs_small.zip...

  0%|          | 0.00/12.7M [00:00<?, ?iB/s]
 91%|█████████ | 11.6M/12.7M [00:00<00:00, 116MiB/s]
100%|██████████| 12.7M/12.7M [00:00<00:00, 116MiB/s]

现在，让我们可视化其中一个 PDF 文档。这里，我们使用 PDF 文档的 S3 URL 和 IFrame 在教程中显示它。

from IPython.display import IFrame
IFrame("https://automl-mm-bench.s3.amazonaws.com/doc_classification/historical_1.pdf", width=400, height=500)

如您所见，这份文档是一份美国历史文件，格式为 PDF。为了确保 MultiModalPredictor 能正确找到文档，我们需要覆盖文档路径。

from autogluon.multimodal.utils.misc import path_expander

DOC_PATH_COL = "doc_path"

train_data[DOC_PATH_COL] = train_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))
test_data[DOC_PATH_COL] = test_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))
print(test_data.head())

                                             doc_path   label
 /home/ci/autogluon/docs/tutorials/multimodal/d...  resume
/home/ci/autogluon/docs/tutorials/multimodal/d...  resume
/home/ci/autogluon/docs/tutorials/multimodal/d...  resume
/home/ci/autogluon/docs/tutorials/multimodal/d...  resume
/home/ci/autogluon/docs/tutorials/multimodal/d...  resume

创建一个 PDF 文档分类器¶

您可以使用 MultiModalPredictor 轻松创建一个 PDF 分类器。您只需创建一个 predictor，并使用上面的训练数据集对其进行拟合（fit）。AutoMM 将处理所有细节，例如 (1) 检测数据集是否为 PDF 格式；(2) 处理 PDF，如将其转换为模型可识别的格式；(3) 检测和识别 PDF 文档中的文本；等等，无需您关注。

此处，label 是包含要预测的目标变量的列名，例如，在我们的例子中是“label”。为了演示目的，我们将训练时间限制设置为 120 秒。

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label="label")
predictor.fit(
    train_data=train_data,
    hyperparameters={"model.document_transformer.checkpoint_name":"microsoft/layoutlm-base-uncased",
    "optim.top_k_average_method":"best",
    },
    time_limit=120,
)

No path specified. Models will be saved in: "AutogluonModels/ag-20250508_210009"
=================== System Info ===================
AutoGluon Version:  1.3.1b20250508
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.6.0+cu124
CUDA Version:       12.4
Memory Avail:       28.41 GB / 30.95 GB (91.8%)
Disk Space Avail:   184.85 GB / 255.99 GB (72.2%)
===================================================
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  ['historical', 'resume']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20250508_210009
    ```
INFO: Seed set to 0
The model does not support using an image size that is different from the default size. Provided image size=224. Default size=None. Detailed model configuration=LayoutLMConfig {
  "_name_or_path": "microsoft/layoutlm-base-uncased",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_2d_position_embeddings": 1024,
  "max_position_embeddings": 512,
  "model_type": "layoutlm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.49.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
. We have ignored the provided image size.
GPU Count: 1
GPU Count to be Used: 1
INFO: Using 16bit Automatic Mixed Precision (AMP)
INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name              | Type                | Params | Mode 
------------------------------------------------------------------
0 | model             | DocumentTransformer | 112 M  | train
1 | validation_metric | BinaryAUROC         | 0      | train
2 | loss_func         | CrossEntropyLoss    | 0      | train
------------------------------------------------------------------
112 M     Trainable params
0         Non-trainable params
112 M     Total params
450.518   Total estimated model params size (MB)
236       Modules in train mode
0         Modules in eval mode
INFO: Epoch 0, global step 1: 'val_roc_auc' reached 0.50000 (best 0.50000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20250508_210009/epoch=0-step=1.ckpt' as top 3
INFO: Epoch 1, global step 2: 'val_roc_auc' reached 0.50000 (best 0.50000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20250508_210009/epoch=1-step=2.ckpt' as top 3
INFO: Epoch 2, global step 3: 'val_roc_auc' reached 0.50000 (best 0.50000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20250508_210009/epoch=2-step=3.ckpt' as top 3
INFO: Epoch 3, global step 4: 'val_roc_auc' was not in top 3
INFO: Epoch 4, global step 5: 'val_roc_auc' was not in top 3
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20250508_210009")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://autogluon.cn/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).

<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f1c15efa7d0>

在测试数据集上评估¶

您可以在测试数据集上评估分类器，以查看其性能表现

scores = predictor.evaluate(test_data, metrics=["accuracy"])
print('The test acc: %.3f' % scores["accuracy"])

The test acc: 0.375

INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

预测新的 PDF 文档¶

给定一个示例 PDF 文档，我们可以轻松使用最终模型来预测标签

predictions = predictor.predict({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})
print(f"Ground-truth label: {test_data.iloc[0]['label']}, Prediction: {predictions}")

Ground-truth label: resume, Prediction: ['historical']

INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

如果需要所有类别的概率，可以调用 predict_proba

proba = predictor.predict_proba({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})
print(proba)

[[0.9863369  0.01366313]]

INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

提取嵌入向量¶

从模型学习到的整个文档中提取表示（嵌入向量）也非常有用。我们提供了 extract_embedding 函数，使 predictor 可以返回 N 维文档特征，其中 N 取决于模型。

feature = predictor.extract_embedding({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})
print(feature[0].shape)

(768,)

INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

其他示例¶

您可以访问 AutoMM 示例查看 AutoMM 的其他示例。

自定义¶

要了解如何自定义 AutoMM，请参阅自定义 AutoMM。