使用 TensorRT 加速预测

Open In Colab Open In SageMaker Studio Lab

TensorRT,基于 NVIDIA CUDA® 并行编程模型构建,通过利用 NVIDIA AI、自主机器、高性能计算和图形中的库、开发工具和技术,使我们能够优化推理。AutoGluon-MultiModal 现在通过 predictor.optimize_for_inference() 接口与 TensorRT 集成。本教程演示了如何利用 TensorRT 提升推理速度,这有助于提高部署环境中的效率。

import os
import numpy as np
import time
import warnings
from IPython.display import clear_output
warnings.filterwarnings('ignore')
np.random.seed(123)

安装所需的包

由于 tensorrt/onnx/onnxruntime-gpu 包目前是 autogluon.multimodal 的可选依赖项,我们需要确保这些包已正确安装。

try:
    import tensorrt, onnx, onnxruntime
    print(f"tensorrt=={tensorrt.__version__}, onnx=={onnx.__version__}, onnxruntime=={onnxruntime.__version__}")
except ImportError:
    !pip install autogluon.multimodal[tests]
    !pip install -U "tensorrt>=10.0.0b0,<11.0"
    clear_output()

数据集

为了演示,我们使用了 PetFinder 数据集 的简化子采样版本。任务是根据动物的领养资料信息预测它们的领养率。在这个简化版本中,领养速度被分为两类:0(慢)和 1(快)。

首先,让我们下载并准备数据集。

download_dir = './ag_automm_tutorial'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip'
from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)
Downloading ./ag_automm_tutorial/file.zip from https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip...
  0%|          | 0.00/18.8M [00:00<?, ?iB/s]
 37%|███▋      | 6.85M/18.8M [00:00<00:00, 68.5MiB/s]
 73%|███████▎  | 13.7M/18.8M [00:00<00:00, 48.8MiB/s]
100%|██████████| 18.8M/18.8M [00:00<00:00, 35.3MiB/s]

接下来,我们将加载 CSV 文件。

import pandas as pd
dataset_path = download_dir + '/petfinder_for_tutorial'
train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)
label_col = 'AdoptionSpeed'

我们需要扩展图片路径以便在训练中加载它们。

image_col = 'Images'
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])

def path_expander(path, base_folder):
    path_l = path.split(';')
    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])

train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

每只动物的领养资料包括图片、文本描述以及各种表格特征,例如年龄、品种、名字、颜色等。

训练

现在让我们用训练数据拟合预测器。这里我们设置一个紧凑的时间预算以便快速演示。

from autogluon.multimodal import MultiModalPredictor
hyperparameters = {
    "optim.max_epochs": 2,
    "model.names": ["numerical_mlp", "categorical_mlp", "timm_image", "hf_text", "fusion_mlp"],
    "model.timm_image.checkpoint_name": "mobilenetv3_small_100",
    "model.hf_text.checkpoint_name": "google/electra-small-discriminator",
    
}
predictor = MultiModalPredictor(label=label_col).fit(
    train_data=train_data,
    hyperparameters=hyperparameters,
    time_limit=120, # seconds
)

clear_output()

在底层,AutoMM 自动推断问题类型(分类或回归)、检测数据模态、从多模态模型池中选择相关模型,并训练选定的模型。如果存在多个骨干网络,AutoMM 会在其顶部附加一个晚期融合模型(MLP 或 transformer)。

使用默认 PyTorch 模块进行预测

给定一个没有标签列的多模态数据框,我们可以预测标签。

请注意,这里我们将使用测试数据的小样本进行基准测试。稍后,我们将在整个测试数据集上进行评估以评估精度损失。

batch_size = 2
n_trails = 10
sample = test_data.head(batch_size)

# Use first prediction for initialization (e.g., allocating memory)
y_pred = predictor.predict_proba(sample)

pred_time = []
for _ in range(n_trails):
    tic = time.time()
    y_pred = predictor.predict_proba(sample)
    elapsed = time.time()-tic
    pred_time.append(elapsed)
    print(f"elapsed (pytorch): {elapsed*1000:.1f} ms (batch_size={batch_size})")
elapsed (pytorch): 387.8 ms (batch_size=2)
elapsed (pytorch): 395.8 ms (batch_size=2)
elapsed (pytorch): 395.9 ms (batch_size=2)
elapsed (pytorch): 395.5 ms (batch_size=2)
elapsed (pytorch): 391.1 ms (batch_size=2)
elapsed (pytorch): 394.5 ms (batch_size=2)
elapsed (pytorch): 393.3 ms (batch_size=2)
elapsed (pytorch): 395.3 ms (batch_size=2)
elapsed (pytorch): 388.7 ms (batch_size=2)
elapsed (pytorch): 401.1 ms (batch_size=2)
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.

使用 TensorRT 模块进行预测

首先,让我们加载一个新的预测器并对其进行推理优化。

model_path = predictor.path
trt_predictor = MultiModalPredictor.load(path=model_path)
trt_predictor.optimize_for_inference()

# Again, use first prediction for initialization (e.g., allocating memory)
y_pred_trt = trt_predictor.predict_proba(sample)

clear_output()
Load pretrained checkpoint: /home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20250508_213159/model.ckpt
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[8], line 3
      1 model_path = predictor.path
      2 trt_predictor = MultiModalPredictor.load(path=model_path)
----> 3 trt_predictor.optimize_for_inference()
      5 # Again, use first prediction for initialization (e.g., allocating memory)
      6 y_pred_trt = trt_predictor.predict_proba(sample)

File ~/autogluon/multimodal/src/autogluon/multimodal/predictor.py:956, in MultiModalPredictor.optimize_for_inference(self, providers)
    931 def optimize_for_inference(
    932     self,
    933     providers: Optional[Union[dict, List[str]]] = None,
    934 ):
    935     """
    936     Optimize the predictor's model for inference.
    937 
   (...)
    954         The onnx-based module that can be used to replace predictor._model for model inference.
    955     """
--> 956     return self._learner.optimize_for_inference(providers=providers)

File ~/autogluon/multimodal/src/autogluon/multimodal/utils/export.py:199, in ExportMixin.optimize_for_inference(self, providers)
    196 data = pd.DataFrame.from_dict(data_dict)
    198 onnx_module = None
--> 199 onnx_path = self.export_onnx(data=data, truncate_long_and_double=True)
    201 onnx_module = OnnxModule(onnx_path, providers)
    202 onnx_module.input_keys = self._model.input_keys

File ~/autogluon/multimodal/src/autogluon/multimodal/utils/export.py:122, in ExportMixin.export_onnx(self, data, path, batch_size, verbose, opset_version, truncate_long_and_double)
    119 warnings.warn("Currently, the functionality of exporting to ONNX is experimental.")
    121 # Data preprocessing, loading, and filtering
--> 122 batch = self.get_processed_batch_for_deployment(
    123     data=data,
    124     onnx_tracing=True,
    125     batch_size=batch_size,
    126     truncate_long_and_double=truncate_long_and_double,
    127 )
    128 input_keys = self._model.input_keys
    129 input_vec = [batch[k] for k in input_keys]

File ~/autogluon/multimodal/src/autogluon/multimodal/utils/export.py:288, in ExportMixin.get_processed_batch_for_deployment(self, data, onnx_tracing, batch_size, to_numpy, requires_label, truncate_long_and_double)
    286 inp = batch[key]
    287 # support mixed precision on floating point inputs, and leave integer inputs (for language models) untouched.
--> 288 if inp.dtype.is_floating_point:
    289     batch[key] = inp.to(device, dtype=dtype)
    290 else:

AttributeError: 'tuple' object has no attribute 'dtype'

在底层,optimize_for_inference() 函数将生成一个基于 onnxruntime 的模块,该模块可以作为 torch.nn.Module 的直接替代。它将替换内部基于 torch 的模块 predictor._model 以进行优化推理。

警告

函数 optimize_for_inference() 将修改内部模型定义,仅用于推理。在此之后调用 predictor.fit() 将导致错误。建议使用 MultiModalPredictor.load 重新加载模型,以便重新训练模型。

然后,我们可以像往常一样执行预测或提取嵌入。为了公平地比较推理速度,这里我们多次运行预测。

pred_time_trt = []
for _ in range(n_trails):
    tic = time.time()
    y_pred_trt = trt_predictor.predict_proba(sample)
    elapsed = time.time()-tic
    pred_time_trt.append(elapsed)
    print(f"elapsed (tensorrt): {elapsed*1000:.1f} ms (batch_size={batch_size})")

为了验证预测结果的正确性,我们可以并排比较结果。

让我们看一眼预期结果和 TensorRT 结果。

y_pred, y_pred_trt

由于我们默认使用混合精度(FP16),可能会有精度损失。我们可以看到概率非常接近,并且对于大多数情况,我们应该能够安全地假设这些结果相对接近。更多详细信息,请参阅 TensorRT 开发者指南中的降精度部分

np.testing.assert_allclose(y_pred, y_pred_trt, atol=0.01)

可视化推理速度

我们可以通过除以预测时间来计算推理时间。

infer_speed = batch_size/np.mean(pred_time)
infer_speed_trt = batch_size/np.mean(pred_time_trt)

然后,可视化速度提升。

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_figheight(1.5)
ax.barh(["PyTorch", "TensorRT"], [infer_speed, infer_speed_trt])
ax.annotate(f"{infer_speed:.1f} rows/s", xy=(infer_speed, 0))
ax.annotate(f"{infer_speed_trt:.1f} rows/s", xy=(infer_speed_trt, 1))
_ = plt.xlabel('Inference Speed (rows per second)')

比较评估指标

既然我们可以通过 optimize_for_inference() 实现更好的推理速度,那么对潜在的精度损失是否有影响呢?

让我们先进行整个测试数据集的评估。

metric = predictor.evaluate(test_data)
metric_trt = trt_predictor.evaluate(test_data)
clear_output()
metric_df = pd.DataFrame.from_dict({"PyTorch": metric, "TensorRT": metric_trt})
metric_df

评估结果预计会非常接近。

如果评估结果之间存在显著差异,请尝试通过使用 CUDA 执行提供程序禁用混合精度

predictor.optimize_for_inference(providers=["CUDAExecutionProvider"])

有关提供程序的完整列表,请参阅 执行提供程序

其他示例

您可以访问 AutoMM 示例 来探索 AutoMM 的其他示例。

自定义

要了解如何自定义 AutoMM,请参阅 自定义 AutoMM