# ART：基于艺术强化训练的多模态大语言模型微调方法

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 17:30
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmq9dxsgb0b12slld6h2y86k3
- 原文链接：https://arxiv.org/abs/2606.11854

## AI 摘要

ART（Art-based Reinforcement Training）是一种参数高效微调方法，通过仅优化冻结多模态大语言模型的原始视觉输入（像素阵列）来注入信息，无需修改预编译计算图，从而可在 vLLM 等高性能推理引擎上以软提示方式运行。ART 支持任意微调目标，优化后的视觉输入可被风格化为计算艺术作品。在开源 Qwen 架构的不同规模模型上，ART 在数学和结构化工具使用基准测试中达到了与 LoRA 相当的准确率。

## 正文

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.