ProCUA-SFT 技术报告

2026-06-15 08:00·18天前

AI 摘要

ProCUA-SFT 是一个包含 3.1M 步级 SFT 样本的数据集，从 93K 合成轨迹蒸馏得到，覆盖 2,484 种应用组合。数据由单一 VLM（Kimi-K2.5）在搭载真实内容（912 个电子表格、约 10K 演示文稿等）的实机环境中自动生成并验证。使用该数据集对 UI-TARS 7B 微调一个 epoch，OSWorld 成功率达 45.0%，比基线高 18.7 个百分点，比 AgentNet 训练的模型高 35% 以上。子集已纳入 Nemotron 3 Nano Omni 模型的训练数据。

原文 · 未翻译

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

HuggingFace Daily Papers（社区热门论文）

48导出 Markdown

ProCUA-SFT 技术报告

2026-06-15 08:00·18天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译