小米发布 Xiaomi-GUI-0 多模态 GUI 智能体
阅读原文· arxiv.org小米提出 Xiaomi-GUI-0,一个面向真实移动环境的原生多模态 GUI 智能体。模型在真实设备闭环中训练和评估,采用物理设备为主、沙箱辅助的混合基础设施。训练数据涵盖高频头部任务、长尾意图泛化及反思与记忆增强样本,并通过错误驱动数据飞轮将失败轨迹转化为修正动作、反思解释和恢复示范。训练采用监督微调、step-level 强化学习和 agentic 强化学习三阶段渐进流程。在内部基准 RealMobile 上成功率达 72.0%,在 AndroidWorld 上达 78.9%,同时显著提升了真实任务中的执行稳定性和异常状态识别能力。
Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigation. However, existing GUI agents are trained and evaluated largely on offline trajectories, simulated environments, and standardized benchmarks. These differ substantially from real applications in interface layout, interaction logic, and abnormal-state distribution, and cannot faithfully characterize execution stability in real-world use, where account states, permission dialogs, payment authentication, and risk control continually reshape the state distribution and open a persistent gap between benchmark scores and real usability. To close this gap, we propose Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop. At its core is a real-device-dominant hybrid infrastructure, where physical devices are the primary execution environment and sandboxes provide auxiliary support, so that data collection, training, rollout, and evaluation share an execution distribution close to real deployment. We construct multi-source training data spanning high-frequency head tasks, high-generalization data for long-tail intents, and capability-enhancement data for reflection and memory, and introduce an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. The model is trained through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Evaluated on public benchmarks and our in-house RealMobile, Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, while substantially improving execution stability and abnormal-state recognition in real-world tasks.