# Robust-U1：让MLLM自我恢复损坏视觉内容实现鲁棒理解

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-06 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmqaef40q0kqmslldyy5uor7u
- 原文链接：https://arxiv.org/abs/2606.08063

## AI 摘要

Robust-U1提出显式视觉自恢复框架，使多模态大语言模型能够修复真实世界噪声破坏的输入图像。方法包含三阶段：监督微调进行初始重建、基于像素级SSIM与语义级CLIP相似度双奖励的强化学习对齐高视觉质量、融合损坏图像与恢复图像的多模态推理。在真实损坏基准上取得最先进鲁棒性，在通用VQA基准上维持对抗性损坏下的优越性能。实验表明高质量视觉恢复直接提升推理能力，自恢复成为鲁棒理解的关键机制。

## 正文

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.