# UniAR：共享语境-视觉分词器是实现统一的关键

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmqhgiko30479sle119lioyxt
- 原文链接：https://arxiv.org/abs/2606.18249

## AI 摘要

UniAR 提出统一多模态自回归框架，用单个离散视觉分词器作为理解与生成的共享桥梁，使模型直接解释自身生成的视觉 token。该框架融合预训练视觉编码器、多级特征融合与无查找按位量化，保留高层语义与低层细节。并行按位预测联合输出空间分组的多级视觉编码，缩短视觉序列长度并加速生成；扩散解码器从离散 token 重建高保真图像。经预训练、监督微调与强化学习，UniAR 在图像生成和编辑上达最优，在多模态理解基准上也有竞争力。

## 正文

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.
