# 像素空间自回归图像生成的并行 rollout 近似（PRA）

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-26 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqzb5rvx004usltjy11fyb25
- 原文链接：https://arxiv.org/abs/2606.27978

## AI 摘要

像素空间连续 token 自回归图像生成面临高维 patch 单步误差大与训练-推理 gap 累积问题。现有方法只能部分缓解。本文提出并行 rollout 近似（PRA），通过生成低维中间状态再经像素解码器映射回像素 token，并在训练时利用相同路径构造类推理像素输入，保持并行教师强制训练。在 ImageNet-1K 256×256 类条件生成上，135M 参数的 PRA-S 取得 FID 2.58，511M 参数的 PRA-L 降至 1.94，创像素空间 AR 模型新 SOTA，且分类探查准确率优于其他基线。

## 正文

Pixel-space continuous-token autoregressive (AR) generation directly models images as sequences of raw pixel patches, avoiding discrete tokenization or a separately pretrained tokenizer. However, it faces coupled challenges: high-dimensional patch generation causes large single-step errors, and teacher-forced training creates a train--inference gap that makes these errors accumulate across AR steps. Existing fixes such as x-prediction and input noise injection only partially mitigate these issues. Exact rollout training better matches inference-time conditions, but is impractical due to prohibitively slow sequential sampling. We propose Parallel Rollout Approximation (PRA), a scalable framework that addresses both challenges jointly. PRA generates low-dimensional intermediate states instead of high-dimensional pixel patches, then maps them back to pixel-space tokens with a pixel decoder, preserving a pixel-in, pixel-out AR interface. It also constructs inference-like pixel inputs through the same intermediate-state-to-pixel path used at inference, independently across positions, approximating the pixel-feedback interface encountered during inference-time rollout while retaining parallel teacher-forced training. On class-conditional ImageNet-1K generation at 256times256 resolution, PRA-S with 135M parameters achieves an FID of 2.58, surpassing the previous billion-scale pixel-space AR result of 3.60. Scaling to PRA-L with 511M parameters further improves FID to 1.94, establishing a new state of the art among pixel-space AR models. Beyond generation, PRA achieves higher ImageNet classification probing accuracy than other AR and diffusion baselines, suggesting its potential for unified pixel-space image generation and understanding.
