# 一维有序token实现高效测试时搜索

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-16 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo6z5esj00mnsli5u2bj3lh2
- 原文链接：https://arxiv.org/abs/2604.15453

## AI 摘要

本文探讨token结构对自回归模型测试时搜索能力的影响。研究表明，采用粗到细结构的一维有序token其中间状态具备可验证的语义意义，使验证器能有效引导生成，显著优于传统二维网格结构。实验显示，基于此类token训练的模型在测试时扩展行为上表现更佳。此外，研究实现了无需训练AR模型的纯测试时搜索文本到图像生成，并系统分析了best-of-N、束搜索等经典算法与不同token结构的交互机制，为自回归模型的推理时扩展提供了实践指导。

## 正文

Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.