PARCEL:基于池锚定重采样与条件弹性查询的高效视觉-语言理解架构
阅读原文· arxiv.org大型视觉语言模型在推理时面临将视觉输入映射为密集token序列带来的二次计算瓶颈。现有视觉token压缩方法在激进压缩下会损失空间保真度。本文提出PARCEL,一种新的视觉token化架构。它通过建立空间池token作为低频布局锚点,并以此为条件对弹性查询token进行重采样,从而动态分配特征提取任务。在27项基准测试中,PARCEL在不同视觉token预算下均优于现有基线方法,改善了性能与效率的帕累托前沿。
Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.