# HiLo-Token：面向高效图像编辑的输入自适应高低频Token压缩框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-11 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqjuogx800awslhijx350mwg
- 原文链接：https://arxiv.org/abs/2606.13898

## AI 摘要

HiLo-Token提出输入自适应高低频token压缩框架，解决扩散Transformer（DiT）在图像编辑中的延迟瓶颈——即使从50步蒸馏至8步，DiT仍占73%延迟。方法在用户掩码编辑区域内保留所有token维持局部关联性；外部区域基于空间频率选取高频token捕捉细节，并用16倍下采样图像的低频token保持全局结构。在生产级评估数据上，针对平均掩码比6.38%、15.92%、35.36%的小/中/大掩码编辑任务，在A100-80GB上分别实现3.13倍、2.59倍、1.67倍DiT加速，且生成质量无退化。

## 正文

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose HiLo-Token, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.