# ViT-Up：面向视觉Transformer的高保真特征上采样

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-12 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmqjjxug90416slmh0hks33kh
- 原文链接：https://arxiv.org/abs/2606.14024

## AI 摘要

ViT-Up提出隐式特征上采样框架，利用中间ViT隐藏状态构建逐层查询，替代外部图像引导，可在任意连续坐标预测特征并保持与骨干特征空间对齐。在密集预测和语义对应任务上，ViT-Up一致优于现有图像引导上采样方法：在DINOv3-S+骨干上，Cityscapes提升+2.07 mIoU，SPair-71k提升+4.17 PCK@0.10；在DINOv3-B骨干上，提升分别达+3.36 mIoU和+8.09 PCK@0.10，表明ViT-Up随骨干容量增长性能更优。

## 正文

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.
