# Phase Marginalization：解决视觉Transformer patch-grid相位不稳定性

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-06 20:15
- AIHOT 分数：53
- AIHOT 链接：https://aihot.virxact.com/items/cmq6n82w909yusl5ih51dmldb
- 原文链接：https://arxiv.org/abs/2606.08132

## AI 摘要

视觉Transformer在固定patch网格上存在相位依赖不稳定：改变patch划分会改变像素可用的token证据，尤其边界处。研究者将patch-grid相位形式化为干扰变量，提出Phase Marginalization后处理方法，评估结构化patch-grid相位、反对齐密集输出并在原始图像坐标系中聚合。核心变体Uniform Phase Marginalization with K=4无需训练，在分割、深度和局部匹配任务上优于标准K=1基线。在Cityscapes实验中，相比通用移位四前向测试时增强(TTA)获得+0.31 mIoU优势。缩放实验表明K=4是实用折中：K=8基本不变，K=16精度提升极小但延迟大增。结论将patch-grid相位定位为可测量干扰变量，Phase Marginalization为密集ViT预测提供了简单诊断和后处理基线。

## 正文

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.
