# 用于视觉-语言数据集蒸馏的秩感知双曲对齐（RAHA）

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-28 08:00
- AIHOT 分数：39
- AIHOT 链接：https://aihot.virxact.com/items/cmr3nyzag00ecsl7law6hhxuj
- 原文链接：https://arxiv.org/abs/2606.29464

## AI 摘要

RAHA（Rank-Aware Hyperbolic Alignment）提出将多模态表示提升到双曲空间，通过非对称目标优化蒸馏对，强制在共享低秩范围内进行测地线对齐，同时正则化残差子空间以保留模态私有多样性并提升迁移鲁棒性。该方法解决了现有视觉-语言数据集蒸馏中欧几里得全维度对齐过于严格的问题，在固定预算下实现有竞争力的跨模态检索和更优的迁移指标。

## 正文

Vision-language dataset distillation (VLDD) compresses a large image-text paired dataset into a small set of synthetic pairs that can efficiently train contrastive vision-language models under strict data and compute budgets. Most existing methods match expert trajectories or cross-modal statistics, yet still enforce full-dimensional alignment in a Euclidean embedding space. This is often overly restrictive due to rank-deficient image--text correlation, with shared semantics concentrated in a low-dimensional range and remaining variation spread across a weakly correlated residual subspace. LoRS relaxes alignment at the similarity level by low-rank factorization, but does not explicitly control dominant alignment capacity and structure in the representation space. We thus propose a rank-aware hyperbolic alignment (RAHA) that combines hierarchical geometry with explicit alignment-capacity control. RAHA lifts multimodal representations to hyperbolic space and optimizes distilled pairs with asymmetric objectives that enforce geodesic alignment in the shared range while regularizing the residual subspace to preserve modality-private diversity and improve transfer robustness. Experiments on benchmarks show that RAHA demonstrates competitive cross-modal retrieval and improved transfer indicators under fixed budgets.
