# CIPER：跨视角图像检索与位姿估计的统一框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-03 08:00
- AIHOT 分数：38
- AIHOT 链接：https://aihot.virxact.com/items/cmq6ixr0408ussl5izni61wxb
- 原文链接：https://arxiv.org/abs/2606.05011

## AI 摘要

现有跨视角地理定位方法将城市级检索和精确位姿估计分离，导致级联误差与特征不一致。CIPER提出单一架构同时完成两项任务：采用共享Transformer编码器配合任务专用token，分离全局检索特征与空间定位线索；引入双向Transformer位姿解码器，以地面特征作为空间查询进行双向交叉注意力，弥合地面与航拍视角的域差距。基于集合预测策略实现稳定3-DoF回归。在VIGOR、KITTI和Ford Multi-AV数据集上，CIPER在有限视场角和任意朝向条件下表现优越。代码已开源。

## 正文

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.