CIPER:跨视角图像检索与位姿估计的统一框架
阅读原文· arxiv.org现有跨视角地理定位方法将城市级检索和精确位姿估计分离,导致级联误差与特征不一致。CIPER提出单一架构同时完成两项任务:采用共享Transformer编码器配合任务专用token,分离全局检索特征与空间定位线索;引入双向Transformer位姿解码器,以地面特征作为空间查询进行双向交叉注意力,弥合地面与航拍视角的域差距。基于集合预测策略实现稳定3-DoF回归。在VIGOR、KITTI和Ford Multi-AV数据集上,CIPER在有限视场角和任意朝向条件下表现优越。代码已开源。
Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.