# 迈向一致的视频几何估计

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmpqs64ts07rjslno9tbemz9q
- 原文链接：https://arxiv.org/abs/2605.30060

## AI 摘要

ViGeo是一种前馈基础模型，用于从视频序列中恢复空间密集且时间一致的几何信息。它基于Transformer架构，支持流式、全序列和长视频推理。其核心设计为动态分块注意力机制，在训练时结合双向和因果时间上下文，并在测试时自适应调整。研究还引入基于补全的数据精炼框架，通过训练视频深度补全教师模型生成密集、时间连贯且可靠的训练目标。该模型在同一框架内同时预测深度、点图和表面法线，仅使用公开数据集训练，便在多个相关视频几何估计任务上达到了最先进水平。

## 正文

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.