HorizonStream:面向流式三维重建的长时域注意力
阅读原文· arxiv.orgHorizonStream 将几何传播形式化为证据影响核,并将其分解为长时域和短时域因子。长时域因子采用几何线性注意力学习通道级衰减率,实现几何证据的有界、多时间尺度传播。短时域因子结合几何局部注意力与时空旋转位置编码,执行可靠三维匹配并抑制注意力尖峰。最终,通过度量读出 token 从持久几何状态中恢复稳定尺度与刚性位姿。该模型仅用 48 帧片段训练,即可在恒定内存与线性时间下,稳定泛化至超过 10,000 帧的序列,达到了流式三维重建的先进性能。
Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an evidence influence kernel and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/