# RayDer：基于真实世界视频的可扩展自监督新视角合成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmpvfo6ys06g5sl0ztu16ykq4
- 原文链接：https://arxiv.org/abs/2605.31535

## AI 摘要

RayDer是一个统一的Transformer前馈模型，将相机估计、场景重建和渲染整合到单一主干网络中。它通过一个被视为干扰因子的最小动态状态来吸收时变内容，从而能够在无约束的真实世界视频上进行稳定训练。该模型以静态场景新视角合成作为目标任务，仅将动态内容用作可扩展的监督信号。实验表明，RayDer在数据量和计算量上展现出清晰的幂律扩展规律，并在大量基准测试中取得了与有监督最先进方法相当的零样本开集性能。

## 正文

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder
