# 一场景，两深度：探针单目深度基础模型中的几何歧义

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-28 08:00
- AIHOT 分数：52
- AIHOT 链接：https://aihot.virxact.com/items/cmr0okx0603p2slolyhr1csw3
- 原文链接：https://arxiv.org/abs/2606.29600

## AI 摘要

单目深度估计通常将每个像素简化为一个标量深度，忽略同一射线中可能存在的多个几何有效表面。本文引入MultiDepth-3k（MD-3k），一个稀疏双层序数基准，用于测量深度层偏好和多层空间关系准确性（ML-SRA）。在MD-3k上，领先的深度基础模型在标准RGB输入下表现出多样化的层偏好。Laplacian Visual Prompting（LVP）作为一种无需训练的谱输入变换，能显著改变某些冻结模型的层报告。最强的RGB/LVP组合DAv2-L达到75.5% ML-SRA。结果提示深度基础模型可能表达了互补的几何假设，需要以歧义感知的视角重新审视深度监督和评估。

## 正文

A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel. Transparent scenes make this ambiguity measurable: the same ray can pass through foreground glass and observe the background, turning the supervised target into a convention of annotation, data, and training rather than a scene-intrinsic truth. A learned predictor exposes this convention as its depth-layer preference. We introduce MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for measuring depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA). On MD-3k, leading depth foundation models exhibit diverse layer preferences under standard RGB input, showing that the same layered geometry can be resolved differently across models. We further find that Laplacian Visual Prompting (LVP), a training-free spectral input transformation, can substantially change the reported layer for certain frozen models. The strongest RGB/LVP pair, DAv2-L, reaches 75.5% ML-SRA. These results suggest that depth foundation models may express complementary geometric hypotheses that standard RGB inference leaves unexpressed. We invite the community to rethink depth supervision and evaluation through an ambiguity-aware lens, where multiple valid 3D interpretations are treated as geometric structure to be measured, preserved, and expressed.