Robusto-2：在利马与纽约市自动驾驶场景中的人类与VLM基准测试

2026-06-18 08:00·15天前

AI 摘要

研究对比视觉语言模型（VLM）与来自利马和纽约的人类驾驶员在两地行车记录仪视频上的表现。使用VQA范式提出事实、评级、反事实和推理四类问题，测试泛化能力。结果发现人类与VLM的回答存在差异，但地理来源对双方回答均无显著影响。数据集已公开。

原文 · 未翻译

As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2

HuggingFace Daily Papers（社区热门论文）

53导出 Markdown

Robusto-2：在利马与纽约市自动驾驶场景中的人类与VLM基准测试

2026-06-18 08:00·15天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译