DiningBench：面向饮食领域感知与推理的分层多视角基准测试

2026-04-12 08:00·82天前

AI 摘要

研究团队推出面向饮食领域的视觉语言模型基准测试DiningBench，包含3,021道菜品、平均每道菜5.27张图像，涵盖细粒度分类、营养估算和视觉问答三个认知层级。该数据集引入来自相同菜单的"困难"负样本和经严格验证的营养数据。实验评估了29个开源及专有模型，结果显示当前VLMs虽擅长通用推理，但在细粒度视觉辨别和精确营养推理方面存在显著不足。研究还系统分析了多视角输入和思维链推理的影响，识别出五种主要失败模式。代码已开源。

原文 · 未翻译

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

DiningBench：面向饮食领域感知与推理的分层多视角基准测试

2026-04-12 08:00·82天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译