思维流重要吗?评估Gemini视觉语言模型在视频场景理解中的推理
阅读原文· arxiv.org研究人员对Google Gemini 2.5 Flash和Flash Lite在视频场景理解中的内部推理轨迹(思维流)进行基准测试,基于100小时视频提出内容丰富度、思维-最终输出覆盖率和主导实体分析三项指标。实验发现,增加思考长度带来的质量提升在最初几百个token后迅速趋于平缓,Flash Lite在质量与token消耗间达到最佳平衡。研究还揭示,当推理预算受限时,模型会在最终输出中添加未经推理的内容,产生"压缩步骤幻觉";Flash倾向于讨论推理过程,而Flash Lite更专注于场景描述。
We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.