Rohan Paul@rohanpaul_ai

2026-06-01 09:47·31天前

AI 摘要

该论文评估了商业AI聊天机器人作为新闻中介的能力。研究发现，当以多选题形式提问时，最佳系统对数小时前新闻的准确率已超过90%，这表明检索增强生成技术正从静态知识库迈向实时信息处理。然而，这种高准确性并不稳定。当要求系统自由生成回答、新闻为印地语，或用户提问包含错误预设时，其表现显著下降。超过70%的错误源于检索失败或来源偏差，即系统检索到了近似但不精确的信息，随后基于错误的来源、语言或时间戳生成了回答。论文标题为《Evaluating Commercial AI Chatbots as News Intermediaries》（arxiv.org/abs/2605.22785）。

AI chatbots can answer fresh news well， but their weakest failures hide inside their confidence.

Best systems are surprisingly good at recent news when the question is clean and multiple choice.

But it also shows that this success is fragile， because the same systems get worse when they must answer freely， when the news is in Hindi， or when the user's question contains a false assumption.

The best systems crossed 90% accuracy on multiple-choice questions about events reported only hours earlier， which means retrieval-augmented AI has moved from stale encyclopedia mode toward live information work.

That accuracy is not the same thing as reliability， because the systems were far worse when answers had to be produced freely

these models usually do not fail because they cannot "think，" but because they land on the wrong evidence.

More than 70% of errors came from retrieval failures or source divergence， where the system found something nearby but not exact， then answered faithfully from the wrong article， wrong language， wrong scope， or wrong timestamp.

----

Paper Link - arxiv. org/abs/2605.22785

Paper Title： "Evaluating Commercial AI Chatbots as News Intermediaries"

检索增强搜索论文/研究

Rohan Paul@rohanpaul_ai · X

62导出 Markdown

2026-06-01 09:47·31天前

在 X 看原推· x.com

AI 摘要

AI chatbots can answer fresh news well， but their weakest failures hide inside their confidence.

Best systems are surprisingly good at recent news when the question is clean and multiple choice.