该论文评估了商业AI聊天机器人作为新闻中介的能力。研究发现,当以多选题形式提问时,最佳系统对数小时前新闻的准确率已超过90%,这表明检索增强生成技术正从静态知识库迈向实时信息处理。然而,这种高准确性并不稳定。当要求系统自由生成回答、新闻为印地语,或用户提问包含错误预设时,其表现显著下降。超过70%的错误源于检索失败或来源偏差,即系统检索到了近似但不精确的信息,随后基于错误的来源、语言或时间戳生成了回答。论文标题为《Evaluating Commercial AI Chatbots as News Intermediaries》(arxiv.org/abs/2605.22785)。
AI chatbots can answer fresh news well, but their weakest failures hide inside their confidence.
Best systems are surprisingly good at recent news when the question is clean and multiple choice.
But it also shows that this success is fragile, because the same systems get worse when they must answer freely, when the news is in Hindi, or when the user's question contains a false assumption.
The best systems crossed 90% accuracy on multiple-choice questions about events reported only hours earlier, which means retrieval-augmented AI has moved from stale encyclopedia mode toward live information work.