微软新论文引入DELEGATE-52基准,模拟52个专业领域的长文档编辑工作流。测试19个模型,包括Gemini 3.1 Pro、Claude 4.6 Opus和GPT-5.4等前沿模型,发现在长工作流结束时平均损坏25%的文档内容。代理工具使用未能改善表现。论文还提供了其他相关见解。
NEW paper from Microsoft.
This is an important read. (bookmark it)
The work introduces DELEGATE-52, a benchmark simulating long document-editing workflows across 52 professional domains like coding, crystallography, and music notation.
Across 19 tested models, even frontier ones (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupted an average of 25% of document content by the end of long workflows. Agentic tool use didn't help.
Lots of other insights in this one.
Check it out below…
Paper: https://arxiv.org/abs/2604.15597
Learn to build effective AI agents in our academy: https://academy.dair.ai/