elvis@omarsar0

2026-04-26 00:56·68天前

AI 摘要

微软新论文引入DELEGATE-52基准，模拟52个专业领域的长文档编辑工作流。测试19个模型，包括Gemini 3.1 Pro、Claude 4.6 Opus和GPT-5.4等前沿模型，发现在长工作流结束时平均损坏25%的文档内容。代理工具使用未能改善表现。论文还提供了其他相关见解。

NEW paper from Microsoft.

This is an important read. （bookmark it）

The work introduces DELEGATE-52， a benchmark simulating long document-editing workflows across 52 professional domains like coding， crystallography， and music notation.

Across 19 tested models， even frontier ones （Gemini 3.1 Pro， Claude 4.6 Opus， GPT-5.4） corrupted an average of 25% of document content by the end of long workflows. Agentic tool use didn't help.

Lots of other insights in this one.

Check it out below…

Paper： https://arxiv.org/abs/2604.15597

Learn to build effective AI agents in our academy： https://academy.dair.ai/

论文/研究评测/基准部署/工程

在 X 查看原推导出 Markdown

elvis@omarsar0 · X

63导出 Markdown