Rohan Paul@rohanpaul_ai

2026-07-03 08:15·2小时前

AI 摘要

字节跳动Seed推出EdgeBench基准，专门测试AI智能体在12-72小时长时间任务中的学习能力。基准包含134个真实世界任务（涵盖科学、专业知识、软件工程、优化、形式数学、游戏6大类），人类平均耗时57.2小时。智能体在本地工作区快速试错，并接收隐藏裁判的反馈。经过约38000小时智能体运行，发现性能随交互时间精确拟合log-sigmoid曲线，且顶级模型每3个月学习速度翻倍。目前首批51个任务及完整评估框架已开源。

ByteDance Seed delivered again.

They released EdgeBench， to test whether AI agents can improve through experience， using 134 real-world tasks that run for at least 12 hours.

The big deal is that it shifts AI evaluation from "what does the model already know？" to "can the model learn while doing real work？"

Huge， because future AI agents will not just answer questions from training data. They will enter messy environments， use tools， make attempts， read feedback， fix mistakes， and slowly build better solutions.

Most current benchmarks are too short for that， so they mostly test memory， coding skill， or one-shot reasoning.

EdgeBench instead gives agents 12-hour real-world tasks with feedback loops， so it can measure whether the agent improves through experience.

Each task has a local workspace for fast trial and error， plus a hidden judge that gives stronger feedback on submitted work， which is meant to feel closer to real expert work.

The authors then ran frontier agents for about 38，000 total hours and tracked how their best score changed as they kept interacting with the task environment.

The big result is that when scores are averaged across many tasks， learning follows a very clean log-sigmoid curve， meaning progress is slow， then faster， then starts to level off.

They also found that newer agents seem to learn from environments much faster， with the top models roughly doubling their 2-hour learning speed every 3 months.

Deyao Zhu

Rohan Paul@rohanpaul_ai · X

51导出 Markdown