字节跳动Seed推出EdgeBench基准,专门测试AI智能体在12-72小时长时间任务中的学习能力。基准包含134个真实世界任务(涵盖科学、专业知识、软件工程、优化、形式数学、游戏6大类),人类平均耗时57.2小时。智能体在本地工作区快速试错,并接收隐藏裁判的反馈。经过约38000小时智能体运行,发现性能随交互时间精确拟合log-sigmoid曲线,且顶级模型每3个月学习速度翻倍。目前首批51个任务及完整评估框架已开源。
ByteDance Seed delivered again.
They released EdgeBench, to test whether AI agents can improve through experience, using 134 real-world tasks that run for at least 12 hours.
The big deal is that it shifts AI evaluation from "what does the model already know?" to "can the model learn while doing real work?"
Huge, because future AI agents will not just answer questions from training data. They will enter messy environments, use tools, make attempts, read feedback, fix mistakes, and slowly build better solutions.
Most current benchmarks are too short for that, so they mostly test memory, coding skill, or one-shot reasoning.
EdgeBench instead gives agents 12-hour real-world tasks with feedback loops, so it can measure whether the agent improves through experience.