# 字节跳动Seed发布EdgeBench基准

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-07-03 08:15
- AIHOT 分数：51
- AIHOT 链接：https://aihot.virxact.com/items/cmr47gz8u019usl3gk5yymk6x
- 原文链接：https://x.com/rohanpaul_ai/status/2072836536069202137

## AI 摘要

字节跳动Seed推出EdgeBench基准，专门测试AI智能体在12-72小时长时间任务中的学习能力。基准包含134个真实世界任务（涵盖科学、专业知识、软件工程、优化、形式数学、游戏6大类），人类平均耗时57.2小时。智能体在本地工作区快速试错，并接收隐藏裁判的反馈。经过约38000小时智能体运行，发现性能随交互时间精确拟合log-sigmoid曲线，且顶级模型每3个月学习速度翻倍。目前首批51个任务及完整评估框架已开源。

## 正文

ByteDance Seed delivered again.

They released EdgeBench， to test whether AI agents can improve through experience， using 134 real-world tasks that run for at least 12 hours.

The big deal is that it shifts AI evaluation from "what does the model already know？" to "can the model learn while doing real work？"

Huge， because future AI agents will not just answer questions from training data. They will enter messy environments， use tools， make attempts， read feedback， fix mistakes， and slowly build better solutions.

Most current benchmarks are too short for that， so they mostly test memory， coding skill， or one-shot reasoning.

EdgeBench instead gives agents 12-hour real-world tasks with feedback loops， so it can measure whether the agent improves through experience.

Each task has a local workspace for fast trial and error， plus a hidden judge that gives stronger feedback on submitted work， which is meant to feel closer to real expert work.

The authors then ran frontier agents for about 38，000 total hours and tracked how their best score changed as they kept interacting with the task environment.

The big result is that when scores are averaged across many tasks， learning follows a very clean log-sigmoid curve， meaning progress is slow， then faster， then starts to level off.

They also found that newer agents seem to learn from environments much faster， with the top models roughly doubling their 2-hour learning speed every 3 months.

### 引用推文

> Deyao Zhu：Introducing EdgeBench, a benchmark designed to study how agents learn from environments over at least 12~72-hour runs. We find that performance follows a log-si...
