Saining Xie@sainingxie · 6月17日So this is not a benchmark for software engineering agents. It’s meant to test core reasoning and intelligence through coding—backed by 71 pages of deep analysis from some of the best competitive programmers out there.
This effort was carried out by students across multiple institutions (I’m mostly just a cheerleader here!) It was led by @ZihanZheng71803 (an undergrad who represented NYU in the ICPC World Finals), @wenhaocha1, and many of their Olympiad medalist friends. They built the live benchmark and offered expert analysis of how elite human coders compare to top LLMs. The results are now public: on the hard problems, LLMs essentially score 0%. They're good at implementation-heavy tasks that rely on memorization, but still struggle badly with observation-heavy or logic-heavy problems—those where the implementation is easy once you’ve had the critical "aha" insight. They also struggle with detail-oriented tasks—often getting the basics right but failing to account for edge cases.
Some more thoughts on why this benchmark matters: I’ve always been surrounded by top competitive programmers. My undergrad program at SJTU is renowned for ICPC success and primarily admits students with a strong high school competitive programming background. While I’ve never won an olympiad medal myself, I deeply admire my peers who did—friends who trained for years as teens and competed at the highest international levels. One of them is my classmate and key collaborator on this project, Prof @shangjingbo, who earned ICPC world final gold for SJTU. For us, competitive programming was the ultimate badge of intelligence for CS students. Competitive programming emphasizes reasoning and problem solving under pressure, which differs from standard software engineering—but the skills carry over surprisingly well. That’s why so many startups love to show off their IOI gold medalists!
Beating this benchmark would be like AlphaGo beating Lee Sedol. We're not at that level yet—not even for problems with clearly verifiable outcomes. And if you care about fundamental intelligence and reasoning, this result might be worth a close look.
译所以这不是一个针对软件工程智能体的基准测试。它旨在通过编程测试核心推理与智能——由一些顶尖竞技程序员撰写的 71 页深度分析作为支撑。