So this is not a benchmark for software engineering agents. It’s meant to test core reasoning and intelligence through coding—backed by 71 pages of deep analysis from some of the best competitive programmers out there. This effort was carried out by students across multiple institutions (I’m mostly just a cheerleader here!) It was led by @ZihanZheng71803 (an undergrad who represented NYU in the ICPC World Finals), @wenhaocha1, and many of their Olympiad medalist friends. They built the live benchmark and offered expert analysis of how elite human coders compare to top LLMs. The results are now public: on the hard problems, LLMs essentially score 0%. They're good at implementation-heavy tasks that rely on memorization, but still struggle badly with observation-heavy or logic-heavy problems—those where the implementation is easy once you’ve had the critical "aha" insight. They also struggle with detail-oriented tasks—often getting the basics right but failing to account for edge cases. Some more thoughts on why this benchmark matters: I’ve always been surrounded by top competitive programmers. My undergrad program at SJTU is renowned for ICPC success and primarily admits students with a strong high school competitive programming background. While I’ve never won an olympiad medal myself, I deeply admire my peers who did—friends who trained for years as teens and competed at the highest international levels. One of them is my classmate and key collaborator on this project, Prof @shangjingbo, who earned ICPC world final gold for SJTU. For us, competitive programming was the ultimate badge of intelligence for CS students. Competitive programming emphasizes reasoning and problem solving under pressure, which differs from standard software engineering—but the skills carry over surprisingly well. That’s why so many startups love to show off their IOI gold medalists! Beating this benchmark would be like AlphaGo beating Lee Sedol. We're not at that level yet—not even for problems with clearly verifiable outcomes. And if you care about fundamental intelligence and reasoning, this result might be worth a close look.

译所以这不是一个针对软件工程智能体的基准测试。它旨在通过编程测试核心推理与智能——由一些顶尖竞技程序员撰写的 71 页深度分析作为支撑。

DeepSeek@deepseek_ai · 5月29日68

🚀 DeepSeek-R1-0528 is here! 🔹 Improved benchmark performance 🔹 Enhanced front-end capabilities 🔹 Reduced hallucinations 🔹 Supports JSON output & function calling ✅ Try it now: https://chat.deepseek.com/ 🔌 No change to API usage — docs here: https://api-docs.deepseek.com/guides/reasoning_model 🔗 Open-source weights: https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

译🚀 DeepSeek-R1-0528 现已发布！ 🔹 基准测试性能提升 🔹 前端能力增强 🔹 减少幻觉现象 🔹 支持 JSON 输出与函数调用 ✅ 立即试用：https://chat.deepseek.com/ 🔌 API 使用方式不变 — 文档在此：https://api-docs.deepseek.com/guides/reasoning_model 🔗 开源权重：https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

Lilian Weng@lilianweng · 5月17日

Giving your models more time to think before prediction, like via smart decoding, chain-of-thoughts reasoning, latent thoughts, etc, turns out to be quite effective for unblocking the next level of intelligence. New post is here :) “Why we think”: https://lilianweng.github.io/posts/2025-05-01-thinking/

译让模型在预测前有更多时间思考，比如通过 smart decoding、chain-of-thoughts reasoning、latent thoughts 等方式，对于解锁下一层次的智能非常有效。

DeepSeek@deepseek_ai · 2月24日

🚀 Day 1 of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production. ✅ BF16 support ✅ Paged KV cache (block size 64) ⚡ 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800 🔗 Explore on GitHub: https://github.com/deepseek-ai/FlashMLA

译🚀 #OpenSourceWeek 第一天：FlashMLA 很荣幸分享 FlashMLA —— 我们针对 Hopper GPU 的高效 MLA 解码内核，针对变长序列优化，现已投入生产。 ✅ 支持 BF16 ✅ 分页 KV 缓存（块大小 64） ⚡ 在 H800 上达 3000 GB/s 内存受限与 580 TFLOPS 计算受限 🔗 在 GitHub 上探索：https://github.com/deepseek-ai/FlashMLA

DeepSeek@deepseek_ai · 2月18日

🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With optimized design for modern hardware, NSA speeds up inference while reducing pre-training costs—without compromising performance. It matches or outperforms Full Attention models on general benchmarks, long-context tasks, and instruction-based reasoning. 📖 For more details, check out our paper here: https://arxiv.org/abs/2502.11089

译NSA是一种硬件对齐且原生可训练的稀疏注意力机制，专为超快速长上下文训练与推理设计。其核心采用动态分层稀疏策略，结合粗粒度token压缩与细粒度token选择。通过针对现代硬件的优化，NSA在加速推理、降低预训练成本的同时不损失性能，在通用基准、长上下文任务及指令推理中匹配或超越Full Attention模型。

ZhipuAI by BigModel@ZhipuAI · 1月5日

购买 $ZHIPU —— 采用全自研的GLM模型架构，基于数十万亿中英文双语数据训练而成。ZHIPU 综合能力接近 GPT-4-Turbo，推理速度高达 80 token/s，阅读速度是人类极限（300词/分钟）的 25 倍，开创了AI性能的新标杆！ https://bigmodel.cn/

译ZHIPU 采用全自研 GLM 架构，基于数十万亿中英文双语数据训练，综合能力接近 GPT-4-Turbo，推理速度达 80 token/s，为人类阅读极限（300词/分钟）的 25 倍。

Lilian Weng@lilianweng · 9月13日

🍓 Finally o1 is out - our first model with general reasoning capabilities. Not only it achieves impressive results on hard, scientific tasks, but also it gets significantly improved on safety and robustness. https://openai.com/index/learning-to-reason-with-llms/ We found reasoning in context about safety rules is a super efficient way for teaching models human values and principles. Truly, capability and safety are not two conflicting goals. 🤝

译🍓 终于 o1 发布了——我们首个具备通用推理能力的模型。它不仅在困难的科学任务上取得了令人瞩目的成果，而且在安全性和鲁棒性方面也有显著提升。