QuanBench+ A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation paper: https://huggingface.co/papers/2604.08570

译QuanBench+ 一个用于基于LLM的量子代码生成的统一多框架基准测试论文: https://huggingface.co/papers/2604.08570

Chubby♨️@kimmonismus · 4月14日

Complaints about Anthropic’s $200 Max plan are escalating as independent tests (e.g. Bridgebench) claim Claude Opus 4.6 dropped sharply in hallucination performance. Maybe the quant it after release and people adopted it in their workflows? Anyways, cudos to Grok for staying forst place.

译关于 Anthropic 200 美元 Max 计划的投诉正在升级，因为独立测试（例如 Bridgebench）声称 Claude Opus 4.6 在幻觉性能方面急剧下降。可能是发布后进行了量化，人们将其应用到了他们的工作流程中？无论如何，祝贺 Grok 保持第一。

Chubby♨️@kimmonismus · 4月14日

Holy, Anthropic did not exaggerate. Claude Mythos is built different.

译天哪，Anthropic 没有夸大其词。Claude Mythos 确实与众不同。 [引用 @AISecurityInst]：我们对 Claude Mythos Preview 进行了网络安全评估，发现它是首个端到端完成 AISI 网络靶场的模型。🧵

DogeDesigner@cb_doge · 4月14日25

Grok 4.20 is crushing BridgeBench. 🔥 #1 in speed #1 in reasoning #1 in hallucination control Beating GPT-5.4, Claude Opus 4.6, Gemini, Qwen, and more.

译Grok 4.20 正在碾压 BridgeBench。🔥 速度排名第一推理排名第一幻觉控制排名第一击败了 GPT-5.4、Claude Opus 4.6、Gemini、Qwen 等模型。

DogeDesigner@cb_doge · 4月14日

Grok 4.20 Reasoning just took the #1 spot on the BridgeBench reasoning benchmark. 🔥 Beating GPT-5.4, Claude Opus 4.6, Google Gemini and others. Week after week, Grok keeps climbing across benchmarks. 🚀

译Grok 4.20 Reasoning 刚刚在 BridgeBench 推理基准测试中夺得第一。🔥 击败 GPT-5.4、Claude Opus 4.6、Google Gemini 等模型。周复一周，Grok 在各个基准测试中持续攀升。🚀

AK@_akhaliq · 4月14日40

FORGE Fine-grained Multimodal Evaluation for Manufacturing Scenarios paper: https://huggingface.co/papers/2604.07413

译FORGE 面向制造场景的细粒度多模态评估论文: https://huggingface.co/papers/2604.07413

Ethan Mollick@emollick · 4月13日

Currently, ChatGPT has the best way of viewing thinking traces, a short summary of steps in the main window, and a detailed audit in the sidebar if you want it Claude does almost as well, but more summarized and harder to see calculations and code Its a big weak spot for Gemini

译ChatGPT 的思维链展示体验当前最优，主窗口呈现步骤摘要，侧边栏可查看详细审计。Claude 表现接近但总结过度，计算与代码细节难以查看。Gemini 在此功能上存在明显短板。

DogeDesigner@cb_doge · 4月12日26

Anthropic’s Claude Opus is FALLING. Latest benchmarks show its accuracy dropped from 83.3% → 68.3% in just days. That’s a major spike in hallucinations during coding. Grok 4.20 still holds the #1 spot. Undefeated.

译Anthropic的Claude Opus正在下滑。最新基准测试显示，其准确率在短短几天内从83.3%降至68.3%。这在编码过程中的幻觉率出现了大幅飙升。 Grok 4.20仍保持第一的位置。未被超越。

Deedy@deedydas · 4月12日

The coolest thing Meta AI's Muse Spark can do by far is counting objects! As you can tell, it's far from perfect. They call it "visual grounding" and it can count objects and do bounding boxes. I've been playing with the new model and here's what I think so far: Good stuff: – Incredible at vision. It's ability to read text in images is the best I've seen. – Really high quality at web design. It's the only model I've seen that uses Unsplash, OpenLibrary and other images by default. – It's free! You don't pay to use Muse Spark Thinking. Bad stuff: – Meta's classic playbook of growth tactics are dodgy. They're sending Instagram notifs to people's friends without their consent. Their app ranking jump is not organic. – Reasoning itself is pretty solid but not best in class. It can do pretty advanced math and science problems. The long term threat here is Meta has distribution and has the ability to give their model away for free, which makes them a formidable threat to the big AI labs, particularly in consumer.

译Meta推出免费视觉模型Muse Spark，擅长visual grounding、图像文字识别与网页设计，能精准计数物体并生成边界框。但Meta采用dodgy增长策略，未经用户同意向Instagram好友推送通知。其推理能力虽solid但非顶尖，不及GPT与Claude。凭借庞大分发渠道和免费策略，Muse Spark对大型AI实验室构成长期威胁，尤其在消费级市场可能重塑竞争格局。

Rohan Paul@rohanpaul_ai · 4月11日

People using AI for Premier League bets are losing badly. A new betting benchmark suggests today’s best AI models still unravel when prediction has to survive a whole season. In KellyBench, every tested model lost money, and some went completely bust. KellyBench forced agents through a changing 100-150 matchday season where they had to predict outcomes, size bets, and protect a £100,000 bankroll. That setup tests something normal benchmarks miss: whether an LLM can stay coherent, adapt to new data, and manage risk over time. Claude Opus 4.6 was best at -11% ROI, GPT-5.4 came next at -13.6%, and several models hit -100%.

译KellyBench基准测试检验了主流LLM在英超赛季投注中的长期预测与风险管理能力。所有参测模型均遭遇亏损，部分资金归零。Claude Opus 4.6以-11% ROI表现最佳，GPT-5.4为-13.6%。该测试通过100-150场动态赛季模拟，暴露出现有AI在持续决策中的连贯性、数据适应性与风险控制方面存在显著缺陷。

Noam Brown@polynoamial · 4月11日

What we really need is a benchmark where AI models make AI models that play poker.

译GTOWizard 测试显示，GPT-5.4、Claude Opus 4.6、Gemini 3.1 Pro、Grok 4 等主流模型在与专业扑克 AI 的 5000 手无限注德州扑克单挑中全部落败。推主调侃，既然直接玩扑克不行，不如测试 AI 生成会玩扑克的 AI 的能力。

SemiAnalysis@SemiAnalysis_ · 4月11日

InferenceX is the industry standard research platform for benchmarking AI chip performance across the world's most popular open-source LLM inference frameworks, updated continuously as the landscape evolves. We are proud to be supported by some of the leading figures across AI research, chip design, and the broader inference community.

译InferenceX 是行业标准的研究平台，用于在全球最受欢迎的开源 LLM 推理框架中对 AI 芯片性能进行基准测试，并随行业格局演变持续更新。我们很荣幸得到 AI 研究、芯片设计及更广泛推理领域一些领军人物的认可与支持。

Epoch AI@EpochAIResearch · 4月10日

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.

译新基准 MirrorCode 显示，Claude Opus 4.6 能重构 16,000 行生物信息学工具包，任务量相当于人类工程师数周工作。与 METR_Evals 合作开发。

Ethan Mollick@emollick · 4月9日

So what's the deal with Amazon Nova? They released Nova 2 in December, and even then, the top flight Nova 2 model trailed Sonnet 4.5. And it still hasn't left preview.

译Amazon Nova 2 自去年12月发布至今，其顶级模型性能仍落后于 Sonnet 4.5，且始终未能脱离预览阶段，进展缓慢。

Peter Steinberger 🦞@steipete · 4月9日

I'm working on character evals and noticed that Claude would constantly pick itself as #1, so I removed the model names from the judge and changed things.

译做角色评估时发现 Claude 总把自己排第一，于是移除评判中的模型名称并调整设置，避免模型自我偏好影响结果。

Epoch AI@EpochAIResearch · 4月9日

We had pre-release access to Meta’s new Muse Spark model and evaluated it on FrontierMath. It scored 39% on Tiers 1-3 and 15% on Tier 4. This is competitive with several recent frontier models, though behind GPT-5.4.

译Meta Muse Spark 模型在 FrontierMath 基准测试中，Tiers 1-3 得分 39%，Tier 4 得分 15%。该成绩与近期多款前沿模型相当，但仍落后于 GPT-5.4。

AK@_akhaliq · 4月9日

Video-MME-v2 Towards the Next Stage in Benchmarks for Comprehensive Video Understanding paper: https://huggingface.co/papers/2604.05015

译Video-MME 基准测试发布 v2 版本，推动全面视频理解评估进入新阶段。论文已上传至 Hugging Face。

Artificial Analysis@ArtificialAnlys · 4月8日

Announcing APEX-Agents-AA, our latest leaderboard on Artificial Analysis, evaluating AI agents on long-horizon professional services tasks with realistic application dependencies This is our implementation of the APEX-Agents benchmark - an agentic work task evaluation open-sourced by @mercor_ai. It tests AI agent ability to execute realistic tasks created by investment banking analysts, management consultants, and corporate lawyers. Mercor released extensive data to enable model evaluation and training across the community, comprising 480 tasks including tool implementations, rubrics, and grading workflows. We exclude tasks with external service dependencies and run the remaining 452 tasks for APEX-Agents-AA. Models complete tasks using Stirrup, our open-source agent harness as used in GDPval-AA, and a customized tool set based on the original benchmark implementation Results overview: 🏅 OpenAI, Anthropic and Google are in close competition at the top of the leaderboard, with 33.3% for GPT-5.4, 33.0% for Claude Opus 4.6, and 32% for Gemini 3.1 Pro Preview 📈 The overall scores on Artificial Analysis today are similar to Mercor’s testing, but some models such as GPT-5.4 nano show improvements in score using our Stirrup test harness ↻ We’ll be updating this leaderboard with key releases for agentic work use as a metric for agent capability on well-defined, long horizon work tasks APEX-Agents overview: ➤ Tasks span 3 professional domains: investment banking, management consulting, and corporate law ➤ The tasks are designed to require long-horizon work with a large number of tools, which are provided through MCP servers as would be used in many real-world deployments (including calendar, chat, spreadsheet and presentation operations, etc.) ➤ Required outputs include direct message responses (87%) and creating or modifying spreadsheets (6.6%), documents (4.8%), and presentations (1.3%) ➤ Model outputs are parsed and graded against binary rubrics using an LLM judge. Each task is run 3 times and scored pass@1 - a pass requires every rubric test to pass ➤ In our APEX-Agents-AA implementation, 452 tasks run in our open-source Stirrup harness with tool management and usage from @mercor_ai's original MCP implementation. This provides a consistent, reproducible baseline for comparing raw model capability that aligns with realistic agent deployments

译Artificial Analysis 发布 APEX-Agents-AA 排行榜，基于 Mercor 的 APEX-Agents 基准评估 AI 代理在长周期专业任务（投资银行、管理咨询、公司法）的表现。测试通过 Stirrup 框架和 MCP 工具执行 452 个任务，涵盖消息回复、文档处理等。结果显示 GPT-5.4 以 33.3% 领先，Claude Opus 4.6 (33.0%) 和 Gemini 3.1 Pro Preview (32%) 紧随其后，三强竞争激烈。评分采用 LLM 评判和 pass@1 标准。

Haider.@haider1 · 4月8日39

i still can't get over this look at those benchmark results: > swe-bench verified: mythos 93.9% vs opus 4.6 80.8% > swe-bench pro: mythos 77.8% vs opus 4.6 53.4% > swe-bench multilingual: mythos 87.3% vs opus 4.6 77.8% > swe-bench multimodal: mythos 59.0% vs opus 4.6 27.1% > terminal-bench 2.0: mythos 82.0% vs opus 4.6 65.4%

译我仍然无法释怀看看这些基准测试结果： > swe-bench 已验证：mythos 93.9% vs opus 4.6 80.8% > swe-bench 专业版：mythos 77.8% vs opus 4.6 53.4% > swe-bench 多语言版：mythos 87.3% vs opus 4.6 77.8% > swe-bench 多模态版：mythos 59.0% vs opus 4.6 27.1% > terminal-bench 2.0：mythos 82.0% vs opus 4.6 65.4%

Artificial Analysis@ArtificialAnlys · 4月8日

We’ve launched agent landscape overviews across 7 key categories relevant to real world tasks agents are used for today ! 💼 Categories so far include: General Work, Coding, Chatbots, Presentations, OCR, Data Analysis, and Customer Support. We report on key capabilities relevant to each agent category such as filetype handling, integrations, browser automation, bring-your-own-model support, open source status, and more. This is just a start of our benchmarking of agents. We’ll continue to dive deeper over time with more quantitative analyses.

译针对真实场景任务需求，我们发布了AI Agent全景概览报告，涵盖通用办公、编程、聊天机器人、演示文稿、OCR、数据分析及客户支持七大类别。报告详细梳理了各类Agent在文件类型处理、系统集成、浏览器自动化、自定义模型支持及开源状态等关键维度的能力差异。这仅是Agent基准测试的开端，后续将持续推出更多定量分析，深入评估各场景下Agent的实际表现与适用性。

Artificial Analysis@ArtificialAnlys · 4月8日

We’ve added a new pseudonymous video model to our Text to Video and Image to Video Arenas.‘HappyHorse-1.0’ is currently landing in the #1 spot for Text and Image to Video (No Audio) and the #2 spot for Text and Image to Video (With Audio). Further details coming soon. Example generations below from HappyHorse-1.0 in the Artificial Analysis Video Arena 🚀

译Artificial Analysis 在 Text to Video 和 Image to Video Arenas 中引入匿名视频模型 HappyHorse-1.0。该模型在无音频视频生成榜单排名第一，有音频榜单排名第二，详细技术信息即将公布。

Anthropic@AnthropicAI · 4月4日

New Anthropic Fellows Research: a new method for surfacing behavioral differences between AI models. We apply the “diff” principle from software development to compare open-weight AI models and identify features unique to each. Read more: https://www.anthropic.com/research/diff-tool

译Anthropic Fellows 推出新研究方法，借鉴软件开发中的 "diff" 原理，对开源权重 AI 模型进行比对，以识别各模型独有的行为特征与差异。

karminski-牙医@karminski3 · 3月30日

不是的哈, 并不是让大模型模拟数据库, 而是让大模型从0写代码实现一个高性能向量数据库, 主要考验大模型对体系结构, 数据库, 索引性能调优, Agent 等各项编程方面的能力. 还在剪视频, 一会我放出详细测评. 可以看评测框架repo，开源的：https://github.com/KCORES/vector-db-bench

译开发者澄清该测试并非让大模型模拟数据库，而是要求其从零编写代码实现高性能向量数据库，重点考验体系结构、数据库、索引性能调优及 Agent 等编程能力。评测框架 vector-db-bench 已开源，详细测评视频即将发布。

Epoch AI@EpochAIResearch · 3月28日

We have removed a problem from FrontierMath: Open Problems. The problem was solved by AI, but upon review we determined that the problem didn’t meet our minimum bar for mathematical notability. This is a different problem from the one whose solution we announced on Monday.

译FrontierMath: Open Problems 移除了一道已被 AI 解决的题目。经审查，该题目未达到数学知名度的最低标准。团队强调，这与周一宣布解决的那道题目不同。

Artificial Analysis@ArtificialAnlys · 3月28日

Introducing AA-AgentPerf - the hardware benchmark for the agent era. Key details: ➤ Real agent workloads, not synthetic queries: we’ve captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens ➤ Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding - we’re allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like ➤ Measures what developers need to know: Max concurrent users at each target output speed, expressed per accelerator, per kW TDP, per $/hr, and per rack ➤ Built for every kind of scale: designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between ➤ Live now: we’re announcing AA-AgentPerf today and opening submissions of configurations for benchmarking effective immediately. The models supported at launch are gpt-oss-120b and DeepSeek V3.2. We’ll be publishing results on a rolling basis. AA-AgentPerf is a benchmark for real-world performance of AI accelerator hardware. We’re benchmarking inference of particular models on a specific system with a specific config (ie. inference stack, parallelism config and more). AA-AgentPerf has been shaped by our work with inference providers and engagement with AI accelerator companies, developers, and enterprise buyers over the past year. Our goal is for anyone deploying models - whether buying or leasing accelerators - to be able to use AA-AgentPerf as the definitive resource for understanding real-world hardware performance.

译AA-AgentPerf是面向Agent时代的AI硬件基准测试，采用真实Agent工作负载（支持200轮交互和超10万token序列），而非合成查询。该基准允许KV cache重用、分离式预填充/解码等生产级优化技术，测量每加速器、每kW TDP、每小时成本及每机架的最大并发用户数。支持从单卡到整机架的各类架构，首批覆盖gpt-oss-120b和DeepSeek V3.2模型，旨在为AI硬件采购与部署提供真实性能参考。

Artificial Analysis@ArtificialAnlys · 3月25日

Inworld, ElevenLabs, and MiniMax continue to lead our Text to Speech leaderboard for most preferred models Recent checkpoints from each of the labs continue to push the frontier of TTS quality, with 4 out of the top 5 models being released this year. Leading TTS models are increasingly realistic, particularly on relatively straightforward text, with preference differences increasingly coming down to affinity for different voices. Latest results also reflect stronger bot vote filtering, confirmed via triangulation against third-party evaluators. We've also added rank ranges based on each model's 95% confidence interval, showing where a model could land based on its Elo score range. Key results: ➤ Most preferred: Current top 5 per our TTS leaderboard: 1. Inworld TTS 1.5 Max (Elo of 1,238); 2. ElevenLabs Eleven v3 (1,197); 3. Inworld TTS 1 Max (1,183); 4. Inworld TTS 1.5 Mini (1,182); 5. MiniMax Speech 2.8 HD (1,175) ➤ Price: Kokoro 82M v1.0 (Replicate) leads at $0.65 per 1M characters, followed by Inworld TTS 1 and 1.5 Mini at $5, and AsyncFlow V2 at $8.33 ➤ Speed: WaveNet leads for batch generation at 419 characters processed per second, followed by Kokoro 82M v1.0 (Replicate) at 235, and Inworld TTS 1.5 Mini at 214 See below for further detail ⬇️

译Inworld、ElevenLabs 与 MiniMax 继续领跑 TTS 排行榜，今年发布的模型包揽前五中的四席。当前领先模型在简单文本上逼真度显著提升，用户偏好差异主要体现在声音风格选择上。评估方法已加强机器人投票过滤，并新增基于95%置信区间的排名范围。具体指标方面，Inworld TTS 1.5 Max 以1,238 Elo分居首，Kokoro 82M v1.0以$0.65/百万字符成为价格最低选项，WaveNet则以每秒419字符领先批处理速度。

Epoch AI@EpochAIResearch · 3月24日

AI has solved one of the problems in FrontierMath: Open Problems, our benchmark of real research problems that mathematicians have tried and failed to solve. See thread for more.

译AI 在 FrontierMath: Open Problems 基准测试中成功解决一道数学家长期未能攻克的真实研究难题。该基准专门收录专业数学家尝试失败的研究级开放问题。

Epoch AI@EpochAIResearch · 10月11日

We manually evaluated three compute-intensive model settings on our extremely hard math benchmark. FrontierMath Tier 4: Battle Royale! GPT-5 Pro set a new record (13%), edging out Gemini 2.5 Deep Think by a single problem (not statistically significant). Grok 4 Heavy lags. 🧵

译在 FrontierMath Tier 4 极难数学基准测试中，GPT-5 Pro 以 13% 准确率创下新纪录，仅以一道题优势险胜 Gemini 2.5 Deep Think（统计差异不显著），Grok 4 Heavy 则明显落后。

Noam Brown@polynoamial · 9月18日

12/12 problems solved, which would be equivalent to a 1st place performance. GPT-5's solutions were responsible for solving 11/12 of them.

译OpenAI 推理系统在 2025 ICPC 世界总决赛中获得 12/12 满分，成绩相当于人类参赛者第一名。其中 11 道题目由 GPT-5 解决。

OpenAI@OpenAI · 9月18日

Our general-purpose reasoning models solved all 12 problems at the 2025 International Collegiate Programming Contest (ICPC) World Finals, the world’s top university programming competition which was enough for a 1st-place human ranking.

译OpenAI 推理系统在 2025 ICPC 世界总决赛中解出全部 12 道算法题，获得 12/12 满分。该成绩在所有人类参赛队伍中排名第一，足以夺得冠军。

Google DeepMind@GoogleDeepMind · 9月18日

An advanced version of Gemini 2.5 Deep Think has achieved gold-medal level performance at the ICPC 2025 - one of the world’s most prestigious programming contests. 🏅 Building on the model's success in math at the IMO, this marks another historic milestone for advanced AI. 🧵

译Gemini 2.5 Deep Think 进阶版在 ICPC 2025 世界编程大赛中取得金牌水平成绩。继 IMO 数学竞赛后，这是该模型在竞技领域取得的又一历史性突破。

Jim Fan@DrJimFan · 9月13日

There was something deeply satisfying about ImageNet. It had a well curated training set. A clearly defined testing protocol. A competition that rallied the best researchers. And a leaderboard that spawned ResNets and ViTs, and ultimately changed the field for good. Then NLP followed. No matter how much OpenAI, Anthropic, and xAI disagree, they at least agree on one thing: benchmarking. MMLU, HLE, SWEBench - you can’t make progress until you are able to measure it. Robotics still doesn’t have such a rallying call. No one agrees on anything: hardware, task, scoring, simulation engine, or real world environment. Everyone is SOTA, by definition, on the benchmark they define on the fly for each paper. From the maker of ImageNet - BEHAVIOR takes a stab at the daunting challenge of unifying robotics benchmarking on a reproducible physics engine (Isaac Sim). The project started before I graduated from Stanford Vision Lab, and took so many years of dedication and PhD careers to build. I hope BEHAVIOR is either the hill-climbing signal we need, or the spark that finally gets us talking about how to measure real progress as a field.

译推文指出计算机视觉（ImageNet）和自然语言处理（MMLU、HLE、SWEBench）已建立标准化基准体系，而机器人学仍缺乏统一评估标准，存在硬件、任务定义、评分体系混乱的问题。由ImageNet创造者开发的BEHAVIOR项目基于Isaac Sim物理引擎，旨在建立可复现的机器人学统一基准。该项目已启动首届NeurIPS 2025挑战赛，期望成为推动领域进步的标志性信号。

Hao AI Lab@haoailab · 8月22日35

[Lmgame Bench] 🤔 Ever wondered how to evaluate different games in Lmgame-Bench or even add your own, but don’t know where to start? We’ve made it super easy to run evaluations and integrate new games. Our latest blog walks you through a few key features from Lmgame Bench including: - Agent & environment setup. - One-command single & multi-agent evals. - Model & gaming harness support. You can find out more from our Blog 👉https://lmgame.org/#/blog/lmgame_use

译[Lmgame Bench] 🤔 是否曾想过如何在 Lmgame-Bench 中评估不同游戏，甚至添加自己的游戏，却不知从何入手？我们已让运行评估和集成新游戏变得极其简单。我们最新的博客将引导您了解 Lmgame Bench 的几个关键功能，包括： - 智能体与环境设置。 - 单命令单智能体与多智能体评估。 - 模型与游戏框架支持。您可以通过我们的博客了解更多 👉https://lmgame.org/#/blog/lmgame_use

Hao AI Lab@haoailab · 8月13日

[Lmgame Bench] 🔥 We tested Openai’s GPT-5-thinking-high and two recent open-source models in our Lmgame Bench! Across 26 models and 6 games (Sokoban, Tetris, 2048, Candy Crush, Mario, Ace Attorney), Here’s where they landed: GPT-5-thinking-high → #2 Qwen3‑235B‑A22B‑Thinking‑2507 → #10 glm4.5 → #18

译[Lmgame Bench] 🔥 我们在 Lmgame Bench 中测试了 Openai 的 GPT-5-thinking-high 和两个最新的开源模型！

Hao AI Lab@haoailab · 8月8日

[Lmgame Bench] 🏆Congratulations to o3 for dominantly championing the first-ever AI Chess Tournament! Also to grok-4 and gemini-2.5-pro for the second and third place! This result highly aligns with our lmgame-Bench leaderboard! This shows that games aren't just for fun: They're reliable and consistent signals of LLM’s intelligence, and our benchmark is an effective predictor of LLM’s gaming capability! https://huggingface.co/spaces/lmgame/lmgame_bench

译[Lmgame Bench] 🏆祝贺 o3 强势夺得首届 AI 国际象棋锦标赛冠军！同时祝贺 grok-4 和 gemini-2.5-pro 分获亚军和季军！

Hao AI Lab@haoailab · 7月25日

[Lmgame Bench] 🧐 Kimi-k2-0711-preview shows stellar performance on math, coding and tool-using agentic benchmarks. But we found gaming environments still serves as a challenge for non-reasoning models like Kimi-k2, on Lmgame Bench, it ranks only #18 out of all 19 models we evaluated on our leaderboard.

译[Lmgame Bench] 🧐 Kimi-k2-0711-preview 在数学、编程和工具使用智能体基准测试中表现出色。但我们发现，对于像 Kimi-k2 这样的非推理模型，游戏环境仍然是一个挑战，在 Lmgame Bench 上，它在我们排行榜评估的所有19个模型中仅排名第18。

Noam Brown@polynoamial · 7月19日

I think it's safe to say this @OpenAI IMO gold result came as a bit of a surprise to folks

译OpenAI 在 IMO 竞赛中斩获金牌，这一结果出乎众人意料。推文以轻松的语气指出，该成绩让许多人感到惊讶。

Saining Xie@sainingxie · 6月17日

So this is not a benchmark for software engineering agents. It’s meant to test core reasoning and intelligence through coding—backed by 71 pages of deep analysis from some of the best competitive programmers out there. This effort was carried out by students across multiple institutions (I’m mostly just a cheerleader here!) It was led by @ZihanZheng71803 (an undergrad who represented NYU in the ICPC World Finals), @wenhaocha1, and many of their Olympiad medalist friends. They built the live benchmark and offered expert analysis of how elite human coders compare to top LLMs. The results are now public: on the hard problems, LLMs essentially score 0%. They're good at implementation-heavy tasks that rely on memorization, but still struggle badly with observation-heavy or logic-heavy problems—those where the implementation is easy once you’ve had the critical "aha" insight. They also struggle with detail-oriented tasks—often getting the basics right but failing to account for edge cases. Some more thoughts on why this benchmark matters: I’ve always been surrounded by top competitive programmers. My undergrad program at SJTU is renowned for ICPC success and primarily admits students with a strong high school competitive programming background. While I’ve never won an olympiad medal myself, I deeply admire my peers who did—friends who trained for years as teens and competed at the highest international levels. One of them is my classmate and key collaborator on this project, Prof @shangjingbo, who earned ICPC world final gold for SJTU. For us, competitive programming was the ultimate badge of intelligence for CS students. Competitive programming emphasizes reasoning and problem solving under pressure, which differs from standard software engineering—but the skills carry over surprisingly well. That’s why so many startups love to show off their IOI gold medalists! Beating this benchmark would be like AlphaGo beating Lee Sedol. We're not at that level yet—not even for problems with clearly verifiable outcomes. And if you care about fundamental intelligence and reasoning, this result might be worth a close look.

译所以这不是一个针对软件工程智能体的基准测试。它旨在通过编程测试核心推理与智能——由一些顶尖竞技程序员撰写的 71 页深度分析作为支撑。