莱比锡的基准测试 (Benchmarks in Leipzig)
阅读原文· arxiv.org一篇题为“Benchmarks in Leipzig”的学术论文于2026年6月6日发布在 arXiv 上,并在 Hacker News 上获得 101 个点赞。该论文关注莱比锡相关的基准测试研究,但其具体方法、数据集及结果未在当前摘要页面中详述。该条目来自 buzzing.cc 对 Hacker News 热门帖子的中文翻译,提供了原文链接(arXiv)及 HN 讨论页。
Mathematics > History and Overview
Title:Benchmarks in Leipzig
Abstract:Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers. Most of the work was done during the 3-day workshop *Benchmarks in Leipzig* with 35 participants at the Max Planck Institute for Mathematics in the Sciences in Leipzig, Germany. We present the resulting collection of 100 questions. We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs, followed by a 20-runs-per-model evaluation with three of these models, and finally a 3-run attempt with two heavy-thinking models. After Stage 1, 41 questions remained completely unsolved; after Stage 2, this count dropped to 16; and we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive.