ComBench：面向奥林匹克级组合数学的推理与构造基准

2026-06-09 08:00·24天前

AI 摘要

ComBench是一个面向奥林匹克级组合数学的基准，包含100个人工标注的竞赛级别问题，分为分析型（侧重严谨数学论证）和构造型（需要明确构造及正确性证明）。评估结合评分指南的证明评分与确定性构造验证，揭示证明质量与构造有效性的差异。前沿模型在该基准上远未饱和：最强模型整体平均分65.4%，Best@4达75.3%。Kimi-K2.6在分析型证明评分上落后于GPT-5.5，但在构造型Best@4上反超；存在性和构造类问题对所有代表性模型始终最难。

原文 · 未翻译

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

HuggingFace Daily Papers（社区热门论文）

56导出 Markdown

ComBench：面向奥林匹克级组合数学的推理与构造基准

2026-06-09 08:00·24天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译