# MathNet：全球多模态数学推理与检索基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-20 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo84qcqu03weslmllevdbgza
- 原文链接：https://arxiv.org/abs/2604.18584

## AI 摘要

本文提出MathNet，一个大规模多语言奥林匹克数学基准，收录47国17种语言的30,676道专家命题，跨越二十年竞赛历史。该基准支持问题求解、数学感知检索及检索增强问题求解三项任务。实验显示，即使最先进的推理模型（Gemini-3.1-Pro 78.4%、GPT-5 69.3%）仍面临挑战，而嵌入模型在数学等价检索上表现不佳。研究表明，DeepSeek-V3.2-Speciale通过检索增强技术实现最高12%性能提升，创下基准最高分。

## 正文

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
