K-BrowseComp：基于韩语语境的网页浏览智能体基准测试

2026-06-01 08:00·32天前

AI 摘要

K-BrowseComp 是一个针对韩语语境的网页浏览智能体基准，包含400个问题，其中300个为人工构建验证的子集。在此子集上，GPT-5.5、DeepSeek-V4-Pro 和 GLM-5.1 等前沿大语言模型仅达到30.00%–45.67%的准确率，而韩国本土大模型得分仅为0.00%–10.33%。此外构建的100题合成对抗测试集中，最强模型得分仅为26.00%。数据集与代码已公开。

原文 · 未翻译

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

HuggingFace Daily Papers（社区热门论文）

69导出 Markdown

K-BrowseComp：基于韩语语境的网页浏览智能体基准测试

2026-06-01 08:00·32天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译