# K-BrowseComp：基于韩语语境的网页浏览智能体基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmpw3ayof03zcslukh1xajebb
- 原文链接：https://arxiv.org/abs/2606.02404

## AI 摘要

K-BrowseComp 是一个针对韩语语境的网页浏览智能体基准，包含400个问题，其中300个为人工构建验证的子集。在此子集上，GPT-5.5、DeepSeek-V4-Pro 和 GLM-5.1 等前沿大语言模型仅达到30.00%–45.67%的准确率，而韩国本土大模型得分仅为0.00%–10.33%。此外构建的100题合成对抗测试集中，最强模型得分仅为26.00%。数据集与代码已公开。

## 正文

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.
