Ko-WideSearch:韩语广度搜索基准
阅读原文· arxiv.org现有网页智能体基准主要测深度搜索,缺乏广度枚举能力评估。Ko-WideSearch 是韩语广度搜索基准,通过自动化合成-验证流程构建。任务要求从集合父实体(如电视剧季、王朝)中完整列举成员并填充属性表,采用 Item-F1、Column-F1、Row-F1 评分。基准含 228 张表格,覆盖 190 个实体、16 个类别,设三个难度层级,通过表宽和二维复合键控制成员覆盖率。对 20 个智能体的测试显示,智能体能恢复集合但无法填充行(Item-F1 92.8,Row-F1 53.7),难度提升准确率下降,增加搜索或花费无法缩小差距;难点在找到正确值而非格式化,自由文本单元格失败率最高。
Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especially outside English. Breadth is also hard to build: certifying that a gold set is complete and every cell correct is far costlier than checking a single answer. I introduce Ko-WideSearch, a Korean breadth-search benchmark built by an automated synthesize-and-verify pipeline. Each task names a set-parent entity -- a TV season, a dynasty, a league, an administrative region, an election -- and asks for its full membership plus a per-item attribute table, graded by Item-, Column-, and Row-F1. It spans 228 tables over 190 entities and sixteen categories across three difficulty tiers, set by two structural knobs I dial independently -- table width and a 2-D composite key -- so cross-product membership climbs from 0\% to 100\% across the tiers. A single normalization-aware comparator is shared between gold construction and grading, so stable date and count columns are not over-dropped on formatting alone. Across twenty web agents, the failure is consistent: agents recover the set but not the rows (e.g.\ Item-F1 92.8 against Row-F1 53.7), accuracy falls steadily as the knobs harden, and neither more search nor more spend closes the gap. Broken down by cell, the hard part is finding the right value, not formatting it: open-ended free-text cells fail most, while cells with a standard answer such as a date or a name usually come out right.