论文提出HLL基准,测试AI智能体解决10种CAPTCHA任务的能力。任务要求智能体查看页面、正确点击或拖动、跟踪状态变化并提交答案,同时需在混乱页面中找到交互元素、理解指令、恢复错误并留下一致的操作轨迹。实验显示,即使是当前最强的智能体,在静态任务上表现良好,但在页面杂乱、任务难度增加或系统验证动作有效性时仍会失败。
Today's AI agents still struggle to pass real human-verification checks (CAPTCHAs) on websites.
The paper proposes HLL, a benchmark where agents must solve 10 types of CAPTCHA tasks by seeing the page, clicking or dragging correctly, tracking state, and submitting the answer.
A useful agent must find the right box on a messy page, understand the instruction, click or drag in the right place, track what changed, recover from mistakes, and leave an interaction trail that looks consistent with the task.
The paper shows that even strong agents can look smart on static tasks, then fail when the page is cluttered, the task is harder, or the system checks whether their actions were actually valid.
----
Link - arxiv. org/abs/2606.02449