# WebCompass：面向代码语言模型的多模态网页编程评估基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-20 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo84qcqu03wcslmlv9nfydlj
- 原文链接：https://arxiv.org/abs/2604.18224

## AI 摘要

研究团队发布WebCompass基准，首次对代码语言模型进行全生命周期多模态网页开发能力评估。该基准涵盖文本、图像、视频三种输入模态，设置生成、编辑、修复三类共七项任务，覆盖15个生成领域、16种编辑操作及11种缺陷类型，难度分三级。评估采用LLM-as-a-Judge与Agent-as-a-Judge（基于MCP在真实浏览器中自动测试）相结合的方法。实测显示：闭源模型综合能力显著领先；美学表现是开源模型的最大瓶颈；Vue框架难度最高，React和Vanilla/HTML表现更稳定。

## 正文

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.
