# WebRISE：面向MLLM生成Web工件的需求诱导状态评估基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-02 14:29
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpyy97hn04h7sli3wkrsix2d
- 原文链接：https://arxiv.org/abs/2606.03220

## AI 摘要

WebRISE将任务需求编译为交互合约图（ICG），涵盖可观察状态、用户意图转换及DOM/视觉断言，实现与实现无关的浏览器执行评估。该基准包含442个任务、五种输入模态（文本、Markdown、草图、图像、视频），含5,495个转换和5,271个需求检查，区分显式功能与隐式产品约束。评估14个MLLM显示，最强模型仅达65.6%转换有效性和66.3%需求覆盖率；视觉质量不反映行为（Qwen3.6-35B-A3B在Markdown上视觉评分80.8但转换仅15.5）。视频提供最强交互信号（隐式覆盖率比文本高10.6个百分点）；缺陷注入表明基于ICG的评分检测状态错误速率是checkpoint式评估的2-16倍。

## 正文

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.