# CoVEBench：视频编辑模型能否处理复杂指令？

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-07 08:00
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmq61s9bh04cwsl5i1som74cl
- 原文链接：https://arxiv.org/abs/2606.08415

## AI 摘要

CoVEBench 是一个组合视频编辑基准，包含 416 个源视频、626 条多点编辑指令和 9,990 个细粒度检查项，覆盖多维度编辑任务。它通过 MLLM 评判指令遵守度与视频保真度，并结合自动指标评估视频质量。实验表明，当前模型在同时处理多操作时仍频繁遗漏编辑、违反保留约束或引入伪影，组合编辑是重大挑战。

## 正文

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.
