EvalVerse:面向专业电影视频生成的流程感知与专家校准基准测试
阅读原文· arxiv.org生成式视频模型正迈向专业电影合成,但现有评估主要关注“是否正确”,忽略了“是否优秀”的电影质感。为此,本文提出EvalVerse,一个全面、流程感知且经专家校准的评估框架。它首先将评估体系与专业电影制作流程(前期、制作、后期)对齐;其次利用大规模人工标注数据集凝练专家判断;最后通过专家校准微调将知识注入视觉语言模型(VLM),使其能进行明确的思维链(CoT)推理。该框架在兼容基础“正确性”指标的同时,将评估显著扩展至“优秀度”,并覆盖多镜头序列与视听整合等复杂任务,为奖励模型等未来研究提供了基础。
The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.