# OmniScript：面向长电影视频的视听剧本生成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-13 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo84qcqu03w8slmlhxqnpsut
- 原文链接：https://arxiv.org/abs/2604.11102

## AI 摘要

研究团队推出80亿参数视听语言模型OmniScript，专攻长电影视频理解与新提出的视频到剧本（V2S）任务。该模型通过思维链监督微调与分段奖励强化学习训练，可生成包含角色动作、对话及音频提示的时序化分层剧本。实验显示，尽管参数规模较小，OmniScript在时间定位与语义准确性上不仅超越更大规模开源模型，更达到与Gemini 3-Pro相当的水平。

## 正文

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.