OmniShow：统一多模态条件的人与物体交互视频生成

2026-04-13 08:00·81天前

AI 摘要

OmniShow 是一个面向人与物体交互视频生成（HOIVG）的端到端框架，支持文本、图像、音频和姿态等多模态条件输入。该方法提出统一通道级条件注入（Unified Channel-wise Conditioning）和门控局部上下文注意力（Gated Local-Context Attention）机制，在可控性与生成质量之间取得平衡，并采用解耦后联合训练策略（Decoupled-Then-Joint Training）解决数据稀缺问题。研究团队还建立了 HOIVG-Bench 基准测试。实验表明，OmniShow 在多种多模态条件下均达到行业领先的生成效果。

原文 · 未翻译

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

OmniShow：统一多模态条件的人与物体交互视频生成

2026-04-13 08:00·81天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译