CoInteract:通过空间结构化协同生成实现物理一致的人-物交互视频合成
阅读原文· arxiv.orgCoInteract框架基于Diffusion Transformer架构,支持以人物参考图像、产品图像、文本及语音为条件生成视频。其引入Human-Aware Mixture-of-Experts模块,通过空间监督路由将token分配至区域专家,以极小参数开销提升手部与面部的结构稳定性;并采用Spatially-Structured Co-Generation双流训练范式,联合建模RGB外观与HOI结构流以注入交互几何先验,避免手物穿透。训练时结构流正则化共享权重,推理时移除该分支实现零额外开销。实验表明,该方法在结构保真度、逻辑一致性与物理合理性上显著优于现有方案。
Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.