# Wan-Streamer v0.1： 端到端实时交互基础模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-23 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmqsvasrj06m2slfuoyvyjt8a
- 原文链接：https://arxiv.org/abs/2606.25041

## AI 摘要

Wan-Streamer v0.1 是原生流式、端到端的交互基础模型，在单一 Transformer 中统一建模语言、音频和视频的输入与输出，序列表示为交错视觉、音频、文本 token，通过块因果注意力实现增量流式。无需外部 VAD、ASR、TTS、视频生成等模块，感知、推理、生成、响应时序等由单一模型联合学习。整套栈围绕流式化重新设计，支持 25 fps 下 160 ms 的流式单元。模型侧响应延迟约 200 ms，结合 350 ms 双向网络延迟后总交互延迟约 550 ms，实现亚秒级全双工音视频通信。

## 正文

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.
