前xAI世界模型负责人Ethan He在播客中分享了对Grok Imagine及视频生成未来的看法。他指出,视频模型的智能主要来自LLM,而非单纯扩大视频数据规模,因此正从视频生成转向LLM领域。他认为,视频生成的下一个前沿是训练用于编排视频模型的视频Agent模型。AI视频的发展将类似编程Agent路径,当前文本到视频仅是“自动补全”阶段。未来,世界模型将变得实时交互,语言模型或成为视频的控制层。
This pod was an incredible gift to the community:
not only our first pod about @xAI, but Ethan really indulged on all our questions on how to train a SOTA Videogen world model, including specific areas (consistent extending/editing, voice) that Grok @Imagine is *still* SOTA,
on top of the factual overviews he ALSO came loaded with opinions/predictions:
- why he's quitting Videogen for LLMs: video models get most of their intelligence from LLMs, not from scaling video data
- why the next frontier for videogen also happens to be video agent models - agentic models trained to orchestrate video models
- why deterministic compression (like MP4) is a useless target vs VAE compression