(1/N) We're launching Dreamverse. Most AI video models take minutes to generate a 5 s 1080p clip. In 4.5 seconds, we can generate 30 s 1080p clips on a single GPU. Our videos generate faster than you can watch them: stop waiting on prompts and start directing scenes live. 🕹️Demo: http://dreamverse.fastvideo.org 📑 Blog: https://haoailab.com/blogs/dreamverse Welcome to the era of vibe-directing 👇

译(1/N) 我们正在推出 Dreamverse。大多数 AI 视频模型需要数分钟才能生成一段 5 秒 1080p 的片段。而在 4.5 秒内，我们就能在单张 GPU 上生成 30 秒 1080p 的片段。

Hao AI Lab@haoailab · 3月18日65

http://x.com/i/article/2034009793598464000 # Into the DreamVerse TL;DR: Our new real-time inference stack in FastVideo enables Dreamverse, a prototype for a new interface where users can vibe direct their own “multiverse” of videos. AI video generation is already good enough to make a convincing clip. But real creative work is not about getting a clip in one shot. It’s about iteration. An idea appears, you test it: keep the subject, change the camera angle, continue the scene, and try again. The problem is that ideas move faster than generations. If every attempt takes minutes, the creative loop breaks; your imagination moves on before the video does. We think there is a better interface for AI video generation, which is why we created Dreamverse, an interface that enables a new workflow called vibe directing. Vibe directing is to video what vibe coding is to software. Instead of rewriting giant prompts from scratch, you talk to the system in natural language and steer the video through fast revision. Keep the subject, change the background, slow the camera, or anything else! Rather than jamming everything into a single prompt, iterate with multiple simple prompts. This kind of workflow is only possible when video generation is done in real-time. Current video generation models like Sora take 1-2 minutes to generate a 5s 1080p clip. We can do it in ~4.55 seconds on a single GPU. In other words, our inference stack in FastVideo can generate a clip faster than you can watch it. This capability completely changes the feel of video generation inference; it stops feeling like a passive experience and starts feeling like directing your own scenes. This allows us to create a longer 30-second scene that unfolds as a chain of these 5-second clips, while keeping a chat window open so you can keep directing in real time. This matters because serious video creation is almost never perfect on the first try. A shot may look off. Motion may break halfway through. Characters may drift between frames. In addition, creators may have multiple versions of a scene and want to play them out to determine which is better. In practice, creators are constantly making small adjustments and trying again. When revisions are slow, it’s much more difficult to explore many ideas. However, when the next result comes back almost immediately, it becomes possible to quickly try many ideas rather than just one. Better creative work comes from a faster feedback loop, not just a better model. We think this is where video generation is going: a way to direct the video as it unfolds. The best systems will not just generate impressive clips. They will let people explore ideas at the speed of their imagination. That is what vibe directing is all about. Step into the Dreamverse today with our demo. The Team Core contributors: Will Lin*, Matthew Noto*, Junda Su*, Yechen Xu*, Peiyuan Zhang* (* equal contribution) Contributors: Shao Duan, Minshen Zhang, Loay Rashid, Kevin Lin UI: Tina Mai Tech leads: Will Lin, Hao Zhang Advisors: Hao Zhang (corresponding), Danyang Zhuo, Eric Xing, Zhengzhong Liu Learn More - FastVideo Documentation - FastVideo Roadmap for 26Q1

译FastVideo团队发布Dreamverse原型界面，引入创新的“氛围导演”工作流。该模式允许用户通过自然语言实时、迭代地引导视频生成，如更换背景或调整运镜，无需编写复杂的长提示词。其核心是全新的实时推理栈，能在单GPU上以约4.55秒生成5秒1080p视频，速度快于观看时间，从而将生成过程从被动等待转变为实时导演体验。团队认为，视频生成的未来在于让创作速度跟上想象速度，快速的反馈循环比单纯追求模型性能更能催生优质作品。

Hao AI Lab@haoailab · 3月14日

(1/N) Content creators have been stuck with costly and slow video generation APIs for far too long. We couldn’t take it anymore.😅😭 FastVideo’s new real-time inference stack has the fastest 1080p TI2AV pipeline ever.😍🚀🚀 Our optimized LTX-2.3 pipeline creates 5-second 1080p videos with audio in 4.55 s, on a single GPU! 3.9x faster than the next fastest option. 🕹️Live demo: https://1080p.fastvideo.org/ 📜Blog: https://haoailab.com/blogs/fastvideo_realtime_1080p/

译(1/N) 内容创作者被困在昂贵且缓慢的视频生成 API 中太久了。我们再也受不了了。😅😭

Saining Xie@sainingxie · 2月27日

world modeling is never about rendering pixels. rendering is local. world state is global. as soon as more than one agent exists, the only thing that truly matters is the shared representation beneath individual views. that shared representation is what scales into collective capability. this is why I'm super excited to share project Solaris -- our new work focused on building a multiplayer video world model in minecraft. This release includes three main pieces. 1⃣Solaris Engine, a fully featured multiplayer data collection system with built in visuals. the team put a huge amount of work into this since nothing like it really exists yet. https://github.com/solaris-wm/solaris-engine 2⃣Solaris Model, a multiplayer DiT with a new memory efficient self forcing design, trained on 12.6M frames of coordinated Minecraft gameplay. https://github.com/solaris-wm/solaris 3⃣Solaris Eval, which uses a VLM as a judge to evaluate different multiplayer capabilities. read the full technical breakdown by @ojmichel4, and start building with Solaris. https://solaris-wm.github.io/

译Project Solaris提出世界建模的本质在于全局共享状态而非局部像素渲染，推出基于Minecraft的多人在线视频世界模型。该系统突破单智能体视角局限，支持任意数量智能体随时介入交互，实现持久化世界状态演化。核心包含三大组件：Solaris Engine多人数据收集系统、基于DiT架构的Solaris Model（采用新型内存高效自强制设计，训练于1260万帧协调游戏数据）、以及使用VLM评判的Solaris Eval评估体系。这一范式转变为构建神经MMORPG服务器奠定基础。

Saining Xie@sainingxie · 1月30日

if you are building video diffusion / world simulators, try this new sampler. temporal consistency pins videos to a low-dimensional manifold in the total pixel space. self-refinement sampling keeps them there.

译如果你在构建视频扩散/世界模拟器，试试这个新采样器。时间一致性将视频固定在总像素空间中的低维流形上。自精炼采样使它们保持在那里。 [引用 @jangsangwon7]：如果你的视频生成器能在推理时自我精炼会怎样？ ❌无需新模型。❌无需重新训练。❌无需外部验证器。 💡 推出自精炼视频采样通过将预训练生成器（Wan2.2、Cosmos）重新解释为去噪自编码器，我们实现了推理时的迭代自精炼 ➡️ 显著提升物理真实感，并获得超过70%的人类偏好！ 🧵

OpenAI@OpenAI · 10月4日

The Quack: Part 1 by Sora 2.

译Sora 2 发布视频生成作品《The Quack》第一部分，并附上作品观看链接。该作品由 Sora 2 创作，目前仅发布第一部分内容。用户可通过原文链接观看这部 AI 生成的视频作品。

OpenAI@OpenAI · 10月2日

Bloopers by Sora 2.

译OpenAI发布由 Sora 2 生成的视频花絮（bloopers），配有声音，展示 AI 视频生成中的趣味失误瞬间。

Hao AI Lab@haoailab · 9月22日

🚀 Thrilled to share that our lab has THREE papers accepted at #NeurIPS2025 on AI efficiency from reasoning to video generation. Come hang out with us, it's going to be a lot of fun this year here local to UCSD! 😎 📊 Efficiently Scaling LLM Reasoning with Certaindex Introduces Certaindex, an algorithm-agnostic metric measuring evolving stability that signals when further computation won't change results, plus Dynasor serving system achieving up to 50% compute savings and 3.3x higher efficiency 📎 https://arxiv.org/abs/2412.20993 @FuYichao123 @Junda_Chen_ ⚡ Scaling Speculative Decoding with Lookahead Reasoning Exploits step-level parallelism to overcome token-level speculative decoding limitations, boosting speedup from 1.4x to 2.1x on GSM8K 📎 https://arxiv.org/abs/2506.19830 @FuYichao123 🎥 VSA: Faster Video Diffusion with Trainable Sparse Attention is a hardware-efficient sparse attention for video DiTs that cuts training FLOPS by 2.53× with zero loss in diffusion quality 📎 https://arxiv.org/abs/2505.13389 @PY_Z001 @BrianChen112900 Congrats to all collaborators! 🎉

译🚀 很高兴分享我们实验室有三篇论文被 #NeurIPS2025 接收，主题是从推理到视频生成的 AI 效率。来和我们一起玩吧，今年在 UCSD 本地举办，一定会很有趣！😎

Google DeepMind@GoogleDeepMind · 9月9日

Three big AI updates for developers: 1️⃣ Veo 3 and Veo 3 Fast are now ready for scaled production use in the Gemini API 2️⃣ Make 16:9 videos in 1080p HD for even higher quality 3️⃣ Start generating 9:16 vertical clips Find out more → https://goo.gle/4niwJOZ

译Veo 3 与 Veo 3 Fast 现已在 Gemini API 开放规模化生产使用，新增支持 16:9 1080p 高清视频及 9:16 竖屏视频格式。

Google DeepMind@GoogleDeepMind · 8月14日

On the road to AGI, @DemisHassabis believes breakthroughs like Genie 3, which can generate playable worlds, could help us better understand reality itself. 🌐↓

译Demis Hassabis 表示，在通往 AGI 的进程中，Genie 3 这类能够生成可玩世界的技术突破，将帮助我们更深入地理解现实本质。

Hao AI Lab@haoailab · 8月6日

Nice -- try FastWan at @FAL !

译不错——在 @FAL 试试 FastWan！

Jim Fan@DrJimFan · 8月5日

World modeling for robotics is incredibly hard because (1) control of humanoid robots & 5-finger hands is wayyy harder than ⬆️⬅️⬇️➡️ in games (Genie 3); and (2) object interaction is much more diverse than FSD, which needs to *avoid* coming into contact. Our GR00T Dreams work was a first attempt at building high-fidelity world simulator for humanoid robots. It's not only for evaluation but also for large-scale synthetic data generation. Time to move away from the "fossil fuel" of robotics (human teleoperation) and embrace clean energy (nuclear "diffusion")! GR00T Dreams kind of flew under the radar, so bringing it back to life on a cheerful day ;)

译NVIDIA发布DreamGen引擎（GR00T Dreams），将Sora/Veo等视频生成模型用作神经物理引擎，通过微调模型、模拟并行世界、恢复伪动作、训练基础模型四步流程，为机器人生成大规模合成训练数据。人形机器人仅凭单一拾放任务即可学会倾倒、折叠等22种新行为，在新动词和陌生环境中实现零样本泛化（成功率分别达43%和28%）。相比传统图形引擎，该方法以恒定计算成本处理可变形物体、流体等复杂交互，团队计划数周内完全开源。

Jim Fan@DrJimFan · 8月5日

This is game engine 2.0. Some day, all the complexity of UE5 will be absorbed by a data-driven blob of attention weights. Those weights take as input game controller commands and directly animate a spacetime chunk of pixels. Agrim and I were close friends and coauthors back at Stanford Vision Lab. So great to see him at the frontier of such cool research! Congrats!

译Google DeepMind 发布世界模型 Genie 3，支持从文本生成交互式世界，实现 720p 分辨率下 24fps 实时交互与数分钟一致性。作者认为这代表"游戏引擎 2.0"——未来 UE5 等复杂引擎的层级结构将被数据驱动的注意力权重取代，直接根据手柄输入生成像素时空块。

Hao AI Lab@haoailab · 8月5日67

Try FastWan at https://fastwan.fastvideo.org/!

译FastVideo团队推出FastWan系列快速视频生成模型。该模型采用名为“稀疏蒸馏”的新训练方法，能将视频去噪速度提升70倍。在单块H200 GPU上，仅需5秒即可生成一段5秒的视频。团队提供了在线演示，并依据Apache-2.0许可证完全开源了模型、代码和数据。

Hao AI Lab@haoailab · 8月5日

(1/n) 🚀 With FastVideo, you can now generate a 5-second video in 5 seconds on a single H200 GPU! Introducing FastWan series, a family of fast video generation models trained via a new recipe we term as “sparse distillation”, to speed up video denoising time by 70X! 🖥️ Live demo: https://fastwan.fastvideo.org/ (Thanks to @gmicloud for the support!) 🔗 Blog: https://hao-ai-lab.github.io/blogs/fastvideo_post_training/ 🔓 We fully open-source our models, code, and data with Apache-2.0 licenses

译(1/n) 🚀 借助 FastVideo，你现在可以在单张 H200 GPU 上用 5 秒生成一段 5 秒视频！

Jim Fan@DrJimFan · 5月20日

What if robots could dream inside a video generative model? Introducing DreamGen, a new engine that scales up robot learning not with fleets of human operators, but with digital dreams in pixels. DreamGen produces massive volumes of neural trajectories - photorealistic robot videos paired with motor action labels - and unlocks strong generalization to new nouns, verbs, and environments. Whether you’re a humanoid (GR1), an industrial arm (Franka), or a cute little robot (HuggingFace SO-100), DreamGen enables you to dream. Video generation models like Sora & Veo are neural physics engines. By compressing billions of internet videos, they learn a multiverse of plausible futures, i.e. superpositions of how the world could unfold from any initial image frame. DreamGen taps into this power with a simple 4-step recipe: 1. Fine-tune a SOTA video model on your target robot; 2. Prompt the model with diverse language prompts to simulate parallel worlds: how your robot would have acted in new scenarios. Filter out the bad dreams (ha!) that don’t follow instructions; 3. Recover pseudo-actions using inverse dynamics or latent action models; 4. Train robot foundation models on the massively augmented dataset of neural trajectories. That’s it. Just more data, and plain old supervised learning. Simple, right? What’s remarkable is how far this goes. Starting with just a single-task dataset of pick-and-place, our humanoid robot learns 22 new behaviors, such as pouring, folding, scooping, ironing, and hammering, despite never seeing those verbs before. Better yet, we can take the robot out of the lab and drop it into the NVIDIA HQ Cafe, and let DreamGen work its magic. We show true zero-to-one generalization: from 0% success to over 43% for novel verbs, and 0 -> 28% in unseen environments. Compared to a traditional graphics engine, DreamGen doesn’t care if the scene involves deformable objects, fluids, translucent materials, contact-rich interactions, or crazy lighting. Good luck engineering those by hand. For DreamGen, every world is just a forward pass through a diffusion neural net. No matter how complex the dream is, it takes constant compute time to roll out. Read our blog and paper today! We plan to fully open-source the entire pipeline in the next few weeks. Links in thread:

译DreamGen让机器人在视频生成模型中"做梦"合成训练数据。通过微调Sora等模型生成海量神经轨迹（逼真视频+动作标签），机器人从单一拾取放置任务泛化到倾倒、折叠等22种新行为。在NVIDIA总部咖啡厅测试中，人形机器人对新动词零样本成功率从0%提升至43%，新环境达28%。相比传统图形引擎，无需手工建模即可处理流体、可变形物体等复杂场景，整个pipeline将于近期完全开源。