When to create music with Lyria 3 vs. Lyria 3 Pro in Gemini: Lyria 3 is your playground for fun, spontaneous tracks & quick sharing. If you're looking for more musical flow & customization, Lyria 3 Pro (available to Google AI Plus, Pro & Ultra users) can help elevate your work.

译Lyria 3 适合创作趣味即兴曲目并快速分享，Lyria 3 Pro 则提供更多音乐流畅性与自定义选项。后者仅向 Google AI Plus、Pro 及 Ultra 订阅用户开放，适合需要进阶功能的创作者。

Google Gemini@GeminiApp · 3月26日

Longer tracks are here with Lyria 3 Pro in Gemini! From experimenting with different styles to generating tracks with complex transitions, Lyria 3 Pro makes it easier to bring your full vision to life. Rolling out today to Google AI Plus, Pro, and Ultra users. Learn more 🧵

译Lyria 3 Pro 正式接入 Gemini，支持生成更长音轨及复杂风格过渡。即日起向 Google AI Plus、Pro 与 Ultra 订阅用户开放。

Google DeepMind@GoogleDeepMind · 3月26日

You can now create longer tracks with Lyria 3 Pro. 🎶 Map out intros, verses, choruses, and bridges to build high-fidelity compositions up to 3 minutes long. 🎹

译Lyria 3 Pro 升级后可生成最长3分钟的高保真音轨，支持编排前奏、主歌、副歌与桥段等完整歌曲结构，实现更长篇幅的音乐创作。

Demis Hassabis@demishassabis · 3月25日

Excited to partner with Agile Robots! Looking forward to seeing our models being deployed through Agile Robots incredible platform to help solve some of the most complex industrial challenges

译Google DeepMind 宣布与 Agile Robots 建立研究合作，将 Gemini 基础模型集成至对方硬件平台，部署于工业场景解决复杂挑战，构建下一代更实用的机器人。

Saining Xie@sainingxie · 3月24日

best read paired with the LeWorldModel paper. don’t ask me why 🙂

译最好搭配 LeWorldModel 论文阅读。别问我为什么 🙂

Google Gemini@GeminiApp · 3月20日

Loving these creations. Try it out and share yours in the replies 👇

译分享一个 Nano Banana 提示词，可生成 2×2 网格的 3D 字体雕塑，将 4 个重要历史年份及其代表性发明以复古科技或蒸汽朋克风格立体呈现。提示词包含锚点定义、形态构建、材质物理和光照渲染等详细参数，直接复制即可使用。欢迎尝试并在回复中晒出你的生成结果。

Artificial Analysis@ArtificialAnlys · 3月20日

Mistral has released Mistral Small 4, an open weights model with hybrid reasoning and image input, scoring 27 on the Artificial Analysis Intelligence Index @MistralAI's Small 4 is a 119B mixture-of-experts model with 6.5B active parameters per token, supporting both reasoning and non-reasoning modes. In reasoning mode, Mistral Small 4 scores 27 on the Artificial Analysis Intelligence Index, a 12-point improvement from Small 3.2 (15) and now among the most intelligent models Mistral has released, surpassing Mistral Large 3 (23) and matching the proprietary Magistral Medium 1.2 (27). However, it lags open weights peers with similar total parameter counts such as gpt-oss-120B (high, 33), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), and Qwen3.5 122B A10B (Reasoning, 42). Key takeaways: ➤ Reasoning and non-reasoning modes in a single model: Mistral Small 4 supports configurable hybrid reasoning with reasoning and non-reasoning modes, rather than the separate reasoning variants Mistral has released previously with their Magistral models. In reasoning mode, the model scores 27 on the Artificial Analysis Intelligence Index. In non-reasoning mode, the model scores 19, a 4-point improvement from its predecessor Mistral Small 3.2 (15) ➤ More token efficient than peers of similar size: At ~52M output tokens, Mistral Small 4 (Reasoning) uses fewer tokens to run the Artificial Analysis Intelligence Index compared to reasoning models such as gpt-oss-120B (high, ~78M), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, ~110M), and Qwen3.5 122B A10B (Reasoning, ~91M). In non-reasoning mode, the model uses ~4M output tokens ➤ Native support for image input: Mistral Small 4 is a multimodal model, accepting image input as well as text. On our multimodal evaluation, MMMU-Pro, Mistral Small 4 (Reasoning) scores 57%, ahead of Mistral Large 3 (56%) but behind Qwen3.5 122B A10B (Reasoning, 75%). Neither gpt-oss-120B nor NVIDIA Nemotron 3 Super 120B A12B support image input. All models support text output only ➤ Improvement in real-world agentic tasks: Mistral Small 4 scores an Elo of 871 on GDPval-AA, our evaluation based on OpenAI's GDPval dataset that tests models on real-world tasks across 44 occupations and 9 major industries, with models producing deliverables such as documents, spreadsheets, and diagrams in an agentic loop. This is more than double the Elo of Small 3.2 (339) and close to Mistral Large 3 (880), but behind gpt-oss-120B (high, 962), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 1021), and Qwen3.5 122B A10B (Reasoning, 1130) ➤ Lower hallucination rate than peer models of similar size: Mistral Small 4 scores -30 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. Mistral Small 4 scores ahead of gpt-oss-120B (high, -50), Qwen3.5 122B A10B (Reasoning, -40), and NVIDIA Nemotron 3 Super 120B A12B (Reasoning, -42) Key model details: ➤ Context window: 256K tokens (up from 128K on Small 3.2) ➤ Pricing: $0.15/$0.6 per 1M input/output tokens ➤ Availability: Mistral first-party API only. At native FP8 precision, Mistral Small 4's 119B parameters require ~119GB to self-host the weights (more than the 80GB of HBM3 memory on a single NVIDIA H100) ➤ Modality: Image and text input with text output only ➤ Licensing: Apache 2.0 license

译Mistral发布开源权重模型Mistral Small 4，采用119B参数MoE架构（每token激活6.5B参数），支持可切换的推理/非推理模式及图像输入。推理模式在Artificial Analysis Intelligence Index获27分，超越Mistral Large 3，但低于gpt-oss-120B等竞品。模型token效率优于同类，幻觉率更低（AA-Omniscience -30分），支持256K上下文窗口，采用Apache 2.0许可证。

Demis Hassabis@demishassabis · 3月19日

You can vibe design some incredible interfaces with @stitchbygoogle

译Google 发布 vibe design 平台 Stitch，支持自然语言描述直接生成高保真界面和交互原型，可通过语音实时调整布局。目前仅面向 18 岁以上用户，在 Gemini 支持的英语国家开放。

Satya Nadella@satyanadella · 3月15日

We’ve trained a multimodal AI model to turn routine pathology slides into spatial proteomics, with the potential to reduce time and cost while expanding access to cancer care.

译新训练的多模态 AI 模型能将常规病理切片转化为空间蛋白质组学数据，在缩短检测时间、降低成本的同时，提升癌症医疗的可及性。

Sundar Pichai@sundarpichai · 3月13日

😎

译MWC 现场展示的 Android XR 原型演示视频显示，Gemini 可流畅处理模糊复杂查询，且眼镜能与手机应用无缝协同工作。详情已发布至 Reddit。

Sundar Pichai@sundarpichai · 3月13日

We trained a new flood forecasting model designed to predict flash floods in urban areas up to 24 hours in advance. To help address a flash floods data gap, we created Groundsource: a new AI methodology using Gemini to identify 2.6M+ historical events across 150+ countries. We’re open-sourcing this dataset to advance global research, and urban flash flood forecasts are live now in Flood Hub to help communities stay safe.

译推出城市山洪预测模型，支持提前24小时预警。发布Groundsource数据集，基于Gemini识别150余国260万+历史山洪事件，数据已开源。预测功能现已在Flood Hub上线。

Claude@claudeai · 3月12日

Claude can now build interactive charts and diagrams, directly in the chat. Available today in beta on all plans, including free. Try it out: http://claude.ai

译Claude 新增交互式图表与图示生成功能，支持在聊天中直接创建可视化内容。该功能今日起以 beta 形式向所有套餐用户开放，包括免费版。

Saining Xie@sainingxie · 3月5日

another scientific exploration from @TongPetersb, @DavidJFan, and @__JohnNguyen__ that might teach you something new, even if you’re in a frontier lab lots of interesting observations here, but I’ll highlight just one: - it’s kind of an open industry secret that trying to scale DiTs with MoE has mostly been fruitless. - the unexpected, yet intuitive, synergy between RAE and MoE might actually change that.

译来自 @TongPetersb、@DavidJFan 和 @__JohnNguyen__ 的又一项科学探索，即使你身处前沿实验室，也可能会让你学到新东西这里有很多有趣的观察，但我只强调一点： - 尝试用 MoE 扩展 DiTs 大多徒劳无功，这算是行业公开的秘密。 - 但 RAE 与 MoE 之间意外却直观的协同作用，可能真的会改变这一点。 [引用 @TongPetersb]：超越语言训练。我们押注视觉世界，将其作为与语言建模并行且超越它的关键下一步。因此，我们研究了从零开始用视觉构建基础模型。我们分享我们的探索：视觉表征、数据、世界建模、架构和扩展行为！[1/9]

Saining Xie@sainingxie · 2月7日49

self-driving <as a 2D robot with a low-dim action space that focused mostly on avoidance rather than interaction> will reach real-world impact faster than anything else. the really cool part is that the world model isn’t just about videos; it’s about modeling continuous, high-dimension, and noisy signals of all kinds. that’s what "multimodal" actually means. congrats to @maxjiang93, xander, bo, and the whole waymo team 👏

译推文观点认为，将自动驾驶视为专注于避障的低维行动空间二维机器人，能更快产生实际影响。Waymo世界模型的核心不止于视频生成，更是对连续、高维、多模态嘈杂信号的建模。该模型基于Google DeepMind的Genie 3构建，能创建大规模、超逼真的驾驶模拟。通过模拟如龙卷风、飞机降落高速公路等极端罕见场景，Waymo Driver可在真实遭遇前进行针对性训练，从而显著提升系统应对复杂情况的能力，加速自动驾驶技术的安全部署与成熟。

Jim Fan@DrJimFan · 2月4日72

http://x.com/i/article/2018744045779238912 # The Second Pre-training Paradigm Next word prediction was the first pre-training paradigm. Now we are living through the second paradigm shift: world modeling, or “next physical state prediction”. Very few understand how far-reaching this shift is, because unfortunately, the most hyped use case of world models right now is AI video slop (and coming up, game slop). I bet with full confidence that 2026 will mark the first year that Large World Models lay real foundations for robotics, and for multimodal AI more broadly. In this context, I define world modeling as predicting the next plausible world state (or a longer duration of states) conditioned on an action. Video generative models are one instantiation of it, where “next states” is a sequence of RGB frames (mostly 8-10 seconds, up to a few minutes) and “action” is a textual description of what to do. Training involves modeling the future changes in billions of hours of video pixels. At the core, video WMs are learnable physics simulators and rendering engines. They capture the counterfactuals, a fancier word for reasoning about how the future would have unfolded differently given an alternative action. WMs fundamentally put vision first. VLMs, in contrast, are fundamentally language-first. From the earliest prototypes (e.g. LLaVA, Liu et al. 2023), the story has mostly been the same: vision enters at the encoder, then gets routed into a language backbone. Over time, encoders improve, architectures get cleaner, vision tries to grow more “native” (as in omni models). Yet it remains a second-class citizen, dwarfed by the muscles the field has spent years building for LLMs. This path is convenient. We know LLMs scale. Our architectural instincts, data recipe design, and benchmark guidance (VQAs) are all highly optimized for language. For physical AI, 2025 was dominated by VLAs: graft a robot motor action decoder on top of a pre-trained VLM checkpoint. It’s really “LVAs”: language > vision > action, in decreasing order of citizenship. Again, this path is convenient, because we are fluent in VLM recipes. Yet most parameters in VLMs are allocated to knowledge (e.g. “this blob of pixels is a Coca Cola brand”), not to physics (“if you tip the coke bottle, it spreads into a brown puddle, stains the white tablecloth, and ruins the electric motor”). VLAs are quite good in knowledge retrieval by design, but head-heavy in the wrong places. The multi-stage grafting design also runs counter to my taste for simplicity and elegance. Biologically, vision dominates our cortical computation. Roughly a third of our cortex is devoted to processing pixels over occipital, temporal, and parietal regions. In contrast, language relies on a relatively compact area. Vision is by far the highest-bandwidth channel linking our brain, our motors, and the physical world. It closes the “sensorimotor loop” — the most important loop to solve for robotics, and requires zero language in the middle. Nature gives us an existential proof of a highly dexterous physical intelligence with minimal language capability. The ape. I’ve seen apes drive golf carts and change brake pads with screwdrivers like human mechanics. Their language understanding is no more than BERT or GPT-1, yet their physical skills are far beyond anything our SOTA robots can do. Apes may not have good LMs, but they surely have a robust mental picture of "what if"s: how the physical world works and reacts to their intervention. The era of world modeling is here. It is bitter lesson-pilled. As Jitendra likes to remind us, the scaling addicts, “Supervision is the opium of the AI researcher.” The whole of YouTube and the rise of smart glasses will capture raw visual streams of our world at a scale far beyond all the texts we ever train on. We shall see a new type of pretraining: next world states could include more than RGBs - 3D spatial motions, proprioception, and tactile sensing are just getting started. We shall see a new type of reasoning: chain of thought in visual space rather than language space. You can solve a physical puzzle by simulating geometry and contact, imagining how pieces move and collide, without ever translating into strings. Language is a bottleneck, a scaffold, not a foundation. We shall face a new Pandora’s box of open questions: even with perfect future simulation, how should motor actions be decoded? Is pixel reconstruction really the best objective, or shall we go into alternative latent spaces? How much robot data do we need, and is scaling teleoperation still the answer? And after all these exercises, are we finally inching towards the GPT-3 moment for robotics? Ilya is right after all. AGI has not converged. We are back to the age of research, and nothing is more thrilling than challenging first principles.

译作者指出，AI预训练正经历从“下一个词预测”到“世界建模”的根本性范式转变。世界模型的核心是预测给定行动后的下一个物理状态序列，本质上是可学习的物理模拟器，并将视觉置于首位。相比之下，当前主流的视觉语言模型本质是语言优先，视觉是次要输入。生物智能中视觉处理占据皮层计算的主导地位，是连接大脑、动作与物理世界的高带宽通道。作者以猿类为例，证明强大的物理智能可独立于高级语言存在。他预测，2026年大型世界模型将为机器人技术和多模态AI奠定真正基础，而YouTube等平台的海量视觉数据将远超文本规模，推动这一新范式发展。

Saining Xie@sainingxie · 1月24日

love this teaser lol (and it is real) academia boxed us in sooo tightly that we nearly broke, but we clawed our way out and found a whole new universe on the other side😅 thank you to Google for supporting the gpu-poor rebels and pulling us into this ride, helping us build what I believe is one of the best tpu/gcp infrastructure teams outside of google

译喜欢这段预告片哈哈（而且是真的）学术界把我们限制得太紧了，差点崩溃，但我们挣扎着爬了出来，在另一边发现了一个全新的宇宙😅 感谢 Google 支持我们这些缺 GPU 的叛逆者，带我们踏上这段旅程，帮助我们建立了我认为是 Google 之外最好的 TPU/GCP 基础设施团队之一 [引用 @TongPetersb]：我们已经在学术界用 TPU 训练两年了（非常感谢 Google TRC！）。像 Cambrian-1、Cambrian-S、RAE 和 Scale-RAE 这样的工作没有 TPU 是不可能的。我们写了一篇博客文章分享我们的经验、优化和教训：https://cambrian-mllm.github.io/blog/tpu-training-experiments.html 我们希望这能帮助更多人更顺畅地使用 TPU，它们非常强大！

Jim Fan@DrJimFan · 12月2日

Going to NeurIPS in San Diego! Available for coffee starting tomorrow afternoon. We are recruiting heavily for talents across robotics, VLM, world models, and software infra! DM me or email (on my very outdated home page).

译要去圣地亚哥参加 NeurIPS！从明天下午开始可以约咖啡。我们正在大力招聘机器人、VLM、世界模型和软件基础设施方面的人才！请私信我或发邮件（在我非常过时的主页上）。

Saining Xie@sainingxie · 11月27日

after V*, many projects tried to get MLLMs to `think with images', but a regular 2d image limits you to mostly basic tools like zooming or cropping. to expand the action space, we need something more embodied. that is where H* from @YimingLi9702 and his team comes in. It takes a panoramic image as the environment. instead of staring at one image, the model can look around and think in 360. it is basically giving the model a neck! with that freedom, it can choose from many more actions and think inside real spaces like nyc train stations or shopping malls!

译H*项目突破传统MLLMs处理单一2D图像的局限，引入全景图像作为环境载体，使模型具备在360度真实空间中主动观察与推理的能力。相比V*等项目的局部视觉工具，H*通过"具身化"范式赋予模型类似人类颈部的视角自由度，显著扩展了行动空间，支持在地铁站、商场等复杂场景中进行视觉搜索与空间推理，实现了从被动接受到主动探索的范式转变。

Google DeepMind@GoogleDeepMind · 10月9日

We’re proud to announce that Genie 3 has been named one of @TIME’s Best Inventions of 2025. Genie 3 is our groundbreaking world model capable of generating interactive, playable environments from text or image prompts. Find out more → https://goo.gle/3KGqiYa

译Genie 3 获评《时代》杂志 2025 年度最佳发明之一。这款世界模型可从文本或图像提示生成交互式可玩环境，支持实时交互探索。

Google DeepMind@GoogleDeepMind · 10月2日

How can AI enhance the creative process of a world-renowned industrial designer? 🎨 We partnered with the visionary @RossLovegroveX and @modem_works to build a tool using Gemini and our image generation technology to translate his signature aesthetic into a new concept. 🪑

译Google 携手工业设计师 Ross Lovegrove 与 modem_works，利用 Gemini 及图像生成技术构建工具，将其标志性美学转化为全新家具设计概念。

Saining Xie@sainingxie · 8月11日

this isn’t just a modeling problem. it’s also a benchmarking problem. spurious correlations are always a pain, but in multimodal llms they become a particularly tough battle. On one hand, you want to leverage the language prior to enable better generalization; on the other, that same language prior can turn into a shortcut that makes the model effectively blind. the irony is that humans do the same thing. We still gravitate toward language-first tasks, and the “multimodal results” in major model releases like gpt-5 reflect exactly that bias. I mean, economically this makes most sense for LLM companies: you can claim wins in “multimodal reasoning” without investing heavily in real multimodal research. that shortcut will come due tho. when you try to put these systems into glasses, robots, or anything else that touches the real world, the cracks will show. and they’ll be costly.

译这不只是建模问题。也是基准测试问题。

Saining Xie@sainingxie · 7月31日

TheRightWay™ is my favorite brand now.

译TheRightWay™ 现在是我最喜欢的品牌。

Saining Xie@sainingxie · 5月29日

Indeed. For text-to-image, @xichen_pan had a great summary supporting this decoupled design philosophy: "Render unto diffusion what is generative, and unto LLMs what is understanding." We've repeatedly observed that diffusion gradients can negatively impact the backbone repr. This effect shows up in simpler settings—for example, we explored this issue to some extent in REPA-E (https://end2end-diffusion.github.io/). I believe the same principle applies to VLA. Fundamentally, the problem seems to be that diffusion gradients care too much about high-frequency details—whether in pixels or action policies—which tends to conflict with representation learning and understanding. btw, @ylecun has always been right about this -- long before any of these empirical findings.

译确实。对于文生图，@xichen_pan 有一个很好的总结支持这种解耦的设计理念："把生成性的归给 diffusion，把理解的归给 LLMs。"

DeepSeek@deepseek_ai · 12月13日

🎉 DeepSeek-VL2 is here! Our next-gen vision-language model enters the MoE era. 🤖 DeepSeek-MoE arch + dynamic image tilling ⚡ 3B/16B/27B sizes for flexible use 🏆 Outstanding performance across all benchmarks 🧵 1/n

译🎉 DeepSeek-VL2 来了！我们的下一代视觉-语言模型进入 MoE 时代。 🤖 DeepSeek-MoE 架构 + 动态图像分块 ⚡ 3B/16B/27B 规模，灵活使用 🏆 在所有基准测试中表现优异 🧵 1/n