AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 904 条
全部一手资讯X论文
标签「多模态」清除
Google Gemini@GeminiApp · 3月26日

When to create music with Lyria 3 vs. Lyria 3 Pro in Gemini: Lyria 3 is your playground for fun, spontaneous tracks & quick sharing. If you're looking for more musical flow & customization, Lyria 3 Pro (available to Google AI Plus, Pro & Ultra users) can help elevate your work.

译Lyria 3 适合创作趣味即兴曲目并快速分享,Lyria 3 Pro 则提供更多音乐流畅性与自定义选项。后者仅向 Google AI Plus、Pro 及 Ultra 订阅用户开放,适合需要进阶功能的创作者。

Google Gemini@GeminiApp · 3月26日

Longer tracks are here with Lyria 3 Pro in Gemini! From experimenting with different styles to generating tracks with complex transitions, Lyria 3 Pro makes it easier to bring your full vision to life. Rolling out today to Google AI Plus, Pro, and Ultra users. Learn more 🧵

译Lyria 3 Pro 正式接入 Gemini,支持生成更长音轨及复杂风格过渡。即日起向 Google AI Plus、Pro 与 Ultra 订阅用户开放。

Google DeepMind@GoogleDeepMind · 3月26日

You can now create longer tracks with Lyria 3 Pro. 🎶 Map out intros, verses, choruses, and bridges to build high-fidelity compositions up to 3 minutes long. 🎹

译Lyria 3 Pro 升级后可生成最长3分钟的高保真音轨,支持编排前奏、主歌、副歌与桥段等完整歌曲结构,实现更长篇幅的音乐创作。

Demis Hassabis@demishassabis · 3月25日

Excited to partner with Agile Robots! Looking forward to seeing our models being deployed through Agile Robots incredible platform to help solve some of the most complex industrial challenges

译Google DeepMind 宣布与 Agile Robots 建立研究合作,将 Gemini 基础模型集成至对方硬件平台,部署于工业场景解决复杂挑战,构建下一代更实用的机器人。

Saining Xie@sainingxie · 3月24日

best read paired with the LeWorldModel paper. don’t ask me why 🙂

译最好搭配 LeWorldModel 论文阅读。别问我为什么 🙂

Google Gemini@GeminiApp · 3月20日

Loving these creations. Try it out and share yours in the replies 👇

译分享一个 Nano Banana 提示词,可生成 2×2 网格的 3D 字体雕塑,将 4 个重要历史年份及其代表性发明以复古科技或蒸汽朋克风格立体呈现。提示词包含锚点定义、形态构建、材质物理和光照渲染等详细参数,直接复制即可使用。欢迎尝试并在回复中晒出你的生成结果。

Artificial Analysis@ArtificialAnlys · 3月20日

Mistral has released Mistral Small 4, an open weights model with hybrid reasoning and image input, scoring 27 on the Artificial Analysis Intelligence Index @MistralAI's Small 4 is a 119B mixture-of-experts model with 6.5B active parameters per token, supporting both reasoning and non-reasoning modes. In reasoning mode, Mistral Small 4 scores 27 on the Artificial Analysis Intelligence Index, a 12-point improvement from Small 3.2 (15) and now among the most intelligent models Mistral has released, surpassing Mistral Large 3 (23) and matching the proprietary Magistral Medium 1.2 (27). However, it lags open weights peers with similar total parameter counts such as gpt-oss-120B (high, 33), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), and Qwen3.5 122B A10B (Reasoning, 42). Key takeaways: ➤ Reasoning and non-reasoning modes in a single model: Mistral Small 4 supports configurable hybrid reasoning with reasoning and non-reasoning modes, rather than the separate reasoning variants Mistral has released previously with their Magistral models. In reasoning mode, the model scores 27 on the Artificial Analysis Intelligence Index. In non-reasoning mode, the model scores 19, a 4-point improvement from its predecessor Mistral Small 3.2 (15) ➤ More token efficient than peers of similar size: At ~52M output tokens, Mistral Small 4 (Reasoning) uses fewer tokens to run the Artificial Analysis Intelligence Index compared to reasoning models such as gpt-oss-120B (high, ~78M), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, ~110M), and Qwen3.5 122B A10B (Reasoning, ~91M). In non-reasoning mode, the model uses ~4M output tokens ➤ Native support for image input: Mistral Small 4 is a multimodal model, accepting image input as well as text. On our multimodal evaluation, MMMU-Pro, Mistral Small 4 (Reasoning) scores 57%, ahead of Mistral Large 3 (56%) but behind Qwen3.5 122B A10B (Reasoning, 75%). Neither gpt-oss-120B nor NVIDIA Nemotron 3 Super 120B A12B support image input. All models support text output only ➤ Improvement in real-world agentic tasks: Mistral Small 4 scores an Elo of 871 on GDPval-AA, our evaluation based on OpenAI's GDPval dataset that tests models on real-world tasks across 44 occupations and 9 major industries, with models producing deliverables such as documents, spreadsheets, and diagrams in an agentic loop. This is more than double the Elo of Small 3.2 (339) and close to Mistral Large 3 (880), but behind gpt-oss-120B (high, 962), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 1021), and Qwen3.5 122B A10B (Reasoning, 1130) ➤ Lower hallucination rate than peer models of similar size: Mistral Small 4 scores -30 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. Mistral Small 4 scores ahead of gpt-oss-120B (high, -50), Qwen3.5 122B A10B (Reasoning, -40), and NVIDIA Nemotron 3 Super 120B A12B (Reasoning, -42) Key model details: ➤ Context window: 256K tokens (up from 128K on Small 3.2) ➤ Pricing: $0.15/$0.6 per 1M input/output tokens ➤ Availability: Mistral first-party API only. At native FP8 precision, Mistral Small 4's 119B parameters require ~119GB to self-host the weights (more than the 80GB of HBM3 memory on a single NVIDIA H100) ➤ Modality: Image and text input with text output only ➤ Licensing: Apache 2.0 license

译Mistral发布开源权重模型Mistral Small 4,采用119B参数MoE架构(每token激活6.5B参数),支持可切换的推理/非推理模式及图像输入。推理模式在Artificial Analysis Intelligence Index获27分,超越Mistral Large 3,但低于gpt-oss-120B等竞品。模型token效率优于同类,幻觉率更低(AA-Omniscience -30分),支持256K上下文窗口,采用Apache 2.0许可证。

Demis Hassabis@demishassabis · 3月19日

You can vibe design some incredible interfaces with @stitchbygoogle

译Google 发布 vibe design 平台 Stitch,支持自然语言描述直接生成高保真界面和交互原型,可通过语音实时调整布局。目前仅面向 18 岁以上用户,在 Gemini 支持的英语国家开放。

Satya Nadella@satyanadella · 3月15日

We’ve trained a multimodal AI model to turn routine pathology slides into spatial proteomics, with the potential to reduce time and cost while expanding access to cancer care.

译新训练的多模态 AI 模型能将常规病理切片转化为空间蛋白质组学数据,在缩短检测时间、降低成本的同时,提升癌症医疗的可及性。

Sundar Pichai@sundarpichai · 3月13日

😎

译MWC 现场展示的 Android XR 原型演示视频显示,Gemini 可流畅处理模糊复杂查询,且眼镜能与手机应用无缝协同工作。详情已发布至 Reddit。

Sundar Pichai@sundarpichai · 3月13日

We trained a new flood forecasting model designed to predict flash floods in urban areas up to 24 hours in advance. To help address a flash floods data gap, we created Groundsource: a new AI methodology using Gemini to identify 2.6M+ historical events across 150+ countries. We’re open-sourcing this dataset to advance global research, and urban flash flood forecasts are live now in Flood Hub to help communities stay safe.

译推出城市山洪预测模型,支持提前24小时预警。发布Groundsource数据集,基于Gemini识别150余国260万+历史山洪事件,数据已开源。预测功能现已在Flood Hub上线。

Claude@claudeai · 3月12日

Claude can now build interactive charts and diagrams, directly in the chat. Available today in beta on all plans, including free. Try it out: http://claude.ai

译Claude 新增交互式图表与图示生成功能,支持在聊天中直接创建可视化内容。该功能今日起以 beta 形式向所有套餐用户开放,包括免费版。

Saining Xie@sainingxie · 3月5日

another scientific exploration from @TongPetersb, @DavidJFan, and @__JohnNguyen__ that might teach you something new, even if you’re in a frontier lab lots of interesting observations here, but I’ll highlight just one: - it’s kind of an open industry secret that trying to scale DiTs with MoE has mostly been fruitless. - the unexpected, yet intuitive, synergy between RAE and MoE might actually change that.

译来自 @TongPetersb、@DavidJFan 和 @__JohnNguyen__ 的又一项科学探索,即使你身处前沿实验室,也可能会让你学到新东西 这里有很多有趣的观察,但我只强调一点: - 尝试用 MoE 扩展 DiTs 大多徒劳无功,这算是行业公开的秘密。 - 但 RAE 与 MoE 之间意外却直观的协同作用,可能真的会改变这一点。 [引用 @TongPetersb]:超越语言训练。我们押注视觉世界,将其作为与语言建模并行且超越它的关键下一步。因此,我们研究了从零开始用视觉构建基础模型。我们分享我们的探索:视觉表征、数据、世界建模、架构和扩展行为![1/9]

Saining Xie@sainingxie · 2月7日49

self-driving <as a 2D robot with a low-dim action space that focused mostly on avoidance rather than interaction> will reach real-world impact faster than anything else. the really cool part is that the world model isn’t just about videos; it’s about modeling continuous, high-dimension, and noisy signals of all kinds. that’s what "multimodal" actually means. congrats to @maxjiang93, xander, bo, and the whole waymo team 👏

译推文观点认为,将自动驾驶视为专注于避障的低维行动空间二维机器人,能更快产生实际影响。Waymo世界模型的核心不止于视频生成,更是对连续、高维、多模态嘈杂信号的建模。该模型基于Google DeepMind的Genie 3构建,能创建大规模、超逼真的驾驶模拟。通过模拟如龙卷风、飞机降落高速公路等极端罕见场景,Waymo Driver可在真实遭遇前进行针对性训练,从而显著提升系统应对复杂情况的能力,加速自动驾驶技术的安全部署与成熟。

Jim Fan@DrJimFan · 2月4日72

http://x.com/i/article/2018744045779238912 # The Second Pre-training Paradigm Next word prediction was the first pre-training paradigm. Now we are living through the second paradigm shift: world modeling, or “next physical state prediction”. Very few understand how far-reaching this shift is, because unfortunately, the most hyped use case of world models right now is AI video slop (and coming up, game slop). I bet with full confidence that 2026 will mark the first year that Large World Models lay real foundations for robotics, and for multimodal AI more broadly. In this context, I define world modeling as predicting the next plausible world state (or a longer duration of states) conditioned on an action. Video generative models are one instantiation of it, where “next states” is a sequence of RGB frames (mostly 8-10 seconds, up to a few minutes) and “action” is a textual description of what to do. Training involves modeling the future changes in billions of hours of video pixels. At the core, video WMs are learnable physics simulators and rendering engines. They capture the counterfactuals, a fancier word for reasoning about how the future would have unfolded differently given an alternative action. WMs fundamentally put vision first. VLMs, in contrast, are fundamentally language-first. From the earliest prototypes (e.g. LLaVA, Liu et al. 2023), the story has mostly been the same: vision enters at the encoder, then gets routed into a language backbone. Over time, encoders improve, architectures get cleaner, vision tries to grow more “native” (as in omni models). Yet it remains a second-class citizen, dwarfed by the muscles the field has spent years building for LLMs. This path is convenient. We know LLMs scale. Our architectural instincts, data recipe design, and benchmark guidance (VQAs) are all highly optimized for language. For physical AI, 2025 was dominated by VLAs: graft a robot motor action decoder on top of a pre-trained VLM checkpoint. It’s really “LVAs”: language > vision > action, in decreasing order of citizenship. Again, this path is convenient, because we are fluent in VLM recipes. Yet most parameters in VLMs are allocated to knowledge (e.g. “this blob of pixels is a Coca Cola brand”), not to physics (“if you tip the coke bottle, it spreads into a brown puddle, stains the white tablecloth, and ruins the electric motor”). VLAs are quite good in knowledge retrieval by design, but head-heavy in the wrong places. The multi-stage grafting design also runs counter to my taste for simplicity and elegance. Biologically, vision dominates our cortical computation. Roughly a third of our cortex is devoted to processing pixels over occipital, temporal, and parietal regions. In contrast, language relies on a relatively compact area. Vision is by far the highest-bandwidth channel linking our brain, our motors, and the physical world. It closes the “sensorimotor loop” — the most important loop to solve for robotics, and requires zero language in the middle. Nature gives us an existential proof of a highly dexterous physical intelligence with minimal language capability. The ape. I’ve seen apes drive golf carts and change brake pads with screwdrivers like human mechanics. Their language understanding is no more than BERT or GPT-1, yet their physical skills are far beyond anything our SOTA robots can do. Apes may not have good LMs, but they surely have a robust mental picture of "what if"s: how the physical world works and reacts to their intervention. The era of world modeling is here. It is bitter lesson-pilled. As Jitendra likes to remind us, the scaling addicts, “Supervision is the opium of the AI researcher.” The whole of YouTube and the rise of smart glasses will capture raw visual streams of our world at a scale far beyond all the texts we ever train on. We shall see a new type of pretraining: next world states could include more than RGBs - 3D spatial motions, proprioception, and tactile sensing are just getting started. We shall see a new type of reasoning: chain of thought in visual space rather than language space. You can solve a physical puzzle by simulating geometry and contact, imagining how pieces move and collide, without ever translating into strings. Language is a bottleneck, a scaffold, not a foundation. We shall face a new Pandora’s box of open questions: even with perfect future simulation, how should motor actions be decoded? Is pixel reconstruction really the best objective, or shall we go into alternative latent spaces? How much robot data do we need, and is scaling teleoperation still the answer? And after all these exercises, are we finally inching towards the GPT-3 moment for robotics? Ilya is right after all. AGI has not converged. We are back to the age of research, and nothing is more thrilling than challenging first principles.

译作者指出,AI预训练正经历从“下一个词预测”到“世界建模”的根本性范式转变。世界模型的核心是预测给定行动后的下一个物理状态序列,本质上是可学习的物理模拟器,并将视觉置于首位。相比之下,当前主流的视觉语言模型本质是语言优先,视觉是次要输入。生物智能中视觉处理占据皮层计算的主导地位,是连接大脑、动作与物理世界的高带宽通道。作者以猿类为例,证明强大的物理智能可独立于高级语言存在。他预测,2026年大型世界模型将为机器人技术和多模态AI奠定真正基础,而YouTube等平台的海量视觉数据将远超文本规模,推动这一新范式发展。

Saining Xie@sainingxie · 1月24日

love this teaser lol (and it is real) academia boxed us in sooo tightly that we nearly broke, but we clawed our way out and found a whole new universe on the other side😅 thank you to Google for supporting the gpu-poor rebels and pulling us into this ride, helping us build what I believe is one of the best tpu/gcp infrastructure teams outside of google

译喜欢这段预告片哈哈(而且是真的) 学术界把我们限制得太紧了,差点崩溃,但我们挣扎着爬了出来,在另一边发现了一个全新的宇宙😅 感谢 Google 支持我们这些缺 GPU 的叛逆者,带我们踏上这段旅程,帮助我们建立了我认为是 Google 之外最好的 TPU/GCP 基础设施团队之一 [引用 @TongPetersb]:我们已经在学术界用 TPU 训练两年了(非常感谢 Google TRC!)。像 Cambrian-1、Cambrian-S、RAE 和 Scale-RAE 这样的工作没有 TPU 是不可能的。 我们写了一篇博客文章分享我们的经验、优化和教训:https://cambrian-mllm.github.io/blog/tpu-training-experiments.html 我们希望这能帮助更多人更顺畅地使用 TPU,它们非常强大!

Jim Fan@DrJimFan · 12月2日

Going to NeurIPS in San Diego! Available for coffee starting tomorrow afternoon. We are recruiting heavily for talents across robotics, VLM, world models, and software infra! DM me or email (on my very outdated home page).

译要去圣地亚哥参加 NeurIPS! 从明天下午开始可以约咖啡。 我们正在大力招聘机器人、VLM、世界模型和软件基础设施方面的人才! 请私信我或发邮件(在我非常过时的主页上)。

Saining Xie@sainingxie · 11月27日

after V*, many projects tried to get MLLMs to `think with images', but a regular 2d image limits you to mostly basic tools like zooming or cropping. to expand the action space, we need something more embodied. that is where H* from @YimingLi9702 and his team comes in. It takes a panoramic image as the environment. instead of staring at one image, the model can look around and think in 360. it is basically giving the model a neck! with that freedom, it can choose from many more actions and think inside real spaces like nyc train stations or shopping malls!

译H*项目突破传统MLLMs处理单一2D图像的局限,引入全景图像作为环境载体,使模型具备在360度真实空间中主动观察与推理的能力。相比V*等项目的局部视觉工具,H*通过"具身化"范式赋予模型类似人类颈部的视角自由度,显著扩展了行动空间,支持在地铁站、商场等复杂场景中进行视觉搜索与空间推理,实现了从被动接受到主动探索的范式转变。

Google DeepMind@GoogleDeepMind · 10月9日

We’re proud to announce that Genie 3 has been named one of @TIME’s Best Inventions of 2025. Genie 3 is our groundbreaking world model capable of generating interactive, playable environments from text or image prompts. Find out more → https://goo.gle/3KGqiYa

译Genie 3 获评《时代》杂志 2025 年度最佳发明之一。这款世界模型可从文本或图像提示生成交互式可玩环境,支持实时交互探索。

Google DeepMind@GoogleDeepMind · 10月2日

How can AI enhance the creative process of a world-renowned industrial designer? 🎨 We partnered with the visionary @RossLovegroveX and @modem_works to build a tool using Gemini and our image generation technology to translate his signature aesthetic into a new concept. 🪑

译Google 携手工业设计师 Ross Lovegrove 与 modem_works,利用 Gemini 及图像生成技术构建工具,将其标志性美学转化为全新家具设计概念。

Saining Xie@sainingxie · 8月11日

this isn’t just a modeling problem. it’s also a benchmarking problem. spurious correlations are always a pain, but in multimodal llms they become a particularly tough battle. On one hand, you want to leverage the language prior to enable better generalization; on the other, that same language prior can turn into a shortcut that makes the model effectively blind. the irony is that humans do the same thing. We still gravitate toward language-first tasks, and the “multimodal results” in major model releases like gpt-5 reflect exactly that bias. I mean, economically this makes most sense for LLM companies: you can claim wins in “multimodal reasoning” without investing heavily in real multimodal research. that shortcut will come due tho. when you try to put these systems into glasses, robots, or anything else that touches the real world, the cracks will show. and they’ll be costly.

译这不只是建模问题。也是基准测试问题。

Saining Xie@sainingxie · 7月31日

TheRightWay™ is my favorite brand now.

译TheRightWay™ 现在是我最喜欢的品牌。

Saining Xie@sainingxie · 5月29日

Indeed. For text-to-image, @xichen_pan had a great summary supporting this decoupled design philosophy: "Render unto diffusion what is generative, and unto LLMs what is understanding." We've repeatedly observed that diffusion gradients can negatively impact the backbone repr. This effect shows up in simpler settings—for example, we explored this issue to some extent in REPA-E (https://end2end-diffusion.github.io/). I believe the same principle applies to VLA. Fundamentally, the problem seems to be that diffusion gradients care too much about high-frequency details—whether in pixels or action policies—which tends to conflict with representation learning and understanding. btw, @ylecun has always been right about this -- long before any of these empirical findings.

译确实。对于文生图,@xichen_pan 有一个很好的总结支持这种解耦的设计理念:"把生成性的归给 diffusion,把理解的归给 LLMs。"

DeepSeek@deepseek_ai · 12月13日

🎉 DeepSeek-VL2 is here! Our next-gen vision-language model enters the MoE era. 🤖 DeepSeek-MoE arch + dynamic image tilling ⚡ 3B/16B/27B sizes for flexible use 🏆 Outstanding performance across all benchmarks 🧵 1/n

译🎉 DeepSeek-VL2 来了!我们的下一代视觉-语言模型进入 MoE 时代。 🤖 DeepSeek-MoE 架构 + 动态图像分块 ⚡ 3B/16B/27B 规模,灵活使用 🏆 在所有基准测试中表现优异 🧵 1/n

没有更多了
全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
3月26日
01:32
Google Gemini@GeminiApp
Lyria 3 适合创作趣味即兴曲目并快速分享,Lyria 3 Pro 则提供更多音乐流畅性与自定义选项。后者仅向 Google AI Plus、Pro 及 Ultra 订阅用户开放,适合需要进阶功能的创作者。
Google产品更新多模态
00:02
Google Gemini@GeminiApp
Lyria 3 Pro 正式接入 Gemini,支持生成更长音轨及复杂风格过渡。即日起向 Google AI Plus、Pro 与 Ultra 订阅用户开放。
Google产品更新多模态语音
00:02
Google DeepMind@GoogleDeepMind
Lyria 3 Pro 升级后可生成最长3分钟的高保真音轨,支持编排前奏、主歌、副歌与桥段等完整歌曲结构,实现更长篇幅的音乐创作。
DeepMind产品更新多模态
3月25日
16:46
Demis Hassabis@demishassabis
Google DeepMind 宣布与 Agile Robots 建立研究合作,将 Gemini 基础模型集成至对方硬件平台,部署于工业场景解决复杂挑战,构建下一代更实用的机器人。

Google DeepMind: Google DeepMind 🤝 Agile Robots Our new research partnership will integrate the Gemini foundation models with their hard...

DeepMindGoogle具身智能多模态
3月24日
03:28
Saining Xie@sainingxie
最好搭配 LeWorldModel 论文阅读。别问我为什么 🙂

Hang Zhao: Our recent findings on World Action Models (WAMs): the core advantage of WAMs is not test-time "imagination" of futures,...

具身智能多模态论文/研究
3月20日
22:54
Google Gemini@GeminiApp
分享一个 Nano Banana 提示词,可生成 2×2 网格的 3D 字体雕塑,将 4 个重要历史年份及其代表性发明以复古科技或蒸汽朋克风格立体呈现。提示词包含锚点定义、形态构建、材质物理和光照渲染等详细参数,直接复制即可使用。欢迎尝试并在回复中晒出你的生成结果。

Gadgetify: I asked Nano Banana to draw me 4 important years in history with their inventions. Interesting output Prompt: 2x2 grid, ...

Google图像生成多模态教程/实践
19:48
Artificial Analysis@ArtificialAnlys
精选
Mistral发布开源模型Small 4,支持混合推理与图像理解

Mistral发布开源权重模型Mistral Small 4,采用119B参数MoE架构(每token激活6.5B参数),支持可切换的推理/非推理模式及图像输入。推理模式在Artificial Analysis Intelligence Index获27分,超越Mistral Large 3,但低于gpt-oss-120B等竞品。模型token效率优于同类,幻觉率更低(AA-Omniscience -30分),支持256K上下文窗口,采用Apache 2.0许可证。

多模态开源生态推理模型发布

推荐理由:Mistral 开源 Small 4,支持混合推理与多模态,Agent 任务表现大幅提升
3月19日
11:12
Demis Hassabis@demishassabis
精选
Google 发布 vibe design 平台 Stitch,支持自然语言描述直接生成高保真界面和交互原型,可通过语音实时调整布局。目前仅面向 18 岁以上用户,在 Gemini 支持的英语国家开放。

Google Labs: Introducing the new @stitchbygoogle, Google's vibe design platform that transforms natural language into high-fidelity d...

智能体Google产品更新多模态

推荐理由:Google推出AI设计工具Stitch,自然语言生成界面并支持语音协作,顺应Vibe Design趋势
3月15日
22:25
Satya Nadella@satyanadella
新训练的多模态 AI 模型能将常规病理切片转化为空间蛋白质组学数据,在缩短检测时间、降低成本的同时,提升癌症医疗的可及性。
Microsoft多模态论文/研究
3月13日
12:14
Sundar Pichai@sundarpichai
MWC 现场展示的 Android XR 原型演示视频显示,Gemini 可流畅处理模糊复杂查询,且眼镜能与手机应用无缝协同工作。详情已发布至 Reddit。

Dieter Bohn: Here's the video of the Android XR demo we showed last week at MWC :) Couple things that stand out to me: how well Gemin...

智能体Google产品更新多模态
00:51
Sundar Pichai@sundarpichai
推出城市山洪预测模型,支持提前24小时预警。发布Groundsource数据集,基于Gemini识别150余国260万+历史山洪事件,数据已开源。预测功能现已在Flood Hub上线。
Google产品更新多模态
3月12日
23:59
Claude@claudeai
精选
Claude 新增交互式图表与图示生成功能,支持在聊天中直接创建可视化内容。该功能今日起以 beta 形式向所有套餐用户开放,包括免费版。
Anthropic产品更新多模态

推荐理由:Claude 新增聊天内交互式图表功能,免费版用户现已可用
3月5日
07:55
Saining Xie@sainingxie
来自 @TongPetersb、@DavidJFan 和 @__JohnNguyen__ 的又一项科学探索,即使你身处前沿实验室,也可能会让你学到新东西 这里有很多有趣的观察,但我只强调一点: - 尝试用 MoE 扩展 DiTs 大多徒劳无功,这算是行业公开的秘密。 - 但 RAE 与 MoE 之间意外却直观的协同作用,可能真的会改变这一点。 【引用 @TongPetersb】:超越语言训练。我们押注视觉世界,将其作为与语言建模并行且超越它的关键下一步。因此,我们研究了从零开始用视觉构建基础模型。我们分享我们的探索:视觉表征、数据、世界建模、架构和扩展行为!【1/9】

Peter Tong: Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, ...

多模态数据/训练论文/研究
2月7日
02:33
Saining Xie@sainingxie
49
推文观点认为,将自动驾驶视为专注于避障的低维行动空间二维机器人,能更快产生实际影响。Waymo世界模型的核心不止于视频生成,更是对连续、高维、多模态嘈杂信号的建模。该模型基于Google DeepMind的Genie 3构建,能创建大规模、超逼真的驾驶模拟。通过模拟如龙卷风、飞机降落高速公路等极端罕见场景,Waymo Driver可在真实遭遇前进行针对性训练,从而显著提升系统应对复杂情况的能力,加速自动驾驶技术的安全部署与成熟。

Waymo: We're excited to introduce the Waymo World Model-a frontier generative mode for large-scale, hyper-realistic autonomous ...

具身智能多模态大佬观点
2月4日
02:31
Jim Fan@DrJimFan
精选72
从"下一个词预测"到"世界建模":AI预训练的第二范式

作者指出,AI预训练正经历从“下一个词预测”到“世界建模”的根本性范式转变。世界模型的核心是预测给定行动后的下一个物理状态序列,本质上是可学习的物理模拟器,并将视觉置于首位。相比之下,当前主流的视觉语言模型本质是语言优先,视觉是次要输入。生物智能中视觉处理占据皮层计算的主导地位,是连接大脑、动作与物理世界的高带宽通道。作者以猿类为例,证明强大的物理智能可独立于高级语言存在。他预测,2026年大型世界模型将为机器人技术和多模态AI奠定真正基础,而YouTube等平台的海量视觉数据将远超文本规模,推动这一新范式发展。

具身智能多模态大佬观点

推荐理由:Jim Fan 把世界模型定义为第二次预训练范式转移,核心论点是视觉优先而非语言优先,这个框架对做机器人和多模态的人是真正的路线判断,不是又一篇水文。
1月24日
06:53
Saining Xie@sainingxie
喜欢这段预告片哈哈(而且是真的) 学术界把我们限制得太紧了,差点崩溃,但我们挣扎着爬了出来,在另一边发现了一个全新的宇宙😅 感谢 Google 支持我们这些缺 GPU 的叛逆者,带我们踏上这段旅程,帮助我们建立了我认为是 Google 之外最好的 TPU/GCP 基础设施团队之一 【引用 @TongPetersb】:我们已经在学术界用 TPU 训练两年了(非常感谢 Google TRC!)。像 Cambrian-1、Cambrian-S、RAE 和 Scale-RAE 这样的工作没有 TPU 是不可能的。 我们写了一篇博客文章分享我们的经验、优化和教训:https://cambrian-mllm.github.io/blog/tpu-training-experiments.html 我们希望这能帮助更多人更顺畅地使用 TPU,它们非常强大!

Peter Tong: We have been training with TPUs in academia for two years now (huge thanks to Google TRC!). Works like Cambrian-1, Cambr...

Google多模态教程/实践数据/训练
12月2日
01:35
Jim Fan@DrJimFan
要去圣地亚哥参加 NeurIPS! 从明天下午开始可以约咖啡。 我们正在大力招聘机器人、VLM、世界模型和软件基础设施方面的人才! 请私信我或发邮件(在我非常过时的主页上)。
具身智能多模态行业动态
11月27日
04:19
Saining Xie@sainingxie
突破2D限制:H*让AI在360度真实环境中思考

H*项目突破传统MLLMs处理单一2D图像的局限,引入全景图像作为环境载体,使模型具备在360度真实空间中主动观察与推理的能力。相比V*等项目的局部视觉工具,H*通过"具身化"范式赋予模型类似人类颈部的视角自由度,显著扩展了行动空间,支持在地铁站、商场等复杂场景中进行视觉搜索与空间推理,实现了从被动接受到主动探索的范式转变。

Yiming Li: 🤔Visual-spatial reasoning requires a shift from a disembodied, passive paradigm to an embodied, active one: 🤖Grounding...

具身智能多模态论文/研究
10月9日
23:40
Google DeepMind@GoogleDeepMind
Genie 3 获评《时代》杂志 2025 年度最佳发明之一。这款世界模型可从文本或图像提示生成交互式可玩环境,支持实时交互探索。
智能体Google多模态现象/趋势
10月2日
00:13
Google DeepMind@GoogleDeepMind
Google 携手工业设计师 Ross Lovegrove 与 modem_works,利用 Gemini 及图像生成技术构建工具,将其标志性美学转化为全新家具设计概念。
DeepMindGoogle图像生成多模态
8月11日
05:50
Saining Xie@sainingxie
精选
这不只是建模问题。也是基准测试问题。

Tairan He: I couldn't believe GPT-5 could make this mistake until @ziqiao_ma pointed it out to me. Highly recommend this paper (htt...

OpenAI多模态大佬观点

推荐理由:当前多模态模型靠语言捷径'作弊',真实场景落地将暴露致命隐患
7月31日
06:42
Saining Xie@sainingxie
TheRightWayTM 现在是我最喜欢的品牌。

Lucas Beyer (bl16): Ok this makes me super happy. The "NoFilter" work, paper, and advocacy that @angelinepouget and I argued so hard for is ...

Meta多模态大佬观点数据/训练
5月29日
05:34
Saining Xie@sainingxie
确实。对于文生图,@xichen_pan 有一个很好的总结支持这种解耦的设计理念:"把生成性的归给 diffusion,把理解的归给 LLMs。"

You Jiacheng: as expected, this matches findings in unified multimodal understanding and generation models by @sainingxie: frozen VLM ...

图像生成多模态大佬观点
12月13日
20:22
DeepSeek@deepseek_ai
精选
🎉 DeepSeek-VL2 来了!我们的下一代视觉-语言模型进入 MoE 时代。 🤖 DeepSeek-MoE 架构 + 动态图像分块 ⚡ 3B/16B/27B 规模,灵活使用 🏆 在所有基准测试中表现优异 🧵 1/n
DeepSeek多模态模型发布端侧

推荐理由:DeepSeek 开源 VL2 视觉模型,3B 轻量版可端侧部署,MoE 架构支持动态图像分块
‹ 上一页
1…212223
下一页 ›