AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 306 条
全部一手资讯X论文
标签「具身智能」清除
Jim Fan@DrJimFan · 2月26日

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:

译研究团队提出EgoScale方法,基于20,000小时第一人称人类视频预训练GR00T N1.5,仅用4小时机器人数据即可掌握组装模型车、操作注射器等高灵巧度任务,性能较从头训练提升54%。研究发现人类视频量与动作预测损失呈对数线性缩放关系(R²=0.998)。该方法利用22-DoF手部与人类的运动学相似性,无需复杂迁移算法即可重定向动作。策略可跨硬件迁移至Unitree G1(7-DoF),性能提升30%以上,且仅需单个示教即可学习新任务。

Jim Fan@DrJimFan · 2月25日

What can half of GPT-1 do? We trained a 42M transformer called SONIC to control the body of a humanoid robot. It takes a remarkable amount of subconscious processing for us humans to squat, turn, crawl, sprint. SONIC captures this "System 1" - the fast, reactive whole-body intelligence - in a single model that translates any motion command into stable, natural motor signals. And it's all open-source!! The key insight: motion tracking is the one, true scalable task for whole body control. Instead of hand-engineering rewards for every new skill, we use dense, frame-by-frame supervision from human mocap data. The data itself encodes the reward function: "configure your limbs in any human-like position while maintaining balance". We scaled humanoid motion RL to an unprecedented scale: 100M+ mocap frames and 500,000+ parallel robots across 128 GPUs. NVIDIA Isaac Lab allows us to accelerate physics at 10,000x faster tick, giving robots many years of virtual experience in only hours of wall clock time. After 3 days of training, the neural net transfers zero-shot to the real G1 robot with no finetuning. 100% success rate across 50 diverse real-world motion sequences. One SONIC policy supports all of the following: - VR whole-body teleoperation - Human video. Just point a webcam to live stream motions. - Text prompts. "Walk sideways", "dance like a monkey", "kick your left foot", etc. - Music audio. The robot dances to the beat, adapting to tempo and rhythm. - VLA foundation models. We plugged in GR00T N1.5 and achieved 95% success on mobile tasks. We open-source the code and model checkpoints!! Deep dive in thread:

译SONIC是一个4200万参数的Transformer模型(规模仅半个GPT-1),通过1亿+动作捕捉帧和50万+并行机器人在NVIDIA Isaac Lab中训练,以密集帧级监督替代手工奖励函数。训练3天后零样本迁移至真实G1机器人,在50种动作序列上达100%成功率。单一策略支持VR遥操作、视频动捕、文本指令、音乐响应及VLA模型控制。项目已完全开源。

Saining Xie@sainingxie · 2月7日49

self-driving <as a 2D robot with a low-dim action space that focused mostly on avoidance rather than interaction> will reach real-world impact faster than anything else. the really cool part is that the world model isn’t just about videos; it’s about modeling continuous, high-dimension, and noisy signals of all kinds. that’s what "multimodal" actually means. congrats to @maxjiang93, xander, bo, and the whole waymo team 👏

译推文观点认为,将自动驾驶视为专注于避障的低维行动空间二维机器人,能更快产生实际影响。Waymo世界模型的核心不止于视频生成,更是对连续、高维、多模态嘈杂信号的建模。该模型基于Google DeepMind的Genie 3构建,能创建大规模、超逼真的驾驶模拟。通过模拟如龙卷风、飞机降落高速公路等极端罕见场景,Waymo Driver可在真实遭遇前进行针对性训练,从而显著提升系统应对复杂情况的能力,加速自动驾驶技术的安全部署与成熟。

Jim Fan@DrJimFan · 2月5日

Greatness arises when non-consensus peaks

译伟大诞生于非共识之巅 [引用 @saranormous]:关于机器人技术如何发展的意见分歧,是 AI 领域最大的赚钱(和职业发展)机会之一

Jim Fan@DrJimFan · 2月5日

New milestone: we trained a robot foundation model on a world model backbone, and enabled zero-shot, open-world prompting capability for new verbs, nouns, and environments. If the world model can "dream" the right future in pixels, then the robot can execute well in motors. We call it "DreamZero", our first World Action Model (WAM). Our team had tons of fun at the lab typing anything we like into an open text prompt, and watch the robot perform tasks it was never trained on. An emergent capability we didn't quite expect. Obviously not GPT-3 reliable yet, but we are marching into the GPT-2 era. Discoveries: - Model and data recipe co-evolve. Compared to VLAs, WAMs learn best from diverse data, breaking away from the conventional wisdom that lots of repeated demos per task are the bread and butter. Diversity >> repetitions. - X-embodiment is extremely hard. Pixels are the answer. Different robot morphologies traditionally have a hard time sharing knowledge well. But if we put video first, pixels become the universal bridge connecting different hardware - even videos of human first-person view. DreamZero shows significant robot2robot and human2robot transfer. With only 55 trajectories on a *new*, unseen hardware (~30 min of teleop), it adapts so quickly and retains zero-shot prompting ability. Yesterday I posted about the "Second Pre-training Paradigm": world models are the next-gen foundation of Physical AI, not language backbones. Today, we are proving it works. And 2026 has just begun. Paper: World Action Models are Zero-Shot Policies. Read it now: (thread)

译团队发布DreamZero,首个基于世界模型骨干的World Action Model (WAM)。该模型突破传统Vision-Language-Action范式,通过像素级世界模型实现零样本开放世界提示能力,可执行未训练过的新任务。研究发现WAM依赖多样化数据而非重复演示,并以像素作为跨具身的通用桥梁,实现robot2robot和human2robot知识迁移。仅需55条轨迹(约30分钟遥操作)即可适应全新硬件,验证世界模型作为Physical AI下一代基础的可行性。

Jim Fan@DrJimFan · 2月4日72

http://x.com/i/article/2018744045779238912 # The Second Pre-training Paradigm Next word prediction was the first pre-training paradigm. Now we are living through the second paradigm shift: world modeling, or “next physical state prediction”. Very few understand how far-reaching this shift is, because unfortunately, the most hyped use case of world models right now is AI video slop (and coming up, game slop). I bet with full confidence that 2026 will mark the first year that Large World Models lay real foundations for robotics, and for multimodal AI more broadly. In this context, I define world modeling as predicting the next plausible world state (or a longer duration of states) conditioned on an action. Video generative models are one instantiation of it, where “next states” is a sequence of RGB frames (mostly 8-10 seconds, up to a few minutes) and “action” is a textual description of what to do. Training involves modeling the future changes in billions of hours of video pixels. At the core, video WMs are learnable physics simulators and rendering engines. They capture the counterfactuals, a fancier word for reasoning about how the future would have unfolded differently given an alternative action. WMs fundamentally put vision first. VLMs, in contrast, are fundamentally language-first. From the earliest prototypes (e.g. LLaVA, Liu et al. 2023), the story has mostly been the same: vision enters at the encoder, then gets routed into a language backbone. Over time, encoders improve, architectures get cleaner, vision tries to grow more “native” (as in omni models). Yet it remains a second-class citizen, dwarfed by the muscles the field has spent years building for LLMs. This path is convenient. We know LLMs scale. Our architectural instincts, data recipe design, and benchmark guidance (VQAs) are all highly optimized for language. For physical AI, 2025 was dominated by VLAs: graft a robot motor action decoder on top of a pre-trained VLM checkpoint. It’s really “LVAs”: language > vision > action, in decreasing order of citizenship. Again, this path is convenient, because we are fluent in VLM recipes. Yet most parameters in VLMs are allocated to knowledge (e.g. “this blob of pixels is a Coca Cola brand”), not to physics (“if you tip the coke bottle, it spreads into a brown puddle, stains the white tablecloth, and ruins the electric motor”). VLAs are quite good in knowledge retrieval by design, but head-heavy in the wrong places. The multi-stage grafting design also runs counter to my taste for simplicity and elegance. Biologically, vision dominates our cortical computation. Roughly a third of our cortex is devoted to processing pixels over occipital, temporal, and parietal regions. In contrast, language relies on a relatively compact area. Vision is by far the highest-bandwidth channel linking our brain, our motors, and the physical world. It closes the “sensorimotor loop” — the most important loop to solve for robotics, and requires zero language in the middle. Nature gives us an existential proof of a highly dexterous physical intelligence with minimal language capability. The ape. I’ve seen apes drive golf carts and change brake pads with screwdrivers like human mechanics. Their language understanding is no more than BERT or GPT-1, yet their physical skills are far beyond anything our SOTA robots can do. Apes may not have good LMs, but they surely have a robust mental picture of "what if"s: how the physical world works and reacts to their intervention. The era of world modeling is here. It is bitter lesson-pilled. As Jitendra likes to remind us, the scaling addicts, “Supervision is the opium of the AI researcher.” The whole of YouTube and the rise of smart glasses will capture raw visual streams of our world at a scale far beyond all the texts we ever train on. We shall see a new type of pretraining: next world states could include more than RGBs - 3D spatial motions, proprioception, and tactile sensing are just getting started. We shall see a new type of reasoning: chain of thought in visual space rather than language space. You can solve a physical puzzle by simulating geometry and contact, imagining how pieces move and collide, without ever translating into strings. Language is a bottleneck, a scaffold, not a foundation. We shall face a new Pandora’s box of open questions: even with perfect future simulation, how should motor actions be decoded? Is pixel reconstruction really the best objective, or shall we go into alternative latent spaces? How much robot data do we need, and is scaling teleoperation still the answer? And after all these exercises, are we finally inching towards the GPT-3 moment for robotics? Ilya is right after all. AGI has not converged. We are back to the age of research, and nothing is more thrilling than challenging first principles.

译作者指出,AI预训练正经历从“下一个词预测”到“世界建模”的根本性范式转变。世界模型的核心是预测给定行动后的下一个物理状态序列,本质上是可学习的物理模拟器,并将视觉置于首位。相比之下,当前主流的视觉语言模型本质是语言优先,视觉是次要输入。生物智能中视觉处理占据皮层计算的主导地位,是连接大脑、动作与物理世界的高带宽通道。作者以猿类为例,证明强大的物理智能可独立于高级语言存在。他预测,2026年大型世界模型将为机器人技术和多模态AI奠定真正基础,而YouTube等平台的海量视觉数据将远超文本规模,推动这一新范式发展。

Jim Fan@DrJimFan · 12月29日

Everyone's freaking out about vibe coding. In the holiday spirit, allow me to share my anxiety on the wild west of robotics. 3 lessons I learned in 2025. 1. Hardware is ahead of software, but hardware reliability severely limits software iteration speed. We've seen exquisite engineering arts like Optimus, e-Atlas, Figure, Neo, G1, etc. Our best AI has not squeezed all the juice out of these frontier hardware. The body is more capable than what the brain can command. Yet babysitting these robots demands an entire operation team. Unlike humans, robots don't heal from bruises. Overheating, broken motors, bizarre firmware issues haunt us daily. Mistakes are irreversible and unforgiving. My patience was the only thing that scaled. 2. Benchmarking is still an epic disaster in robotics. LLM normies thought MMLU & SWE-Bench are common sense. Hold your 🍺 for robotics. No one agrees on anything: hardware platform, task definition, scoring rubrics, simulator, or real world setups. Everyone is SOTA, by definition, on the benchmark they define on the fly for each news announcement. Everyone cherry-picks the nicest looking demo out of 100 retries. We gotta do better as a field in 2026 and stop treating reproducibility and scientific discipline as second-class citizens. 3. VLM-based VLA feels wrong. VLA stands for "vision-language-action" model and has been the dominant approach for robot brains. Recipe is simple: take a pretrained VLM checkpoint and graft an action module on top. But if you think about it, VLMs are hyper-optimized to hill-climb benchmarks like visual question answering. This implies two problems: (1) most parameters in VLMs are for language & knowledge, not for physics; (2) visual encoders are actively tuned to *discard* low-level details, because Q&A only requires high-level understanding. But minute details matter a lot for dexterity. There's no reason for VLA's performance to scale as VLM parameters scale. Pretraining is misaligned. Video world model seems to be a much better pretraining objective for robot policy. I'm betting big on it.

译硬件方面,Optimus等虽工程精湛,但可靠性不足严重限制软件迭代,且维护成本高昂。基准测试领域仍处混乱,缺乏统一的硬件平台、任务定义和评分标准,cherry-picking现象普遍,可复现性堪忧。VLA(Vision-Language-Action)方法基于VLM存在本质缺陷:VLM为视觉问答优化,参数侧重语言知识而非物理理解,且视觉编码器丢弃低层细节,不利于精细操作。作者认为视频世界模型是更优的预训练目标。

Jim Fan@DrJimFan · 12月24日

I was very late to own a Tesla but among the earliest to try out FSD v14. It's perhaps the first time I experience an AI that passes the Physical Turing Test: after a long day at work, you press a button, lay back, and couldn't tell if a neural net or a human drove you home. Despite knowing exactly how robot learning works, I still find it magical watching the steering wheel turn by itself. First it feels surreal, next it becomes routine. Then, like the smartphone, taking it away actively hurts. This is how humanity gets rewired and glued to god-like technologies.

译作者虽晚购特斯拉却率先体验FSD v14,认为这是首个通过"物理图灵测试"的AI系统:疲惫下班后只需按下按钮放松休息,已无法分辨是神经网络还是人类在驾驶。尽管深知机器人学习原理,方向盘自动转动时的流畅表现仍令人震撼。这项技术正从超现实体验转变为日常习惯,最终如智能手机般不可或缺。这种对"神级技术"的深度依赖,正在从根本上重塑人类行为模式。

Jim Fan@DrJimFan · 12月2日

Going to NeurIPS in San Diego! Available for coffee starting tomorrow afternoon. We are recruiting heavily for talents across robotics, VLM, world models, and software infra! DM me or email (on my very outdated home page).

译要去圣地亚哥参加 NeurIPS! 从明天下午开始可以约咖啡。 我们正在大力招聘机器人、VLM、世界模型和软件基础设施方面的人才! 请私信我或发邮件(在我非常过时的主页上)。

Saining Xie@sainingxie · 11月27日

after V*, many projects tried to get MLLMs to `think with images', but a regular 2d image limits you to mostly basic tools like zooming or cropping. to expand the action space, we need something more embodied. that is where H* from @YimingLi9702 and his team comes in. It takes a panoramic image as the environment. instead of staring at one image, the model can look around and think in 360. it is basically giving the model a neck! with that freedom, it can choose from many more actions and think inside real spaces like nyc train stations or shopping malls!

译H*项目突破传统MLLMs处理单一2D图像的局限,引入全景图像作为环境载体,使模型具备在360度真实空间中主动观察与推理的能力。相比V*等项目的局部视觉工具,H*通过"具身化"范式赋予模型类似人类颈部的视角自由度,显著扩展了行动空间,支持在地铁站、商场等复杂场景中进行视觉搜索与空间推理,实现了从被动接受到主动探索的范式转变。

Jim Fan@DrJimFan · 9月26日

Go check out @yukez’s talk at CoRL! Project GR00T is cooking 🍳

译@yukez 在 CoRL 2025 分享 Project GR00T 最新研究,发布 NVIDIA Isaac GR00T 平台更新,探讨人形机器人基础模型的技术挑战与新机遇。

Jeff Dean@JeffDean · 9月21日

Very nice analysis by neurosurgeon @slotkinjr of how @Waymo's autonomous vehicles have much better safety properties than human drivers, and what would happen if everyone in the U.S. drove as safely as a Waymo. "The national math: If every US vehicle performed like Waymo, we’d prevent 33,000-39,000 deaths annually and save $0.9-1.25 trillion in societal costs. Even partial adoption at 27% would save ~10,000 lives per year. In terms of magnitude, this would be the equivalent of eliminating every pedestrian death nationally in a year."

译神经外科医生基于Waymo9600万英里行驶数据的深度分析显示,其自动驾驶汽车严重事故率比人类驾驶低91%,交叉口伤害事故减少95%。数据显示47%的碰撞速度差小于1 mph,系统将不可避免的事故转化为轻微接触。若全美车辆达到该安全水平,每年可防止3.3–3.9万人死亡、节省0.9–1.25万亿美元社会成本,即使27%普及率也能挽救约1万人生命,实现从伤害减轻到伤害预防的根本转变。

Jim Fan@DrJimFan · 9月13日

There was something deeply satisfying about ImageNet. It had a well curated training set. A clearly defined testing protocol. A competition that rallied the best researchers. And a leaderboard that spawned ResNets and ViTs, and ultimately changed the field for good. Then NLP followed. No matter how much OpenAI, Anthropic, and xAI disagree, they at least agree on one thing: benchmarking. MMLU, HLE, SWEBench - you can’t make progress until you are able to measure it. Robotics still doesn’t have such a rallying call. No one agrees on anything: hardware, task, scoring, simulation engine, or real world environment. Everyone is SOTA, by definition, on the benchmark they define on the fly for each paper. From the maker of ImageNet - BEHAVIOR takes a stab at the daunting challenge of unifying robotics benchmarking on a reproducible physics engine (Isaac Sim). The project started before I graduated from Stanford Vision Lab, and took so many years of dedication and PhD careers to build. I hope BEHAVIOR is either the hill-climbing signal we need, or the spark that finally gets us talking about how to measure real progress as a field.

译推文指出计算机视觉(ImageNet)和自然语言处理(MMLU、HLE、SWEBench)已建立标准化基准体系,而机器人学仍缺乏统一评估标准,存在硬件、任务定义、评分体系混乱的问题。由ImageNet创造者开发的BEHAVIOR项目基于Isaac Sim物理引擎,旨在建立可复现的机器人学统一基准。该项目已启动首届NeurIPS 2025挑战赛,期望成为推动领域进步的标志性信号。

Google DeepMind@GoogleDeepMind · 8月22日

Why make explorable, AI generated worlds? @ShlomiFruchter and @JParkerHolder explain how creating diverse and challenging virtual environments can help test and train AI agents safely. 🦺 Watch their conversation about Genie 3 with podcast host, @fryrsquared ↓ Timecodes: 00:00 Coming up 00:54 Intro 02:00 What is Genie 3? 03:00 Demos 06:50 Physics 11:00 Possible scenarios 15:07 Emergent properties 20:35 Veo and Genie 25:00 Hunter's lodge 28:02 Memory 34:52 SIMA and Genie 3 39:13 Searching for interestingness 52:00 AGI and the future 59:23 Hannah's thoughts

译Google DeepMind 研究者 Shlomi Fruchter 与 Jack Parker-Holder 在播客中解析 Genie 3:通过生成多样化、可探索的虚拟世界,为 AI 智能体提供安全的测试与训练环境。对话涵盖物理模拟、涌现特性、与 SIMA 结合及 AGI 前景等议题。

Jim Fan@DrJimFan · 8月7日

Would love to see the FSD Scaling Law, as it’s the only physical data flywheel at planetary scale. What’s the “emergent ability threshold” for model/data size?

译关注 FSD Scaling Law 及涌现能力阈值,这是全球唯一的物理数据飞轮。Tesla 正训练参数量约 10 倍的新 FSD 模型,视频压缩损失大幅改进,顺利的话下月底发布。

Jim Fan@DrJimFan · 8月5日

World modeling for robotics is incredibly hard because (1) control of humanoid robots & 5-finger hands is wayyy harder than ⬆️⬅️⬇️➡️ in games (Genie 3); and (2) object interaction is much more diverse than FSD, which needs to *avoid* coming into contact. Our GR00T Dreams work was a first attempt at building high-fidelity world simulator for humanoid robots. It's not only for evaluation but also for large-scale synthetic data generation. Time to move away from the "fossil fuel" of robotics (human teleoperation) and embrace clean energy (nuclear "diffusion")! GR00T Dreams kind of flew under the radar, so bringing it back to life on a cheerful day ;)

译NVIDIA发布DreamGen引擎(GR00T Dreams),将Sora/Veo等视频生成模型用作神经物理引擎,通过微调模型、模拟并行世界、恢复伪动作、训练基础模型四步流程,为机器人生成大规模合成训练数据。人形机器人仅凭单一拾放任务即可学会倾倒、折叠等22种新行为,在新动词和陌生环境中实现零样本泛化(成功率分别达43%和28%)。相比传统图形引擎,该方法以恒定计算成本处理可变形物体、流体等复杂交互,团队计划数周内完全开源。

Jim Fan@DrJimFan · 8月5日

Evaluation is the hardest problem for physical AI systems: do you crash test cars every time you debug a new FSD build? Traditional game engine (sim 1.0) is an alternative, but it's not possible to hard-code all edge cases. A neural net-based sim 2.0 is purely programmed by data, grows more capable with data, and scales as the fleet data flywheel scales.

译物理AI评估无法靠实车碰撞测试完成,传统游戏引擎(sim 1.0)也难以覆盖所有边缘情况。基于神经网络的sim 2.0由数据驱动,随车队规模扩展。Tesla已应用多年,用于生成近正面碰撞等罕见危险场景的训练数据,补充800万辆实车难以采集的极端案例。

Jim Fan@DrJimFan · 7月26日

I'm observing a mini Moravec's paradox within robotics: gymnastics that are difficult for humans are much easier for robots than "unsexy" tasks like cooking, cleaning, and assembling. It leads to a cognitive dissonance for people outside the field, "so, robots can parkour & breakdance, but why can't they take care of my dog?" Trust me, I got asked by my parents about this more than you think ... The "Robot Moravec's paradox" also creates the illusion that physical AI capabilities are way more advanced than they truly are. I'm not singling out Unitree, as it applies widely to all recent acrobatic demos in the industry. Here's a simple test: if you set up a wall in front of the side-flipping robot, it will slam into it at full force and make a spectacle. Because it's just overfitting that single reference motion, without any awareness of the surroundings. Here's why the paradox exists: it's much easier to train a "blind gymnast" than a robot that sees and manipulates. The former can be solved entirely in simulation and transferred zero-shot to the real world, while the latter demands extremely realistic rendering, contact physics, and messy real-world object dynamics - none of which can be simulated well. Imagine you can train LLMs not from the internet, but from a purely hand-crafted text console game. Roboticists got lucky. We happen to live in a world where accelerated physics engines are so good that we can get away with impressive acrobatics using literally zero real data. But we haven't yet discovered the same cheat code for general dexterity. Till then, we'll still get questioned by our confused parents.

译机器人领域存在"莫拉维克悖论":后空翻等杂技比做饭、清洁更容易实现。前者可在模拟中训练并零样本迁移,无需感知环境;后者需要真实的视觉、接触物理和物体动力学,难以模拟。这导致外界困惑——机器人能炫技却做不好家务,只因通用灵巧性仍是未解难题。

Jim Fan@DrJimFan · 7月19日

My bar for AGI is far simpler: an AI cooking a nice dinner at anyone’s house for any cuisine. The Physical Turing Test is very likely harder than the Nobel Prize. Moravec’s paradox will continue to haunt us, looming larger and darker, for the decade to come.

译AGI 的门槛不是赢得诺贝尔奖,而是能去任何人家中烹饪任意菜系。物理图灵测试远比学术理论困难,Moravec 悖论将在未来十年持续困扰 AI 发展。

Jim Fan@DrJimFan · 7月14日

I've been a bit quiet on X recently. The past year has been a transformational experience. Grok-4 and Kimi K2 are awesome, but the world of robotics is a wondrous wild west. It feels like NLP in 2018 when GPT-1 was published, along with BERT and a thousand other flowers that bloomed. No one knew which one would eventually become ChatGPT. Debates were heated. Entropy was sky high. Ideas were insanely fun. I believe the GPT-1 of robotics is already somewhere on Arxiv, but we don't know exactly which one. Could be world models, RL, learning from human video, sim2real, real2sim, etc. etc, or any combo of them. Debates are heated. Entropy is sky high. Ideas are insanely fun, instead of squeezing the last few % on AIME & GPQA. The nature of robotics also greatly complicates the design space. Unlike the clean world of bits for LLMs (text strings), we roboticists have to deal with the messy world of atoms. After all, there's a lump of software-defined metal in the loop. LLM normies may find it hard to believe, but so far roboticists still can't agree on a benchmark! Different robots have different capability envelopes - some are better at acrobatics while others at object manipulation. Some are meant for industrial use while others are for household tasks. Cross-embodiment isn't just a research novelty, but an essential feature for a universal robot brain. I've talked to dozens of C-suite leads from various robot companies, old and new. Some sell the whole body. Some sell body parts such as dexterous hands. Many more others sell the shovels to manufacture new bodies, create simulations, or collect massive troves of data. The business idea space is as wild as research itself. It's a new gold rush, the likes of which we haven't seen since the 2022 ChatGPT wave. The best time to enter is when non-consensus peaks. We're still at the start of a loss curve - there're strong signs of life, but far, far away from convergence. Every gradient step takes us into the unknown. But one thing I do know for sure - there's no AGI without touching, feeling, and being embodied in the messy world. On a more personal note - running a research lab comes with a whole new level of responsibility. Giving updates directly to the CEO of a $4T company is, to put it mildly, both thrilling and all-consuming of my attention weights. Gone are the days when I could stay on top of and dive deep into every AI news. I’ll try to carve out time to share more of my journey.

译机器人领域正处于类似 2018 年 NLP 的混沌期,技术路线未定(世界模型、RL、sim2real 等),商业模式百花齐放,是入局的好时机。管理实验室并直接向 4 万亿美元公司 CEO 汇报消耗了全部精力,故在 X 上发言减少。坚信没有具身智能就没有 AGI。

Saining Xie@sainingxie · 6月21日

guys, real geospatial data is a total goldmine for digital agents. step away from the web browser and get real. (we explored a bit in http://virl-platform.github.io, but building a simulation-ready pipeline like this could take things way further)

译伙计们,真实的地理空间数据对数字智能体来说完全是座金矿。走出网页浏览器,来点真实的。

Jim Fan@DrJimFan · 5月20日

What if robots could dream inside a video generative model? Introducing DreamGen, a new engine that scales up robot learning not with fleets of human operators, but with digital dreams in pixels. DreamGen produces massive volumes of neural trajectories - photorealistic robot videos paired with motor action labels - and unlocks strong generalization to new nouns, verbs, and environments. Whether you’re a humanoid (GR1), an industrial arm (Franka), or a cute little robot (HuggingFace SO-100), DreamGen enables you to dream. Video generation models like Sora & Veo are neural physics engines. By compressing billions of internet videos, they learn a multiverse of plausible futures, i.e. superpositions of how the world could unfold from any initial image frame. DreamGen taps into this power with a simple 4-step recipe: 1. Fine-tune a SOTA video model on your target robot; 2. Prompt the model with diverse language prompts to simulate parallel worlds: how your robot would have acted in new scenarios. Filter out the bad dreams (ha!) that don’t follow instructions; 3. Recover pseudo-actions using inverse dynamics or latent action models; 4. Train robot foundation models on the massively augmented dataset of neural trajectories. That’s it. Just more data, and plain old supervised learning. Simple, right? What’s remarkable is how far this goes. Starting with just a single-task dataset of pick-and-place, our humanoid robot learns 22 new behaviors, such as pouring, folding, scooping, ironing, and hammering, despite never seeing those verbs before. Better yet, we can take the robot out of the lab and drop it into the NVIDIA HQ Cafe, and let DreamGen work its magic. We show true zero-to-one generalization: from 0% success to over 43% for novel verbs, and 0 -> 28% in unseen environments. Compared to a traditional graphics engine, DreamGen doesn’t care if the scene involves deformable objects, fluids, translucent materials, contact-rich interactions, or crazy lighting. Good luck engineering those by hand. For DreamGen, every world is just a forward pass through a diffusion neural net. No matter how complex the dream is, it takes constant compute time to roll out. Read our blog and paper today! We plan to fully open-source the entire pipeline in the next few weeks. Links in thread:

译DreamGen让机器人在视频生成模型中"做梦"合成训练数据。通过微调Sora等模型生成海量神经轨迹(逼真视频+动作标签),机器人从单一拾取放置任务泛化到倾倒、折叠等22种新行为。在NVIDIA总部咖啡厅测试中,人形机器人对新动词零样本成功率从0%提升至43%,新环境达28%。相比传统图形引擎,无需手工建模即可处理流体、可变形物体等复杂场景,整个pipeline将于近期完全开源。

Jim Fan@DrJimFan · 5月15日

Going to ICRA next week in Atlanta!! We are on a mission to build the most cracked team on humanoid robotics. Hiring the best talents on the research frontier of VLA, world models, RL, and simulation! DM or email me for meetup! linxif@nvidia.com

译NVIDIA 研究团队下周赴亚特兰大参加 ICRA,现场招募人形机器人核心成员。聚焦 VLA、world models、RL 及仿真等前沿方向,寻求顶尖研究人才,可私信或邮件约见。

Jim Fan@DrJimFan · 5月8日

The Physical Turing Test: your house is a complete mess after a Sunday hackathon. On Monday night, you come home to an immaculate living room and a candlelight dinner. And you couldn't tell whether a human or a machine had been there. Deceptively simple, insanely hard. It is the next North Star of AI. The dream that keeps me awake 12 am at the lab. The vision for the next computing platform that automates chunks of atoms instead of chunks of bits. Thanks Sequoia for hosting me at AI Ascent! Below is my full talk on the first principles to solve general-purpose robotics: how we think about the data strategy and scaling laws. I assure you it will be 17 minutes you don't regret!

译提出"物理图灵测试"作为通用机器人的北极星目标:机器能否像人一样完成物理任务(整理房间、准备晚餐)而不被察觉。这是从自动化比特迈向原子的下一代计算平台。在 Sequoia AI Ascent 演讲,分享第一性原理、数据策略与扩展定律,时长17分钟。

Jim Fan@DrJimFan · 4月22日

Some day in the next decade, we will have robots in every home, every hospital and factory, doing every dull and dangerous jobs with superhuman dexterity. That day will be known as “Thursday”. Not even Turing would dare to dream up our lifetime in his wildest dreams.

译未来十年机器人将普及至家庭、医院和工厂,以超人灵巧完成枯燥危险工作,而那天只会被称为"周四"。跨越 Turing Test 已无人欢呼,从前被视为终极挑战的里程碑,如今只是"又一个该死的周二"。技术迭代之快让奇迹变得稀松平常。

Jim Fan@DrJimFan · 3月21日

We got lots of great community feedback on our open-source GR00T N1! Check out our Github, star, fork, contribute back! Let's solve generally intelligent robots together, one commit at a time. https://github.com/NVIDIA/Isaac-GR00T/

译NVIDIA 发布世界首个开源人形机器人基础模型 GR00T N1,仅 2B 参数,采用 VLM 加 Diffusion Transformer 架构实现端到端控制。模型基于真实遥操作、30 万+仿真轨迹及合成神经轨迹训练,在 GR1、1X Neo 等机器人上任务性能提升 30%,并可跨具身部署至百元级开源机械臂。

没有更多了
全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
2月26日
01:22
Jim Fan@DrJimFan
精选
22自由度灵巧手人形机器人:从2万小时人类视频学习精细操作

研究团队提出EgoScale方法,基于20,000小时第一人称人类视频预训练GR00T N1.5,仅用4小时机器人数据即可掌握组装模型车、操作注射器等高灵巧度任务,性能较从头训练提升54%。研究发现人类视频量与动作预测损失呈对数线性缩放关系(R²=0.998)。该方法利用22-DoF手部与人类的运动学相似性,无需复杂迁移算法即可重定向动作。策略可跨硬件迁移至Unitree G1(7-DoF),性能提升30%以上,且仅需单个示教即可学习新任务。

具身智能数据/训练论文/研究
关联讨论 1 条X:Jim Fan (@DrJimFan)
推荐理由:人类视频学习呈现完美缩放定律,机器人仅需单演示即可掌握新技能,具身智能迎来数据革命
2月25日
01:34
Jim Fan@DrJimFan
精选
SONIC:半个GPT-1规模的机器人全身控制模型

SONIC是一个4200万参数的Transformer模型(规模仅半个GPT-1),通过1亿+动作捕捉帧和50万+并行机器人在NVIDIA Isaac Lab中训练,以密集帧级监督替代手工奖励函数。训练3天后零样本迁移至真实G1机器人,在50种动作序列上达100%成功率。单一策略支持VR遥操作、视频动捕、文本指令、音乐响应及VLA模型控制。项目已完全开源。

智能体具身智能开源生态模型发布

推荐理由:42M小模型实现人形机器人全身控制,零样本迁移真实硬件且完全开源,开发者可复现
2月7日
02:33
Saining Xie@sainingxie
49
推文观点认为,将自动驾驶视为专注于避障的低维行动空间二维机器人,能更快产生实际影响。Waymo世界模型的核心不止于视频生成,更是对连续、高维、多模态嘈杂信号的建模。该模型基于Google DeepMind的Genie 3构建,能创建大规模、超逼真的驾驶模拟。通过模拟如龙卷风、飞机降落高速公路等极端罕见场景,Waymo Driver可在真实遭遇前进行针对性训练,从而显著提升系统应对复杂情况的能力,加速自动驾驶技术的安全部署与成熟。

Waymo: We're excited to introduce the Waymo World Model-a frontier generative mode for large-scale, hyper-realistic autonomous ...

具身智能多模态大佬观点
2月5日
14:54
Jim Fan@DrJimFan
伟大诞生于非共识之巅 【引用 @saranormous】:关于机器人技术如何发展的意见分歧,是 AI 领域最大的赚钱(和职业发展)机会之一

sarah guo: the divergence of opinion in how robotics plays out is one of the biggest money making (and career making) opportunities...

具身智能大佬观点
02:15
Jim Fan@DrJimFan
精选
新里程碑:基于世界模型骨干的DreamZero实现零样本开放世界机器人控制

团队发布DreamZero,首个基于世界模型骨干的World Action Model (WAM)。该模型突破传统Vision-Language-Action范式,通过像素级世界模型实现零样本开放世界提示能力,可执行未训练过的新任务。研究发现WAM依赖多样化数据而非重复演示,并以像素作为跨具身的通用桥梁,实现robot2robot和human2robot知识迁移。仅需55条轨迹(约30分钟遥操作)即可适应全新硬件,验证世界模型作为Physical AI下一代基础的可行性。

智能体arXiv具身智能论文/研究

推荐理由:世界模型成为物理AI新底座,机器人零样本泛化能力逼近GPT-2时刻
2月4日
02:31
Jim Fan@DrJimFan
精选72
从"下一个词预测"到"世界建模":AI预训练的第二范式

作者指出,AI预训练正经历从“下一个词预测”到“世界建模”的根本性范式转变。世界模型的核心是预测给定行动后的下一个物理状态序列,本质上是可学习的物理模拟器,并将视觉置于首位。相比之下,当前主流的视觉语言模型本质是语言优先,视觉是次要输入。生物智能中视觉处理占据皮层计算的主导地位,是连接大脑、动作与物理世界的高带宽通道。作者以猿类为例,证明强大的物理智能可独立于高级语言存在。他预测,2026年大型世界模型将为机器人技术和多模态AI奠定真正基础,而YouTube等平台的海量视觉数据将远超文本规模,推动这一新范式发展。

具身智能多模态大佬观点

推荐理由:Jim Fan 把世界模型定义为第二次预训练范式转移,核心论点是视觉优先而非语言优先,这个框架对做机器人和多模态的人是真正的路线判断,不是又一篇水文。
12月29日
02:11
Jim Fan@DrJimFan
精选
机器人领域的三大困境:硬件可靠性、基准测试与VLA局限

硬件方面,Optimus等虽工程精湛,但可靠性不足严重限制软件迭代,且维护成本高昂。基准测试领域仍处混乱,缺乏统一的硬件平台、任务定义和评分标准,cherry-picking现象普遍,可复现性堪忧。VLA(Vision-Language-Action)方法基于VLM存在本质缺陷:VLM为视觉问答优化,参数侧重语言知识而非物理理解,且视觉编码器丢弃低层细节,不利于精细操作。作者认为视频世界模型是更优的预训练目标。

智能体具身智能大佬观点

推荐理由:NVIDIA科学家揭示机器人学三大痛点:硬件拖累迭代、基准混乱、VLA路线存在根本缺陷
12月24日
06:28
Jim Fan@DrJimFan
精选
晚购特斯拉却早试FSD v14:首个通过物理图灵测试的AI体验

作者虽晚购特斯拉却率先体验FSD v14,认为这是首个通过"物理图灵测试"的AI系统:疲惫下班后只需按下按钮放松休息,已无法分辨是神经网络还是人类在驾驶。尽管深知机器人学习原理,方向盘自动转动时的流畅表现仍令人震撼。这项技术正从超现实体验转变为日常习惯,最终如智能手机般不可或缺。这种对"神级技术"的深度依赖,正在从根本上重塑人类行为模式。

Phil Duan: Along for the ride in unsupervised FSD testing

具身智能大佬观点
关联讨论 1 条X:Jim Fan (@DrJimFan)
推荐理由:当AI驾驶让你无法区分人机时,出行方式的底层信任逻辑将被重写
12月2日
01:35
Jim Fan@DrJimFan
要去圣地亚哥参加 NeurIPS! 从明天下午开始可以约咖啡。 我们正在大力招聘机器人、VLM、世界模型和软件基础设施方面的人才! 请私信我或发邮件(在我非常过时的主页上)。
具身智能多模态行业动态
11月27日
04:19
Saining Xie@sainingxie
突破2D限制:H*让AI在360度真实环境中思考

H*项目突破传统MLLMs处理单一2D图像的局限,引入全景图像作为环境载体,使模型具备在360度真实空间中主动观察与推理的能力。相比V*等项目的局部视觉工具,H*通过"具身化"范式赋予模型类似人类颈部的视角自由度,显著扩展了行动空间,支持在地铁站、商场等复杂场景中进行视觉搜索与空间推理,实现了从被动接受到主动探索的范式转变。

Yiming Li: 🤔Visual-spatial reasoning requires a shift from a disembodied, passive paradigm to an embodied, active one: 🤖Grounding...

具身智能多模态论文/研究
9月26日
08:25
Jim Fan@DrJimFan
@yukez 在 CoRL 2025 分享 Project GR00T 最新研究,发布 NVIDIA Isaac GR00T 平台更新,探讨人形机器人基础模型的技术挑战与新机遇。

NVIDIA Robotics: The rise of humanoid platforms presents new opportunities and unique challenges. 🤖 Join @yukez at #CoRL2025 as he share...

产品更新具身智能模型发布
9月21日
01:27
Jeff Dean@JeffDean
Waymo自动驾驶安全分析:每年可挽救数万生命并节省万亿成本

神经外科医生基于Waymo9600万英里行驶数据的深度分析显示,其自动驾驶汽车严重事故率比人类驾驶低91%,交叉口伤害事故减少95%。数据显示47%的碰撞速度差小于1 mph,系统将不可避免的事故转化为轻微接触。若全美车辆达到该安全水平,每年可防止3.3–3.9万人死亡、节省0.9–1.25万亿美元社会成本,即使27%普及率也能挽救约1万人生命,实现从伤害减轻到伤害预防的根本转变。

Dr. Jon Slotkin: As a neurosurgeon I care a lot about road safety. By now you've probably seen @Waymo's stunning safety results (like 91%...

Google具身智能现象/趋势
9月13日
22:51
Jim Fan@DrJimFan
BEHAVIOR挑战启动:机器人学迎来ImageNet时刻

推文指出计算机视觉(ImageNet)和自然语言处理(MMLU、HLE、SWEBench)已建立标准化基准体系,而机器人学仍缺乏统一评估标准,存在硬件、任务定义、评分体系混乱的问题。由ImageNet创造者开发的BEHAVIOR项目基于Isaac Sim物理引擎,旨在建立可复现的机器人学统一基准。该项目已启动首届NeurIPS 2025挑战赛,期望成为推动领域进步的标志性信号。

Fei-Fei Li: (1/N) How close are we to enabling robots to solve the long-horizon, complex tasks that matter in everyday life? 🚨 We a...

具身智能评测/基准
8月22日
01:26
Google DeepMind@GoogleDeepMind
为何要构建可探索的 AI 生成世界?

Google DeepMind 研究者 Shlomi Fruchter 与 Jack Parker-Holder 在播客中解析 Genie 3:通过生成多样化、可探索的虚拟世界,为 AI 智能体提供安全的测试与训练环境。对话涵盖物理模拟、涌现特性、与 SIMA 结合及 AGI 前景等议题。

智能体DeepMind产品更新具身智能
8月7日
01:36
Jim Fan@DrJimFan
关注 FSD Scaling Law 及涌现能力阈值,这是全球唯一的物理数据飞轮。Tesla 正训练参数量约 10 倍的新 FSD 模型,视频压缩损失大幅改进,顺利的话下月底发布。

Elon Musk: Tesla is training a new FSD model with ~10X params and a big improvement to video compression loss. Probably ready for p...

具身智能数据/训练模型发布
8月5日
23:57
Jim Fan@DrJimFan
精选
NVIDIA推出DreamGen引擎:让机器人在视频生成模型中"做梦"学习

NVIDIA发布DreamGen引擎(GR00T Dreams),将Sora/Veo等视频生成模型用作神经物理引擎,通过微调模型、模拟并行世界、恢复伪动作、训练基础模型四步流程,为机器人生成大规模合成训练数据。人形机器人仅凭单一拾放任务即可学会倾倒、折叠等22种新行为,在新动词和陌生环境中实现零样本泛化(成功率分别达43%和28%)。相比传统图形引擎,该方法以恒定计算成本处理可变形物体、流体等复杂交互,团队计划数周内完全开源。

Jim Fan: What if robots could dream inside a video generative model? Introducing DreamGen, a new engine that scales up robot lear...

具身智能视频论文/研究

推荐理由:NVIDIA提出用视频生成模型为机器人“造梦”合成训练数据,实现零样本技能泛化
23:38
Jim Fan@DrJimFan
精选
物理AI评估无法靠实车碰撞测试完成,传统游戏引擎(sim 1.0)也难以覆盖所有边缘情况。基于神经网络的sim 2.0由数据驱动,随车队规模扩展。Tesla已应用多年,用于生成近正面碰撞等罕见危险场景的训练数据,补充800万辆实车难以采集的极端案例。

Elon Musk: @DrJimFan Tesla has had this for a few years. Used for creating unusual training examples (eg near head-on collisions), ...

具身智能大佬观点数据/训练

推荐理由:Jim Fan 指出物理 AI 评估难题,提出神经网络驱动的 Sim 2.0 数据飞轮方案
7月26日
00:58
Jim Fan@DrJimFan
精选
机器人领域的小型 Moravec's paradox:对人类困难的体操动作反而更容易

机器人领域存在"莫拉维克悖论":后空翻等杂技比做饭、清洁更容易实现。前者可在模拟中训练并零样本迁移,无需感知环境;后者需要真实的视觉、接触物理和物体动力学,难以模拟。这导致外界困惑——机器人能炫技却做不好家务,只因通用灵巧性仍是未解难题。

具身智能大佬观点

推荐理由:Jim Fan 揭示机器人'炫技易做家务难'的莫拉维克悖论成因
7月19日
23:30
Jim Fan@DrJimFan
精选
AGI 的门槛不是赢得诺贝尔奖,而是能去任何人家中烹饪任意菜系。物理图灵测试远比学术理论困难,Moravec 悖论将在未来十年持续困扰 AI 发展。

Thomas Wolf: My bar for AGI is an AI winning a Nobel Prize for a new theory it originated.

具身智能大佬观点推理

推荐理由:Jim Fan 提出 AGI 的物理图灵测试标准:能烹饪任意晚餐比获诺奖更难
7月14日
01:06
Jim Fan@DrJimFan
精选
最近在 X 上较为沉默。过去一年是转型之旅。Grok-4 与…

机器人领域正处于类似 2018 年 NLP 的混沌期,技术路线未定(世界模型、RL、sim2real 等),商业模式百花齐放,是入局的好时机。管理实验室并直接向 4 万亿美元公司 CEO 汇报消耗了全部精力,故在 X 上发言减少。坚信没有具身智能就没有 AGI。

智能体具身智能大佬观点

推荐理由:Jim Fan 称机器人领域处 GPT-1 时刻,具身智能是 AGI 必要条件
6月21日
03:38
Saining Xie@sainingxie
伙计们,真实的地理空间数据对数字智能体来说完全是座金矿。走出网页浏览器,来点真实的。

Chuang Gan: Virtual Community provides an online pipeline that automatically generates 3D scenes from real geospatial data, performi...

智能体具身智能大佬观点
5月20日
21:29
Jim Fan@DrJimFan
精选
DreamGen:让机器人在视频生成模型中"做梦"合成训练数据

DreamGen让机器人在视频生成模型中"做梦"合成训练数据。通过微调Sora等模型生成海量神经轨迹(逼真视频+动作标签),机器人从单一拾取放置任务泛化到倾倒、折叠等22种新行为。在NVIDIA总部咖啡厅测试中,人形机器人对新动词零样本成功率从0%提升至43%,新环境达28%。相比传统图形引擎,无需手工建模即可处理流体、可变形物体等复杂场景,整个pipeline将于近期完全开源。

具身智能视频论文/研究

推荐理由:NVIDIA 提出 DreamGen:让机器人在视频生成模型中「做梦」合成训练数据,实现强零样本泛化,将开源
5月15日
02:39
Jim Fan@DrJimFan
NVIDIA 研究团队下周赴亚特兰大参加 ICRA,现场招募人形机器人核心成员。聚焦 VLA、world models、RL 及仿真等前沿方向,寻求顶尖研究人才,可私信或邮件约见。
具身智能行业动态
5月8日
23:41
Jim Fan@DrJimFan
精选
物理图灵测试(Physical Turing Test):周日黑客松后家里一片狼藉,周一晚上回家却发现客厅整洁、烛光晚餐就绪,而你无法分辨这是人还是机器所为

提出"物理图灵测试"作为通用机器人的北极星目标:机器能否像人一样完成物理任务(整理房间、准备晚餐)而不被察觉。这是从自动化比特迈向原子的下一代计算平台。在 Sequoia AI Ascent 演讲,分享第一性原理、数据策略与扩展定律,时长17分钟。

具身智能大佬观点

推荐理由:NVIDIA Jim Fan 提出「物理图灵测试」,定义通用机器人终极标准
4月22日
22:03
Jim Fan@DrJimFan
未来十年机器人将普及至家庭、医院和工厂,以超人灵巧完成枯燥危险工作,而那天只会被称为"周四"。跨越 Turing Test 已无人欢呼,从前被视为终极挑战的里程碑,如今只是"又一个该死的周二"。技术迭代之快让奇迹变得稀松平常。

signüll: we crossed the turing test & no one gave a shit. no parades. no front page headlines. just... a casual shrug. like "oh y...

智能体具身智能大佬观点
3月21日
01:01
Jim Fan@DrJimFan
精选
NVIDIA 发布世界首个开源人形机器人基础模型 GR00T N1,仅 2B 参数,采用 VLM 加 Diffusion Transformer 架构实现端到端控制。模型基于真实遥操作、30 万+仿真轨迹及合成神经轨迹训练,在 GR1、1X Neo 等机器人上任务性能提升 30%,并可跨具身部署至百元级开源机械臂。

Jim Fan: Excited to announce GR00T N1, the world's first open foundation model for humanoid robots! We are on a mission to democr...

具身智能开源生态模型发布

推荐理由:NVIDIA开源首个通用人形机器人基础模型GR00T N1,2B参数可部署于百元级机械臂
‹ 上一页
1…678
下一页 ›