Several Unitree G1 humanoid robots mirrored a lead dancer's choreography in real time via motion capture at a Shanghai event. The demonstration was part of a record 100-person simultaneous motion tracking challenge.

译几台宇树G1人形机器人在上海的一场活动中，通过动捕技术实时镜像了一位领舞者的编舞。这场演示是创纪录的百人同时运动追踪挑战的一部分。

Jim Fan@DrJimFan · 6月6日71

NitroGen just won CVPR Best Paper Honorable Mention!! We are making strides towards general-purpose embodied agents that master not only the real world physics, but also all possible physics across a multiverse of simulations. It’s been 4 years since MineDojo, our first embodied agent in Minecraft, won NeurIPS Best Paper. Congrats to everyone on the team!!

译NitroGen 刚刚获得 CVPR 最佳论文荣誉提名！！我们正在朝着通用具身智能体迈进，不仅掌握真实世界的物理规律，还能掌握模拟多元宇宙中所有可能的物理规律。距离我们的第一个 Minecraft 具身智能体 MineDojo 获得 NeurIPS 最佳论文奖已经过去 4 年了。祝贺团队里的每一位！！

AYi@AYi_AInotes · 6月5日54

孙正义也不是随口说说，一个是他刚靠AI投资重回亚洲首富，这位大佬是真的尝到AI的甜头了，另外他最近刚砸75亿欧元在法国建AI数据中心，可以说是SoftBank未来10年的All in方向了。所以他才会说AI革命的规模会是互联网泡沫时代的50倍，这是人类经历过的最伟大的技术革命😄 那么Physical AI到底是什么？物理AI是AI大脑+物理身体，是能看见、能思考、能动手、能走路，能和真实世界交互的智能实体，它是工厂里24小时不休息的机械臂，是仓库里搬货的人形机器人，也是未来给你做饭、打扫、照顾老人的家庭助手， AI终将从虚拟世界的劳动者变成物理世界的劳动者，这是大势所趋， Tesla、Figure、国内宇树，智元等等会成为下一个阶段的主角和巨头，拭目以待，5年以后我们回来看。

译孙正义6月1日在巴黎接受CNBC专访时预测，Physical AI（物理AI）和机器人是下一个万亿美元机会，AI革命规模将是互联网泡沫时代的50倍。他近期已投资75亿欧元在法国建设AI数据中心。Physical AI定义为“AI大脑+物理身体”，能看见、思考、动手并与真实世界交互，应用包括工厂机械臂、仓库人形机器人和未来家庭助手。孙正义认为Tesla、Figure、宇树、智元等将成为下一阶段主角。

AYi@AYi_AInotes · 6月5日60

我想明白了一件事，AI 下一波最大的机会在哪里，孙正义基本上给AI的下一个十年定调了。孙正义刚在巴黎说：下一个万亿美元的机会，是 Physical AI 和机器人，不是聊天，也不是写代码，更不是做视频，关键是让 AI 有了身体，站起来、走出去、动手干活。现在的人形机器人市场大概 20-30 亿美元，机构预测 2035 年到 2000 亿，乐观的说 10 年内破万亿。这个数字大家可能没感觉，换个说法：我们现在用的手机，从少数人买得起到人手一台，走了大概十年。机器人也正在走同一条成本曲线，有AI的加持而且可能更快——中国已经把单台成本压到了 5 万美元。那么这意味着什么？意味着 Physical AI 已经不是未来的事了，是现在已经开始了但你还没注意到。但我最想说的其实不是投资，还有一个更扎心的判断就是，软件 AI 的红利窗口正在从爆发走向成熟，如果我们现在的全部注意力还在 prompt 技巧、纯软件 Agent 层等，那我们很可能会像 2010 年代只做移动 APP 的人一样——手上功夫很熟，但下一波浪潮跟你基本没啥关系了。倒不是说软件 AI 不重要，我想表达的是下一代的AI应该是懂物理世界的 AI，一个 LLM 写不出拿起杯子的力反馈，一个 Agent 不知道搬箱子时拐弯要先减速。这些物理常识才是 AI 最难啃的骨头——也是最早啃下这块骨头的人能建立的优势。所以我自己现在的判断很简单：把 AI 分成三层理解。第一层，软件智能——你现在每天用的，聊天、写代码、生图。第二层，具身智能——AI 有身体，能感知、决策、行动。第三层，超级智能——太远，先不想。绝大多数人还只在第一层，我们现在要做的就是不要焦虑机器人会不会取代我，咱先把第二层装进自己的认知系统里。具体到每周：花一两个小时，关注一两个具身智能项目的真实落地进展——不是 demo 视频，要看量产时间表、成本曲线、实际部署场景等，把它当成一个必追的频道来追。因为一个很残酷的规律是：每一次底层技术换代，最先被淘汰的从来不是不懂的人，而是那些以为自己懂、但一直没更新的人。 AI 肯定不再只在屏幕里存在，它一定会走出来融进我们的生活，那么我们的认知系统也得跟着进化升级才行。

译孙正义6月1日在巴黎CNBC专访中指出，下一个万亿美元机会是Physical AI和机器人，AI革命规模可能是互联网泡沫的50倍。当前人形机器人市场约20-30亿美元，机构预测2035年达2000亿美元，乐观估计10年内破万亿。中国已将单台成本压至5万美元。作者将AI分三层：软件智能、具身智能、超级智能，认为纯软件Agent红利窗口正在成熟，建议关注具身智能项目的量产时间表、成本曲线和实际部署场景。

Rohan Paul@rohanpaul_ai · 6月5日23

Robot unboxing scenes will become common in many homes everywhere. Sooner that we think.

译机器人开箱场景将在各地的许多家庭中变得常见。比我们想象的更快。

Rohan Paul@rohanpaul_ai · 6月5日51

This robotic hands will cuase some layoff in massage parlors 😅 Co-ordinated finger movements. Fist clenching, pointing & spreading. Complete hand closures. Palm opening and precise pinching actions and digit control. Xynova at ICRA 2026 in Vienna.

译这双机器人手会导致按摩院一些裁员😅 协调的手指运动。握拳、指点和张开。完全的手掌闭合。手掌张开以及精确的捏合动作和手指控制。Xynova 在维也纳的 ICRA 2026。

AYi@AYi_AInotes · 6月5日59

看了新晋亚洲首富孙正义这个最新访谈睡不着了， 6 月 1 号他在巴黎接受CNBC 专访时透漏了很多未来的财富密码，明确表示下一个万亿美元机会,是 Physical AI 和机器人。以及这一波 AI 革命的规模, 大概率是互联网泡沫时代的 50 倍, 是人类经历过最大的一次技术与实现革命。我看了一圈中文圈的反应, 绝大多数人都把这条当普通新闻刷过去了, 过去三年我们忙着教 AI 写代码、画图、聊天, 但下一个十年,AI很可能会从屏幕里走出来,站起来,迈出腿,动手做事。也就是说, 我们现在练的所有 prompt 技巧、Agent 编排、内容生成等等本质上都还在无身体的 AI这一层。未来真正决定下一代生产力地形的是有身体的那一层，下面这几条,是我把这件事彻底想透之后, 给普通人能用上的一份认知和财富进阶地图 👇

译孙正义在6月1日CNBC专访中称，下一个万亿美元机会是Physical AI和机器人，AI革命规模将是互联网泡沫时代的50倍，是人类经历的最大技术变革。他预测未来十年AI将从屏幕走进现实，拥有身体并动手做事。当前AI仍停留在无身体层面（提示词、Agent编排、内容生成），真正决定生产力的是有身体的一层。该推文还提供了普通人认知与财富进阶地图。

Rohan Paul@rohanpaul_ai · 6月4日58

Great piece from Dr. Fei-Fei Li (@drfeifei) “The world is not made of words.... A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents." LLMs learn patterns in text, so they can explain a room, but they do not naturally know how the room changes when a chair moves, glass breaks, sunlight shifts, or a robot pushes a cup. A world model tries to learn the hidden structure behind what we see, meaning it can predict views the camera never captured, model object behavior, and support agents that act inside real or virtual environments. To see a world from a new angle, to predict what happens when something is pushed, and to decide what to do next all require a common internal model of space, causality, and consequence.

译李飞飞（Fei-Fei Li）指出，大语言模型（LLM）仅学习文本模式，能描述房间但无法理解椅子移动、玻璃破碎、阳光变化或机器人推杯子等物理变化。世界模型则试图学习视觉背后隐藏的结构，能预测相机未捕捉的视角、建模物体行为、支持真实或虚拟环境中行动的智能体。理解新视角、预测推动结果、决定下一步行动，都需要一个共同的内在模型，涵盖空间、因果与后果。

Rohan Paul@rohanpaul_ai · 6月4日39

Robotic fingers are progressing faster than we think. Here, motors embedded in the fingers, onboard actuators inside each finger segment, in this Wuji Tech robot hands created this smooth multi-joint movements.

译机器手指的发展速度比我们想象的要快。这里，手指中嵌入电机，每个指节内部装有板载执行器，Wuji Tech的这款机器人手实现了如此流畅的多关节运动。

Fei-Fei Li@drfeifei · 6月4日78

http://x.com/i/article/2062244283940544512 # A Functional Taxonomy of World Models > “The world is everything that is the case.” — Ludwig Wittgenstein, Tractatus Logico-Philosophicus, 1921 ## The world is not made of words. In an earlier essay, we argued that spatial intelligence is AI’s next frontier and that world models are the path to it. Here, the World Labs team and I want to go one level deeper: of the many things now being built and called ‘world models,’ which functional pieces actually compose that capacity — and what is each one for? Language models have given machines an extraordinary command of concepts, vocabulary, and reasoning, but the physical world, virtual or real, runs on a different substrate. Where language models learn the statistical structure of text, world models learn the statistical structure of space and time: how light falls on a surface, how a garden looks from an angle no camera has captured, how objects respond to force and follow the laws of physics. That makes “world model” one of the most important and most overloaded terms in AI today. Computer vision, robotics, reinforcement learning, and generative AI each claim to be building world models, and each means something quite different. A video model that produces gorgeous but physically impossible flames, a language model improvising a playable game, and a physics engine that faithfully simulates combustion all go by the same name. The ancient Greeks could never agree on what the world was made of, whether fire, water, or indivisible atoms, because “world” was never a single thing. It was always a stand-in for whatever totality a given thinker needed to reason about. AI has inherited the same problem, at exactly the moment when the field needs precision. ## The loop beneath the taxonomy Cutting through that confusion starts with a diagram older than any of the technology in question. Reinforcement learning textbooks, including the canonical Sutton and Barto, have used a version of the same picture for decades to describe how an agent interacts with a world. The formal name for this picture is the partially observable Markov decision process, or POMDP, and the original definition of the term “world model” belongs to that tradition. An agent, which can be a person, a robot, or a software system, takes actions. Those actions affect the state of the world. The agent never sees the state directly. What reaches the agent are observations: the photons that fall on a retina, the readings from a sensor, and the pixels in a video frame. New observations inform new actions, and the loop continues. The word “state” needs unpacking, because the meaning shifts from field to field. This is not the chemist’s state, the difference between solid, liquid, and gas. This is the physicist’s and roboticist’s state: a complete description of what is happening in the world at a given moment, including every object, every position, every velocity, every property. State is the underlying reality of the world; complete in principle, but never directly visible to any agent inside it. Observations are an agent’s partial view of that reality. Actions are what the agent does in response. This loop — agent to action to state to observation and back — is the structure that gave the modern term “world model” its technical meaning. The phrase itself is older, traced to Kenneth Craik’s 1943 proposal that minds reason by running “small-scale models” of reality, and carried into neural networks by the late 1980s and early 1990s. And the loop also explains what people mean by the term today. The different things now being called world models are in fact different projections of this same loop. Each one outputs a different piece of it. ## Three functions of a world model The first kind of world model is a renderer. A renderer outputs observations in the form of pixels meant for human eyes, and the quality that matters most is visual fidelity. A video model that turns a text prompt into a cinematic drone shot is a renderer. So is an interactive system like Google’s Genie 3, or World Labs’ own RTFM, where the model generates frames in real time conditioned on user input. The model carries no explicit understanding of three-dimensional structure. It produces what a viewer would see, not what is. The buildings in the drone shot may look flawless from above, but try to drive through the city below and they fall apart. The second kind is a simulator. A simulator outputs state: a geometrically, physically or dynamically faithful representation of the world that humans and computer programs can both compute on and interact with. Where the renderer’s contract is purely visual, the simulator’s contract is structural, demanding geometry that holds up under inspection, physics that respects Newton’s laws, and dynamics that behave the way the world needs to behave given the laws of physics. A simulator serves two consumers at once. Human professionals such as architects, designers, filmmakers, and game developers need accuracy beyond visual plausibility. Computer programs such as reinforcement learning agents, robot controllers, and autonomous vehicles use simulators as training grounds where they can interact with the world at scale, testing scenarios that would be dangerous, expensive, or impossible to run in reality. The third kind is a planner. A planner outputs actions. Given an observation and a goal, a planner answers the question of what the agent should do next. This is, in many ways, the inverse of the renderer. Where a renderer takes actions as input and produces observations, a planner takes observations as input and produces actions, closing the perception-action loop. Vision-Language-Action models, model-based systems, and the new wave of World Action Models are all attempts at planners: systems that can decide what a robot should do in an unstructured world. These three categories describe most of what is actually shipping today, and the distinction between them is useful in practice. The categories are not, however, fundamentally separate. The same underlying knowledge of how the world works—geometry, physics, dynamics—sits beneath all of them. A model that can render a cup from any angle ought, in principle, to be able to simulate what happens when the cup is pushed and plan a hand to pick the cup up. Increasingly, the most interesting research deliberately blurs the boundaries between the three. ## Why simulation is the linchpin Of the three categories, the simulator gets the least public attention, and is the most consequential of the three. This essay addresses this asymmetry. The renderer is by far the most commercially mature. A number of image- or text-to-video products are expanding in the consumer or enterprise markets rapidly. Google’s Nano Banana model has put renderer-quality image generation in the hands of potentially hundreds of millions of users. The technology is real, and the markets are real. Yet renderers optimize for visual plausibility rather than physical accuracy, and that ceiling matters. Their outputs are beautiful, but they cannot be trusted to design a building or train a robot. The planner is the most intriguing and the most nascent, closely connected to the rapidly evolving field of robotic learning. The field has produced robotic demos in the last two years that look impressive in videos, but candor is required about what those demos actually show. Almost all have been confined to heavily constrained laboratory setups, with narrow object sets and short task horizons. None have been validated at the complexity, variability, or duration that real-world deployment demands. The gap between a compelling demo reel and a robot that reliably works in a kitchen, a warehouse, or an operating room remains vast. The commercial bets are nonetheless substantial. A wave of well-funded entrants is racing to ship general-purpose planning systems, while the largest infrastructure players are positioning planning atop broader simulation stacks. A robot that can plan is a robot that can work, and the entire industry is racing to be the one that gets there first. Simulation is the bridge between the two. If language is an abstraction of the world and pixels are a projection of it, then geometry, physics, and dynamics are the world itself. A simulator must work at that level: the structural backbone from which both visual appearance (for renderers) and action consequences (for planners) can be derived. A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents. A model that masters only rendering, or only planning, cannot do either. The commercial surface area is enormous. NVIDIA’s Omniverse alone targets what the company estimates as more than a trillion dollars of addressable market in factories, warehouses, supply chains, and digital twins. Robotics training, autonomous vehicle testing, architectural visualization, engineering, and drug discovery all depend on something simulation-shaped. The hardest open problems in the field live there too. Three-dimensional data with explicit geometry, material properties, and physical annotations is orders of magnitude scarcer than the internet video that renderers train on. The sim-to-real gap, which is the difference between how things behave in simulation and how they behave in reality, persists. Generative simulators introduce a new risk on top of that: AI-generated geometry can look correct while containing self-intersections or wrong scale that produce nonsensical physics. Multi-physics simulation at scale, where rigid bodies, deformable objects, fluids, and cloth all interact, remains orders of magnitude more expensive than single-domain simulation. At World Labs, Marble is our first move into this territory. It takes multimodal prompts (text, image, video, or spatial sketch) and generates explorable 3D environments, outputting Gaussian splats for visual exploration alongside collision meshes a physics engine can operate on. But Marble is only the first chapter of a much longer arc being written across the field as the lines between rendering, simulation, and planning begin to collapse. ## Where the boundaries are collapsing and what comes next But more is to come. The most important pattern in the field right now is that the three categories are starting to blend into one another. The shared insight is that the knowledge required to render a world, simulate it, and act in it is largely the same. Continuing the earlier example, a model that truly understands how a cup sits on a table (its geometry, material properties, response to force, etc.) should be able to render that cup from any angle, simulate what happens when the cup is pushed, and plan for a hand to pick the cup up. The three categories are three projections of a single underlying understanding. For example: a small but growing number of recent work from various robotics labs have demonstrated that—at least conceptually—a pretrained video renderer can be used as the backbone for joint world-and-action prediction, suggesting a bridge between the renderer and the planner by letting one model imagine what will happen and what to do. World Labs’ Marble already outputs Gaussian splats and collision meshes from a single model, dissolving the boundary between the renderer and the simulator. Every level is moving from passive output to interactive system, with renderers becoming action-conditioned, simulators generating worlds that are more controllable and editable, and planners deliberating rather than just reacting. The logical endpoint is a unified world model: one foundation model that can render photorealistic views, produce physically accurate structure, and plan action sequences, switching between output modalities depending on what the downstream consumer needs. We will still face a number of daunting challenges. The data picture is uneven, with renderers awash in internet video while simulators and planners face acute shortages of 3D assets and robot demonstrations. Optimizing for visual beauty can sacrifice the precision a robot or a high-fidelity simulation needs. Reconciling these tensions inside a single architecture is the defining open problem in world model research today, and this is what World Labs sets out to do as we continue to evolve Marble. The direction, however, is clear. The same bet the field has been making since the late 1980s — that a sufficiently rich model of the world is all that any agent needs to see worlds, build them, and act in them — is the bet now driving an entire generation of research. What gives that “big bet” weight is the convergence already underway: three threads, each already driving and shaping multi-billion-dollar industries on its own, that began as separate research programs are starting to behave like one. Taken together, as the boundaries between them collapse, they will reshape something larger: the relationship between machine intelligence and the physical world it inhabits - the long arc of spatial intelligence. Language gave machines a way to talk about that world. World models are how machines will finally come to understand, imagine, reason and interact with it.

译World Labs团队与李飞飞发文，梳理“世界模型”这一被滥用的术语。对比语言模型学习文本统计，世界模型学习空间与时间统计（如光照、物理规律）。基于部分可观马尔可夫决策过程（POMDP）框架，智能体通过动作影响世界状态，观测是部分视图。当前被称为“世界模型”的不同系统本质上是同一循环的不同投影：第一类为渲染器，输出给人眼看的像素，以视觉保真度为核心。文章着重于概念分层，未给出具体模型名、参数或基准分数。

fofr@fofrAI · 6月3日31

I need to see a video of two of these playing each other in real life.

译一位开发者使用强化学习在模拟中训练AI智能体，随后部署到真实的机器人空气曲棍球台上。该机器人能以毫米级精度跟踪曲棍球，反应时间约20毫秒，足以挑战熟练的人类玩家。这标志着从预设编程规则到模拟学习后在物理世界执行的转变。主推文作者期待看到两个这样的机器人进行真实对战。

Rohan Paul@rohanpaul_ai · 6月3日36

Boston Dynamics’ Spot is patrolling World Cup venues in Dallas, using 360° cameras, thermal imaging, and chemical sensors to detect suspicious packages, scan surroundings, and support security teams live. No facial recognition capability.

译波士顿动力的Spot机器人正在达拉斯的世界杯场馆巡逻，使用360度摄像头、热成像和化学传感器来检测可疑包裹、扫描周围环境，并实时支持安保团队。不具备面部识别能力。

X.PIN@thexpin · 6月2日72

http://x.com/i/article/2061763779088797696 # Everyone in Robotics Is Burning Cash. Unitree Turned a Profit in China. Late 2017. The World Internet Conference is underway in Wuzhen, a canal town in Zhejiang province. Wang Xingxing—founder of a small Hangzhou robotics outfit called Unitree—doesn’t have a badge to get in. So he sets up outside the doors and demos his company’s first product: an early Laikago, a quadruped robot named for the Soviet space dog that flew on Sputnik 2. His audience: Lei Jun, CEO of Xiaomi, and Wang Xing, CEO of Meituan. Two of the most powerful tech founders in China. Then the robot crashed. Wang had to reboot it right there on the doorstep. By all accounts, it was a deeply awkward few minutes. He was convinced it would work anyway. Nine years later—on the day I was finishing this piece—Unitree’s IPO cleared the Shanghai Stock Exchange’s listing committee, targeting a raise of about 4.2 billion yuan, roughly $610 million. It’s set to become the first dedicated humanoid robotics company to list on China’s A-share market. Around the same time, Nvidia CEO Jensen Huang announced that the company’s Isaac GR00T reference design would integrate Unitree’s H2 Plus humanoid, paired with Nvidia’s Jetson Thor and the GR00T workflow. The H2 Plus is expected to ship by year’s end. If you’ve read about Unitree in the English-language press, you’ve probably gotten the broad-strokes version. How did the company actually go global? What is Wang Xingxing like? And how, in an industry where everyone is hemorrhaging cash, did Unitree start making money? I’ve been lucky enough to interview Wang in person more than once. What follows draws on his IPO prospectus, the company’s reply letter to the exchange, and several off-the-record conversations—an attempt at some real answers. ## A $5,600 Robot With a 40% Margin For the rest of the robotics industry, Unitree’s prospectus is a problem. The field has made enormous technical strides in the last few years, but most companies run on venture money. Losing money is the baseline. Unitree posted a net profit of 77.5 million yuan (about $11 million) in 2024, and by 2025 that had climbed to roughly 600 million yuan ($84 million)—a net margin around 35 percent. That isn’t supposed to be possible right now. Humanoids still aren’t shipping in real volume. Most makers count it a win just to keep build quality consistent. Training data is scarce, so the robots can’t do much that’s useful in the real world. And security is an afterthought—even basic backdoor protection is spotty. Wang isn’t chasing any of those frontiers. Spend time with him and you realize he’s fixated on one question: how do you ship a product that works, at a cost you can actually control? His robots may not be the most advanced on the market. But they’re reliable enough—and once you factor in the price, “reliable enough” starts to look like a steal. He’s been obsessed with cost since long before Unitree existed. As a student, he tried to build a bipedal robot for 200 yuan—about 28 bucks. He tinkered constantly; one experiment, electrolyzing tap water, accidentally released chlorine gas. In 2015, finishing his master’s at Shanghai University, he built a quadruped called XDog out of hobby-grade motors meant for model airplanes. All in, it cost under 20,000 yuan—about $2,800. Boston Dynamics’ Spot, for comparison, rented for more than $70,000. Where Boston Dynamics used hydraulic joints, Wang went electric—and not with industrial motors, but cheap brushless ones. His robot dogs used as few parts as he could get away with. He’s said he started the company with just 2 million yuan—around $280,000—and every yuan had to pull its weight. That same discipline shows up in the humanoids. This March, the Chinese brokerage China Post Securities took apart a base-model G1 (after-tax price: 85,000 yuan, about $12,000) to estimate what it cost to build. The motors, driver boards, and gearboxes—a humanoid’s most critical components—came out with no manufacturer logos at all, which usually means one of two things: Unitree makes them itself, or the supplier is staying very quiet. The memory and storage came from Biwin and Longsys, both Chinese. The main processor was a Rockchip RK3588 (there’s also a Qualcomm-based version, the G1Q). The default lidar came from DJI, with RoboSense or Hesai as options. Mixing in-house parts with cheap commodity components, the teardown pegged the base G1’s bill of materials at around 40,000 yuan (roughly $5,600)—a gross margin north of 40 percent. Upgrade the unit, and that margin sails past 60. This is the engine behind Unitree’s climbing margins: most humanoid buyers are universities and labs, and they tend to splurge on the pricier, modifiable EDU version. The more they buy, the better the math gets. Back in 2024, I interviewed a Unitree salesperson at a trade show. He told me, flatly, that the humanoid business could realistically clear a billion yuan—about $140 million—a year. He wasn’t wrong—2025 revenue came in around 1.71 billion yuan, roughly $240 million. (He later blocked me. Unitree, I gather, keeps its people on a short leash when it comes to reporters.) Please like and follow if you enjoyed our work! ## So Why Did the Money Show Up All at Once? The real puzzle in the prospectus isn’t the early losses. It’s how fast the profits arrived in 2025. Humanoid revenue jumped from 107 million yuan (about $15 million) in 2024 to 869 million ($122 million) in 2025—outearning, for the first time, the robot dogs that built the company. The Western press tends to credit one moment: the dancing robots on China’s CCTV Spring Festival Gala in early 2025, which kicked off a national humanoid craze. That’s not wrong, but it’s not the whole story. Having covered this beat from 2023 to 2025, I can tell you the fascination was building in China well before that broadcast. Unitree’s early H1, back when it could only shuffle, was already pulling millions of views on Douyin. Once a later H1 could fold itself up and walk like a person, Chinese social media lost it. Every product teaser Unitree dropped, ordinary users would re-cut into clips that racked up millions of views overnight—I was one of them, for a while. Other startups noticed and tried to copy the formula. None of it landed the way Unitree’s did. At the 2025 World Robot Conference in Beijing, I asked Wang whether he’d set out to build humanoids on purpose. His answer caught me off guard: “For a long time I was actually against making humanoids. I’d built a bipedal one back in 2009, and the business case was brutal. But by 2022, customers were placing orders—some were paying deposits before we even had a product. So we built one.” That’s it. No vision, no AGI, no sweeping story about automation. Customers wanted one, so he made one. The humanoid frenzy has, in a strange way, almost nothing to do with him—he’s watching it from the sidelines. My honest guess is that the 2025 revenue spike is just the 2023 and 2024 orders finally being fulfilled. This is what separates Wang from most humanoid founders: he’s more conservative. Zhang Peng, founder of the tech-media brand GeekPark and an early Unitree backer, has described him as the rare founder who’ll tell you plainly which problems are hard and how long each will really take. Worth remembering: when Wang was saying these things, he’d just left a three-month stint at DJI. Because he never learned to sell a vision, his path to profit was almost comically simple: build the thing, and the labs will buy it. So labs and universities became his market. Unitree’s gear performed roughly on par with Boston Dynamics’ at about 30 percent of the cost, sometimes less. The electric drivetrain was easy to hack on—grad students could tinker, publish papers, and spread the word at conferences. Marketing, in the usual sense, was a line item he could mostly skip. The Unitree social accounts everyone knows now? They didn’t roll out until 2021. The in-house video team didn’t exist until 2022. Wang barely posts. The prospectus puts Unitree’s 2025 ad spend at 60.53 million yuan—about $8.5 million — not much, for a brand this recognizable. Continue Reading

译中国人形机器人公司宇树科技（Unitree）于2024年实现净利润7750万元人民币，2025年利润增至约6亿元，净利润率约35%，在行业普遍亏损的背景下实现盈利。该公司已通过上海证券交易所上市委员会审核，拟融资约42亿人民币，目标成为首家在A股上市的专业人形机器人公司。其H2 Plus人形机器人预计年底出货，已被Nvidia纳入Isaac GR00T参考设计，将与Jetson Thor整合。宇树以低成本、高可靠性的产品路线实现商业化，基础版G1机器人售价约1200美元。

Rohan Paul@rohanpaul_ai · 6月1日42

A humanoid robot is useful only when teams can test motion, perception and interaction on real hardware, because simulation often misses friction, balance errors, sensor noise, and messy human environments. That’s exactly why robots need developer ecosystems. LUMOS Robotics just launched Project EDGE, a program giving 100 free LUMOS NIX humanoids to selected builders, universities, robotics labs, and creative robotics teams.

译LUMOS Robotics 启动 Project EDGE 计划，旨在构建开发者生态。由于仿真难以复现摩擦力、平衡误差、传感器噪声和真实人类环境，人形机器人的运动、感知与交互能力必须在真实硬件上进行测试。项目将向全球开发者、高校、机器人实验室及创意团队免费提供 100 台 LUMOS NIX 人形机器人。获选合作伙伴将获得机器人设备、开放的 SDK 访问权限及直接技术支持，以探索从动态运动控制到具身 AI 应用的广泛场景。项目现已开放申请。

Luma@LumaLabsAI · 6月1日62

To improve human life, AI systems must be able to help us improve the physical world. What stands between us and that prosperous future is the problem of generalization in physical AI. To solve this problem, we are establishing a new open science physical AI lab at Luma. Read more → https://lumalabs.ai/news/luma-open-physical-ai-lab

译为改善人类生活，AI系统必须能帮助我们改善物理世界。阻碍我们迈向这一繁荣未来的是物理AI的泛化问题。为解决此问题，我们在Luma建立了一个新的开放科学物理AI实验室。阅读更多 → https://lumalabs.ai/news/luma-open-physical-ai-lab

Chubby♨️@kimmonismus · 6月1日83

1/ NVIDIA just open-sourced Cosmos 3 at GTC Taipei! It's the first fully open "omnimodel" for physical AI - one model that understands the real world, predicts what happens next, and generates the actions a robot should take. Weights, code, datasets. All open. And this is really big. Lets dig into everything: 🧵

译NVIDIA在GTC Taipei上宣布完全开源Cosmos 3。这是首个针对物理AI的“全能模型”，具备原生视觉推理能力，可理解真实世界、预测未来并生成机器人应采取的行动。本次发布包含两个变体：Super（32B）和Nano（8B）。模型权重、代码及数据集均已完全开放。

X.PIN@thexpin · 6月1日52

BREAKING: Unitree just passed its IPO review on the SSE STAR Market! The Listing Review Committee confirmed today (June 1) that Unitree fully meets all issuance, listing, and disclosure requirements. Big milestone for the company!

译突发：宇树科技刚刚通过了上海证券交易所科创板的IPO审核！上市审核委员会今日（6月1日）确认，宇树科技完全符合所有发行、上市和信息披露要求。这是公司的一个重要里程碑！

X.PIN@thexpin · 6月1日63

Big robotics moment at Computex 🤖 In his keynote, Nvidia CEO Jensen Huang unveiled the Isaac GR00T Reference Humanoid Robot — Nvidia's first robotics system sold to researchers. • Body: Unitree H2 Plus (180cm, 70kg — Unitree's first near-human-sized robot) • Hands: Sharpa Wave 5-finger tactile hands (Singapore-based Sharpa) • Brain: Nvidia Jetson Thor (Blackwell GPU) + Isaac GR00T software stack Already going to Stanford, ETH Zurich, UC San Diego and Ai2. Unitree H2 ships late 2026 — perfect timing. Unitree CEO Wang Xingxing said in January: whoever cracks the robot LLM is the world's top AI + robotics company — "fully Nobel-worthy." So… where's Boston Dynamics?

译Nvidia在Computex上发布了首个面向研究者的机器人系统Isaac GR00T。该系统硬件采用Unitree H2 Plus（身高180cm，体重70kg），配备Sharpa Wave 5指触觉手；软件大脑基于Nvidia Jetson Thor（Blackwell GPU）与Isaac GR00T软件栈。该系统已捐赠给斯坦福大学、苏黎世联邦理工学院、加州大学圣地亚哥分校和Ai2。Unitree H2计划于2026年底交付。Unitree CEO王星星此前表示，破解“机器人LLM”的公司将成为顶级AI与机器人企业，其贡献“完全值得诺贝尔奖”。

Berryxia.AI@berryxia · 6月1日40

这不是视频生成模型，是一个持久化、多人协作的世界模型。核心突破是把「世界状态」与「视觉渲染」彻底解耦：世界不再是一帧帧画面，而是持续运行、可被用户修改、能从任意视角稳定观测的结构化环境。这可能是目前最接近「可交互持久世界」的尝试。

译该推文介绍了一种新型“持久化、多人协作的世界模型”，明确强调其并非传统的视频生成模型。其核心突破在于将“世界状态”与“视觉渲染”彻底解耦。这使得世界不再是一系列连续画面，而是一个可持续运行、允许用户修改、并可从任意视角进行稳定观测的结构化环境。作者认为，这可能是目前最接近实现“可交互持久世界”的技术尝试。

Runway@runwayml · 6月1日69

Introducing the Cosmos Coalition A new global initiative with NVIDIA and leading AI labs to build and open-source frontier world models for physical AI. Runway joins as a founding member, working alongside NVIDIA and a set of leading AI labs to build, share and accelerate world model research and development through a common open ecosystem.

译介绍Cosmos联盟一项与NVIDIA及领先AI实验室合作的新全球倡议，旨在构建并开源用于物理AI的前沿世界模型。Runway作为创始成员加入，与NVIDIA及一系列领先AI实验室共同工作，通过一个共同的开放生态系统来构建、共享并加速世界模型的研究与开发。

Greg Brockman@gdb · 6月1日57

OpenAI Robotics is making rapid progress towards building AI that can help people in the physical world. Apply now to join the team:

译OpenAI宣布其世界模拟研究项目已演进为OpenAI Robotics团队，并在机器人与机器学习的协同设计基础上取得快速进展。该团队短期目标是开发能支持技术工人建设未来基础设施的机器人，长期愿景是为每个人配备个人机器人。团队正在招聘全栈硬件、运营、系统及机器学习工程师，旨在共同编程和制造对社会有用的机器人。

Emad@EMostaque · 6月1日78

Sora team became robotics team

译OpenAI的Sora世界模拟研究团队已转型为OpenAI Robotics机器人团队。团队由Aditya Ramesh领导，核心理念是AI应能帮助人类的物理世界。短期目标聚焦于开发支持技能工人、建设未来基础设施的机器人；长期愿景是为每个人打造个人机器人。该团队进展迅速，其基础是机器人硬件与ML研究的协同设计，目前正在招聘全栈硬件、系统及ML工程师。

Sam Altman@sama · 6月1日83

OpenAI Robotics is hiring, looking for exceptional full-stack hardware, ops, systems, and ML engineers to help us program and manufacture robots that are useful for society. AI should be able to help people in the physical world. In the short term, we are focused on robots to support skilled workers to build our future infrastructure; in the long term, we imagine everyone having a personal robot doing anything they need. Our world simulation research program, led by Aditya Ramesh (@model_mechanic), has evolved over the past year into OpenAI Robotics. Progress is rapid, and based on a foundation of co-design between robotics hardware and ML research. If you love working hands-on across the robotics stack and want to build the future, please consider joining us. Send an email with your background and evidence of exceptional accomplishment to: robotics-recruiting@openai.com

译OpenAI宣布成立OpenAI Robotics团队，并开始招聘全栈硬件、系统及ML工程师，以编程和制造能服务社会的机器人。该项目由Aditya Ramesh领导，其世界模拟研究计划已演变为机器人研究，强调硬件与ML研究的协同设计。短期目标是支持技术工人构建未来基础设施，长期愿景是为每个人提供个人机器人。

AK@_akhaliq · 5月30日55

DynaFLIP Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

译DynaFLIP 通过三模态动态引导的表征重新思考机器人感知

AK@_akhaliq · 5月30日62

Qwen-VLA Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

译Qwen-VLA 跨任务、环境与机器人具身的统一视觉语言动作建模

swyx@swyx · 5月29日45

hear me out: 2016, but nobody pays anything because data

译AI服务商 shift 推出纽约免费清洁服务。用户预约后，经过审核的 shift 操作员将佩戴设备上门清洁，用户无需付费。作为交换，清洁过程会被记录，这些关于人类执行日常任务的行为数据将用于训练机器人技术，其价值资助了免费服务。录音中的个人信息会被匿名化处理。该模式旨在让AI变革具体化，未来计划扩展至水管工、维修和跑腿等全球服务。

AK@_akhaliq · 5月29日58

GEM Generative Supervision Helps Embodied Intelligence

译GEM 生成式监督助力具身智能

AK@_akhaliq · 5月28日64

PhysX-Omni Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

译PhysX-Omni 统一的、可直接用于仿真的物理3D生成模型，支持刚体、可变形体和铰接体对象。

Rohan Paul@rohanpaul_ai · 5月28日59

China has started giving every humanoid robots 29-character ID codes, with 28,000+ assigned so far. The system works like a robot passport: each unit gets a code that records its country link, maker, model type, and individual serial identity. The goal is not only paperwork, because humanoids are moving from demos into public roads, factories, homes, and service jobs where failures need clear responsibility. A robot that falls, damages property, leaks data, or gets modified after sale becomes easier to trace when regulators can link the machine to its maker, seller, user, service history, and recycler. --- scmp .com/tech/policy/article/3354747/china-give-every-humanoid-robot-digital-id-push-boost-industry-standards

译中国开始为人形机器人分配29位的身份代码，已发放超过2.8万个。该系统类似机器人护照，记录制造国、厂商、型号及唯一序列号。主要目的是明确责任归属，因为人形机器人正从演示进入工厂、家庭等真实场景，需要清晰追责。当机器人发生故障、损坏财产、泄露数据或被改装时，监管方可借此追溯至制造商、销售商、用户、服务历史及回收商。

Rohan Paul@rohanpaul_ai · 5月28日57

China’s humanoid robot race has moved from lab demos to real shipment. Global humanoid shipments grew nearly 800% in 2025, China now has 140 humanoid robot makers, & 330 new models launched in just 12 months (per IDC Global Humanoid Robot Market Analysis) AGIBOT ranked #1 globally by humanoid robot shipments in 2025, with about 5,200 units and roughly 39% global share.

译中国人形机器人竞赛已从实验室演示转向实际交付。全球人形机器人出货量在2025年增长近800%，中国目前有140家人形机器人制造商，仅12个月内就推出了330款新型号（据IDC全球人形机器人市场分析）。 AGIBOT在2025年按人形机器人出货量排名全球第一，出货约5,200台，全球市场份额约39%。

Rohan Paul@rohanpaul_ai · 5月28日45

Brett Adcock, CEO of Figure AI: "we're working until midnight every night... we are here every weekend. By end of 2026, we'll be able to put a robot into home and be able to do fairly long horizon work."

译Figure AI CEO Brett Adcock："我们每晚工作到午夜……每个周末都在这里。到2026年底，我们将能够把机器人送入家庭，并能完成相当长时间跨度的工作。"

SenseTime@SenseTime_AI · 5月27日36

Our SenseSmart Go AI shop assistant is redefining convenient store service in Shanghai. Come, enjoy, and have a great day!🥰

译我们的 SenseSmart Go AI 店员正在重新定义上海便利店的服务。来吧，享受愉快的一天！🥰

Berryxia.AI@berryxia · 5月27日9

Wow,这个机器人好啊，想要拥有它!

SenseTime@SenseTime_AI · 5月27日59

Our SenseSmart Go AI shop assistant is redefining convenient store service in Shanghai. Come, enjoy, and have a great day!🥰

译我们的 SenseSmart Go AI 店员正在重新定义上海的便利店服务。来吧，享受愉快的一天！🥰

Rohan Paul@rohanpaul_ai · 5月26日48

Dexterity demonstrations with a range of finger movements of robotic hands. humanoid usefulness depend less on walking than on hand manipulation. Useful work begins where fingers meet the world: grip, slip, pressure, cable routing, recovery from mistakes.

译该推文认为人形机器人的实用性更依赖手部操作能力而非行走，真正有用的工作始于手指与外界的交互（如抓握、滑动、压力控制等）。引用推文以SharpaWave为例，指出其能实现每秒超过4次的快速手部循环，展示了工程上在力量与速度间取得的平衡。其Dynamic Tactile Array采用视觉触觉感知技术，指尖集成了摄像头与超过1000个触觉像素。

Rohan Paul@rohanpaul_ai · 5月26日59

One engineering challenge in dexterous Robot hands is balancing strength and speed. Here a SharpaWave performing rapid hand cycles at over 4x/sec. The Dynamic Tactile Array uses visuo-tactile sensing: fingertip integrates camera & 1,000+ tactile pixels.

译灵巧机械手的一个工程挑战在于平衡强度与速度。这里 SharpaWave 正以超过每秒 4 次的频率进行快速手部循环。动态触觉阵列采用视觉-触觉传感：指尖集成了摄像头和 1000 多个触觉像素。

Rohan Paul@rohanpaul_ai · 5月26日22

This autonomous weeding robot uses AI vision to detect weeds among young crops and eliminates them instantly with targeted high-precision laser pulses. Real-time on board GPUs map every plant position and directs lasers precisely at weeds @carbon_robotics

译这款自主除草机器人使用AI视觉在幼苗作物中检测杂草，并立即用高精度激光脉冲进行定向清除。实时车载GPU绘制每株植物位置，并精确引导激光对准杂草 @carbon_robotics

Rohan Paul@rohanpaul_ai · 5月26日35

This autonomous weeding robot uses AI vision to detect weeds among young crops and eliminates them instantly with targeted high-precision laser pulses. Real-time on board GPUs uses map every plant position and directs lasers precisely at weeds.

译这款自主除草机器人使用AI视觉在幼苗作物中检测杂草，并立即用高精度激光脉冲进行定向清除。实时车载GPU用于绘制每株植物的位置图，并精确引导激光对准杂草。

Rohan Paul@rohanpaul_ai · 5月25日55

Home robots are leaving stage demos and entering the only test that really matters: ordinary family life. X Square Robot is starting to move its next-gen home robot into real households. It runs on WALL-B, a world model designed to connect vision, language, touch, action, and physical prediction, which is exactly what a home robot needs when the real world refuses to stay neat. A kitchen is not a controlled environment of a factory floor. it is a moving negotiation between habits, clutter, pets, children, half-finished chores, and objects that never return to the same place twice. That is where Moravec’s paradox shows up: tasks that feel effortless to humans, like picking up clutter, avoiding pets, or judging what belongs where, are often brutally hard for robots. Would you bring a robot with daily chores?

译X Square Robot正将其下一代家庭机器人投入真实家庭环境进行测试。该机器人基于WALL-B世界模型运行，该模型旨在连接视觉、语言、触觉、动作和物理预测，以适应家庭中非受控的复杂场景。此举旨在克服机器人领域的“莫拉维克悖论”（即对人类轻松完成的家务任务对机器人而言异常困难）。官方声明指出，这些机器人在发布会后正逐步进入家庭，它们仍在学习阶段，动作可能缓慢或笨拙，但每个家庭环境都将帮助它们更好地理解世界。

Berryxia.AI@berryxia · 5月25日60

中国这样的企业其实在AI时代会越来越多！ Unitree 只是先锋而已~ 一个中国机器人公司，面对全球巨头都在卷“人形机器人秀肌肉”的时候，没有选择最吸睛的路线，只闷头做了一款能真正干活的家伙。他们把Unitree WVLA 2.0扔进一个真实会议室，桌子上一片狼藉：水瓶、纸张、杂物、咖啡杯…… 然后让它单次拍摄、全程自主、多任务清理，强外部干扰下全程没有掉链子。这个视频一出，全球机器人圈直接炸了。这就是Unitree WVLA 2.0的会议室乱局清理测试。故事就这么简单，却硬核到离谱。过去几年，机器人圈最爱拍的就是“实验室完美环境”里的花式表演：跳舞、后空翻、端盘子。可一旦放到真实办公室——椅子乱放、人走来走去、桌子永远收拾不干净——99%的演示机器人瞬间原地傻掉。 Unitree这次直接反其道而行。他们把WVLA 2.0扔进一个完全没布置过的真实会议室，桌上乱成一锅粥，外部还有人走动、东西晃动、意外干扰。结果：机器人全程自主决策——识别垃圾、分类处理、擦桌子、摆放物品…… 一气呵成，没有任何剪辑，没有人工遥控，没有“实验室魔法”。最狠的是单次拍摄这个细节：意味着整个流程没有重置、没有失败重来、没有后期修补。它在真实物理世界里一次性把活干完。这不是又一个“看起来很厉害”的演示视频，这是机器人从实验室走向真实世界的铁证。 Unitree本来就是以高性能民用四足机器人起家（G1、H1系列早已全球出圈），这次WVLA 2.0明显是他们向“实用场景机器人”迈出的关键一步，轮腿混合？机械臂+移动平台？不管架构如何，核心是：它真的能把“清理乱局”这个最烦人的日常任务干好。而这背后，是国产机器人企业在被卡脖子、被技术封锁的背景下，用真实工程能力硬刚的又一次证明。你今晚就可以感受到这个差距。直接去Unitree官方账号看完整视频（单次拍摄，全程无剪辑），你会发现：水瓶晃动、杂物堆叠、人类干扰……它全扛住了。这不是实验室玩具，这是未来办公室/家庭清洁机器人的真实雏形。 Big Tech和西方巨头还在卷“最像人”的机器人秀，Unitree却在用“最能干活”的机器人一步步把实用场景攻下来。而你，现在已经知道了。

译宇树科技（Unitree）发布其WVLA 2.0模型在真实会议室进行的乱局清理测试视频。该测试为单次拍摄、全程无剪辑，机器人需在桌椅杂乱、物品随意摆放、且有人走动等强外部干扰下，全程自主完成识别、分类、清理和摆放等多任务。测试旨在展示机器人从实验室走向复杂现实世界的能力，与行业中常见的“实验室完美环境”演示形成对比。