李飞飞(Fei-Fei Li)指出,大语言模型(LLM)仅学习文本模式,能描述房间但无法理解椅子移动、玻璃破碎、阳光变化或机器人推杯子等物理变化。世界模型则试图学习视觉背后隐藏的结构,能预测相机未捕捉的视角、建模物体行为、支持真实或虚拟环境中行动的智能体。理解新视角、预测推动结果、决定下一步行动,都需要一个共同的内在模型,涵盖空间、因果与后果。
Great piece from Dr. Fei-Fei Li (@drfeifei)
"The world is not made of words….
A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents."
LLMs learn patterns in text, so they can explain a room, but they do not naturally know how the room changes when a chair moves, glass breaks, sunlight shifts, or a robot pushes a cup.
A world model tries to learn the hidden structure behind what we see, meaning it can predict views the camera never captured, model object behavior, and support agents that act inside real or virtual environments.
To see a world from a new angle, to predict what happens when something is pushed, and to decide what to do next all require a common internal model of space, causality, and consequence.