Recursive self-improvement post by Anthropic: “Each time we release a model, we give it code that trains a small AI model, ask the new model to speed it up. In May 2024, Claude Opus 4 averaged a ~3x speedup. This April, Mythos Preview achieved ~52x.” RSI is happening, and I can't wait to see Mythos.

译Anthropic 发布的递归自我改进帖子： “每次我们发布一个模型，都会给它代码，让它训练一个小型 AI 模型，然后让新模型加速训练。 2024 年 5 月，Claude Opus 4 平均实现约 3 倍加速。今年 4 月，Mythos Preview 达到约 52 倍。” RSI 正在发生，我等不及要看到 Mythos 了。

Ethan Mollick@emollick · 6月5日47

A real problem with feeling the acceleration viscerally is that current models are really good and it is hard to feel the vibe difference on most individual tasks with new models, even as AIs continue to increase in ability by large amounts (which they actually are doing).

译一个切实的问题在于，要切身感受到这种加速很难——当前模型已经非常出色，即便AI的能力在持续大幅提升（它们确实在这么做），大多数个体任务上也很难体会到新模型带来的那种不同。

Nathan Lambert@natolambert · 6月5日31

I feel like this also goes for a lot of people without Mythos as they learn to use agents too tbf

译Anthropic 表示，使用 Mythos 后人均代码产出较半年前 Opus 4.5 提升 3.2 倍。Nathan Lambert 评论称，没有 Mythos 的人在学用智能体时也有类似感受。

Claude@claudeai · 6月4日51

Anton Osika (@antonosika) is the co-founder and CEO of @lovable, where anyone can build software through conversation. His working thesis: the most underrated moat in AI is trust, and earning it takes craft, care, and obsession.

译Anton Osika (@antonosika) 是@lovable 的联合创始人兼CEO，任何人都能通过对话构建软件。他的工作论点：AI中最被低估的护城河是信任，而赢得信任需要技艺、用心与执着。

elvis@omarsar0 · 6月4日48

I am hooked on Dynamic Workflows! The idea of generating harnesses on the fly is so compelling that I reverse-engineered it for my agent orchestrator. And then I built a monitoring dashboard (as an HTML artifact) to track tasks, metrics, and reports. I can now use and monitor dynamic workflows in my agent orchestrator with coding agents like Claude Code, Codex, Pi, and even my own custom-built @dair_ai agent. This is clearly the future of working with agents to accomplish complex, long-running tasks. Some use cases I'm having success with: - Branching deep research tasks (with verification) - Parallel deep research tasks - Session mining of all my agent sessions - Bug hunting - Triaging - Fact-checking - LLM councils - AI simulations - Data synthesis - Evals generation ... and many others Dynamic workflows, like agent skills, feel like an important primitive to not only get the most out of agents but also incorporate dynamic behaviors and important components like cooperation and verification. There is so much exploration ground here. The exciting part is that this is not limited to coding tasks; it extends to business use cases and many other technical domains like science and research.

译Elvis Saravia 逆向工程了动态工作流（Dynamic Workflows）并集成到自研智能体编排器中，同时构建 HTML 监控仪表盘跟踪任务、指标和报告。该工作流可在 Claude Code、Codex、Pi 等编码智能体及自研 @dair_ai agent 上运行。成功用例包括分支深度研究、并行深度研究、会话挖掘、Bug 定位、分类、事实核查、LLM 委员会、AI 模拟、数据合成和评测生成等。他认为动态工作流与 agent 技能一样，是实现复杂长期任务的关键原语，不仅限于编码，还可扩展至商业、科学等领域。

ginobefun@hongming731 · 6月4日61

Vibe Coding「借来的杠杆」vs「增长的能力」

译@pengzheng_ 指出，Vibe Coding 让人同时感觉更聪明和更笨——能发布产品但无法解释原理。如果离开 AI 就无法复现成功，那只是借来的杠杆而非增长的能力。目标不是从提示到产品，而是理解实现路径并建立信心。理解为何有效时，AI 扩展能力；不理解时，AI 替代学习。无限提示终可发布软件，关键在于每次成功是否转化为经验，否则只是产出而非能力增长。

Rohan Paul@rohanpaul_ai · 6月4日58

Great piece from Dr. Fei-Fei Li (@drfeifei) “The world is not made of words.... A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents." LLMs learn patterns in text, so they can explain a room, but they do not naturally know how the room changes when a chair moves, glass breaks, sunlight shifts, or a robot pushes a cup. A world model tries to learn the hidden structure behind what we see, meaning it can predict views the camera never captured, model object behavior, and support agents that act inside real or virtual environments. To see a world from a new angle, to predict what happens when something is pushed, and to decide what to do next all require a common internal model of space, causality, and consequence.

译李飞飞（Fei-Fei Li）指出，大语言模型（LLM）仅学习文本模式，能描述房间但无法理解椅子移动、玻璃破碎、阳光变化或机器人推杯子等物理变化。世界模型则试图学习视觉背后隐藏的结构，能预测相机未捕捉的视角、建模物体行为、支持真实或虚拟环境中行动的智能体。理解新视角、预测推动结果、决定下一步行动，都需要一个共同的内在模型，涵盖空间、因果与后果。

Chubby♨️@kimmonismus · 6月4日84

OpenAI just wrote: "We also see early signs of recursive self-improvement (RSI) in today’s systems: where AI development is itself accelerated by AI. We expect this to increase competitive pressures among developers and nations, and create governance challenges that existing institutions are not equipped to address. As RSI emerges, societies will need ways to shape the trajectory of AI development and ensure that it serves human interests." The vibe has changed, something is happening.

译OpenAI刚刚写道：“我们也看到了当今系统中递归自我改进（RSI）的早期迹象：AI开发本身正被AI加速。我们预计这将加剧开发者与国家之间的竞争压力，并带来现有机构无法应对的治理挑战。随着RSI的出现，社会将需要找到塑造AI发展轨迹的方法，确保其服务于人类利益。” 气氛变了，有事正在发生。

Ethan Mollick@emollick · 6月4日55

The capabilities of Claude Code and Codex have expanded a lot in recent months, they added many ways to approach work (subagents, skills, goal, workflows, plugins, etc). Given the AI labs can use their own AI to help documentation, a surprising amount is effectively undocumented

译近几个月来，Claude Code和Codex的能力大幅扩展，增加了许多工作方式（子智能体、技能、目标、工作流、插件等）。考虑到AI实验室可以用自己的AI来辅助文档编写，令人惊讶的是，大量功能实际上没有文档。

AYi@AYi_AInotes · 6月4日64

关于Codex的使用，分享下我的一些思考，如果从前几天我分享的使用AI的底层心法是以道御术的角度看，省额度是术，看清生产力归谁管是道。如果你也在用 Codex，并且习惯把额度省着点用—— 我劝你看完这条再决定要不要继续省，昨天那次 reset，可能正好把你攒的 buffer 覆盖掉了。 OpenAI Codex 负责人 Tibo（@thsottiaux）发帖，说过去 24 小时出了 3 次小可靠性事故，给所有付费计划统一重置了额度，配了一句 May the tokens flow again。评论区一片感谢，刷屏 Saint Tibo、he did it again，我翻了一圈，想说一句可能没人爱听的话，你这几天精打细算省下来的额度，大概率是白省了。先讲讲 Codex 这套额度怎么算的，没按 token，而是按推理时间（reasoning time）算的。一个 5 小时窗口，本地和云任务共用。据社区实测，Plus 计划下 GPT-5.4 大概跑 40 分钟推理就把这 5 小时额度烧到 100%，GPT-5.3 大概 60 分钟。也就是说你开个 /goal 让它自己 plan→act→test→iterate 连轴转，额度掉得比你想象快得多，你只看到一个百分比，看不见它每分钟在烧多少。现在把 reset 叠进来看，据社区讨论，这个 reset 很多时候不是凭空补额度，是把你下一个计费周期的起点往前提了。所以那些 reset 后立刻开跑的人，有人一口气跑了 11 小时＋推理；而你为了周末大项目辛辛苦苦攒的 buffer，一次 reset 直接被覆盖归零。省的人亏，冲的人赚。 4、5 月到这次，Tibo 已经 reset 好几轮了，这不是偶发，属于常态了。所以在现在这套规则下，精打细算反而是次优解。不是让你瞎浪费，是这系统在奖励立刻消耗的人，你得顺着它的规则走。但真正让我在意的，不是怎么省额度，是这件事意味着什么。把 Codex、把额度、把 reset 这几个词去掉，这是所有用云端 AI 干活的人的同一个故事，你的生产力，不在你手里，在一套你看不透、也补偿得不透明的系统手里。今天 Tibo 心情好给你 reset，明天他换岗了呢。靠一个好心负责人的 goodwill 续命的信任，他在的时候特别暖，他一走，账要一次性集中还。所以这事真正的解法，不是蹲着等下一次 reset，是别把生产力全押在一个你控制不了的池子里。本地模型兜底 + 云端冲峰值，自己记一份 burn rate（40 分钟≈100%，倒推 4 分钟≈10%），把节律攥回自己手里。我觉得AI 工具的下一道分水岭，已经不在模型多强了，关键看是我们的生产力到底归谁管。

译OpenAI Codex 负责人 Tibo 因 24 小时内 3 次可靠性事故，重置所有付费计划额度。Codex 按推理时间计费：Plus 下 GPT-5.4 约 40 分钟耗尽 5 小时窗口，GPT-5.3 约 60 分钟。重置常将下个计费周期提前，导致精打细算攒的额度被覆盖，立即消耗者反而获得更多推理时间。作者认为系统奖励即刻消耗，建议本地模型兜底、云端冲峰值，夺回生产力控制权。

Tibo@thsottiaux · 6月4日23

Lots of little vectors at OpenAI all pointing in the same direction. Excited to see it all add up and come together over the coming weeks.

译OpenAI 有很多小向量都指向同一个方向。期待看到它们在未来几周内汇聚融合。

宝玉@dotey · 6月4日57

最近 Codex GPT-5.5 给我的感觉是干活不如 Claude Opus 4.8，当然可能是因为我在开发 Mac 应用，Opus 更擅长一些

译宝玉 (@dotey) 表示，Codex GPT-5.5 在干活上不如 Claude Opus 4.8，尤其在开发 Mac 应用时 Opus 更擅长。@jesselaunz 也反馈 Codex 突然“降智”，原本预期 2 天的目标仅 20 分钟就交付，用户给出了评分以来最低的 5/10 分。

Ethan Mollick@emollick · 6月4日50

Leaving aside the question of consciousness, the Ted Chiang piece has a reasonable point about moral atrophy if you let AI make choices. But it is also interesting in light of the fact that repeated randomized trials find AI is apparently a good ethicist. https://x.com/emollick/status/1717198389006176519?s=20

译Ethan Mollick 引用一篇论文：四名牧师、一名拉比、十三名学者和 50 名 MBA 被要求比较《纽约时报》伦理专栏作家与 GPT-4 提出的伦理方案，结果基本持平（tie）。主推文指出，尽管 Ted Chiang 关于让 AI 做选择会导致道德萎缩的观点有一定道理，但重复随机试验发现 AI 似乎是优秀的伦理学家。

SemiAnalysis@SemiAnalysis_ · 6月4日38

Vertical power delivery, flexible moving-pin interposers, and direct-impingement water cooling. Cerebras had to rewrite the mechanical engineering playbook just to keep a single wafer from cracking itself apart.

译垂直功率传输、柔性移动引脚中介层和直接冲击水冷。Cerebras 不得不重写机械工程手册，仅仅是为了防止单个晶圆自裂。

Ethan Mollick@emollick · 6月4日56

Deciding that under no circumstances AI could never be conscious removes a whole bunch of thorny problems that might impact the AI industry if some form of AI consciousness might be possible at some point.

译决定在任何情况下AI都不可能拥有意识，这消除了大量棘手问题——如果某种形式的AI意识在某个时候是可能的，这些问题可能会影响AI行业。

swyx@swyx · 6月4日44

you guys know where this is going right

译引用推文高度评价 Reve 2.0 发布文案。Reve 2.0 的核心理念：可控图像生成与编辑的关键不是更密集的提示词，而是以代码形式呈现的高度详细、可操作的中间表示。指出当前图像生成模型通过渐进退化惩罚迭代，而创造力本质上不是一次性工作流。引用 Alan Kay "做软件应自造硬件" 类比，Reve 认为真正严肃的创意工具公司应训练自己的模型。

宝玉@dotey · 6月4日61

AI Agent 对比 PC 和移动端不是完全取代的关系。手机出来没有完全取代PC，但很多事情你可以随时随地处理了；AI Agent 也不会完全取代手机和PC，但很多时候你不需要打开很多App了，直接给 Agent下指令就够了。

译AI Agent 不会完全取代手机和 PC，但用户无需打开多个 App，直接给 Agent 下指令即可。通用 Agent 将成未来操作系统，App 有三种结局：消亡、转为 CLI/MCP、保留为 GUI 插件。SaaS 应尽快推出 CLI 与 Skill 以适配 Agent。

歸藏(guizang.ai)@op7418 · 6月4日67

http://x.com/i/article/2062359856376610816 # 即览：手机上看 Markdown 和 HTML，怎么就这么难？之前预告过的那个「手机上的 Markdown / HTML 阅读器」做完了，叫即览。它解决的是一个很小、但最近越来越烦的问题：别人从微信、文件 App 或群里发你一份 AI 报告、网页 PPT、Markdown 文档，手机上点开不是空白，就是源码，要么样式全坏，要么根本不知道该用什么打开。 .md、.markdown、.html、.htm、.txt，还有打包好的网页 ZIP，都可以直接用即览在 iPhone 和 iPad 上打开。本地渲染，本地保存，不需要上传，也不需要注册账号。文末有 TestFlight，想试可以直接申请，我开了 8000 个名额。但我做即览，不只是因为缺一个阅读器。更直接的原因是：这段时间我越来越明显地感觉到，在 AI 参与内容生产之后，我们交换内容的格式正在变。很多文本内容开始落到 Markdown，很多展示内容开始落到 HTML。即览只是这个变化走到手机端时，掉出来的一个小工具。 ## Markdown 不只是文本格式，它正在变成 AI 的数据层前几天看到 Obsidian 作者的一句话，我觉得很准：.md 正在成为 AI 文件交互里的一个 Schelling point。 Schelling point 可以翻译成“谢林点”，意思是没有人强制规定，但大家会自然聚到同一个选择上。 Markdown 现在就有点像这样。没人规定 AI 应该用 Markdown，标准委员会也没有出来宣布过什么。但在真实使用里，不管是人写给 AI，还是 AI 写给人，最后经常都会落到 .md 文件上。原因也很朴素。它是纯文本，模型读写都轻。它有足够的结构，标题、列表、表格、代码块、链接都能表达。它又不会像 .docx 那样被包进一层复杂格式里。人可以直接打开，AI 也可以直接处理，版本管理和 diff 都干净。但我觉得更重要的是，Markdown 不能再只被理解成“编辑器里的文本”。它更像是 AI 工作流里的底层数据。我在 CodePilot 里就是这么用的。它没有特别复杂的 memory 机制，很多记忆其实就是一组 Markdown 文件。 AI 往里写，AI 从里读，我自己也能打开改。更进一步，CodePilot 里的 widget 也可以把这些本地 Markdown 和 memory 当作数据来源。文件变了，组件展示也跟着变。这时候 Markdown 就不只是“拿来读的一篇文章”了。它变成了一种很轻的本地数据层：人能看，AI 能读，工具也能基于它生成新的界面和交互。这也是为什么我觉得，最近很多人继续卷 Markdown 编辑器，方向可能有点窄。真正有意思的不是再做一个更漂亮的编辑框，而是把 Markdown 当成数据，去构建新的阅读、管理和人机交互方式。 ## HTML 正在变成 AI 内容的展示层另一端是 HTML。这个趋势最近也越来越明显。上个月我开源了一个 PPT Skill，生成的就是网页形式的演示文稿。它 25 天到 1 万 star，后来我在线下答辩、展会和分享里，也反复见到有人用它做出来的 PPT。这件事让我确认了一点：很多场景里，大家要的并不是一个标准的 .pptx 文件，而是一个能拿上去讲、能被人看懂、能快速分享的展示物。刚好 Claude Code 团队最近也在讲同一件事。他们有篇文章专门写为什么越来越多输出开始用 HTML，而不是 Markdown。理由很直接：HTML 信息密度更高，更容易做视觉层级，更适合展示图表、布局、交互，也更容易被别人打开和阅读。这跟我自己的体验很接近。 Markdown 适合沉淀内容，但它一长就难读。几千字、几万字的报告堆在一个 .md 文件里，哪怕结构是对的，人也很难真的读进去。 HTML 反过来。它可以用排版、空间、颜色、图表和交互，把信息组织得更像一个“可以被消费的东西”。它不是更适合存事实，而是更适合让人理解事实。所以我现在越来越倾向于把这两件事分开看： Markdown 是数据层，HTML 是展示层。底层内容用 Markdown 留着，干净、可读、可版本管理。需要给人看、给人讲、对外分享时，再渲染成 HTML。这不是某种宏大的新标准，更像是 AI 工作流里自然长出来的一种分工。 ## 但这条链路在手机上断了内容有了，文件也发出来了，问题出在最后一步：人经常是在手机上打开它。桌面端还好。你有浏览器，有编辑器，实在不行还有 VS Code。但手机不是这样。尤其是你在微信里收到一份 AI 生成的报告、一个网页 PPT、一个 Markdown 文档时，常见体验就是点不开、显示源码、样式坏掉，或者要在几个 App 之间来回跳。这件事很小，但非常烦。微信这种 IM，本质上不是文件阅读器。它的优先级是聊天、预览和转发，不是认真打开一个 Markdown 或 HTML 文件。浏览器也不是为这个场景设计的。浏览器默认处理的是“你给我一个链接，我帮你打开网页”。但别人发给你的往往是一个本地文件，不是一个链接。你当然可以绕来绕去把 HTML 丢给浏览器，但整个链路又长又别扭。很多 Markdown 工具也偏编辑、偏笔记，不一定适合临时打开别人发来的文件。更不用说有些工具会要求你导入、同步、建库、注册账号。 HTML 还多一层安全问题：一个陌生文件里可能带脚本，你不一定希望它默认执行。所以我一直觉得这里缺了一个很简单的东西：在手机上，把 AI 工作流里常见的这些文件，安全、顺手地打开。这就是即览。 ## 即览做得很窄：打开、读、收着即览没有做成编辑器，也没有接 AI，顺便我必须得吹一下 CodeX 画的这个 App 图标，太可爱了。我一开始就想得很清楚，它只做三件事：打开、读、收着。收到文件时，从微信、文件 App 或系统分享面板里选择即览，就能打开。支持 .md、.markdown、.html、.htm、.txt，也支持网页资源打包成的 .zip。所有文件都在本地处理，不上传，不注册账号。读 Markdown 的时候，我主要按长文阅读去调。字号、行距、背景可以改；长表格可以横向滚动；有标题结构的文档可以用目录跳转。常见的 Obsidian 写法，比如任务列表、Callout、脚注、Frontmatter、标签，也尽量兼容。也支持夜间模式和颜色主题的切换。读 HTML 的时候，我更在意“可控”。它用系统 WebView 本地渲染，支持缩放、横竖屏切换，也可以在手机模式和桌面模式之间切。动态脚本默认关闭。陌生 HTML 里到底有没有脚本，你通常是不知道的。所以即览默认不把执行脚本作为前提；遇到确实需要 JS 才能看的页面，再手动打开。 ZIP 也是为真实场景做的。很多 AI 导出的网页不是单个 HTML，而是 index.html 加一个 assets 文件夹。即览会解压后自动找入口，本地图片和 CSS 也能正常加载，不至于样式全丢、图片全裂。打开过的文件会自动留在本地历史里。下次想回看，进 App 就能找到。重复导入同一个文件不会堆出两份，重要的也可以收藏。这就是它现在的边界。它不做云同步，不做账号，不做编辑，也不接 AI。不是因为这些功能不重要，而是因为一个查看器先应该把“打开并读完”这件事做干净。 ## 即览接在前两件事后面现在回头看，即览不是一个孤立的小工具。上个月我做 PPT Skill，是因为我相信 HTML 会成为 AI 生成演示内容时很自然的一种形态。它不一定取代 PowerPoint，但在“快速生成一个能讲的东西”这件事上，HTML 足够轻、足够开放，也足够适合模型直接生成。我做 CodePilot，是因为我相信 Markdown 会成为 AI 协作里很自然的数据和记忆载体。它不是最漂亮的格式，但它最容易被人、模型和工具同时使用。即览接的是第三步：这些格式不能只停在“生成出来”那里，还得让人真的能打开、能读、能收起来。前两件事偏生产，即览偏消费。 AI 已经能生成 Markdown，也能生成 HTML。但如果这些文件一到手机上就断掉，那前面的生成体验再顺，也没有真正落到人手里。即览补的就是这个最后一公里。 ## 但这件事还远没结束即览现在补的只是最浅的一层：收到一个文件，把它打开。再往后，其实还有几个问题没有解决。比如管理。很多人的手机、网盘、聊天记录和各种 App 缓存里，已经散落着大量 Markdown 和 HTML 文件。它们不是没有价值，只是太分散，找不到，也管不起来。比如分享。即览解决的是“别人发给我，我怎么看”。但反过来，“我做了一份 HTML，怎么让别人顺手打开”，仍然麻烦。发文件，对方未必打得开；发链接，又需要自己找地方部署。比如跨设备。手机上读了一半，回电脑接着看；电脑上生成了一份报告，推到手机上读，这都很自然。但一旦做同步，就会碰到账号、云端、隐私和复杂度。即览现在还很小，小到我不太想把它包装成一个大产品。但它正好卡在我自己每天都会遇到的缝里： AI 把内容生成出来了，可我只是想在手机上好好看一眼。你也经常被 Markdown、HTML、网页 PPT 这些文件硌到的话，可以试试。 > TestFlight：https://testflight.apple.com/join/sv7KTqn9 也欢迎聊聊你们怎么看这件事：在 AI 参与之后，文档、展示和阅读到底会变成什么样。

译即览是一款iOS/iPad应用，解决手机端无法正常打开AI生成的.md、.html等文件的问题。它本地渲染，无需上传或注册，有8000个TestFlight名额。作者引用Obsidian观点：.md正成为AI文件交互的“谢林点”；Claude Code团队认为HTML更适合展示层。即览定位纯粹：仅打开、阅读和收藏，不编辑、不云同步、不接AI。支持.md/.html/.txt及.zip网页文件，动态脚本默认关闭以确保安全。

DogeDesigner@cb_doge · 6月4日39

Grok Imagine 1.5 video quality is seriously impressive. 🔥

译Grok Imagine 1.5 视频质量确实令人印象深刻。🔥

Ethan Mollick@emollick · 6月4日62

I actually read this & it is super weird, it appears to be an argument that prior machine learning systems (not generative AI) did not generate savings due to data issues so that will lead to a lack of investment into current AI systems Also it cites the mostly fake “MIT study”

译我确实读了这篇文章，它非常奇怪，似乎是在论证先前的机器学习系统（非生成式AI）因数据问题并未带来成本节约，因此将导致对当前AI系统的投资不足。此外，它还引用了那个基本是伪造的“MIT研究”。

向阳乔木@vista8 · 6月4日58

跟朋友聊天，他提到去年千问统计的年度Top10提示词，主题如下： 1. 股票 2. 八字 3. 情感咨询 4. 朋友圈文案 5. 景点推荐 6. 双色球号码 7. 失眠 8. 解答这道题 9. 离婚财产分割 10. 人生的意义总之，感觉用 AI 做 2C 出路很少。 1. 搞钱/省钱/变聪明：直接产生经济回报或能力提升。炒股，折扣购物，副业流水线。消耗token获得产出物，产出物带来金钱。 2. 懒人向/省时间：刷医院挂号、买火车票、把微信群里老婆交代的10件事自动加进日历和购物车。愿意付的钱少，因为个人时间不值钱。 3. 情感/养成向，情绪价值：数字分身，宠物，玄学等。消耗token带来情绪满足

译千问统计年度Top10提示词：股票、八字、情感咨询、朋友圈文案、景点推荐、双色球号码、失眠、解答这道题、离婚财产分割、人生的意义。作者认为AI 2C出路有限，三类：直接赚钱、懒人省时（付费意愿低）、情感情绪价值，整体空间狭窄。

Berryxia.AI@berryxia · 6月4日45

刚刚看到李飞飞最新的发的文章，虽然没有了语言墙，但是还是喜欢自己翻译看一下完整的内容。👇🏻 “世界不是由词语构成的”：Fei-Fei Li 论世界模型的三种形态与空间智能！（译） > “The world is everything that is the case.” --路德维希·维特根斯坦一、世界不是由词语构成的。语言模型在文本、概念和推理上表现出色，但物理世界运行在空间、时间、物理和几何之上。Fei-Fei Li（李飞飞）及其 World Labs 团队认为，空间智能（spatial intelligence）是 AI 的下一个前沿，而世界模型（world models）是通往这一目标的关键路径。然而，“世界模型”这个词如今已被严重滥用。计算机视觉、机器人、强化学习、生成式 AI 等不同社区对它的理解大相径庭。李飞飞基于经典的 POMDP（部分可观测马尔可夫决策过程）/智能体-环境循环，给出了一个清晰的分类法。世界模型的三种核心功能 1. Renderer（渲染器）输出观察（observations），主要是像素，服务于人眼。优化目标是视觉保真度和 plausibility（合理性）。典型代表：文生视频模型、Google 的 Genie、World Labs 的 RTFM。局限：画面可以非常完美，但在物理交互或细致检验下容易崩坏——“好看但不结实”。 2. Simulator（模拟器）输出状态（state）——对世界进行几何和物理上准确的表征。必须严格遵守物理、碰撞、动力学和材料特性。既服务于人类（设计、建筑、影视），也服务于机器（训练 RL 智能体、机器人、自动驾驶）。李飞飞认为这是最关键的一环。它是从渲染和规划中都可以派生出来的结构主干。当前最大挑战：3D/物理数据极度稀缺、sim-to-real 差距、多物理场 scaling 困难。 3. Planner（规划器）根据观察和目标输出动作（actions）。它闭合了感知-行动循环，包括视觉-语言-动作模型以及新兴的“世界动作模型”。目前大多仍局限于受限的实验室环境。核心观点模拟器是最重要、却最不被炒作的那一个。渲染器已经商业成熟（视频生成赛道）。规划器正获得大量关注和资金（机器人公司）；而模拟器连接两者，是实现可靠真实世界应用的关键。最激动人心的进展在于边界的模糊：同一套底层知识（几何 + 物理 + 动力学）应该同时支持渲染、模拟和规划。 World Labs 的 Marble 项目就是典型例子，它能从多模态提示生成可探索的 3D 环境，同时输出高斯溅射（用于视觉）和碰撞网格（用于物理）。长期愿景是一个统一的世界模型，一个基础模型能够流畅地在照片级写实的渲染、精确物理模拟、动作规划三种模式之间切换。结语语言让机器学会了“谈论”世界。而世界模型，才是机器真正理解、想象、推理并在其中行动的方式。这是一篇信息密度极高的文章，既有技术哲学深度，也清晰地表明了 World Labs 的战略方向。

译李飞飞基于POMDP框架将世界模型分为三种功能：Renderer（渲染器，输出像素）、Simulator（模拟器，输出几何/物理状态）、Planner（规划器，输出动作）。渲染器已商业成熟（如文生视频），规划器受资本追捧，模拟器最关键但数据稀缺。World Labs的Marble项目可从多模态提示生成可探索3D环境，同时输出高斯溅射和碰撞网格。长期目标是统一模型，在渲染、模拟与规划间流畅切换。

Josh Woodward@joshwoodward · 6月4日25

These are so fun!

译这些太有趣了！我们当前最喜欢的 Gemini Omni 趋势：使用真实世界镜头创造意想不到的转折。试试自己做一个！🧵

meng shao@shao__meng · 6月4日63

工程、产品、设计正在融合成一种「Builder」角色？不要听投资人、卖课博主们随口造概念、卖焦虑！现实工程中，绝非如此！Cursor 团队 @leerob 帮咱们客观梳理。「角色合并」被说得太简单了！即便公司里有一千个 Member of Technical Staff（MTS）头衔，组织里仍需要有人把产品或设计当作自己的 Main Thing™——深度、优先级、问责都集中在一件事上。 MTS 本身未必错，但在他看来，它常被用来包装一种被稀释的「人人都是 builder」话术：头衔变模糊了，责任并没有消失。 AI 降低了写代码门槛，没有降低系统复杂度代码生成变容易，不等于能安全、可持续地交付软件。若非工程师大量产出低质量代码（AI Slop），又缺少强工程师去约束架构、债务和边界，痛苦会后置：维护、事故、协作成本会爆发。隐含判断：Builder 叙事容易低估「驯服复杂度」这件事，而这仍是工程的核心价值之一。硅谷叙事存在「用创业公司过度拟合全行业」初创公司有时是行业变化的领先指标，这点他承认。但把「小团队里一个人干多件事」推广到所有组织，会失真。他用摩根大通反问：大型、强监管、流程重的公司里，PM 是否真能兼工程与设计？他的预期是：极难，甚至不现实——不是因为人不够聪明，而是因为岗位结构、合规、风险、分工与政治成本不同。真正难颠覆的是「人的那一面」，不是工具那一面岗位边界不只因技术栈而存在，更因组织记忆、权力与激励而固化。比如内部政治、15 年无人文档化、靠个人维系、知识垄断与岗位安全捆绑等。AI 很难一夜抹平这些。专业化不会消失，AI 对知识工作的冲击会很慢他明确反对「专家/专队会过时」的想象。协作里，有一个真正懂某一域的人或团队，仍然高效、安心。对知识工作的 AI 颠覆，他判断会以十年计，因为瓶颈主要在社会学与组织学（信任、分工、权力、流程、问责），而不只是智力或技能本身。

译邵猛引用leerob推文，反对“工程、产品、设计融合成Builder角色”的观点。即便团队有大量MTS头衔，仍需要有人将产品/设计作为主业，责任不会因头衔模糊而消失。AI降低了代码生成门槛，但未降低系统复杂度——非工程师输出低质量代码（AI Slop）且缺乏强工程师约束架构，将导致后续维护成本爆发。初创一人多角色模式不适合摩根大通等大型受监管组织。真正难颠覆的是内部政治、15年无人文档化的关键系统、知识垄断等“人的一面”。专业化不会消失，与真正专家协作依然高效。AI对知识工作的颠覆将以十年计，瓶颈在于社会学与组织学。

Berryxia.AI@berryxia · 6月4日58

我今天刷到OpenAI官方消息，直接把“通用模型就能通吃一切”这个主流认知又打了个反转。他们把GPT-Rosalind正式升级了。这不是简单迭代，把一个专门为生命科学研究打造的企业级模型系列。底层直接把GPT-5.5最强的Agentic Coding和工具调用能力，和生命科学领域的深度智能融合到一起。以前药企做药物发现、分子分析、实验设计、湿实验流程，经常卡在“AI只能给idea，真实实验还得人一步步验证”这个断层上。现在Rosalind把agentic能力直接嵌入到这些流程里：它能自主生成假设、调用工具做模拟、设计实验方案、甚至追踪整个工作流的可重复性。更狠的是，它是专门定制的，不是在通用模型上加个生命科学prompt，而是从头针对药物发现、蛋白设计、实验优化这些真实场景做了专项强化。企业级规模意味着它能处理海量实验数据、跨团队协作、合规审计这些以前只有顶尖实验室才玩得起的复杂链路。这其实戳破了当前AI行业最大的集体幻觉：大家还在卷单一通用模型的参数和基准分，OpenAI却在用行动告诉我们，真正能改变产业的，是把agentic智能下沉到垂直领域，让AI从“聊天助手”变成“科研基础设施”。 Rosalind这个名字也选得有深意，向Rosalind Franklin致敬，那位被低估却奠定DNA结构基础的科学家。现在AI终于开始在生命科学里扮演真正能落地的伙伴角色，而不是停留在纸面演示。

译OpenAI 为专为生命科学研究打造的企业级模型系列 GPT-Rosalind 增加新能力，融合 GPT-5.5 的 Agentic Coding 与工具调用能力。Rosalind 可自主生成假设、调用工具模拟、设计实验方案并追踪工作流可重复性，面向药物发现、分子分析、实验设计及湿实验流程。该模型非通用模型加生命科学提示，而是从头针对药物发现、蛋白设计等场景专项强化，支持企业级海量数据处理、跨团队协作与合规审计。命名致敬 DNA 结构科学家 Rosalind Franklin。

ginobefun@hongming731 · 6月4日48

#BestBlogs 早报 06-04 三个重点： ① 微软 CEO 纳德拉在 Build 大会深度开麦，把「Frontier Intelligence Platform」战略和私有评测集作为企业 AI 核心 IP 讲得极透，值得一读； ② 月之暗面 Kimi Work Beta 上线，92% 代码由 AI 完成，桌面端 Working Agent 正式来了； ③ 腾讯研究院 3 万字报告拆解超级个体如何聚合——核心公式：组织竞争力 = 人才密度 × AI 杠杆 / 组织摩擦。

译微软CEO纳德拉在Build大会阐释Frontier Intelligence Platform战略，强调私有评测集为企业AI核心IP；月之暗面Kimi Work Beta上线，92%代码由AI生成，桌面端Working Agent正式推出；腾讯研究院发布3万字报告，提出组织竞争力公式：人才密度×AI杠杆/组织摩擦。

Berryxia.AI@berryxia · 6月4日37

卧槽！这下Codex真的要起飞了……

宝玉@dotey · 6月4日26

请教：Claude Code （Desktop）总是弹窗要确认权限，有没有办法避免总是要 Allow，很烦人，已经启用了 Bypass Permissions

Orange AI@oran_ge · 6月4日25

硅谷英文AI推的焦虑程度是中文AI推的100倍。

Chubby♨️@kimmonismus · 6月4日60

Microsoft Build. My personal review. For me, this was the first time I had the chance to attend Microsoft Build, at Microsoft's invitation. To be honest, I didn't really know what to expect, but I was especially looking forward to the keynote. And it wasn't just the keynote: I also visited GitHub HQ, saw the event hall, sat in on numerous sessions, and even met Satya Nadella in person. Holy moly. It truly exceeded all my expectations. 2026 is turning out to be a crazy year for me. It started with NVIDIA GTC in San Jose in March, followed shortly after by a trip to China - Guangzhou and Beijing - then Google I/O in California, and now Microsoft Build, also in California. What a wild ride! I met incredible people and had fascinating conversations late into the evening about LLMs, chips, energy, geopolitical challenges, financial markets, and so much more. What impressed me most was the pioneering spirit, the optimistic atmosphere, the enthusiasm for being at the forefront of this tech-revolution. Optimism mixed with passion and a love of building, that's what I take away from all these trips. Microsoft was no exception. I got a behind-the-scenes look, heard exclusive GitHub sessions, experienced a personal demo of the flagship Surface Laptop Ultra, met researchers, and much more. My honest take on Microsoft Build: Microsoft is taking feedback seriously and is trying to set things in motion and drive change on every front. Seven new AI models - clearly not aiming for the absolute top end, but positioned in the mid-range, roughly at Sonnet level, and affordable; a new laptop with a new chip meant to rival the MacBook Pros, which, frankly, at first glance even seems capable of pulling it off; bold experiments like Project Solaris and the agentic handheld (yes, I've read all the Rabbit comparisons :D); a revamped Copilot app; the rollout of agentic features into enterprise editions with a new quantum chip; and plenty more. It certainly wasn't boring. Time will tell what succeeds, but I'd argue Microsoft is on the right track.

译Kim受邀首次参加微软Build，参观GitHub HQ、参与多场会议并见到Satya Nadella，认为远超预期。微软发布7个新AI模型（定位中端、约Sonnet级别、价格亲民），新Surface Laptop Ultra配新芯片对标MacBook Pro，展示Project Solaris和智能体手持设备等实验项目，推出改版Copilot应用，企业版新增智能体功能及新量子芯片。作者认为微软正认真听取反馈，在各个方向推动变革。

Chubby♨️@kimmonismus · 6月4日14

Im confused. And excited at the same time. I got the feeling OpenAI is preparing for some big releases. Superapp? 5.6? Let it come!

译我很困惑，同时也感到兴奋。我感觉到OpenAI正在准备一些重大发布。超级应用？5.6？让它来吧！

Ethan Mollick@emollick · 6月4日39

This story was so implausible that the only way it even (kind of) made sense if it is some sort of internal accounting placeholder at a cloud provider using their own compute. And even then it seems unbelievable for a wide number of reasons.

译@binarybits 称，不相信有公司一个月意外花费5亿美元在Claude上，这个数字大得不合理。主推文表示这故事难以置信，唯一可能解释是云提供商内部会计占位符，即便如此也仍有诸多疑点。

Fei-Fei Li@drfeifei · 6月4日78

http://x.com/i/article/2062244283940544512 # A Functional Taxonomy of World Models > “The world is everything that is the case.” — Ludwig Wittgenstein, Tractatus Logico-Philosophicus, 1921 ## The world is not made of words. In an earlier essay, we argued that spatial intelligence is AI’s next frontier and that world models are the path to it. Here, the World Labs team and I want to go one level deeper: of the many things now being built and called ‘world models,’ which functional pieces actually compose that capacity — and what is each one for? Language models have given machines an extraordinary command of concepts, vocabulary, and reasoning, but the physical world, virtual or real, runs on a different substrate. Where language models learn the statistical structure of text, world models learn the statistical structure of space and time: how light falls on a surface, how a garden looks from an angle no camera has captured, how objects respond to force and follow the laws of physics. That makes “world model” one of the most important and most overloaded terms in AI today. Computer vision, robotics, reinforcement learning, and generative AI each claim to be building world models, and each means something quite different. A video model that produces gorgeous but physically impossible flames, a language model improvising a playable game, and a physics engine that faithfully simulates combustion all go by the same name. The ancient Greeks could never agree on what the world was made of, whether fire, water, or indivisible atoms, because “world” was never a single thing. It was always a stand-in for whatever totality a given thinker needed to reason about. AI has inherited the same problem, at exactly the moment when the field needs precision. ## The loop beneath the taxonomy Cutting through that confusion starts with a diagram older than any of the technology in question. Reinforcement learning textbooks, including the canonical Sutton and Barto, have used a version of the same picture for decades to describe how an agent interacts with a world. The formal name for this picture is the partially observable Markov decision process, or POMDP, and the original definition of the term “world model” belongs to that tradition. An agent, which can be a person, a robot, or a software system, takes actions. Those actions affect the state of the world. The agent never sees the state directly. What reaches the agent are observations: the photons that fall on a retina, the readings from a sensor, and the pixels in a video frame. New observations inform new actions, and the loop continues. The word “state” needs unpacking, because the meaning shifts from field to field. This is not the chemist’s state, the difference between solid, liquid, and gas. This is the physicist’s and roboticist’s state: a complete description of what is happening in the world at a given moment, including every object, every position, every velocity, every property. State is the underlying reality of the world; complete in principle, but never directly visible to any agent inside it. Observations are an agent’s partial view of that reality. Actions are what the agent does in response. This loop — agent to action to state to observation and back — is the structure that gave the modern term “world model” its technical meaning. The phrase itself is older, traced to Kenneth Craik’s 1943 proposal that minds reason by running “small-scale models” of reality, and carried into neural networks by the late 1980s and early 1990s. And the loop also explains what people mean by the term today. The different things now being called world models are in fact different projections of this same loop. Each one outputs a different piece of it. ## Three functions of a world model The first kind of world model is a renderer. A renderer outputs observations in the form of pixels meant for human eyes, and the quality that matters most is visual fidelity. A video model that turns a text prompt into a cinematic drone shot is a renderer. So is an interactive system like Google’s Genie 3, or World Labs’ own RTFM, where the model generates frames in real time conditioned on user input. The model carries no explicit understanding of three-dimensional structure. It produces what a viewer would see, not what is. The buildings in the drone shot may look flawless from above, but try to drive through the city below and they fall apart. The second kind is a simulator. A simulator outputs state: a geometrically, physically or dynamically faithful representation of the world that humans and computer programs can both compute on and interact with. Where the renderer’s contract is purely visual, the simulator’s contract is structural, demanding geometry that holds up under inspection, physics that respects Newton’s laws, and dynamics that behave the way the world needs to behave given the laws of physics. A simulator serves two consumers at once. Human professionals such as architects, designers, filmmakers, and game developers need accuracy beyond visual plausibility. Computer programs such as reinforcement learning agents, robot controllers, and autonomous vehicles use simulators as training grounds where they can interact with the world at scale, testing scenarios that would be dangerous, expensive, or impossible to run in reality. The third kind is a planner. A planner outputs actions. Given an observation and a goal, a planner answers the question of what the agent should do next. This is, in many ways, the inverse of the renderer. Where a renderer takes actions as input and produces observations, a planner takes observations as input and produces actions, closing the perception-action loop. Vision-Language-Action models, model-based systems, and the new wave of World Action Models are all attempts at planners: systems that can decide what a robot should do in an unstructured world. These three categories describe most of what is actually shipping today, and the distinction between them is useful in practice. The categories are not, however, fundamentally separate. The same underlying knowledge of how the world works—geometry, physics, dynamics—sits beneath all of them. A model that can render a cup from any angle ought, in principle, to be able to simulate what happens when the cup is pushed and plan a hand to pick the cup up. Increasingly, the most interesting research deliberately blurs the boundaries between the three. ## Why simulation is the linchpin Of the three categories, the simulator gets the least public attention, and is the most consequential of the three. This essay addresses this asymmetry. The renderer is by far the most commercially mature. A number of image- or text-to-video products are expanding in the consumer or enterprise markets rapidly. Google’s Nano Banana model has put renderer-quality image generation in the hands of potentially hundreds of millions of users. The technology is real, and the markets are real. Yet renderers optimize for visual plausibility rather than physical accuracy, and that ceiling matters. Their outputs are beautiful, but they cannot be trusted to design a building or train a robot. The planner is the most intriguing and the most nascent, closely connected to the rapidly evolving field of robotic learning. The field has produced robotic demos in the last two years that look impressive in videos, but candor is required about what those demos actually show. Almost all have been confined to heavily constrained laboratory setups, with narrow object sets and short task horizons. None have been validated at the complexity, variability, or duration that real-world deployment demands. The gap between a compelling demo reel and a robot that reliably works in a kitchen, a warehouse, or an operating room remains vast. The commercial bets are nonetheless substantial. A wave of well-funded entrants is racing to ship general-purpose planning systems, while the largest infrastructure players are positioning planning atop broader simulation stacks. A robot that can plan is a robot that can work, and the entire industry is racing to be the one that gets there first. Simulation is the bridge between the two. If language is an abstraction of the world and pixels are a projection of it, then geometry, physics, and dynamics are the world itself. A simulator must work at that level: the structural backbone from which both visual appearance (for renderers) and action consequences (for planners) can be derived. A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents. A model that masters only rendering, or only planning, cannot do either. The commercial surface area is enormous. NVIDIA’s Omniverse alone targets what the company estimates as more than a trillion dollars of addressable market in factories, warehouses, supply chains, and digital twins. Robotics training, autonomous vehicle testing, architectural visualization, engineering, and drug discovery all depend on something simulation-shaped. The hardest open problems in the field live there too. Three-dimensional data with explicit geometry, material properties, and physical annotations is orders of magnitude scarcer than the internet video that renderers train on. The sim-to-real gap, which is the difference between how things behave in simulation and how they behave in reality, persists. Generative simulators introduce a new risk on top of that: AI-generated geometry can look correct while containing self-intersections or wrong scale that produce nonsensical physics. Multi-physics simulation at scale, where rigid bodies, deformable objects, fluids, and cloth all interact, remains orders of magnitude more expensive than single-domain simulation. At World Labs, Marble is our first move into this territory. It takes multimodal prompts (text, image, video, or spatial sketch) and generates explorable 3D environments, outputting Gaussian splats for visual exploration alongside collision meshes a physics engine can operate on. But Marble is only the first chapter of a much longer arc being written across the field as the lines between rendering, simulation, and planning begin to collapse. ## Where the boundaries are collapsing and what comes next But more is to come. The most important pattern in the field right now is that the three categories are starting to blend into one another. The shared insight is that the knowledge required to render a world, simulate it, and act in it is largely the same. Continuing the earlier example, a model that truly understands how a cup sits on a table (its geometry, material properties, response to force, etc.) should be able to render that cup from any angle, simulate what happens when the cup is pushed, and plan for a hand to pick the cup up. The three categories are three projections of a single underlying understanding. For example: a small but growing number of recent work from various robotics labs have demonstrated that—at least conceptually—a pretrained video renderer can be used as the backbone for joint world-and-action prediction, suggesting a bridge between the renderer and the planner by letting one model imagine what will happen and what to do. World Labs’ Marble already outputs Gaussian splats and collision meshes from a single model, dissolving the boundary between the renderer and the simulator. Every level is moving from passive output to interactive system, with renderers becoming action-conditioned, simulators generating worlds that are more controllable and editable, and planners deliberating rather than just reacting. The logical endpoint is a unified world model: one foundation model that can render photorealistic views, produce physically accurate structure, and plan action sequences, switching between output modalities depending on what the downstream consumer needs. We will still face a number of daunting challenges. The data picture is uneven, with renderers awash in internet video while simulators and planners face acute shortages of 3D assets and robot demonstrations. Optimizing for visual beauty can sacrifice the precision a robot or a high-fidelity simulation needs. Reconciling these tensions inside a single architecture is the defining open problem in world model research today, and this is what World Labs sets out to do as we continue to evolve Marble. The direction, however, is clear. The same bet the field has been making since the late 1980s — that a sufficiently rich model of the world is all that any agent needs to see worlds, build them, and act in them — is the bet now driving an entire generation of research. What gives that “big bet” weight is the convergence already underway: three threads, each already driving and shaping multi-billion-dollar industries on its own, that began as separate research programs are starting to behave like one. Taken together, as the boundaries between them collapse, they will reshape something larger: the relationship between machine intelligence and the physical world it inhabits - the long arc of spatial intelligence. Language gave machines a way to talk about that world. World models are how machines will finally come to understand, imagine, reason and interact with it.

译World Labs团队与李飞飞发文，梳理“世界模型”这一被滥用的术语。对比语言模型学习文本统计，世界模型学习空间与时间统计（如光照、物理规律）。基于部分可观马尔可夫决策过程（POMDP）框架，智能体通过动作影响世界状态，观测是部分视图。当前被称为“世界模型”的不同系统本质上是同一循环的不同投影：第一类为渲染器，输出给人眼看的像素，以视觉保真度为核心。文章着重于概念分层，未给出具体模型名、参数或基准分数。

AYi@AYi_AInotes · 6月4日65

150M 的活，35M 干了， Google 新出的 Gemma 4 12B，把多模态里那个最重的零件，视觉编码器，从 150M-550M 直接压到 35M了，过去做多模态，套路是固定的，图片先扔给一个专门的视觉编码器翻译成模型能懂的语言，再交给大模型理解，就像配了个翻译官。这个翻译官，传统 ViT 编码器要 150M 到 550M 参数。 Gemma 4 12B 直接把翻译官辞了，只留一个 35M 的轻量嵌入器，把图片切成 48×48 的小块，当成 token 直接扔进去，让 Transformer 自己学着看世界，音频也一样，16kHz 原始波形切成 40ms 一帧，直接喂进同一个模型。也就是说，图片、声音、文字，第一次被当成同一种东西。为什么敢这么干，因为它赌的是一件事，当基座模型大到某个临界点，那些专门的子模块，就不再是必需品了。这个剧本你可能见过，当年 ViT 取代 CNN，也是同一个套路，规模够大的时候，与其手工设计一堆专用结构，不如把活儿直接交给一个统一的大模型自己学。现在这套逻辑，正从视觉单模态，蔓延到整个多模态架构。而且 12B 这个尺寸不是随便选的，刚好大到能扔掉编码器，又刚好小到能塞进 16GB 的笔记本里，据 aaryan_kakad 在 M4 Max 上的实测，4-bit 量化下识图延迟 1.2 到 1.5 秒，官方说 16GB 够用，社区的说法更实在，能跑，但高分辨率多图会压线。但这条新闻真正值得琢磨的，不是它能跑在你的笔记本上，是它意味着什么，过去做一个多模态应用，你得拼装 Whisper 转录、LLaVa 看图、再接一个 LLM，像攒一台机器，每个零件都得你自己调好接口、对齐、调试。如果 encoder-free 这条路走通，未来一个微调好的统一模型，可能就把这一整条流水线吃掉了。那一刻贬值的，不是某个工具，是你过去攒那台机器、拼那条 pipeline 攒下的全部手艺。模型不是在帮你省一个零件，是在悄悄重写哪种手艺还值钱。

译Google 推出 Gemma 4 12B（Apache 2.0），采用无独立视觉编码器的统一多模态架构。仅用 35M 参数的轻量嵌入器，将图像切为 48×48 块、音频（16kHz 原始波形）切为 40ms 帧，直接作为 token 输入 Transformer。M4 Max 上 4-bit 量化识图延迟 1.2-1.5 秒，官方称 16GB 内存可用，但社区指出高分辨率多图会压线。该设计暗示：当基座模型足够大，专用子模块不再是必需，未来一个微调好的统一模型可能取代传统拼装 Whisper、LLaVa 等多模态 pipeline。

Ethan Mollick@emollick · 6月4日68

In early May, the best superforecasters predicted that, by the end of the year, the longest METR 80% task horizons would reach 3-4 hours. In late May, Claude Mythos achieved that number.

译5月初，顶级超级预测者预计2026年底前最长METR 80%任务时间范围可达3-4小时。然而5月底，Anthropic的Claude Mythos模型在METR基准预览中即以80%成功率达到3小时6分钟，直接落在专家和超级预测者对2026年底的中位数预测范围内（3-4小时）。此前基线为1.5小时。此次突破表明AI能力进展速度远超预期。

Ethan Mollick@emollick · 6月4日60

Most people, including really accomplished people, don't have an accurate mental model of how LLMs operate (and why would they?) You see this in wide beliefs that AI is just copying from known sources, or that it only produces average answers, or that it can't generate new ideas

译大多数人，包括非常有成就的人，对LLM的运作方式没有准确的认知（他们凭什么有呢？）你可以从广泛的观念中看到这一点：认为AI只是从已知来源复制，或者它只能产生平均水平的答案，或者它不能产生新想法。

elvis@omarsar0 · 6月4日66

This SkillOpt paper from Microsoft is a must-read! (bookmark it) I was a bit skeptical of the results reported in the paper when I shared it a few days ago. However, I managed to integrate it into my agent orchestrator and ran a few experiments. The results are mindblowing. Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this. One exciting result was when I applied it to my paper-figure-extraction skill, which requires an agent to do multimodal analysis. In particular, it improved quality by +20 points (0.73 → 0.93). I went to see the extracted tables and figures, and I was absolutely stunned by how much better my skill got at the task. Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve. In this case, it was skills, but it's not hard to imagine how this scales to optimizing agent patterns, tool use, context engineering efforts, agentic search, workflows, evals, and even the harness itself. I already started with a few of these ideas inspired by SkillOpt. Stay tuned!

译DAIR.AI的Elvis Saravia将微软SkillOpt论文集成到智能体编排器中后，所有智能体技能获得测试框架与自我演化机制。应用于多模态论文图表提取技能时，质量评分从0.73提升至0.93（+20点），提取结果显著改善。Saravia认为这是自我改进AI的早期范例，该思路可扩展至智能体模式优化、工具使用、上下文工程、智能体搜索及工作流评估等环节。他已基于SkillOpt启动多项后续实验。

Lee Robinson@leerob · 6月3日61

"Engineering, product, and design are all merging into a 'builder' role" Yeah... I'm not so sure. This feels like an oversimplification and podcast talking point. Reality is a lot more complex. Even with 1000 "Member of Technical Staff" titles, someone still has to wake up and care 100x more about Product or Design than anyone else. It is their Main Thing™ That's not to say MTS titles are universally bad, but I think they're an example of this 'builder' talking point that's become bastardized. AI and coding agents have made generating code easy and yet... you're in for a world of pain if non-engineers ship a bunch of slop and don't have great engineers to tame the complexity. The SF hivemind has a tendency to overfit what works at startups for every company. And to be fair, sometimes this is true! Startups can be a leading indicator for how the industry is changing and often cause disruption. However, it is going to be incredibly hard to disrupt the extremely human parts of corporate jobs. You really think there's going to be a PM who also does some engineering and design on the side at JPMorgan Chase? This is true for the simple parts of most jobs, like people wanting to have ownership over something and do good work, move up a career ladder, support their family, get paid well, make an honest living... And also the hard parts: internal politics, some critical business system that has a bus factor of 1 which has been running for 15 years and isn't documented anywhere because it's that guy's job security. The real world has a lot of this stuff. It's easy to pontificate about all roles collapsing but it's actually really nice to have a specific person or team who is an expert in one thing that you can work with. I don't expect that to change. Further, I think AI disruption to knowledge work will take decades to play out because it is more fundamental to the human condition (e.g. sociological/organizational) than pure intelligence.

译Lee Robinson 认为该说法是过度简化的播客话术。现实更复杂：即便大量“技术专家”存在，仍需要有人百分百专注产品或设计；AI 虽让生成代码变易，但缺乏优秀工程师会导致灾难。硅谷常把创业公司经验套用于大公司，却难以颠覆内部政治、遗留系统等极度人性化的部分。他判断 AI 颠覆知识工作需要数十年，因为本质是社会/组织问题，而非纯智力问题。

Nathan Lambert@natolambert · 6月3日40

A key lesson of the last year of building open models, once it became so obvious the US is behind, is that talk is cheap. Many people say they're helping / want to help but actually don't do anything. Finding the few people who genuinely push open forward is crucial.

译过去一年构建开放模型的一个关键教训，当美国明显落后这一点已变得如此清晰时，就是空谈是廉价的。许多人说他们在帮助/想要帮助，但实际上什么都没做。找到那些真正推动开放进步的人是至关重要的。

AYi@AYi_AInotes · 6月3日46

发现老黄简直就是个行走的拉盘神器， COMPUTEX 2026 台北国际电脑展， Nvidia 市值5万多亿的黄仁勋，逛展会逛累了，直接跑到技嘉展台，席地一坐，跟技嘉老总李宜泰喝起来了。旁边围了一圈人，他完全不在意，地上坐了近 10 分钟。技嘉股价当场就被拉了一下，估计很多人都纳闷，：老黄和技嘉到底铁到什么程度？这么捧场？上上届 COMPUTEX 他公开喊过 "GIGABYTE NO.1"，这次直接坐人地盘上喝啤酒——是真把合作伙伴当兄弟。而且有个规律很硬，COMPUTEX 期间老黄一出现，相关供应链股票经常大涨，技嘉最近参会已经五连涨超 20%，这个视频一出，盘中又被带了一波。所以怎么看这个信号？第一层是股价信号，他在哪里坐下，市场的钱就跟到哪里，第二层更深，他没去敲钟的展台，而是去长期合作伙伴的地盘坐下来聊天，这说明 Nvidia 的供应链逻辑里，技嘉的位置在加深，而不只是贴个牌。对看供应链的人来说，老黄的行程表比研报值钱。

译黄仁勋在COMPUTEX 2026上逛至技嘉展台，席地而坐与技嘉老总喝啤酒近10分钟，引来围观。技嘉股价当场被拉，期间已五连涨超20%。深层信号显示Nvidia供应链逻辑中技嘉地位加深。引用推文回顾：2009年Nvidia市值仅40亿美元（Intel 1000亿），黄仁勋押注CUDA和异构计算，17年后Nvidia市值5万亿，Intel约五千亿，25倍劣势变为近10倍反超，体现其远见与护城河。