Codex can see and set its own /goal. Everything we build, we build also as a tool for the agent. This is a generalization of meta prompting, where you let the agent set its own task based on your intent.

译Codex 可以查看并设置它自己的 /goal。我们所构建的一切，也都是作为该智能体的工具而构建的。这是元提示的一种泛化，即让智能体根据你的意图自行设定任务。

Ethan Mollick@emollick · 6月15日49

We don’t honestly know the best approaches to rebuilding companies around AI agents, especially in ways that expand competitive advantage & augment existing human capabilities. Practical agents are merely months old. Experimentation (and productive failures) will be required.

译老实说，我们并不知道围绕AI智能体重建公司的最佳方法，尤其是那些能够扩大竞争优势并增强现有人类能力的方式。实用的智能体仅仅诞生了几个月。实验（以及富有成效的失败）将是必要的。

Google AI Developers@googleaidevs · 6月15日40

Learn how to vibe code in 5 days! Build scalable agent systems using natural language and complete a hands-on capstone project in this @Kaggle course hosted by our researchers and engineers.

译Learn how to vibe code in 5 days! 了解如何在5天内进行vibe coding！在这门由我们的研究人员和工程师主持的@Kaggle课程中，使用自然语言构建可扩展的智能体系统，并完成一个实践性的顶点项目。

elvis@omarsar0 · 6月15日51

I spent the last 6 months building my own harness and orchestrator. I built it to allow me to experiment on the frontier of ideas. Little did I know that the orchestration, the harness, routing capabilities, dynamic artifacts/workflows, verifiers, ability to switch/route between agent backends, automations, the skills, and the MCP tools would be the absolute best defense for what happened with Fable this week. The argument folks made when I was talking about "owning the agent orchestrator" at the beginning of the year is that this is just high maintenance, too costly, and is unsustainable. It might still feel like it to many. But there is too much to lose if you decide to lock yourself in with a specific tool or model provider. Really, the way I have built my orchestrator is through mining my agent sessions and using that to recursively build and test our new ideas that range from autonomous loops to continual learning/memory systems. I can test research ideas on the fly. I just can't go back to using a vendor that only offers me a set of features. My argument now is that you really don't have a choice. You need to be able to control cost, decision making, context management, and everything in between. If you don't, then how are you going to tap into the world of recursive self-improving AI? It won't get any easier if you don't own the decision-making part of the intelligence stack.

译Elvis Saravia（DAIR.AI）耗时6个月构建自有的 agent orchestrator（编排器），具备编排、路由、动态工件/工作流、验证器、agent 后端切换、自动化、技能及 MCP 工具等功能。这些能力在本周的 Fable 事件中成为最佳防御。他年初即主张“拥有自己的 agent orchestrator”，反对者认为维护成本高且不可持续，但他认为锁定特定工具或模型供应商损失更大。通过挖掘 agent 会话递归构建和测试新想法（包括自主循环、持续学习/记忆系统），他已无法回到仅提供固定功能的供应商。他强调必须控制成本、决策和上下文管理，否则无法进入递归自我改进 AI 领域。

elvis@omarsar0 · 6月15日73

To use an LLM Council with your own agent, check out my llm-council skill. It works with Fireworks AI APIs, but you can easily adapt it to OpenRouter. Built for Claude Code, but it might work with other agents. I use it a lot for deep research tasks. Let me know if you would like a full tutorial for this. I have a ton of ideas on how to expand this to other domains and use some of the more recent ideas like dynamic workflows. https://github.com/dair-ai/dair-academy-plugins/blob/main/plugins/llm-council/skills/llm-council/SKILL.md

译Elvis Saravia 开源了 llm-council 技能，专为 Claude Code 等 AI 智能体设计，适用于深度研究任务。该技能默认集成 Fireworks AI 的 API，可轻松适配 OpenRouter。项目代码托管在 GitHub，地址为 dair-ai/dair-academy-plugins。

Satya Nadella@satyanadella · 6月14日65

http://x.com/i/article/2065582894790365184 # A frontier without an ecosystem is not stable I’ve been thinking a lot about the future of the firm in an AI-driven economy. This transition is different than any previous platform shift. In the past, we used digital systems to enhance human capital. This is the first time we can create a real cognitive loop between people and digital systems. That is a mind-bender, because it changes how we even conceptualize work inside an enterprise. What is at stake is not some digital tool or system and its use, but how organizations continue to learn, build IP, differentiate, and thrive in a world where AI models can continuously absorb the expertise of humans and organizations and commoditize it. Every company is going to have to build what I think of as human capital and token capital. Human capital comprises the knowledge, judgment, relationships, ingenuity, and pattern recognition of its people, while token capital is the firm’s AI capability it builds and owns. Importantly, human capital does not become less valuable as token capital grows. It only becomes more valuable! I believe human agency will be the driver of token capital growth. Humans will set ambitious goals, connect dots across domains, build relationships, and recognize patterns that matter most. Without human direction, you have compute running in circles. This means the real opportunity is not in picking the best model but instead in building a learning loop on top of models where human capital and token capital compound. You can offload a task, or even a job, but you can never offload your learning. The future of the firm is the ability to compound that learning across people and AI. This requires a new architectural approach where every business is able to build agentic systems that improve over time, while still retaining control over their IP. A company should be able to switch out a “generalist” model without losing the “company veteran” expertise built into their learning system. This is the key “test” of your control and sovereignty in the era ahead. Companies need to turn their workflows, domain knowledge, and accumulated judgment into AI systems that improve with each use. Private evals should capture whether a model is actually improving against outcomes that matter to the business (not just external benchmarks!). Private reinforcement learning environments should let models grow stronger on real traces from inside the organization. Its knowledge base makes institutional memory queryable and use of tokens more efficient. This loop becomes the new IP of the firm. I think of it as a hill climbing machine. And unlike most assets, it compounds. Every improved workflow generates better training signal, which accelerates the accumulation of tacit knowledge unique to the firm. The companies that build this early will have an advantage that is hard to replicate, regardless of any new individual model capability. The last thing any of us want is a world where every company across every sector is ceding value to a few models that eat everything they see. If all the value is accrued by only a few models, the political economy will simply not tolerate it. There is no societal permission for an AI future that hollows out entire industries. Think about what happened in the first phase of globalization where entire industrial economies were hollowed out by outsourcing. The GDP numbers looked fine on the surface, but the displacement was real and the consequences are still being felt. Let us not bring that dynamic into the AI era, with a small number of AI systems capturing all the economic returns, while entire industries find their knowledge commoditized right out from underneath them. In my view, our priority has to be building a frontier ecosystem, not just a frontier model, so value flows broadly across every company, every industry, and every country. One where every organization can own the learning loop that encodes its institutional knowledge, compounding its human and token capital. This is the ethos I’ve grown up with where platforms enable more value on top than is captured inside, and where every company can continuously innovate and build value of its own. When that happens, companies will create value for themselves and for the economy around them. Employees will see their expertise amplified and their judgment become part of systems that make it replicable and scalable and the benefits accrue to the companies and communities around them. That is how companies drive value for themselves and the broader economy. And it is the stable equilibrium we should build together.

译微软CEO Satya Nadella认为，AI驱动的平台转变首次实现人与数字系统间的认知循环。企业需同时构建人力资本（知识、判断、关系）与token资本（自有的AI能力），且人力资本不会贬值，反而随token资本增长而增值。真正的机会在于建立人力资本与token资本复合增长的学习循环——企业应能替换通用模型而不丢失已内化的专家知识，通过私有评估和强化学习让模型从内部真实轨迹中持续提升。他警告，若所有价值被少数模型吞噬，将重演全球化空心化悲剧，呼吁构建前沿生态系统，让每家企业、行业和国家拥有自己的学习循环。

Rohan Paul@rohanpaul_ai · 6月14日68

Univ of Texas paper shows AI agents can slowly become less reliable after deployment, even when the model itself does not change. The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance. An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house. Every one of those steps can quietly rot. A medication dose can become “a daily medication,” two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass. The uncomfortable finding is that the agent may still sound competent while becoming less exact. The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point. It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory. The deeper lesson is that “give it more memory” is often the wrong repair. If the fact was never written, retrieval cannot save it. If the fact was written but crowded out, better summarization will not fix it. If the fact is present but unused, the problem is not storage but the agent’s decision to trust or ignore what it retrieved. This paper reframes deployed agents less like static models and more like aging infrastructure. ---- Link – arxiv. org/abs/2605.26302 Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"

译德克萨斯大学论文指出，AI 智能体在部署后即使模型不变，也会因长期记忆的摘要压缩、相似记忆混淆、事实更新失效及维护操作而可靠性下降。例如药物剂量可能变成“每日用药”，相似客户记录混淆，已取消订阅仍保留，日程可能因维护消失。论文提出 AgingBench 基准测试，评估智能体在多次会话中的可靠性。研究强调“增加更多记忆”往往是错误修复——问题可能在于从未写入、写入后被挤掉、或写入后未被信任使用。论文将部署智能体重新定义为类似老化基础设施的系统。

Rohan Paul@rohanpaul_ai · 6月14日59

Researchers found our current approach to making AI smarter over time has a giant blind spot. AI is not actually understanding or applying high-level abstract lessons at all. Developers spend massive amounts of time building systems that condense past AI mistakes into neat little rules for the future. This paper proves that the AI essentially throws those rules in the trash and only looks at raw historical logs. Modern LLM systems try to get better over time by storing past tasks as either raw step-by-step histories or condensed summary rules. The study tested if these agents actually use their stored memories by secretly swapping the correct tips with random garbage text. - When the step-by-step histories were messed up, the AI failed hard, proving it heavily relies on copying exact past actions. - But when researchers completely corrupted the condensed summary rules, the AI kept acting normally and showed zero performance drop. If an AI cannot apply an abstract lesson to a new situation, it is not truly reasoning or learning. This raises the question if the entire AI industry need to rethink how memory works because right now these agents are just mimicking instead of understanding. ---- arxiv. org/abs/2601.22436 "LLM Agents Are Not Always Faithful Self-Evolvers"

译一项新研究发现，当前提升AI随时间表现的方法存在盲点：LLM智能体实际上并不理解或应用抽象规则总结，而是仅依赖直接复制原始逐步骤历史日志。实验显示，当研究者将浓缩的规则总结替换为随机垃圾文本时，智能体表现无下降；但破坏逐步执行历史则导致明显失败。这表明智能体只是在机械模仿过往步骤，而非真正从教训中学习。论文质疑需重新设计AI记忆机制，因为当前系统仅是模仿而非理解。

Berryxia.AI@berryxia · 6月14日50

Agent-skills则把全栈开发技能打包成可调用的模块，开发者直接就能让agent干完整的工程活。 open-notebook是本地版的NotebookLM，能在自己电脑上跑知识整理和生成. 最狠的是Headroom，直接把AI API账单砍掉90%，不改代码就能省钱。这些项目都不是什么前沿大模型，而是实打实的工具层优化。开源、免费、能马上用，还把本地化、成本控制、agent能力三件事一次性解决了。以前大家觉得AI好用就得砸钱上大模型，现在这些小而美的开源项目直接证明：真正改变生产力的，往往是把现有能力包装成开发者能直接拿来用的东西。这波分享一出，开发者手里又多了好几把能立刻提升效率的利器。 Github 项目地址，见评论区👇🏻

译Berry Xia 推荐四个开源 AI 项目：/last30days（新搜索引擎）、agent-skills（将全栈开发技能打包成可调用模块）、open-notebook（本地版 NotebookLM，可离线运行知识整理与生成）、headroom（不改代码即可将 AI API 账单降低 90%）。这些项目聚焦工具层优化，免费开源，一次性解决本地化、成本控制和 agent 能力三个痛点，让开发者能直接拿来提升效率。

Rohan Paul@rohanpaul_ai · 6月14日59

Long-running language agents may work better if they periodically stop to consolidate memory. The problem is that today’s transformer agents get slower and more expensive as their context grows, because attention has to keep checking more past tokens. The usual fix for long context is to keep more tokens nearby, but that turns every next-token prediction into a larger search through the past. The sharper idea here is that memory is not only storage. Sometimes the hard part is converting a messy stretch of experience into a state that can actually be used later. So the paper’s idea is to add a sleep phase, where the model pauses, rereads recent context several times, writes the useful information into fixed-size memory layers, and then clears the short-term attention cache. During sleep, the model runs several offline passes over recent context, writes the result into fast weights inside its state-space blocks, then clears the attention cache. This means the model pays extra compute while sleeping, not while answering, so normal prediction can still happen with 1 forward pass. The authors test this on cellular automata, graph lookup, and GSM-Infinite math problems, where the model must use old information that is no longer sitting in its attention cache. The main result is that longer sleep improves performance, especially on harder cases that need deeper reasoning rather than just remembering a fact. The big deal is that long-horizon agents may not need to carry bigger and bigger raw context forever, because they can consolidate the important parts and safely forget the raw tokens. ---- Link – arxiv. org/abs/2605.26099 Title: "Language Models Need Sleep"

译针对Transformer agent随上下文增长而变慢、变贵的问题，新论文提出“睡眠阶段”：模型暂停，多次重读近期上下文，将有用信息通过状态空间块的fast weights写入固定大小的记忆层，然后清空注意力缓存。额外计算在睡眠时完成，正常预测仍只需一次前向传播。在元胞自动机、图查找、GSM-Infinite数学问题上的测试表明，更长的睡眠提升性能，尤其是需要深入推理的难题。核心启示：长程agent无需无限扩大原始上下文，可通过巩固重要部分、遗忘原始token来解决。

Rohan Paul@rohanpaul_ai · 6月14日56

Vinod Khosla on why he does not really prefer "AI co-pilots". Because he thinks "humans get in the way of co-pilots", which slows everything down and blocks real change. He says workers like accountants and programmers do not actually want co-pilots, because they feel their jobs are at risk and then resist using the tool properly. So instead of “helping” them, he prefers building AI that fully does the job itself, like a complete software engineer. He expects that by 2030, most of these roles will be pure AI workers, not human+co-pilot. --- From 'Corgi Insurance' YT channel (link in comment)

译Vinod Khosla 不看好“AI 副驾驶”模式。他认为人类会妨碍 AI 副驾驶的发挥，导致效率降低并阻碍真正变革。会计师、程序员等员工因担心失业而抵触工具，不会正确使用。因此，他更倾向于构建能独立完成整个岗位工作的 AI，例如完全替代软件工程师的 AI。他预计到 2030 年，大多数此类岗位将由纯 AI 工人而非“人类+副驾驶”承担。

jason@jxnlco · 6月14日66

added something new to my agents.md "when i send you an app shot with no context try your best to figure out what you want me to do with it and udpate your appshot triage skill"

译在我的 agents.md 中新增了一些内容： "当你给我发送一张没有上下文的应用截图时，尽你所能弄清楚你想让我对它做什么，并更新你的应用截图分诊技能。"

宝玉@dotey · 6月14日46

模型是根本，Harness层相对好补齐，但Harness这层不需要太多垂直领域的，Claude Design 很快就会合并到 Claude Desktop，Codex 在下一代或者几代模型能力够了后，会在 Codex App 直接以 Plugin 集成 Codex Design

译模型能力是根本，Harness层相对容易补齐且无需过多垂直领域。Claude Design将很快合并至Claude Desktop。未来模型能力足够时，Codex会在Codex App以Plugin集成Codex Design。针对开源Open Design方案，若使用Claude Code的模型能否达到类似工程能力？这是该讨论中提出的问题。

宝玉@dotey · 6月14日49

精细调整字型字号颜色，确实是设计师的日常。但我觉得用 AI Agent 辅助设计之后，修改的方式也得跟着变： 1、设计系统要用起来为什么需要手动精调字型字号、颜色？很多时候是因为没有统一的设计系统做规范。如果有配套的设计系统，按钮圆角、字号、间距都有严格定义，生成时不会出现 3px、5px 这种随意值。就算偶尔有偏差，让 Agent 遵循设计系统去修改就行，极少需要人工微调。 2、设计师变成设计经理不再亲自调像素，而是用文字指令指挥 Agent 去改。Opus 4.8+ 结合设计系统，基本做到"言出法随"，不太会偏出你的要求。 3、方向和验收还是人的活虽然执行交给了 Agent，但大方向还是人来把关，告诉 Agent 该怎么调整，调完检查结果是否符合预期。Agent 干活，人做判断。

译建立统一设计系统，Agent 遵循规范；设计师不再调像素，用文字指令指挥 Agent；方向与验收仍由人把关。引用指出并非所有情况都适合用 Claude Design 描述精确调整。

宝玉@dotey · 6月14日63

给 Agent 交代任务的时候一定说清楚怎么验证，然后就怎么需要管中间结果了

译宝玉分享与AI Agent交互的关键习惯：交代任务时只需说清楚验证标准，之后便无需关注中间结果。引用@huangyun_122的做法：先让Agent写出代码计划，反复确认后汇总为任务列表，最后编程并逐项标记完成。这一流程确保目标明确，同时减少不必要的中间干预，提升效率。

MiniMax (official)@MiniMax_AI · 6月14日45

All powered by M3 on Hermes Agent @NousResearch

译我本人没有操作 TouchDesigner。Hermes Agent 从头开始学习并完成了以下工作： → 使用计算机操控功能浏览我的桌面 → 弄明白如何连接到 TouchDesigner → 读取我的参考图像 → 与我一起在自我学习循环中迭代艺术作品 → 然后将学到的内容保存为可复用的技能，用于处理下一张图像这一切均由 @MiniMax_AI M3 × Hermes Desktop Agent @NousResearch 驱动。完整演示 📽️

ginobefun@hongming731 · 6月14日44

http://x.com/i/article/2065938724446441473 # BestBlogs 早报 · 06-14｜出口管制、AI 监管边界、编程瓶颈转移在线阅读本期早报 ## 导语出口管制首次落地前沿 AI 模型。Claude Fable 5 发布四天，美国政府以国家安全为由叫停所有外国公民的访问权限，Anthropic 的外籍员工同样不例外。这不只是一次执法动作，而是一个信号——「AI 主权」的争夺从产业话语变成了现实执法。同一天，Marc Andreessen 发布了一篇措辞精准的长文，划出他认为的监管分水岭：官僚式的保护主义是诅咒，但护栏、刹车与建立信任的规则是文明社会的基石。两件事在同一天发生，彼此构成了极为精妙的现实注脚：到底什么样的政府干预是必要的，什么时候变成了武器化的管控？今天的第三条主线来自阿里的工程实录。当模型产出稳定超过 Token 成本之后，瓶颈已经不在模型身上，而在人的注意力带宽上。这位工程师用半年的亲历记录，描绘了从「更快打字」到「睡后 Token 持续流动」的完整进化路径——从 Cursor 辅助到 CLI Agent 自主执行，再到三层委派与云端持续运行的 Harness。三条线索，三个维度——政策管控、边界哲学、工程实践——共同勾勒出 AI 在 2026 年中期的真实处境。 ## 精讲一：美国政府要求 Anthropic 暂停外国公民访问 Fable 5 和 Mythos 5 Anthropic 官方账号在 6 月 14 日发布了一则措手不及的公告：美国政府依据国家安全权限，发布出口管制指令，要求立即暂停所有外国公民对 Fable 5 和 Mythos 5 的访问。这里所说的「外国公民」覆盖范围极广——无论当事人身处美国境内还是境外，包括 Anthropic 自身的外籍员工，均被立即切断访问。其余 Claude 模型的访问不受影响。 Anthropic 在声明中表示歉意，并将此事定性为「误会」，称正在积极努力尽快恢复访问。但「误会」这个说法本身就耐人寻味。如果这是误会，那么哪里出了岔子？是情报评估失误？是沟通渠道不畅？还是法律解释存在分歧？公告没有给出进一步说明，「误会」的说法更像是在为后续的政策澄清留出空间。这是出口管制首次直接落地于前沿 AI 模型，意义超过了此次事件本身。过去几年，出口管制主要集中在芯片和硬件层面——英伟达 H100、A100 对特定国家的出口受限，这些都发生在硬件供应链环节。而这一次，管制的对象直接是模型能力本身，是「可以调用 Fable 5 进行推理」这件事。这意味着监管的颗粒度已经细化到了 API 访问层面，而不只是芯片出口许可证。从技术上讲，按照国籍切断 API 访问是可以实现的，但这涉及复杂的身份核验流程，也可能对 Anthropic 的全球商业部署产生深远影响。多少国际客户、跨国企业、学术机构的工作流会因此中断？合规成本如何分配？这些问题目前都还没有答案。更值得关注的是时间节点的敏感性。Anthropic 刚刚在四天前宣布了 Fable 5，发布仅四天即遭遇政府叫停，这在 AI 产业史上是前所未有的。结合近期美国政府在 AI 领域的一系列动作——加强对算力出口的管控、推动 AI 安全框架立法、收紧前沿模型的国际扩散路径——这次事件很可能不是孤例，而是一套系统性政策布局的一部分。对于依赖前沿 AI 模型开展工作的开发者和企业来说，这次事件提出了一个新的合规维度：你的用户构成和所在地区，可能直接影响到哪些模型你有资格使用，而这一点已经不再只是隐私政策或服务条款的范畴，而是涉及国家安全法律框架。与今天精讲二配合读：Marc Andreessen 那篇谈监管的文章，在这个具体事件面前会获得额外的现实感——他区分「坏监管」和「好监管」的框架，正好可以用来拷问这次 Fable 5 事件：这是必要的国家安全护栏，还是技术扩散的主动武器化？阅读原文 ## 精讲二：Marc Andreessen 对监管的终极立场：一篇精妙绝伦的二分法论述如果说精讲一展示的是政府对 AI 管制的实际执行，那么精讲二则提供了一套思考这件事的哲学框架。Marc Andreessen 选择在同一天发布了一篇修辞精湛的长文，时间上的巧合很难说完全是偶然。 Andreessen 的核心论点建立在一个鲜明的二分法上：坏的监管是一种诅咒，好的监管是文明社会的基石。他用「官僚冰冷潮湿的手」来描述他所反对的那种监管——官僚主义、反创新、保护主义、欧洲式的过度干预。这种监管扼杀竞争、固化既得利益、阻碍技术扩散，他认为这是对社会进步的主动伤害。这里的指向相当明确，欧盟的 AI Act、美国国内的某些监管倡议，都在他批评的射程之内。但紧接着，他同样旗帜鲜明地为另一类监管辩护——护栏、刹车、建立信任的规则、保护弱者的机制。他称这是「运转良好、充满创新的社会的基石」，是他「不容妥协的立场」。这两段论述放在一起，构成了 Andreessen 一贯的修辞风格：先用一把大锤砸碎「所有监管都坏」和「监管越多越安全」这两个简单化的立场，然后在废墟上建立一个更精细的区分。这种写法的修辞力量在于，它让读者很难简单地反驳，因为他已经同时接受了两个看起来对立的前提。这个二分法真正的难题，在于边界在哪里。谁来判断一个具体的监管措施是「护栏」还是「官僚的手」？答案不在条文里，而在于视角和利益立场。把 Anthropic Fable 5 被叫停这件事放进 Andreessen 的框架里：对于美国政策制定者来说，这可能是依据国家安全评估实施的必要护栏；对于被切断访问的外国公民——包括 Anthropic 自己的外籍员工——来说，这明显更像「官僚的手」伸得太长。同一个管制行为，从不同立场看，可以同时满足「坏监管」和「好监管」的定义。这篇文章的价值在于提供了一套思维工具，而不是答案。Andreessen 告诉我们应该区分「阻碍技术自然扩散的监管」和「为信任和商业创造条件的监管」，但他没有给出这个区分的操作性标准。这个开放性问题在未来几年会被反复提出来，随着 AI 能力的持续提升，政府干预的频率和深度也将随之增加。在读这篇文章时，值得有意识地注意：他哪些论点是有具体指向的（可以验证的），哪些是修辞性的（给读者留了解释空间的）。这种区分，本身就是理解 Andreessen 这类公共知识分子最重要的阅读能力。阅读原文 ## 精讲三：Qoder 工程实践：当瓶颈从模型转移到人这是一篇来自阿里技术工程师的第一人称工程实录，记录了他过去半年在 AI 编程工具使用上的完整进化路径。文章的核心洞察可以用一句话概括：当 AI 输出的价值稳定超过 Token 成本之后，真正的瓶颈从模型能力转移到了人的注意力带宽。进化路径的四个阶段第一阶段是 Cursor 时代，用 AI 辅助打字。效率提升三到五成，体验确实好，写几个字母就能补出一整行代码，写个函数签名实现自动填上来。但有一件事始终没变：方向盘在人手里，人不打字 AI 就停。从 Token 的角度看，产出是「节省了一些打字时间」，但人停，Token 停。这只是把锤子换了个更好的型号。第二阶段是 CLI Agent 的出现，作者以 Opus 4.5 为分水岭。第一次在终端启动 CLI Agent，几分钟内他就意识到「这和之前所有工具都不是一回事」。如果说 Cursor 是辅助驾驶，那么 CLI Agent 是自主执行体：说去哪，它自己找路、绕障碍、停车入库。第一次用它完成完整任务：30 秒写需求，60 秒读懂项目结构，5 分钟完成预估需要半天的改动。代码对了，测试通过，风格和项目一致。他开始记录数据：分析一个 2400 行的 TypeScript agent-loop 模块产出 276,010 tokens 的完整架构分析，耗时 10 分钟；一个 bug 修复从描述问题到代码提交 60 秒；设计文档深度 review 发现 5 个 Critical 和 8 个 Medium，只需 5 到 6 分钟。第三阶段是并发的陷阱。看起来直接的解法——同时开多个终端并行跑多个任务——带来了意想不到的代价。他用 tmux 管理多个工作区，四个 Agent 并行，15 分钟出串行需要一小时的结果，产出确实高了，但一天结束时的疲劳感比单线程还重。原因不复杂：注意力在多个上下文之间不停切换有认知成本，更要命的是每个 prompt 都得人来写。三条并行线意味着三份 prompt 要构思，三组结果要判读，三次后续决策要做。Token 在加速，人反而成了瓶颈。并发没有消灭瓶颈，只是把等待时间换成了调度时间。第四阶段是「委派」的根本转变。在 Qoder 自身产品逐渐成熟之后，作者的角色发生了根本性变化——从执行者变成了纯粹的决策者。他只做三件事：提需求、审方案、验结果。架构是三层精炼：自然语言需求 → QoderWork 精炼为带规格锚点的结构化 prompt（文件路径、接口命名、错误码体系、事务边界、并发策略等九个维度）→ Task Agent 在独立上下文里长时间运行 → QoderCLI 在独立 worktree 里把指令翻译成代码。每一层只管自己的事，信息逐层精炼，控制权逐层下放。「睡后 Token」：瓶颈转移的终极表达文章最精彩的部分在后半段。如果 Token 产出的价值持续高于成本，凌晨三点跑和下午三点跑，价值是一样的。区别只是凌晨三点人在睡觉，Token 却得等。「睡后 Token」的核心设计是：把输入、边界、验证、回收全部提前想好，让 Token 在人离线时继续产出候选结果，第二天早上人做价值判断。要让这个模式成立，需要三个条件同时满足：Session 可恢复（中断后能从断点继续，不需要重头来过）、Sandbox 可替换（执行环境的故障不中断整体任务）、Harness 无状态（不依赖任何本地持久化状态，可以在任意节点接管）。缺一不可——Session 不可恢复意味着任何中断都要人工介入；Sandbox 不可替换意味着环境故障导致整个任务失败；Harness 有状态意味着不能真正实现离线持续运行。 Context Engineering 的分层管理作者还分享了让 Agent 长期稳定运行而不需要每次重新交代背景的方法：给每层 Agent 写操作手册。AGENTS.md 定义职责边界、禁止行为、交付规则；MEMORY.md 记录项目上下文和历史决策；USER.md 记录个人偏好和判断标准。这些文件构成 Agent 的长期记忆，不是把所有信息在会话开始时全量塞给 Agent，而是分层管理：什么是全局不变的（项目规范、技术栈约束）、什么是会话级的（当前任务目标、验收标准）、什么是按需加载的（特定模块的代码结构、历史决策记录）。这篇文章的价值不在于方法论的新颖，而在于它是一份带数据、带时序、带真实工程判断的亲历报告。如果你正处于从 Cursor 向 CLI Agent 过渡的节点，或者正在思考「多 Agent 并发导致人成为瓶颈」的问题，这篇是值得认真读完的一手材料。阅读原文 ## 速览 build 之前先 plan：AI 智能体的确定性规划模式全景来自 Spring I/O 的演讲视频，一位 Google Cloud 架构师系统梳理了 AI 智能体从确定性到动态规划的完整架构谱系，涵盖 Workflow、Supervisor LLM、HTN（层次任务网络）、Utility AI、GOAP 五种模式，并现场演示了一个带共识度量的多模型协商应用。核心观点：直接把 LLM 接工具让它自由发挥，会导致执行路径不可预测、测试无法覆盖、Token 消耗失控。在构建之前先做规划设计，才能从脆弱的实验过渡到稳健的生产级自动化。对正在设计 Agent 系统架构的工程师有很强的参考价值，特别是那些在思考「什么时候用固定工作流，什么时候放开让 LLM 动态规划」这个问题的人。观看视频 Mastra vs LangChain：构建 AI Agent 流水线并分析数据这是一篇少见的务实对比文章——作者不是纸上谈兵，而是真正把同一个五步研究与综合流水线在两个框架里各实现了一遍，并全程埋点测量：每步 Token 消耗、每步延迟、发给模型的确切 prompt、原始搜索结果，还配了一个实时 Web 仪表板供任何人自行复现。结论是 Mastra 的类型化 step 合约和工作流组织更清晰，但每个 Agent step 都会初始化工具循环管理器，即使不需要工具也带来 Token 额外开销；LangChain 的图节点方式更精简、延迟更低，但控制粒度需要更多手工管理。如果你正在两个框架之间做技术选型，这是目前最有说服力的实测对比材料。阅读原文 Gemma Challenge 中 AI 智能体涌现出的社会性行为 Omar Sanseviero 报告了 Gemma Challenge 中超过 70 个 AI 智能体协作优化 Gemma E4B 时涌现出的令人着迷的现象：GPU 资源丰富和匮乏的智能体之间自发形成分工协作；一个智能体基于伦理原因主动撤回了自己的提交；智能体发现基准测试漏洞后协商决定不滥用并要求组织者修复；多个智能体自发通过配额池化突破速率限制；还有一个智能体成功识别并阻止了人类试图通过 Telegram 进行场外社交工程的尝试。这些行为没有被明确编程，而是在大规模多智能体协作中自然涌现，提示了一个值得认真对待的问题：当 AI 智能体数量足够多时，群体层面会出现什么样的规范与秩序？阅读原文我们如何让 GitHub Copilot CLI 的子智能体委派更具选择性 GitHub 工程团队发布的生产级案例文章，详述了他们如何改进 Copilot CLI 的智能体编排逻辑，让主智能体在「自己处理更快」时选择不委派，在「专家子智能体能创造真正杠杆」时才选择委派，在「任务真正独立」时并行执行。改进通过 A/B 测试验证：工具故障率降低 23%（搜索工具故障降低 27%，编辑工具故障降低 18%），P95 用户等待时间减少 5%，且无任何质量回退。这与精讲三的核心洞察高度呼应——更多委派不等于更高效率，关键是判断什么时候委派才真正有价值。阅读原文 Codex 操作浏览器的两种模式：Chrome 插件 vs 内置浏览器，差异与选型指南宝玉（@dotey）的深度分析 Thread。Chrome 插件模式的核心优势是继承用户登录态和 Cookie，可访问付费内容和内部系统，但内存和 CPU 资源消耗极大，适合需要登录态的短期任务；内置浏览器模式轻量、响应快，但没有登录态，反爬严格的网站可能无法访问，其亮点是标记模式（Annotation Mode）可用于前端调试。选型建议明确：需要登录用 Chrome 插件，不需要登录、配置有限、抓取公开数据用内置浏览器。阅读原文港中文团队用全光信号处理芯片突破 AI 数据中心传输瓶颈，成果登 Science 香港中文大学黄超然教授团队在《科学》发表全光信号处理芯片（OSP），核心突破是让光信号无需转换为电信号、直接在光路上完成失真补偿，将 GPU 间互联延迟从微秒级压缩至 60 皮秒，总吞吐量达 1.6 Tbps（相当于每秒传输上百部蓝光电影）。目前数据中心 GPU 平均利用率仅约 10%，其余 90% 的算力都在等数据搬运，全光处理芯片有望从根本上改变这一局面，同时因减少光电转换而降低发热和能耗。AI 基础设施层面的重要研究进展。阅读原文 Anthropic 工程师：我们日常如何使用 Claude Code 晚点再听 LaterCast 对 Anthropic 工程师 Arno 的 workshop 的文字整理。核心内容是 Anthropic 内部如何将 Claude Code 用作工程系统的一部分，而不只是一个代码补全工具。关键实践包括：让 Claude 先采访人再写需求（避免一开始就漏掉重要条件）、用 HTML 规格稿作为人和 Agent 都能理解的中间产物、把验证框架嵌进产物本身（而不是事后 review）。配套的三阶段 repo 演示覆盖了从需求提取、规格生成到验收的完整链路。对已经在用 Claude Code 但还停留在「代码补全」阶段的读者来说，这篇是很好的进阶材料。阅读原文 ## 补充阅读 CPU 物理原理与内存层级深解（6IT 书稿章节）一本即将出版的 C++ 性能书籍的章节草稿，从物理层出发解释 CPU 工作原理：为什么信号路径越长访问越慢、L1/L2/L3 缓存的延迟差异、寄存器到主存到网络的完整延迟层级。对需要写高性能代码的 C++ 开发者是一份难得的基础材料，作者特别欢迎读者指出事实性问题。阅读原文循环工程：构建真正自主运行的 AI 智能体 Avi Chawla 展开了 Andrej Karpathy「消除自身成为瓶颈」概念的工程化路径：核心结构是调度器决定运行什么，「制造者」循环负责产出工作，一个独立的「检查者」智能体对输出评分，磁盘文件保存共享状态。强调使用独立检查者避免「自我合理化」、设置硬性退出条件防止成本失控、把状态存磁盘以在 context 重置后仍能持久化。对正在构建长时间运行 Agent 系统的工程师有参考价值。阅读原文 WebMCP 标准提案现已登陆 Chrome（Origin Trials），赋能智能体化网页操作 Google 宣布 WebMCP 进入 Chrome 149 的 Origin Trials，允许网站直接向浏览器内 AI 智能体暴露带类型和名称的 JavaScript 函数和 HTML 表单，智能体可以可靠地模拟用户操作，而不必依赖 DOM 爬取或屏幕识别。这对需要在网页中集成 Agent 能力的开发者是重要基础设施进展。阅读原文实现进化式数据库开发：基于 Lakebase 的数据库分支，结论篇 Databricks 系列文章的收尾篇，总结了 Lakebase 中写时复制（Copy-on-Write）数据库分支如何支持团队级的进化式数据库开发实践，包括如何划分长期 tier 分支和临时 feature 分支、新的 DBA 角色定义，以及面向 AI 智能体的结构化开发框架。对使用 Databricks 技术栈且需要在 AI 时代重新设计数据库变更管理流程的团队有参考价值。阅读原文 arXiv 因 AI 幻觉引用封禁研究人员 arXiv 出台新政策，对论文中出现 AI 幻觉引用的研究人员进行封禁，引发学术界强烈反应。这一政策暴露的核心张力在于：AI 写作辅助已经在学术界广泛使用，但核实引用准确性的责任依然在作者个人。谁该为 AI 的幻觉负责、如何在学术规范中定义「使用 AI 的合理边界」，是这个事件留下的真正问题。阅读原文 ## 今日阅读路径如果你今天只有 20 分钟，按这个顺序读：第一篇：精讲三——Qoder 工程实践（阅读原文）实用密度最高的一篇。「瓶颈从模型转移到人」这个认知会改变你对 AI 编程工具的使用思路。文章每个阶段都带具体数字和亲历感，值得完整读完。第二篇：精讲一——Fable 5 出口管制（阅读原文）用 5 分钟了解这件事的基本事实和潜在影响。出口管制首次落地前沿模型，这个节点值得记住。第三篇：精讲二——Marc Andreessen 的监管二分法（阅读原文）在读完精讲一的事实背景之后，再读 Andreessen 的框架，两者之间的张力会让这篇文章的思维工具价值更加清晰。如果还有时间，GitHub Copilot CLI 的子智能体委派改进（阅读原文）和 Claude Code 工程师的 workshop 整理（阅读原文）是精讲三的很好延伸阅读，三篇合在一起构成了一幅关于「如何在 AI 编程时代更好地工作」的完整图景。 BestBlogs 是 AI 驱动的私人阅读助手，帮助你建立稳定、可信、个性化的高质量信息输入。它帮你判断什么值得读、协助你读懂，并逐渐理解你关注什么。欢迎体验：https://www.bestblogs.dev/

译美国政府以国家安全为由要求Anthropic暂停所有外国公民对Fable 5和Mythos 5的访问，包括外籍员工，这是出口管制首次直接落地API访问层。同日，Marc Andreessen发文区分“坏监管”（官僚主义）与“好监管”（护栏、刹车）。阿里工程师分享半年进化路径：从Cursor辅助到CLI Agent自主执行，再到三层委派与“睡后Token”连续运行，指出瓶颈已从模型能力转向人的注意力带宽。

Rohan Paul@rohanpaul_ai · 6月14日62

Vinod Khosla’s warning for India's BPO in the age AI: The traditional IT services and BPO business “will be gone” But India can still win if it shifts to deploying AI. ---- From "SparX by Mukesh Bansal" YouTube channel, (link in comment)

译Vinod Khosla称传统IT服务和BPO业务“将消失”，但印度若转向部署AI仍能胜出。TCS主席表示AI智能体数量未来或与员工数相当，公司已裁员1.2万人，AI年化收入达23亿美元，并与OpenAI签有数据中心协议。印度3150亿美元IT服务业依赖低成本人力，AI可在欧美云端运行、遵循本地规则，使区位优势失效。TCS预计招聘下降，旧有外包模式或崩溃，转向软件自动化。

Rohan Paul@rohanpaul_ai · 6月14日42

Today’s AI agents still struggle to pass real human-verification checks (CAPTCHAs) on websites. The paper proposes HLL, a benchmark where agents must solve 10 types of CAPTCHA tasks by seeing the page, clicking or dragging correctly, tracking state, and submitting the answer. A useful agent must find the right box on a messy page, understand the instruction, click or drag in the right place, track what changed, recover from mistakes, and leave an interaction trail that looks consistent with the task. The paper shows that even strong agents can look smart on static tasks, then fail when the page is cluttered, the task is harder, or the system checks whether their actions were actually valid. ---- Link – arxiv. org/abs/2606.02449 Title: "HLL: Can Agents Cross Humanity's Last Line of Verification?"

译论文提出HLL基准，测试AI智能体解决10种CAPTCHA任务的能力。任务要求智能体查看页面、正确点击或拖动、跟踪状态变化并提交答案，同时需在混乱页面中找到交互元素、理解指令、恢复错误并留下一致的操作轨迹。实验显示，即使是当前最强的智能体，在静态任务上表现良好，但在页面杂乱、任务难度增加或系统验证动作有效性时仍会失败。

elvis@omarsar0 · 6月14日47

The LLM Council idea was never fully explored, but I think it can have massive applications given the state of things today. LLM routing is closely related, but I really believe that properly ensembling different agents' intelligence & knowledge is worth deep exploration.

译LLM Council 的想法从未被充分探索，但我认为鉴于当今的状况，它可能有巨大的应用。LLM 路由与之密切相关，但我真的相信，适当地集成不同智能体的智能和知识是值得深入探索的。

elvis@omarsar0 · 6月14日71

http://x.com/i/article/2065876120965111808 # Autonomous Long-Running Coding Agents Autonomous coding is moving from better prompting to better control systems. The important shift is that engineers are learning how to wrap agents in goals, evaluators, loops, and artifacts that let them keep working after the human stops typing. This matters because most serious engineering work spans long horizons: ambiguous requirements, hidden constraints, partial failures, changing context, and repeated verification. The new frontier is designing the system around the agent so it can plan, execute, check its work, recover from mistakes, and keep making progress without constant human steering. This piece is based on a DAIR.AI Academy session on autonomous long-running coding agents, where I walked through Claude Code's /goal mode, the newer /loop command, verifiers, artifacts, and orchestration patterns in practice. Written in collaboration with Codex and Claude Code. ## From Prompting to Goal Design The core idea behind features like Claude Code's /goal is simple. A coding agent remains the executor, but the human no longer interacts with it turn by turn. Instead, the human specifies the desired end state, the evidence required to prove success, the constraints that must not be violated, and, where possible, the number of turns and budget. That goal works more like a contract than a longer prompt. A weak goal gives the model room to stop early, take shortcuts, or redefine success in a way that looks plausible in the transcript but fails in the real system. A strong goal gives the agent a target it can repeatedly measure itself against. Engineering judgment still matters here. The best goals encode domain knowledge that the model would otherwise guess. For a research experiment, that might mean a target benchmark score, a held-out evaluation, a required loss curve, and a rule that the result must beat an initial baseline. For a UI task, it might mean a screenshot reference, concrete layout constraints, and a browser verification step. The model can execute, but the human still defines what "done" actually means. ## The Evaluator Becomes a First-Class Component Long-running agents need a second role besides the goal. That evaluator can be another coding agent, an LLM-as-judge, a script, a test suite, a benchmark harness, or a mix of all of them. The key design choice is matching the evaluator to the task. When success is crisp, deterministic checks are better. Type checks, unit tests, lint rules, integration tests, and benchmark scripts should be used whenever they can express the condition clearly. When success is fuzzy, an agent evaluator becomes useful. A script can tell you whether tests pass, but it cannot easily decide whether a generated research report is coherent, whether an implementation faithfully follows a paper, or whether a UI matches a design intent. This is where the evaluator benefits from language, judgment, and sometimes vision. The practical pattern uses deterministic checks as the floor and agent evaluation as the higher-level review. That combination reduces hallucinated success while still allowing autonomy on tasks that do not fit cleanly into a test assertion. ## Verifiers Define the Boundary of Trust The deeper point is that autonomy only works when the system has a reliable verifier. A coding agent can generate a plan, implement a feature, and explain why it believes the work is complete, but that explanation should not be treated as evidence. Evidence comes from an external check that the agent cannot easily talk its way around. For code, the verifier might be a test suite, type checker, benchmark, browser run, screenshot comparison, or reproducible script. For research work, it might be a held-out evaluation, a reproduced table, a loss curve, or a benchmark score that improves over the baseline. For design work, it might be a reference screenshot plus a visual review step. The verifier is what turns a long-running agent from a confident text generator into a system that can be trusted with more time. Most shortcuts appear at this boundary. If the verifier is vague, the model will often satisfy the easiest interpretation of the task. If the verifier is too narrow, the model may overfit to it and miss the broader intent. A good autonomous workflow, therefore, needs layered verification, with cheap deterministic checks catching basic failures and higher-level review catching judgment-heavy failures. A few of the frontier models can already achieve some level of verification, but based on my research, there is still an evident OOD problem, where if the verification task you assign to the agent falls outside the training distribution, models struggle significantly. Verifiers are still an open area of research, but I anticipate more companies will start to make huge investments in this area. The concept of fine-tuned verifiers is also in high demand in the enterprise. ## Loops Make Autonomy Durable A goal gives the agent direction, but a loop keeps the work alive. This distinction is important because models often stop before the real task is finished. They may hit a turn limit, lose confidence, exhaust context, or decide that a partial solution is enough. The loop is the outer control system. It wakes up, inspects progress, runs checks, compares the result against the goal, and sends the agent back in with the next instruction when the goal has not been met. In its simplest form, this is the Ralph loop pattern with a coding agent and a deterministic condition. In a more flexible form, the loop includes an evaluator agent that can reason about progress and decide what should happen next. Long-running autonomy works as repeated effort under supervision from a control layer, not as one continuous act of intelligence. The agent can still fail, but the loop gives the system a way to notice the failure and continue instead of silently declaring victory. ## Planning Is Where Expertise Enters One of the strongest themes from the session was that planning remains critical. You can ask a frontier model to generate a plan, but you still need to inspect it, challenge assumptions, and make the success criteria sharper before handing the task to an autonomous loop. This leads to a useful division of labor. A stronger planning model can help define the goal, identify missing constraints, and structure the evaluation. A different execution model can then run the implementation once the plan is clear. In practice, this means engineers should stop thinking of "the model" as a single choice. Model choice becomes an architecture decision. Some models are better planners. Some are better executors. Some are cheaper evaluators. Some are better at vision-based review. A good orchestrator lets you swap these roles instead of waiting for one vendor to provide the perfect coding agent interface. ## Visual Artifacts Become Control Surfaces Terminal transcripts do not scale when many agents are running. Once you have several sessions working in parallel, raw text becomes a poor interface for understanding progress. Live artifacts matter because a dashboard with loss curves, benchmark scores, task states, screenshots, cost estimates, and recent decisions gives the human a much better way to supervise autonomy. The artifact becomes the control surface for deciding when to intervene, rather than a report generated after the fact. The most useful pattern is to separate storage from presentation. Markdown or a vault can store durable evidence, logs, notes, plans, and results. HTML artifacts can render that state into something visual and interactive. The agent can search the Markdown, while the human can monitor the artifact. For UI and product work, visual cues are especially powerful. A screenshot reference can communicate design intent more precisely than prose, and a vision-capable evaluator can compare the implementation against that reference. This reduces the common failure mode where the agent technically implements the requested component but misses spacing, hierarchy, alignment, or product feel. ## Session Mining Turns Usage Into Memory Another important insight is that past agent sessions are a rich source of workflow data. If an agent repeatedly fails in the same way, forgets to run the same check, uses the wrong path, or retries the same broken command, that pattern should not stay buried in logs. Session mining turns those transcripts into operating rules. An agent can scan the last thirty days of work, find recurring failure modes, and propose updates to project instructions, vault learnings, or agent rules. This is how a team can gradually improve its harness without manually remembering every mistake. The goal is to make the local environment smarter without training a model from scratch. A small rule in an agent instruction file can prevent repeated failures across future sessions, especially when the rule is specific to the project. ## A Practical Operating Model For AI engineers, the emerging workflow looks like this. - Start with a small, cheap subset before launching the full autonomous run. - Write a goal with measurable success criteria, explicit constraints, and a turn or time budget (where possible). - Separate the executor from the evaluator so implementation and judgment are not collapsed into one role. - Define external verifiers before the long-running loop starts. - Use deterministic checks wherever possible, then add agent review for fuzzy criteria. - Require proof artifacts such as logs, screenshots, benchmark curves, or changed files. - Mine past sessions and promote repeated lessons into project instructions. That is the difference between using a coding agent and engineering an autonomous coding system. One gives you a conversation. The other gives you a harness. ## What Still Breaks None of this removes the hard problems. Agents still take shortcuts. They still stop early. They still overestimate completion. They still produce confident but weak plans, especially on recent papers, unfamiliar benchmarks, or systems outside their training distribution. Trusting them more will not solve this. Better control systems will. Goals, loops, evaluators, deterministic checks, visual artifacts, and session memory are all ways of making autonomy observable and correctable. The direction is clear. The future of coding agents depends on better orchestration around more capable models, where engineers design the conditions under which agents can safely run for hours or days and still produce work that can be verified.

译长期运行编码智能体核心从提示转向控制系统。Elvis Saravia在DAIR.AI Academy session中详解Claude Code的/goal模式：人类指定最终状态、成功证据、约束与预算，目标作为“合同”而非长提示。评估器成为第一类组件——明确任务用确定性检查（测试、lint、基准），模糊任务用智能体评估器（判断报告、UI设计），两者结合降低幻觉。验证器定义信任边界：外部检查（测试套件、类型检查、浏览器运行、截图对比）提供不可绕过的证据。

elvis@omarsar0 · 6月14日53

Notes on the recent session we had related to autonomous long-running coding agents. (bookmark it) Topics: /goal, loop engineering, verifiers, dynamic workflows, and much more. So much to unpack, so I tried to quickly summarize the most relevant parts using my writer agent.

译关于我们最近一次关于自主长期运行编码智能体的会议的笔记。（收藏它）主题：/goal、循环工程、验证器、动态工作流等等。内容太多，所以我尝试用我的写作智能体快速总结最相关的部分。

宝玉@dotey · 6月14日51

为啥 Codex 还不推出类似 Codex Design 的产品？ Anthropic 最近推出了 Claude Design，是我除了编程之外用得最多的 Agent，也推荐过很多次。效果真的好：你用一句话描述想要的 App，它直接给你生成一个可交互的原型，点哪哪都有反应，不仔细看还以为在操作真实的 App。有网友问：为啥 Codex 还不推出类似 Codex Design 的产品？简单来说，GPT-5.5 的模型能力还做不好这件事。但要解释清楚为什么，得先理解一个关键区分。【1】Agent 的两层：模型和 Harness 很多人把 Codex、Claude Design 和 GPT-5.5、Claude Opus 4.8 混在一起说，其实它们是完全不同的两层。 Claude Design 和 Codex 是"产品层"，业界叫 Harness，包括提示词、工具链、UI 交互流程这些工程层面的东西。Claude Opus 4.8 和 GPT-5.5 是"模型层"，是真正干活的大脑。打个比方：Harness 是厨房，里面有锅碗瓢盆（工具）和菜谱（Skills），模型是厨师。同一套厨房，换个厨师，做出来的菜完全不一样。理解了这个区分，后面的事情就好说了。【2】Harness 不是门槛 Claude Design 的 Harness 层技术上不复杂。花点心思逆向一下，提示词、工具代码几乎都可以拿到。我已经做过了，成果在 baoyu-design（https://github.com/JimLiu/baoyu-design），可以借助 Skill 把 Claude Design 在其他模型上运行。工程上没秘密。真正拉开差距的是背后的模型。【3】高精度可交互原型，难在模型 Claude Design 这个名字容易让人误解，以为交付的是 Figma、Photoshop 那样的静态设计图。实际上它交付的比 Figma 更进一步，是融合了设计稿和原型的高精度可交互原型：你不光能看到设计，还能直接上手操作。这对模型的要求很高。举个例子。我要做一个类似 X/微博的客户端。让模型画一个好看的静态界面，很多模型都做得到。但要让这个界面能交互就复杂了：切换不同 Timeline，展示不同类型的推文（文本、图片、视频），点赞要变红心，删推要从列表消失，从列表点进详情再返回，状态还要保持住。要做到这些，模型必须在动手画 UI 之前，先把整套数据结构和状态管理想清楚：tweet 长什么样、timeline 有哪几种、每个按钮当前是什么状态、状态之间怎么联动。这是系统架构设计的活，不是画 UI 的活。 Claude Design 对模型的要求，是同时具备优秀的 UI/UX 设计能力和系统架构设计能力，缺一个效果就大打折扣。这也是为什么我之前反对只产出纯 HTML 的设计稿，那只是静态的 UI 设计，没有融合 UX 交互。有条件的话可以自己测试感受一下。比如用这个提示词： Design a X Client for Mac, similar to Tweetbot for Mac from Tapbots 同样的提示词让 Codex 去做，也能出个东西，能看，也能简单交互。但对比一下就知道差距了：列表能滚动，sidebar 不能点；点赞按钮没反应。来回迭代好几轮，才能达到一个勉强凑合的水平。 Claude Design 做出来完全不一样。从 Timeline 切到通知页，从列表点进详情再返回，全程流畅，状态都保持住了。不仔细看真以为在操作一个完成度很高的 App，虽然数据都是模拟的。 Claude Opus 4.8 显然在设计和架构这类场景上做了大量训练和优化。【4】产出物就是代码去看 Claude Design 的产出物，注意里面的 data.jsx 文件。它把整个设计的数据结构定义得很清晰，基于这个结构模拟了一套完整数据，然后用 React 在这套数据上构建 UI。设计产物本身就是代码（React、CSS、JSON），不是 Figma 或 PSD，任何开发者拿到都能直接看出按钮的圆角、主色、间距，照着自己的技术栈实现就行。后续设计变更？git diff 一看就知道改了什么。设计和开发之间的沟通损耗降到了最低。说得不严谨，应该说设计 Agent 和开发 Agent 之间的沟通损耗很低了。现在都是人在指挥 Agent 去设计，人指挥 Agent 写代码了。【5】怎么用好 Claude Design 很多人不知道该怎么用好 Claude Design，其实有点像 Vibe Coding：有个基本的想法，先让它做一个版本出来，然后通过 Chat 去指挥 Agent 帮你改，调整几个版本你的思路就清晰了。整个调整的过程非常神奇，有一种"言出法随"的感觉，你想让它怎么改它总能给你实现出来。这也是为啥我现在很痴迷用 Claude Design，反馈来得太快太过瘾了。还有一个小技巧：不要说太具体的要求，而是说你的目标是想要什么，让它自由发挥。往往能得到更好的效果，毕竟它训练过几乎所有公共的 UI 设计。回到最初的问题。Codex 不推类似的设计产品，是因为 GPT-5.5 还扛不住这个活。画个好看的界面很多模型都行，难的是在动手之前把数据结构、状态管理、交互逻辑都想清楚，然后一次性交付一个完整的可交互原型。目前只有 Claude 的模型做到了。至于能领先多久，就看 OpenAI 或者其他家后面模型的进化速度了。

译Anthropic推出Claude Design，可用一句话生成高精度可交互原型。网友问为何OpenAI的Codex没有类似产品？关键在模型层差距。Agent分Harness（产品层）和模型层，Harness非门槛（已有开源baoyu-design可复现），真正壁垒是Claude Opus 4.8同时具备UI/UX设计和系统架构设计能力，先定义数据结构、状态管理和交互逻辑再交付完整原型。而GPT-5.5生成的交互效果差。产出物为React/CSS/JSON代码。

宝玉@dotey · 6月14日71

Codex 操作浏览器有两种模式，一种是 Chrome 插件，一种是内置浏览器。用了一段时间之后，我总结一下两者的差异和各自适合的场景。【1】先说一个被低估的用法：拿 Codex 当爬虫传统爬虫用 requests 或者 Playwright 无头模式去请求页面，现在风控越来越严，指纹检测、行为分析、验证码轮番上阵，很多网站一看你是程序化请求直接拦截。Codex 的浏览器不一样，它操作的是真实浏览器，有完整的渲染引擎、真实的用户代理、正常的 JavaScript 执行环境，在网站看来就是一个普通用户在浏览页面。配合 /goal 模式，你设定一个目标（比如“把这个网站上所有产品的名称、价格、评分抓下来存成 CSV”），Codex 会自己规划步骤、翻页、处理异常，不需要你一步步指挥。这比自己写爬虫脚本省事得多。但 Codex 有两种浏览器模式，特性完全不同，选对了事半功倍。【2】Chrome 插件模式：能力强，但吃资源用 @Chrome 调用的 Chrome 插件模式，核心优势是一个字：登录态共享。它直接运行在你自己的 Chrome 浏览器里，继承你所有的 Cookie、登录会话、已安装的扩展。那些需要登录才能访问的内容，比如付费订阅的文章、企业内部的管理后台、CRM 系统里的客户数据、需要登录的社交平台，Chrome 插件都能直接访问，因为对网站来说，就是你本人在操作浏览器。 Codex 在 Chrome 里工作时会把任务放进独立的标签页分组，不会打断你正在看的页面。它还支持 DevTools 协议，能抓性能数据、看网络请求、调试 Console 错误。但代价也很明显：资源消耗相当大。Chrome 本身就是内存大户，每个标签页都是独立进程。Codex 的 Chrome 插件在上面再加一层操控逻辑，截图、DOM 解析、指令交互全在跑，内存和 CPU 占用会非常高。机器配置不行的话（比如 8G 内存的笔记本），跑起来能明显感觉到卡顿，拿来做批量爬虫任务就更难受了。长时间运行还容易出现截图延迟、状态不同步的问题。另外 Chrome 插件目前只支持 macOS 和 Windows，Linux 用户暂时用不了。它也不支持无头模式，Chrome 窗口必须保持打开状态。适合的场景：需要登录态的短期任务。比如登录某个平台抓一批数据、在内部工具上批量操作、从 CRM 导出信息。【3】内置浏览器模式：轻快，但有局限用 @Browser 调用的内置浏览器，是 Codex 自带的沙盒浏览器环境。它最大的优势是轻量。不需要启动整个 Chrome，资源消耗小很多，响应速度快，适合需要频繁操作浏览器的场景。但它有一个根本性的限制：没有你的登录态。不继承 Cookie、不继承浏览器扩展、不继承已保存的会话。打开一个需要登录的页面，你得在内置浏览器里重新登录。而且有些反爬严格的网站，对这种非标准浏览器环境的检测更敏感。我试过在内置浏览器里登录 X，反复失败，大概率是因为 X 的风控识别出了异常的浏览器指纹。内置浏览器真正出彩的地方是前端开发调试。它有一个标记模式（Annotation Mode），你可以直接在渲染好的页面上选中某个元素或者框选一个区域，写上“这个按钮往上移”“字体加粗”“这个间距太大了”之类的批注，Codex 会把这些批注当作可执行指令来处理。这比用文字描述“第三行第二个按钮的 margin-top 减少 8px”直观太多了。配合 Developer Mode，内置浏览器还能跑性能分析、抓网络请求、看 Console 输出，对本地开发服务器的调试非常友好。适合的场景：公开页面的数据抓取、本地开发调试、不需要登录态的网页操作。【4】怎么选简单说：需要登录的用 Chrome 插件，不需要登录的用内置浏览器。如果你的机器配置有限又需要大量抓取公开数据，内置浏览器是更好的选择。如果目标网站必须登录才能看到内容，或者反爬很严需要真实浏览器指纹，那只能用 Chrome 插件，但要有心理准备面对资源消耗。 Codex 自己也会根据任务判断应该用哪种浏览器。它的优先级是：有专用插件（比如 Jira、GitHub 的集成）就用插件，需要登录态就用 Chrome，其余情况用内置浏览器。当然浏览器的用途远不止爬虫。我觉得内置浏览器做前端调试的体验比很多专门工具都好，标记模式配合 Codex 的理解能力，几乎是“指哪改哪”。Chrome 插件在自动化操作企业内部工具方面也很实用，比如定期从后台导数据、批量更新记录。这些场景还有不少值得挖掘的空间，大家可以根据自己的实际需求去试试。

译Codex 操作浏览器有 Chrome 插件和内置浏览器两种模式。Chrome 插件继承登录态，可访问付费订阅、内部管理等需登录内容，支持 DevTools，但资源消耗大（8G 内存笔记本会卡顿），仅支持 macOS 和 Windows，窗口需保持打开。内置浏览器轻量快速，自带沙盒，有标记模式支持可视化批注改 UI，适合前端调试和公开页面抓取，但无登录态，反爬严格的网站可能登录失败。选择建议：需登录用 Chrome 插件，否则用内置浏览器。

elvis@omarsar0 · 6月14日65

Own the harness, own the agent orchestrators. Great to see open-source work starting to enable it. Being able to compose and combine multiple agents is clearly the future to avoid model lock-in. Curious how routing works, as that remains unsolved.

译Elvis Saravia 指出，拥有 harness 即拥有智能体编排器，开源正推动这一趋势，多智能体组合可避免模型锁定，但路由仍待解决。@matei_zaharia 开源了 Omnigent，一个位于 Claude Code、Codex、Pi 及各类 agent SDK 之上的元平台，支持构建多智能体编码和自定义智能体，并实现实时协作与丰富的控制策略。

Chubby♨️@kimmonismus · 6月14日45

Having access to different AI tools isn't the bottleneck anymore, it is the cognitive load of orchestrating them. LobeHub is tackling this systemic challenge with a new operational paradigm called the Chief Agent Operator (CAO). Instead of requiring users to micromanage individual tasks, the CAO serves as an autonomous management layer handling cross-tool orchestration behind the scenes.

译拥有不同AI工具不再是瓶颈，协调它们的认知负担才是。LobeHub正用一种名为"首席智能体操作员（CAO）"的新操作范式应对这一系统性挑战。 CAO不再要求用户微观管理单个任务，而是作为一个自主管理层，在后台处理跨工具编排。

🚨 AI News | TestingCatalog@testingcatalog · 6月14日35

Google is working on the Skills Marketplace for Gemini Business and Enterprise. We need this on the consumer too 👀

译Google 正在为 Gemini 商业版和企业版开发技能市场。消费者也需要这个 👀

Rohan Paul@rohanpaul_ai · 6月14日65

Adaline just launched a self-improvement layer for AI agents that turns messy production traces into fresh evals, synthetic edge cases, and better agent candidates for humans to approve. I expected it to be a regular trace viewer, but it is reading my production traffic and building evals I would never have considered. It reads production traffic and user feedback, then clusters the mess into recognizable agent behaviours rather than asking a human to manually inspect every strange conversation.

译Adaline 2.0 推出 AI 智能体自我改进层，将生产流量和用户反馈痕迹自动转化为行为聚类，进而生成评估（Evals）、合成边缘场景数据，并基于此产出新的智能体候选版本。开发者只需审核胜出版本即可上线。该工具无需人工逐条检查异常对话，可自动发现人类难以想到的评估用例。

Yuchen Jin@Yuchenj_UW · 6月14日62

This is super exciting - I’ve been using Omnigent at Databricks for a while, and today we open-sourced it. Omnigent is a meta-agent for orchestrating a swarm of agents. Why do we need this? The best results no longer come from a single model running in a single harness. I used to run the same task with Codex and Claude Code, then pick the better one. But the obvious thing is to let them collaborate, debate, and converge on something better. Omnigent makes this smooth. The other feature I love is real-time collaboration. You can invite people into an Omnigent session to watch, steer, and send commands. Multi-agent, multi-human collaboration is the future. Omnigent was built by @matei_zaharia and a very lean team in just 6 weeks, working every day out of a Databricks war room, truly amazing. Databricks AI really feels like a startup.

译Databricks 开源 Omnigent，一个位于 Claude Code、Codex、Pi 等 Agent 工具和 SDK 之上的元智能体编排框架。它让多个 AI 智能体协作、辩论并收敛出更优结果，同时支持实时人工协作——可邀请他人加入会话观察、引导和发送命令。Omnigent 由 Matei Zaharia 带领小团队在 6 周内建成，现已开源。

Rohan Paul@rohanpaul_ai · 6月14日44

Nice survey paper mapping agentic reinforcement learning for LLMs, showing how models learn by acting across time. Covers 500+ works and groups them into a 2-part map of capabilities and applications. The problem is that common LLM training rewards a single answer once, then stops learning. Real tasks need many steps, partial information, and choices that affect what happens later. The survey formalizes that setup as an agent that sees a bit, chooses an action, and gets feedback. That perspective uses memory to track context, planning to pick sequences, and tools to affect the world. It also includes reasoning for constraint handling, perception for multimodal inputs, and self-improvement to refine policies. Reinforcement learning links all of this, because rewards arrive after sequences, so the policy learns what to try next. ---- Paper – arxiv. org/abs/2509.02547 Paper Title: "The Landscape of Agentic Reinforcement Learning for LLMs: A Survey"

译该综述梳理了专注大语言模型的智能体强化学习，涵盖500余篇工作，按能力与应用两维度归类。指出传统LLM训练仅对单次答案给予单次奖励，无法处理真实任务中的多步决策、部分信息与延迟反馈。智能体学习框架包含：记忆跟踪上下文、规划选取动作序列、工具影响环境，并整合推理处理约束、感知多模态输入、自我改进优化策略。强化学习串联所有环节——奖励在序列结束时到达，策略借此学习下一步行动。

Chubby♨️@kimmonismus · 6月14日45

Went in expecting another trace dashboard. Instead Adaline reads the production traffic nobody on the team has time to read, clusters it into real behaviors, and writes hundreds of fresh evals against them every day. Then it assembles the stronger agent candidates and hands them to you. And the smart part: nothing ships until you sign off. That's the first "self-improving" pitch that actually holds up.

译Adaline 2.0 是一个智能体自我改进层，将生产流量 trace 聚类为真实行为，自动暴露问题并生成 evals 和数据，每天编写数百条新 eval。然后生成更强的智能体候选并测试，最终由用户审查通过后才发布。不同于普通仪表盘，它实现了真正的自动迭代，且保留人工最终审批权。

向阳乔木@vista8 · 6月13日13

世界杯来了！用Goal Skill 一句话生成观赛日程订阅站。让Codex开发一个2026世界杯日程信息网，方便自己看，也方便身边朋友订阅。现在开始执行，看什么时候能开发好。

meng shao@shao__meng · 6月13日65

趁周六把我的「infocard-skills」更新了一版，主要对不同比例下的布局合理性做了提升，避免出现大块空白、或拥挤、截断等问题，保持了原有的瑞士国际主义风格。看看八种风格下的展示，我自己还比较满意，感兴趣的朋友看这里： https://github.com/shaom/infocard-skills

译邵猛（@shao__meng）更新开源项目 infocard-skills，提升不同比例下的布局合理性，避免空白或截断，保留瑞士国际主义风格。支持 16/9、4/3、1/1 等常见信息卡比例及封面比例，默认 4/3。用户输入内容和比例，AI Agent 使用该 Skill 生成 HTML 并截图输出 PNG。项目开源于 GitHub。

Peter Steinberger 🦞@steipete · 6月13日48

I can barely keep up with implementing/testing/landing all the Issues/PRs folks submit to https://github.com/openclaw/crabbox#providers Codex runs INSIDE crabbox while it is building crabbox. This is becoming essential infra for my work. Codex been looping nonstop for the last 4 days in multiple trees. Since all of it is e2e verifiable it basically builds itself. Codex even signs up for the services automatically via browser/computer use. My main job is adding credit card details and closing things that I don't see as a fit.

译Peter Steinberger 分享了 Codex 在其项目 crabbox 中的应用体验。Codex 在 crabbox 内部运行，同时构建 crabbox 自身。它已连续4天在多处代码树中非停止循环运行。所有构建均为端到端可验证，使得项目几乎能够自我构建。Codex 还能通过浏览器/电脑使用自动注册所需服务。作者的主要工作仅剩添加信用卡信息和关闭不合适的内容。

Rohan Paul@rohanpaul_ai · 6月13日44

Kai-Fu Lee (founder of Sinovation Ventures) explains how the future is all about multi-agent systems. 1 agent today is like a pre-internet PC, useful but isolated. Connect agents, and they share context, split tasks, and coordinate instantly.

译李开复（创新工场创始人）解释了未来全是关于多智能体系统。今天的一个智能体就像一台前互联网时代的PC，有用但孤立。连接智能体，它们就能共享上下文、拆分任务并即时协调。

小互@xiaohu · 6月13日72

卧槽 Telegram 发布重大更新 - 现在机器人能发富文本了 - 还能让AI帮你管理群聊 - Telegram 终于上手表了这不得够微信学习十来年？？以前 bot 回消息只能纯文本，现在AI Bot支持：・表格、清单、嵌套引用块・行内插图、图片轮播、拼图・可折叠段落、脚注、标题锚点・数学公式、上下标单条最多塞 32768 个字符，超过 8000 字会自动折叠成一个"显示更多"按钮适合 AI bot 输出长答案、做内容卡片的场景

译Telegram 发布重大更新，机器人现支持富文本消息，包括表格、清单、嵌套引用块、行内插图、图片轮播、可折叠段落、脚注、标题锚点、数学公式、上下标。单条消息最多可包含 32768 个字符，超过 8000 字时自动折叠为“显示更多”按钮。更新还引入 AI 协助管理群聊功能，适合 AI 机器人输出长答案和内容卡片场景。

OpenRouter@OpenRouter · 6月13日62

New server tool: Subagent 🤖 Your model can now delegate focused sub-tasks to a smaller, cheaper, faster model mid-generation. The big model orchestrates, the subagent executes. The subagent can use any model on OpenRouter.

译新的服务器工具：Subagent 🤖 你的模型现在可以在生成过程中，将聚焦的子任务委派给一个更小、更便宜、更快的模型。大模型负责统筹，子智能体负责执行。子智能体可以使用 OpenRouter 上的任何模型。

Berryxia.AI@berryxia · 6月13日72

兄弟们，这几天国产模型都在疯狂更新啊！ Kimi直接把coding model最烦人的“过度思考”这个毛病给治好了，2.7版比上一代少烧30% token，agent长任务成功率却反而大幅提升。 Kimi-K2.7-Code今天正式开源，Kimi Code Bench v2涨21.8%、Program Bench涨11%、MLS Bench Lite直接拉高31.5%，指令跟随和端到端完成率都明显更好。以前大家做长horizon coding agent最头疼的就是模型越想越多、token越烧越多、最后半途而废！现在Kimi用更高效的推理直接把这个瓶颈给砸开了，还顺手把权重和代码全扔到Hugging Face。最狠的是他们还预告了即将到来的6x High-Speed Mode，coding效率要继续起飞。开放API和Kimi Code今天就能用，同时开了Beta计划让开发者先试新功能。这波更新证明了真正的coding agent进步，不是单纯堆参数，而是把“想得少但干得好”这件事做到极致。开源之后，社区直接能把这个能力拿去魔改、组合、部署。以前coding model总在“聪明但低效”和“高效但笨”之间摇摆，结果Kimi直接给出了第三条路。

译Kimi 发布并开源最新编码模型 Kimi-K2.7-Code。相比 K2.6，该模型在 Kimi Code Bench v2 提升 21.8%，Program Bench 提升 11%，MLS Bench Lite 提升 31.5%。核心改进是解决编码模型“过度思考”问题，推理 token 使用量降低 30%，long-horizon 编码任务的指令跟随和端到端成功率显著提升。权重与代码已上传 Hugging Face，支持通过 Kimi API 和 Kimi Code 使用，同时开放 Beta 计划。团队预告即将推出 6x High-Speed Mode，进一步提升编码效率。

Berryxia.AI@berryxia · 6月13日59

卧槽！我们一开始就用错了Fable 5模型啊！可以花几分钟看看原文还是有价值和启发的！大多数人把Claude Fable 5当成更大上下文窗口的Sonnet 4.6在用，提个问,用5分钟,关标签页。 90%的用户从没跑过真正会复利可持续增长的Agent系统:每次运行都让下次更聪明,状态文件不断积累,技能持续打磨。 Fable 5是为连续运行数天设计的模型。你却只用了几分钟。（我想说特么额度也不够啊！）😆 作者用14步构建自我改进系统，可以让你的Fable 5 起飞～一、Fable 5真正解锁了什么 1. Mythos级模型 - 2026年6月9日发布,首个公开的Mythos级模型(比Opus高一档)。核心能力: • 数天级自主会话 • 内置自我验证 • 最复杂的代码工作 • 多阶段知识工作 2. 自我改进≠自我学习 - 模型权重不变,但系统环境会变聪明:每次会话写入经验教训,技能随边缘案例打磨,状态文件积累验证过的事实。 3. 复利堆栈:四层架构 • 第1层:原语(Fable 5本身、子Agent、工具) • 第2层:编排(目标循环、动态工作流、例程) • 第3层:记忆(状态文件、技能库、知识库) • 第4层:自我改进(视觉自检、评估循环、规则提炼) 4. 何时用哪个模型 - 按任务复杂度路由: • Fable 5:重型编排角色 • Opus 4.8:复杂但有界的子任务 • Sonnet 4.6:高频工人任务 • Haiku 4.5:评分子Agent 二、三个关键模式设计 5. /goal vs Outcomes + 验证器子Agent - 独立验证器优于自我批评。 6.模型评估自己的输出会偏向自己已写的结论。 7. 动态工作流 - 三个关键模式:扇出-综合、对抗验证、循环直到完成 8. Worktrees并行安全 - 多Agent并行工作时避免文件冲突 9. Routines长期编排 - 笔记本合上,Fable 5继续工作三、自我改进层 10. 5阶段记忆进化:失败→调查→验证→提炼→查阅 • Sonnet 4.6止步于第1阶段 • Opus 4.7止步于第3阶段 • Fable 5能完成全流程 11. 状态文件 - 记忆实际存放的地方,包含5个部分对应5个阶段 12. 技能复利 - 把经验教训写进技能本身,而不只是聊天记录 13. 视觉自验证 - Fable 5用视觉检查UI输出是否符合目标 14. Mythos安全边界 - 在网络安全、生物、化学、模型蒸馏领域会自动降级到Opus 4. 把模型的能力发挥到真正需要的地方和适合自己的项目中，调优到最佳状态才是榨干最后一个token 最好的办法😄

译大多数用户将Claude Fable 5（首个公开Mythos级模型，2026年6月9日发布）当作更大上下文窗口的Sonnet 4.6单次提问使用，但Fable 5专为连续数天的Agent系统设计，支持自我改进：每次运行让下次更聪明，状态文件积累，技能持续打磨。文章提出14步构建自我改进系统，涵盖四层架构（原语、编排、记忆、自我改进）、任务路由（Fable 5用于重型编排，Opus 4.8负责复杂子任务，Sonnet 4.6高频工人，Haiku 4.5评分）、动态工作流模式以及5阶段记忆进化（失败→调查→验证→提炼→查阅）。在网络安全、生物、化学、模型蒸馏领域会自动降级到Opus 4。

meng shao@shao__meng · 6月13日46

为大规模训练 Composer 模型，Cursor 团队构建了始终运行的 Agent 舰队系统，本质是一个 Loop，实现数千个 Agent 的协同工作和自我管理 # 系统架构与工作原理主 Agent（Fleet Manager）： · 运行在大型远程机器上，配备本地常用工具 + 一个磁盘文件作为“inbox”（舰队共享收件箱） · 通过 SSH 连接数百台子 Agent 机器，收集状态并写入 inbox · 每轮循环检查舰队健康状况： · 保持健康任务后台运行 · 将故障/异常推送至 Slack 或 PagerDuty · 可主动控制舰队：终止、重启进程，处理瞬时故障子 Agent：数百个并行运行的研究任务 Agent，专注于具体实验。构建基础：基于 Cursor 此前公开的长运行 Agent 研究，赋予主 Agent 多项 Skills，这些技能编码了运行 ML 实验、审查监控结果等的隐性知识。关键设计：使用 Cursor 自身产品，inbox 文件 + 良好 skills 实现状态共享和协调。

译Cursor 团队为训练 Composer 模型构建了一个始终运行的 Agent 舰队系统。主 Agent（Fleet Manager）在远程机器上运行，通过 SSH 连接数百台子 Agent 机器，利用本地工具和磁盘文件“inbox”实现状态共享与协调。每轮循环检查舰队健康，将故障推送至 Slack/PagerDuty，并主动终止或重启进程。子 Agent 并行执行研究实验。系统基于此前长运行 Agent 研究，主 Agent 拥有编码 ML 实验隐性知识的 Skills。核心是使用 Cursor 自身产品，通过 inbox 文件与 Skills 实现大规模 Agent 协同与自我管理。