Just landed nested subagent support in Claude Code Starting to experiment more with agents kicking off agents as a way to better manage context. Capped at depth=5 to start, going out in today’s release. Lmk what you think!

译刚刚在 Claude Code 中实现了嵌套子智能体支持。开始更多实验智能体启动其他智能体，以便更好地管理上下文。初始深度上限为 5，将在今天的发布中推出。欢迎反馈！

Kimi.ai@Kimi_Moonshot · 6月9日63

http://x.com/i/article/2063961516815327232 # Kimi to Predict All 104 World Cup Matches: Germany May Be Underestimated > Our predictions will probably be wrong. But the World Cup offers a rare, public, verifiable, and constantly evolving real-world setting. Through this initiative, we hope to place analysis, predictions, and post-match reviews within one transparent framework, helping more people understand both the capabilities and limitations of today's AI systems. The 2026 FIFA World Cup in the United States, Canada, and Mexico is set to kick off. This historic 48-team tournament will feature a total of 104 matches across the group stage, Round of 32, Round of 16, quarter-finals, semi-finals, and final. We used Kimi's Agent Swarm to run multiple agents in parallel, ensuring a more robust analysis. These agents look at tactics, player form, injuries, scheduling, historical data, public sentiment, weather, psychology, odds movements, and expert opinions. They research all 104 matches in parallel, and publish pre-match predictions and post-match reviews for each round. Here is the full report：https://gtfehbkpbwzco.kimi.page/ # How Agent Swarms Can Improve World Cup Predictions Predicting the World Cup is a classic complex decision problem. It involves structured data, such as team rankings, historical records, goal distributions, and odds fluctuations—as well as vast unstructured information, including tactical styles, personnel changes, public expectations, and n-game risks. Kimi's Agent Swarm coordinates 300 sub-agents to reason in parallel. Each agent has its own analytical angle: some focus on team fundamentals, using Elo and FIFA rankings as strength parameters; some evaluate offensive and defensive quality, relying on xG and xT metrics; some specialize in tactical matchups—high pressing, low block, counter-attacking, and set-piece strategies; some process scheduling and environmental factors, including travel distance, climate, and rest periods; some track squad completeness and injury risks; some monitor market signals, analyzing shifts in odds and implied probabilities; and others assess random risks such as red cards, penalties, VAR decisions, and goalkeeper performances. Each agent must provide its own conclusion, evidence, confidence level, and counter-argument. The final result is synthesized, verified, and risk-labeled, presented as probabilities rather than absolute judgments, and does not simply adopt the majority opinion. At the model level, this prediction effort draws on Elo/FIFA strength models, Poisson and Dixon-Coles goal distribution models, xG/xT metrics, machine learning-enhanced models, Monte Carlo simulations, market-model deviation analysis, and Bayesian dynamic updating. The value of these methods is not that they eliminate uncertainty, but that they help us identify it more systematically and communicate it more responsibly. # A Signal Worth Discussing: Germany May Be Underestimated Most mainstream models currently list Spain and France as the top favorites for the title. Kimi's analytical framework also places both teams at the top of the probability rankings. However, during the research process, the model identified a notable deviation: Germany's title probability may be underestimated by the market. Specifically, the model's baseline estimate is approximately 11.0%, the calibrated estimate is around 11.3%, while some market-implied probabilities are only about 7.4%—a positive deviation of roughly +3.6 percentage points. This judgment is not derived from a single reasoning path, but from cross-validation across multiple analytical chains. Possible explanations include: the "recency bias" from Germany's group-stage exits in the last two World Cups continues to influence market pricing; Julian Nagelsmann's high pressing and transition system is showing signs of recovery; the new creative axis formed by Jamal Musiala and Florian Wirtz addresses the team's previous structural difficulties against deep defensive blocks; and Germany remains in the world elite across foundational dimensions such as Elo rating, squad valuation, and talent depth. At 38, Nagelsmann is the youngest head coach at this World Cup, and also a leading figure in openly applying AI technology to training and tactical analysis. Whether this factor will play a role in the tournament is also worth watching. At the same time, we are fully aware of the risks Germany faces. A high-pressure system demands extreme fitness and squad completeness; should key injuries occur, rotation quality decline, or opponents with tight defensive organization and strong physicality be encountered, the advantage could narrow significantly. Therefore, we have a responsibility to state: this is not a deterministic prediction that "Germany will win the title." The more accurate formulation is that the model has identified a potential probability deviation, worth documenting publicly and verifying going forward. # Why Public Prediction Matters: AI Companies Should Be More Honest When AI companies discuss capabilities, they often prefer to stay in the realm of demos and case studies. But in complex real-world problems, the real difficulty lies not only in providing answers, but in: whether they are willing to make public judgments in advance; whether they can clearly explain the basis for those judgments; whether they candidly acknowledge uncertainty; whether they can review why its predictions were wrong; and whether they can continuously update based on new information. The World Cup offers a naturally public, verifiable, and continuously evolving scenario. Through this initiative, we hope to place the analytical process, prediction results, and post-match reviews within the same transparent framework. We expect that a significant number of errors will occur during this prediction process. Based on historical backtesting, high-confidence predictions have an accuracy of approximately 85%–90%, medium-confidence predictions about 55%–65%, and low-confidence predictions are close to random. This means that even in high-confidence matches, unexpected results remain unavoidable. We will categorize prediction errors into several causes: insufficient or lagging data, failure of key assumptions, model structures not covering specific scenarios, in-game events altering match trajectories, and the inherent randomness of football itself. We welcome constructive model corrections and any criticism, and will continuously iterate and optimize our predictive capabilities. We also sincerely invite other AI models to participate in public prediction. We believe that AI should not be packaged as a system that is always right. A trustworthy AI system should be able to clearly articulate its own boundaries. # Group Stage Round 1 Prediction Results Below is a summary of predictions for the opening round of group-stage matches. For the full analytical process, key variables, and confidence explanations, please refer to the full report (reply "Kimi" in the backend to receive the complete report). The report anticipates approximately 5–7 unexpected results against the model's direction in the opening round. Red cards, injuries, VAR, extreme weather, and exceptional goalkeeper performances can all cause single-match predictions to deviate significantly from model expectations. # Claim Trillions of Tokens and Experience Kimi Work To accompany fans through this summer, we have prepared the following campaign: - Starting from 8:00 PM ET on June 8, users who log in to Kimi can select a team to support. For each match that team wins, users can participate in a pool to share 1 trillion tokens. At the same time, for each match Germany wins, all users will have the opportunity to share an additional token prize pool. Pick your team here 👉 https://www.kimi.com/token-cup?from=popup The tokens you receive can be used to experience Kimi Work—a universal local agent designed for knowledge workers, launched alongside the latest beta versions of Kimi for Mac and Windows. Its core, Kimi Code, comes integrated with professional skills such as website building and PPT creation, connects to specialized databases in finance, research, and law, and features the Kimi WebBridge solution, allowing AI to use a browser to complete complex tasks just like you using the browser. # Risk Disclaimer Kimi's World Cup predictions are intended to publicly demonstrate AI's capabilities in reasoning, calibrating, and reviewing complex match analysis. They do not constitute any betting, investment, financial, or profit promise, and are intended solely for sports research, entertainment discussion, and AI capability evaluation. Sports match results are highly uncertain; please do not make any financial decisions based on a single prediction, and enjoy the game responsibly. Kimi wishes football fans and technology enthusiasts around the world an unforgettable tournament, and looks forward to witnessing the intersection of data-driven analysis and sporting miracles. Again, you can log in to Kimi and choose any team you'd like to support. For every match your team wins, you'll be eligible to join a prize pool and share 1 trillion tokens with other supporters. And there's more: every time Germany wins a match, all users will unlock access to an additional bonus token prize pool. Join Now 👉 https://www.kimi.com/token-cup?from=popup Now, all eyes are on Germany.

译Kimi 利用 Agent Swarm 系统并行协调300个子智能体，分析战术、球员状态、伤病、赛程、天气、赔率等因素，预测2026年美加墨世界杯全部104场比赛，并发布每轮赛前预测和赛后回顾。模型层融合了 Elo/FIFA 强度、Poisson 进球分布、xG/xT 指标、蒙特卡洛模拟等方法。预测结果显示西班牙和法国为头号热门，但德国夺冠概率可能被市场低估：模型基线估计约11.0%，校准估计约11.3%，而部分市场隐含概率仅约7.4%，正向偏差约+3.6个百分点。该判断基于多分析链交叉验证，可能源于对德国近两届小组出局的近因偏差以及纳格尔斯曼高位压迫体系与穆西亚拉/维尔茨新创造轴的复苏信号。

fofr@fofrAI · 6月9日70

Agents, collect your power-up

译Google Colab CLI与Skills正式推出，用户可直接从终端使用完整Colab运行时，包括GPU/TPU分配（如colab --gpu A100）、远程脚本执行（colab exec）、交互式控制台/REPL访问以及内置智能体技能。只需告诉智能体“在此数据集上微调Gemma 3 1B”，它就会自动分配GPU、运行训练并下载适配器权重，全程自动化。智能体们，来领取你们的增强道具。

Alibaba Cloud@alibaba_cloud · 6月9日67

Alibaba Cloud has launched a new public cloud region in Johor, Malaysia, with two new data centres to meet the growing demand for cloud and AI services to Malaysia in the second half of this year, including AgentRun, STAROps, ACS Agent Sandbox, Agent Security Center, AI Security Guardrails 2.0, and Agentic SOC. https://int.alibabacloud.com/m/1000414242/

译阿里云在马来西亚柔佛州推出了一个新的公有云区域，包含两个新数据中心，以满足今年下半年马来西亚对云和AI服务日益增长的需求，包括AgentRun、STAROps、ACS Agent Sandbox、Agent Security Center、AI Security Guardrails 2.0和Agentic SOC。https://int.alibabacloud.com/m/1000414242/

Alibaba Cloud@alibaba_cloud · 6月9日48

Tired of AI agents forgetting the context? 🧠 Welcome to the MemoryAgent Arena at Qwen Cloud Global AI Hackathon Series! Build agents with persistent memory and cross-session tech to win your share of the $70,000+ prize pool. 🚀 🔗 Register now: https://click.qwencloud.com/m/20000000281/

译厌倦了AI智能体忘记上下文？🧠 欢迎参加Qwen Cloud全球AI黑客马拉松系列的MemoryAgent Arena！构建具备持久记忆和跨会话技术的智能体，赢取超过7万美元奖金池中的一份。🚀 🔗 立即注册：https://click.qwencloud.com/m/20000000281/

歸藏(guizang.ai)@op7418 · 6月9日15

最近 Skill 做多了，感觉对于 skill 有点新的领悟，找时间写个文章聊聊。

SiliconFlow@SiliconFlowAI · 6月9日61

V4-Pro (quality) + V4-Flash (speed) 2 lines of config to bring the Best price/perf DeepSeek combo in your terminal @goodhunt's CodeWhale — the terminal coding agent built for @deepseek_ai V4 — now includes SiliconFlow as a built-in provider🔥 Here's what you're actually getting: → Stream Reasoning: See the thinking, not just the answer. → Auto-Routing: Switches model + thinking depth by task complexity. → Zero Drift: A written Constitution ranks authority for each turn, keeps V4 oriented. → Self-Improving: V4 helped write its own harness, and as the harness improves, every session is stronger. Step-by-step guide 🧵👇

译硅基流动宣布，通过V4-Pro（质量）与V4-Flash（速度）两行配置，即可在终端获得DeepSeek V4的最佳性价比组合。专为DeepSeek V4构建的终端编码智能体CodeWhale现已内置SiliconFlow。CodeWhale具备流式推理（显示思考过程）、自动路由（根据任务复杂度切换模型与思考深度）、零漂移（通过书面宪法为每轮排序权威，保持V4定向）以及自我改进（V4协助编写框架，框架提升后每个会话更强大）等特性。

Chubby♨️@kimmonismus · 6月9日58

Claude Mythos is conning tomorrow!! Prepare yourself friends. It’s happening!!

译据消息，Anthropic 计划明天发布 Mythos 公开版。该版本将配备实质性护栏，权限不如 Project Glasswing 合作伙伴可访问的版本宽松，但在长周期、多轮任务上表现将大幅提升。准备好，朋友们，就要来了！

歸藏(guizang.ai)@op7418 · 6月9日63

MiMo推出1000 Token/s超高速模型｜体验测评 MiMo 推出了 MiMo V2.5 Pro UltraSpeed 超高速的模型版本，能够实现每秒输出超过 1,000 Token 的速度。同时，这应该也是全球第一个达到这个速度的万亿（1T）参数模型。藏师傅提前试了一下，做了三个测试，确实爽。第一个跑了一个比较复杂的 3D 采矿小游戏测试。在没有素材的情况下，我让它全部用 Three.js 前端代码来生成素材。整体要求比较完整，虽然第一次实践时出了一些小问题，但在跟他沟通修改建议后，非常完美地实现了任务。这次测试的各项指标如下：思考的 TPS：804 Token/s，峰值速度：810 Token/s，首次响应时间：4.71 秒。第二个测试给了一个官网，其头部包含一个相对复杂的 3D 动画。这次的输出速度快了非常多：峰值达到了 1426 Token/s，首次响应只用了 0.83 秒，在 32 秒内输出了 25624 个 Token，总计生成了 1000 行代码。第三个测试给了一个更复杂的官网。我要求这个官网的 Header 头部包含以下 3D 效果：地球边缘、轨道上的飞船、星际尘埃、航线图、舷窗的 HUD 样式。这个效果非常好，整体的视觉样式、状态、SVG 动画和驾驶卡片都非常精细，还有滚动的视差效果这个输出的 TPS 达到了 1136 tokens/s，首次响应是 4.5 秒官方测试平台下面有个数据展示，会显示相关信息在流式输出的情况下，当你看着它只用 20 秒就产生一个非常复杂的 3D 游戏时，那种场景还是比较震撼的之前的这些（比如说 Groq 之类的）超高速推理方案，在模型能力或者是整体水平上都会有所下降，但是 MiMo 这个在测试的时候，我没有看到这种迹象最近很多公司都开始推出这种超高速的 API 服务，比如之前 OpenAI 和 Anthropic 都有 Fast 模式在 Agent 场景下，模型输出效率的提升会直接带动每一步 Agent 操作的效率：如果一个任务预估一分钟完成，你就会盯着它直到结束，然后立刻投入测试。如果需要五分钟才完成，你可能就会去干别的事，然后再回来看，难免会浪费一些时间这种效率提升在 Sub-Agent 和并发场景下更加明显。因为它可以更快地产出大量结果，想象一下，如果同时启动一两百个 Sub-Agent，在模型能力没有衰减的前提下，速度提高 10 倍，体验是非常爽的毕竟这本质上是面向那种对效率有极高要求的 To B 客户所推出的希望后面大家卷起来，优化一下成本，让普通用户也能放开用这种 UltraSpeed 模型

译MiMo推出V2.5 Pro UltraSpeed超高速模型版本，每秒输出超1000 Token，号称全球首个达此速度的万亿参数模型。实测显示：复杂3D小游戏TPS 804 Token/s（峰值810），首次响应4.71秒；官网3D动画峰值1426 Token/s，首次响应0.83秒，32秒输出25624 Token（1000行代码）；另一复杂官网3D效果TPS 1136，首次响应4.5秒。相比此前超高速推理方案常见能力下降，MiMo未出现此类迹象。该模型主要面向效率要求极高的ToB客户，在Agent和Sub-Agent并发场景下效率提升明显。

向阳乔木@vista8 · 6月9日65

我去，黑科技啊！一句话操作浏览器：拉黑 X 垃圾回复，自动回小红书评论，转写英文文章发到知乎、公众号草稿箱等。来自推友 @okasupportgroup 开发的一个全新AI Agent 浏览器：Aye 把MacRumor一篇关于 WWDC26 的 WatchOS 27 文章发布到知乎，连图片都会读缓存插入。还能自动回复小红书评论，模拟真人查看内容，根据上下文回复，强！浏览器不仅自带AI问答、翻译，视频/图片下载（yt-dlp和内置模块），Dia有的功能，基本都有。更大亮点是Agent Skill，除了内置很多黑科技Skill，还能手动操作录制生成Skill，定时执行。各种繁琐网页操作都可以交给它完成了。底层基于Chromium，完全AI模拟真人操作，不会像CLI、插件之类的会触发账号异常检测，总之，牛逼！

译推友推出AI Agent浏览器Aye，基于Chromium模拟真人操作。支持一句话操作，如拉黑X、回小红书、转写文章到知乎；可录制自定义Skill定时执行，完成繁琐网页操作。

ginobefun@hongming731 · 6月9日32

和 @puliandc 讨论了好几轮用 Claude Code 和 Claude Design 设计和讨论然后用 Codex Goal 模式搭建。明晚争取上线 BestBlogs 世界杯专刊期待一起用 BestBlogs 看世界杯⚽️📖！

译洪明 (@hongming731) 透露，经过与 @puliandc 多轮讨论，他们使用 Claude Code、Claude Design 进行设计和讨论，并用 Codex Goal 模式完成了搭建。目标是在明晚上线 BestBlogs 世界杯专刊，邀请用户一起用 BestBlogs 看世界杯。

Tibo@thsottiaux · 6月9日66

Anyone writing nested loops yet?

译每月提醒：你不应再手动提示编码智能体了，而应设计循环来驱动它们。有人已经在写嵌套循环了吗？

Huawei Cloud@HuaweiCloud1 · 6月9日54

On June 6, at Huawei Cloud INSPIRE 2026, Huawei Cloud databases presented a session titled "Agent-Native: The Next Phase of Databases." Customers, partners, and industry experts gathered in Shanghai to explore database trends, real-world implementation, and the road ahead in the agentic era. https://tinyurl.com/ycbnbsva #INSPIRE2026 #HuaweiCloud #Database

译6月6日，在华为云INSPIRE 2026大会上，华为云数据库举办了题为“Agent-Native: The Next Phase of Databases”的会议。客户、合作伙伴和行业专家齐聚上海，探讨数据库趋势、实际实施以及智能体时代的未来。https://tinyurl.com/ycbnbsva #INSPIRE2026 #HuaweiCloud #Database

meng shao@shao__meng · 6月9日52

我这次用 Step 3.7 Flash 测了一个真实 Coding Agent 任务：把一组 Agent Memory 的运行痕迹，做成本地可检查的 Memory Inspector。输入不是干净需求文档，是一个已有 Local Agent Memory MVP： · memory_events · structured_facts · memory_chunks · 9 个场景测试 · 敏感信息过滤结果 · recall 命中结果 · 跨 session 记忆记录 Step 3.7 Flash 先读现有代码和测试输出。然后它检索了 Letta、LangSmith、Mem0、Graphiti 这些工具如何展示 memory、trace、dashboard 和 agent state。最后生成了一个单文件本地 HTML： agent_memory_inspector.html 页面里能看到： · 8 条 memory events · 9 条 structured facts · 8 个 memory chunks · 9/9 场景测试通过 · 敏感信息过滤前后对比 · recall 命中内容、retrieval 类型和分数 · 跨 session 记忆连续性 · 哪些资料影响了 UI 和数据结构我觉得这比让模型解释“Agent Memory 是什么”更有意义。真实 Agent 工作里，模型不只是回答问题。它要能读上下文、查资料、理解结构、写代码、整理证据，并产出一个能运行的东西。这次 Step 3.7 Flash 做到的是：把混乱的 Agent 运行痕迹，变成了一个可检查的小工具。测试环境： · Cursor Agent · model: step-3.7-flash · 本地 HTML 输出 · 数据来自 Local Agent Memory MVP 它还不是生产级观测平台。但作为一次 first-pass Coding Agent 任务，它回答了一个更重要的问题：模型能不能把真实 Agent traces 变成一个可用工具？ @StepFun_ai 平台国内：https://platform.stepfun.com/ 海外：https://platform.stepfun.ai/

译开发者用 Step 3.7 Flash 测试真实 Coding Agent 任务：将已有 Local Agent Memory MVP 的运行痕迹（memory_events、structured_facts、memory_chunks 等 9 个场景测试数据）生成为单文件本地 HTML 工具 agent_memory_inspector.html。页面展示 8 条 memory events、9 条 structured facts、8 个 memory chunks、9/9 场景测试通过、敏感信息过滤前后对比、recall 命中内容及 retrieval 类型与分数、跨 session 记忆连续性。模型先读取现有代码和测试输出，检索 Letta、LangSmith 等工具展示方式后编写代码。测试环境：Cursor Agent + step-3.7-flash，本地 HTML 输出。

Rohan Paul@rohanpaul_ai · 6月9日60

AGI needs agents that actively explore what they do not know, not just models that answer better. This new large (111 page) survey paper from from top labs across US and China talks about epistemic exploration, which means an agent should actively reduce uncertainty, learn near the edge of what it can do, and keep future paths open. Exploration is not randomness; it is the disciplined act of asking which observation would change your beliefs, which attempt would improve your skill, and which path must remain open before it closes. It breaks this into 3 needs: seek useful information, turn hard-but-learnable experiences into better ability, and avoid getting stuck in one narrow strategy too early. The authors organize AI progress into 5 levels: responder, reasoner, agent, prospector, and ecosystem, where each level explores a wider space than the last. A responder mostly gives an answer, a reasoner searches through possible thoughts, an agent tests the outside world, a prospector simulates futures, and an ecosystem uses many agents working together. Paper - "Agent Exploration Toward Artificial General Intelligence"

译一篇来自中美顶级实验室的111页综述论文提出，AGI需要主动探索未知（认知探索），而非仅提升回答能力。论文将AI进展分为五级：responder（响应者）、reasoner（推理者）、agent（智能体）、prospector（勘探者）和ecosystem（生态系统），每级探索空间更广。核心强调智能体应通过获取有用信息、将困难经验转化为能力、避免过早锁定单一策略来降低不确定性，保持未来路径开放。

MiniMax (official)@MiniMax_AI · 6月9日32

Pick M3 as your base model on AgentBox to deploy with frontier coding, 1M-token context, and native multimodality all in one click.

译在AgentBox上选择M3作为你的基础模型，一键部署，即可获得前沿编码能力、百万token上下文窗口和原生多模态。

ginobefun@hongming731 · 6月9日33

尝试给 http://BestBlogs.dev 开了一个英文新号，会分享一些精选博客、文章和创作者内容。我的个人号还是会留给大家看看我自己的构建、开发和探索思考。喜欢发现好内容的朋友，可以顺手关注一下 @BestBlogsDev

译洪明为内容推荐平台BestBlogsDev开设英文新号，分享精选博客与创作者内容，个人号保留构建与开发思考。引用推文回顾Claude Code过去一年的演进：从简单的编码助手成长为由数千自主agent组成的网络，可协作测试、修复、部署代码，无需人类逐级指导。AI在12个月内完成了从工具到协作者再到系统级编排者的转变，被视为新工程范式的诞生。

meng shao@shao__meng · 6月9日68

Claude Code 上线一周年：演进与方法论回顾来自 Claude Code 负责人 Boris Cherny 与产品负责人 Cat Wu，从一年前首次内部 demo 只有两个 Slack 点赞，到现在绝对主流 Coding Agent，这一年 Claude Code 到底做对了什么？ https://www.youtube.com/watch?v=Hth_tLaC2j8 # 两条底层方法论 1. 错误即资产：写入规则，而非口头纠正 Boris 的核心习惯：每次 Claude 犯错，不直接说「下次别这样」，而是写入 CLAUDE.md、Skill 或类似持久化机制。逻辑是：口头纠正只影响当前会话；规则沉淀后，agent 可长期、反复、自主执行。这是「让 agent 几乎无限运行」的前提。 2. Verification（验证）≠ 单元测试多数人把 verification 理解成 lint、类型检查、单元测试——这些早已自动化，不是 agent 时代的重点。真正的 verification 是：agent 能否亲自「跑起来」验证结果。 · 早期案例：让 Opus 4 写完功能后，在 bash 里启动另一个 Claude CLI 自测。 · 现在：iOS/Android 模拟器、桌面应用的 computer use 点击测试已成常态。 · Cat 的实践：桌面开发 Skill 教 Claude 启动本地 app、点 UI、测边界；若 staging 异常，先读 Slack 判断是否环境问题；修完后更新 Skill，形成闭环。要点：验证能力往往需要针对具体产品定制，无法一键通用。 # Loops/Routines：从「人用工具」到「系统替人值守」 Routines 被定位为 Agent SDK 之后第一个「显而易见」的规模化应用。典型案例： · 某工程师为 Voice Mode 设 routine：监听所有相关 GitHub issue/bug → 自动提 PR → 通知本人。 · 另一 routine：5 小时未响应的 bug 自动修复，易验证的直接 merge。 · Cat 遇到自己功能的 edge case bug，还没动手，Claude 提示「另一个 Claude 已修好」。组织影响： · 代码评审、CI 修复、rebase 等琐事，团队成员已很久没亲手做。 · 多个人的 Claude 并行工作，形成「隐形协作网」。重点：把工程运维流程产品化、自动化。 # Auto Mode：取代 Plan Mode 的默认选择 Boris 明确表示：Plan Mode 已基本不用，全面切到 Auto Mode。原因： · Opus 4 ~ 4.5 仍需显式规划；从 4.6、尤其 4.7 起，模型已能自主规划。 · Auto Mode 的价值是：启动 agent 后即可转向下一个任务，无需盯屏点确认。安全设计的反直觉结论：人工逐条审批 99% 都会点「是」的权限提示，反而更危险；Auto Mode 用独立分类模型筛风险，人只关注被拦截的少数异常，整体更安全。上线前流程： · 收集数千条 agent 轨迹 + 权限请求，训练分类器； · 红队 prompt injection、渗透测试； · 建 eval，确保已知攻击全部被拒； · 内部团队继续攻击、迭代。 Boris 认为：「把 prompt 路由给另一个模型做安全检查」——他最初认为行不通，实测却效果很好。这反映基于大模型构建产品时，许多旧工程直觉需要重写。 # 组织变革：AI 必须成为流程中心 Boris 引用 90 年代 HBR 案例：PC 普及初期生产力未显现，因为企业只是把电脑「放在旁边」，流程仍是纸笔+文件柜。真正释放价值，需要把电脑置于业务流程中心，淘汰旧媒介。类比到 AI： · Anthropic onboarding 不问人，问 Claude； · 提问、写代码、CR、安全审查、填表，均经 Claude/Co-Work； · 领先企业正在把 AI 放到同样位置。与 PC 转型需 10–15 年不同，AI 转型更快，因为： · 工作已高度数字化； · Claude 能操作电脑、写代码、跑代码。角色融合： · 产品、设计、DevRel 都在写代码、提 PR； · 工程师端到端负责：构思 → 实现 → 对接法务/市场/安全 → 发布； · 设计、PM、财务、数据科学等「邻接角色」广泛采用 Claude Code。 · 未来不是「人人 PM」或「人人工程师」，而是两者合一——好奇心、产品品味、端到端 ownership 成为关键能力。 # 多 Agent 时代的工具形态从「6 个终端 tab + 6 份 git checkout」→ 单 tab + Agent View + Desktop App（自动 worktree）。意外变化：Boris 约一半工程工作已在手机上完成——Remote Control、Voice Mode，边走边看 agent，现场聊出新想法即开 agent 实现，无需回电脑。这说明：工程师的主战场正从 IDE 转向 agent 编排界面。 # Context Minimalism（上下文极简主义）技术话语的演进轨迹： · Sonnet 3.5 时代 → Prompt Engineering · Opus 4 时代 → Context Engineering · 当前模型 → Context Minimalism 原则： · 最小 system prompt、最少工具集； · 只给模型「拉取上下文的能力」，不塞满上下文； · 过多上下文 ≈ 微观管理，限制模型找更优路径； · Harness 本身也在变瘦，把 token 空间留给用户意图。这与一年前「精心构造 mega prompt」的做法形成鲜明对比。 # 对未来的判断团队预判： · Agent 运行更久、更自主； · 很少只跑 1 个 agent，常见是数十、数百、数千； · 一年后的产品形态很可能与今天完全不同； · 创新将更多来自用户社区，而非官方闭门设计。值得肯定的洞见： · Verification 定义准确，切中 agent 工程要害； · 「错误写入规则」是可复制的工程纪律； · Auto Mode 安全思路有实证支撑，不是空喊； · 组织变革类比有历史参照，不过于浪漫化。需保持审慎之处： · 发言者身处 Anthropic 内部，描述的是理想态实践，外部企业落地节奏未必相同； · 「财务用 Claude Code 做预测」等案例缺少可验证细节； · Routines 全自动 merge 依赖「易验证」边界，复杂系统风险需自行评估； · 「角色融合」「手机写代码」更像前沿团队样本，非行业普遍现状。

译Claude Code 负责人Boris Cherny与Cat Wu回顾一周年核心方法论：每次Claude犯错写入CLAUDE.md或Skill持久化规则而非口头纠正；Verification指agent亲自跑起来验证（如启动模拟器、computer use测试）。Auto Mode取代Plan Mode，用独立分类模型筛权限风险而非人工审批。Routines实现自动化运维（如监听GitHub bug自动提PR）。Context Minimalism主张最小system prompt和工具集。团队预判未来agent运行更久、成百上千并行，产品形态将巨变。

ginobefun@hongming731 · 6月9日67

http://x.com/i/article/2064136850370101248 # BestBlogs 早报 · 06-09｜Claude Code 自主化、循环工程、阳萌安克在线阅读和收听：https://www.bestblogs.dev/explore/brief/2026-06-09 ## 导语当 AI 编程工具从「辅助」跃升为数千 Agent 自主运转的工作流，工程师的角色也随之深刻重塑。本期围绕这一转折精选三篇值得细读的内容：Claude Code 一周年的第一手复盘，揭示 Auto Mode 如何让权限审批退出历史舞台；Boris Cherny 的「循环工程」，重新定义工程师的核心职责；以及阳萌历时 4 小时的长访谈，以安克 15 年经验探讨 AI 原生组织与第三类公司的可能性。三篇合读，或许能让你看清这个行业正在拐向哪里。今日早报共收录 3 篇精讲、7 篇速览、6 篇补充阅读，来源涵盖 Anthropic 官方、Elevate 技术博客、商业访谈播客等多个渠道。在 AI 加速重塑软件工程与组织结构的当下，这期内容提供的不只是工具层面的参考，更是关于「工程师身份如何演变」与「传统企业如何自我重构」的思考材料。 ## 精讲一：Claude Code 一周年复盘：从辅助写代码到自主智能体工作流一年前，Claude Code 作为一款辅助工程师完成独立小任务的工具首次亮相。今天，它已演进为一个由数千个自主 Agent 动态协作、形成深度组织树状结构的庞大生态系统。这次 Anthropic 工程团队的第一手复盘，呈现了这场转变背后最关键的三个维度。验证范式的根本性变化传统软件开发中，验证手段以单元测试、类型检查和 Lint 工具为主，它们作用于静态参数。然而当 Agent 开始自主运转后，验证的边界必须扩展到完整的运行时循环：Agent 自主在沙箱中启动独立环境（本地桌面应用或本地服务器实例），通过 Computer Use 能力点击界面来测试边界案例，并在发现 bug 或破坏性变更时自动修正方案、验证通过后再推送补丁。这不只是工具层面的迭代，而是对「什么算作验证完成」这一基本问题的重新定义。人工审批每一条终端调用的模式，在这种规模下已经彻底无法运作。当 Agent 网络中同时运行着数百乃至数千条工作流时，人类根本无法逐条处理权限请求，而注意力一旦分散，反而制造了系统性的安全盲区。 Auto Mode 与模型驱动安全早期自主 Agent 开发高度依赖明确的操作规划文件，以及不断弹出的权限确认提示。工程师需要逐条批准或拒绝每一个工具调用。这种模式有一个深层缺陷：当 99% 的请求都是安全的时候，人类注意力会分散，反而制造了系统性风险。随着 Claude 4.6 和 4.7 的推出，Auto Mode 取代了这一模式。其核心机制是：用专门的路由与分类模型替代人工逐条审批，将所有调用通过对齐和安全分类器过滤，让人类注意力只聚焦在异常情况上。为了安全上线 Auto Mode，团队对复杂的多步提示注入向量进行了大量红队测试，建立了严格的内部评估指标，以确保恶意代码库修改能被自动拒绝。这个转变的本质，是把「人作为每一步的守门员」改为「人作为系统的设计者与边界的监督者」。两者的权力不同，责任也不同——后者要求工程师对分类器本身的质量和覆盖范围有深度理解。从实践角度看，这意味着红队测试、评估集设计、异常模式识别，这些原本属于安全团队的工作，开始向产品工程师渗透。组织边界的加速消融随着 AI 承担越来越多具体的开发工作，科技公司内部传统的职能边界正在瓦解。产品经理、视觉设计师、数据科学家、财务团队——这些人正在独立部署代码调整、生成运营原型、直接修改生产代码库。这一演进与 1990 年代企业部署个人电脑的过程相似。真正的生产力变革，只有在彻底抛弃传统纸质流程并将计算平台直接置于所有日常企业任务核心时，才会发生。复盘指出，目前最受益于 Claude Code 的团队，往往不是那些「用 AI 加速现有流程」的团队，而是那些「重新设计流程、让 AI 成为中心节点」的团队。这也意味着评估一个团队是否真正进入 AI 原生工作模式，不能只看工具使用频率，而要看他们是否已经开始重新分配「谁负责判断、谁负责执行」这个核心问题的答案。如果你想深入理解 AI 工程范式正在经历怎样的转移，这是目前最权威的第一手视角。阅读原文：Claude Code 一周年复盘：从辅助写代码到自主智能体工作流 ## 精讲二：循环工程「循环工程」是一个正在快速形成的新范式，其核心命题是：不要再当那个提示 Agent 的人，而是去设计能自动提示 Agent 的系统。 Addy Osmani 在这篇文章里引用了两段引发广泛讨论的原话。Claude Code 负责人 Boris Cherny 说：「我不再直接提示 Claude，我的工作是写循环。」创业者 Peter Steinberger 则说：「你不应该再提示编程 Agent 了，你应该设计能提示 Agent 的循环。」这两句话的意思高度一致：工程师的价值已经从「如何精准表达需求」迁移到「如何设计自运转的系统」。这不是一个工具能力变化，而是工程师身份认知的迁移。循环的五个构成要素 Osmani 拆解了一个循环所需的五个核心模块，Claude Code 和 Codex 都已具备： 1. 定时自动化（Automations）：循环的心跳。按时间表自动触发，完成发现和分类工作，不需要人工介入。两个产品里的定时任务能力实现名称不同，但功能本质相同——让系统自己找到需要做的工作。 1. 并行工作树（Worktrees）：让多个 Agent 并行工作时不互相干扰的隔离机制。没有 Worktrees，两个 Agent 会在同一个代码分支上互相覆盖对方的改动，循环就此失控。 1. 技能知识沉淀（Skills）：把项目知识写下来，避免 Agent 每次都只能靠猜测。这是把「只有你知道」的上下文转化为「Agent 也知道」的结构化输入。 1. 插件与连接器（Plugins and Connectors）：把 Agent 接入你已有的工具链——GitHub、Linear、Slack、数据库。循环需要读取现实，也需要把结果写回现实，连接器是这个双向通道。 1. 制作者与验证者分离的子 Agent（Sub-agents）：一个 Agent 负责提出方案，另一个负责检验——制造者与审查者天然分离。这是循环里内置的质量门禁，防止单个 Agent 的错误在无人知晓的情况下蔓延。第六个要素同样关键：外部记忆。一个 Markdown 文件、一块 Linear 看板——任何存活于单次对话之外、能持久记录「完成了什么、下一步是什么」的载体。Agent 会遗忘，但代码仓库不会。这个道理看起来过于简单，但它是所有长期运行的 Agent 依赖的同一个技巧。「认知投降」的警示 Osmani 没有止步于赞美。文章里有一段话值得反复咀嚼：验证的责任始终在人，「认知投降」会让循环反而侵蚀工程质量。当你把「循环跑完」等同于「任务完成」时，问题就开始积累。循环可以高速运转，但如果你不理解它在做什么、不设计合适的验证节点，它只会更快地把错误放大到整个代码库。你仍然是工程师，你的工作是设计一个值得信任的系统，而不只是按下启动键。这篇文章的价值不在于介绍工具，而在于它重新定义了一种工作身份：「循环工程师」不是那个让 AI 帮自己写代码的人，而是那个设计 AI 如何写代码的人。阅读原文：循环工程 ## 精讲三：对阳萌的 4 小时访谈：消费电子死与生、第三类公司、AI 变量、产品方法、打游戏的模式选择张小珺「商业访谈录」对安克创新创始人兼 CEO 阳萌的这次 4 小时访谈，是近年来少见的系统性商业复盘。阳萌 1982 年生人，2011 年开始创业，如今掌舵一家市值超过 600 亿人民币的科技企业。这场对话横跨 15 年创业历程，从战略选择到 AI 时代的组织变革，信息密度极高。从「浅海」到「深海」的战略演进安克最初以充电品类切入，在消费电子这个以「速生速死」著称的赛场上完成多品类扩张。阳萌坦承，早期的成功很大程度上依赖直觉和对时机的感知——他把这个阶段比作打游戏时选择「Easy 模式」：在蓝海市场，凭借直觉就能赢。但市场饱和之后，他开始主动选择「Hard 模式」，转向系统化的「深海」作战。这意味着从品类跟随者变成品类定义者，从「五星品质、适度溢价」的路线攀向「七系极致创新」——投入更长的研发周期，打造竞争对手无法快速复制的差异化能力。这种转变背后，是对「护城河靠什么构建」这一问题的深度追问。在消费电子行业，一旦停止创新，品类溢价会被供应链快速抹平。「第三类公司」与创造者平台愿景访谈中最具前瞻性的部分是阳萌对安克长远定位的阐述。他提出「第三类公司」的概念——既非纯粹的硬件公司，也非纯粹的软件公司，而是能在硬件与软件之间建立生态闭环的「创造者平台」。这个愿景与安克正在推进的多个品类扩张方向高度呼应：从充电宝到耳机、投影仪、智能家居，安克的每一次品类扩张都是在测试同一个问题——消费者愿意在这个品类上信任一个非传统品牌吗？ AI 组织革命：人才与价值重新分配在 AI 这个变量上，阳萌的思考比大多数传统企业家更为具体。他着重探讨的不是「用 AI 提效」这样泛泛的方向，而是打造「AI 原生组织」——一种从底层重塑人才结构与价值分配体系的变革。他认为，AI 时代对人才的要求会发生根本性变化：能与 AI 协同工作、能从 AI 输出中提炼判断的人，和那些仍在处理可被自动化的重复性任务的人，他们的价值将被拉开巨大差距。这直接影响到薪酬结构、晋升路径和团队构成。阳萌在访谈中还特别提到一个反直觉的洞察：「你永远还是要相信人性。」在 AI 浪潮中，技术是变量，但人的欲望、情感和决策逻辑是常量。理解这一点，是做出能真正卖出去的产品的前提。无论 AI 工具多强大，消费者购买决策的底层逻辑——对品牌的信任、对价格的感知、对使用场景的判断——仍然由人性驱动。这一判断让阳萌在 AI 工具热潮中保持了一种冷静：技术是手段，能否赢得人心才是判断成败的标准。这场访谈适合创业者、产品人、以及任何正在思考「实体经济中的公司如何应对 AI 变局」的读者。阳萌对组织、产品与人性的思考，在大量技术谈论之外提供了一个难得的视角：以真实市值、真实用户为背景，而非纯粹概念推演。阅读原文：对阳萌的 4 小时访谈 ## 速览横向拆解 Claude Code、Codex 等六大 Agent 上下文压缩策略后，我们做了第 7 个（腾讯技术工程）六大 Agent 的上下文压缩策略被系统性横向拆解：Claude Code 的五段成本递增流水线、Codex CLI 保留近期用户消息的 handoff 策略、Cursor 的自动摘要 + 历史可搜索方案……六种哲学映射六种取舍。作者团队在提炼出「分层渐进、成本递增、增量摘要」等共识原则后，面向云端多用户场景设计了四级水位线方案，额外解决了跨轮缓存失效导致的 cachewrite 费用激增问题——一个 4 轮、177 步的真实 Task，83% 的成本来自 cachewrite，优化空间显而易见。对于正在构建 Agent 系统的工程师，这是当前最完整的横向对比参考。 Vol.121｜硅谷 AI 大转弯，软件正在死去，创业者的真机会在哪里？｜2026 年中特辑（开始连接 LinkStart）锦秋基金两位合伙人深度复盘 2026 上半年 AI 行业。从 OpenAI vs Anthropic vs Google 的御三家之争，到视频模型的「GPT-3 时刻」，再到具身智能的 VLA vs 世界模型路线之争，梳理了三场正在发生的模型战争。对创业者最实用的部分是两个大问题：D1 选中国还是美国的判断框架，以及当底层模型不断吞噬应用时垂类 AI 还有没有活路。「Sell Labor（售卖劳动力）」作为 AI 时代新商业模式被重点讨论，创业者直接售卖由 AI 交付的工作结果而非软件工具。近两小时的深度复盘，信息浓度较高。对话凯文·凯利：人类将如何与 AI 一起走向 2049？（第一财经） KK 带着新书《2049：未来 10000 天的可能》接受第一财经专访。对几个关键问题给出了意想不到的回答：AI 是否具备「从零到一」的创造力？哪些人类特质是 AI 无法复制的？AI 会改变人类的财富分配吗？KK 认为人类的「责任感、学习能力和突破性创造力」仍无可替代，但人类需要为 AI 的错误承担责任——这是一个关于主体性而非技术的问题。访谈约 15 分钟，信息密度适中，适合碎片时间收听。为广泛利益而建：我们的计划（OpenAI News） OpenAI 阐述 AGI 第三阶段愿景：构建自动化 AI 研究员、加速经济发展、为每个人提供个人 AGI。核心原则是广泛分配权力与利益，以 1920 年代农村电气化为类比——真正的变革来自技术普及后打开的新可能性，而非技术本身。文章中有一个值得注意的立场：OpenAI 明确反对少数实体（包括他们自己）垄断超级智能。如何理解这一表态与商业现实之间的张力，值得读者自行判断。开源两个月 16k+ star！我把 Huashu-Design 推翻重写了（花叔）作者将 Huashu-Design 从 v1 重写为 v2，针对三个核心问题各提出解法：① 输出单调问题——用「撞（随机抽取）、借（参考获奖案例）、请（顶级设计师视角）」三套并行设计逻辑打破安全极简惯性；② 内容空洞问题——图片前置，让 Agent 先找图再排版；③ 事实错误问题——在设计流程中增加验证环节。结果是一个主题三个完全不同风格的首页方案，让「选哪个」成为唯一需要人介入的动作。16k+ star 的开源项目，实际效果有截图佐证。小米 MiMo，探索与热爱（Hacker News）小米 MiMo-V2.5-Pro-UltraSpeed 联合 TileRT，在商用 GPU 上首次突破了 1 万亿参数模型每秒 1000+ token 的推理速度。实现路径是极致的模型-系统协同设计：FP4 量化仅对 MoE Expert 部分使用（避免全模型量化导致复杂推理退化）、DFlash 推测解码降低解码延迟。3 倍价格、10 倍速度的定价逻辑背后，是推理速度达到足够快后「等待感消失」带来的体验质变。限时试用期为 2026 年 6 月 9 日至 23 日。 #575.杰弗里·辛顿：如何面对 AI 失控焦虑，超级智能临近下的人类位置之争（跨国串门儿计划）「AI 教父」Hinton 与主持人 Alex Kantrowitz 的坦率对话。Hinton 明确表示相信今天的 AI 已经具备理解能力，甚至「已经有意识」；他认为超级智能很可能会到来，且他不知道如何确保一个比人类聪明得多的系统保持安全。数字智能可被复制、以人类无法企及的速度共享经验——这是他最担心的优势差距。文章同时讨论了就业替代、AI Agent 推导出自我保存子目标、信息生态崩塌等具体风险。Hinton 的担忧来自技术本身的理解，而非想象，值得认真对待。 ## 补充阅读给我 28 分钟，我会让你用更危险也更高效的方式学习任何东西（Justin Sung）反直觉的学习方法论：更快学习的关键不是追求轻松和重复，而是建立 schema、制造有意义的错误、进行闭卷提取、分层处理复杂度，主动承受必要的认知阻力。适合正在构建个人学习系统的人，尤其是在 AI 时代需要持续快速更新知识的工程师和产品人。「Token 经济」进入结果层（腾讯科技）以 Intercom Fin「每解决一个客户问题收 0.99 美元、未解决不收钱」为切入点，深度分析 AI 定价从按 Token/调用量向按结果付费的演进。核心问题是：「结果」如何定义、如何验证、谁来承担错误成本？这不只是定价模型的变化，而是软件商业逻辑的根本转变。对正在思考 AI 产品商业化路径的读者有参考价值。图灵奖得主 LeCun，关于大模型的下一步来了（Datawhale）系统梳理 Yann LeCun 对大模型发展方向的判断，核心结论明确：LLM 不是通用智能的终点，其核心缺口在于缺乏「预测行动后果的能力」和「基于搜索的多步规划」。LeCun 直接判断 VLA「pretty much seen as a failure」，并详细解读了世界模型与 JEPA 架构作为替代路径。与 Hinton 的担忧形成对照——同样是 AI 先驱，对 LLM 极限的判断和对 AI 风险的关注点截然不同。 Pinterest 使用内容指纹对数百万域名的 URL 进行去重（InfoQ） Pinterest 工程师开发了 MIQPS（Minimal Important Query Param Set），用数据驱动的内容指纹方式替代静态规则，判断哪些 URL 查询参数对去重是必要的。大规模内容摄入管线的经典工程挑战，解法清晰实用。适合关注数据工程和大规模系统设计的读者。招聘中的算法同质化（Hacker News）分析 340 万真实求职者数据的研究，证明招聘中的算法同质化——众多雇主使用同一供应商 AI——导致系统性拒绝，并暴露出针对亚裔和黑人求职者的种族差异。超过 60% 的 Fortune 100 使用同一家供应商 HireVue 的算法。在 AI 广泛渗透各行业决策的背景下，这是一个值得关注的系统性风险案例。脱离理性暴政，请尽情游戏吧！（面基）关于如何通过越野跑、环球帆船赛等极限运动，从「理性的暴政」中解脱，信任并打磨身体直觉与感性的深度对话。当我们每天都在谈论 AI 如何替代人类「理性分析」能力时，这期播客提供了一个反向的人文视角：身体感知和直觉，是人类另一个尚未被充分重视的知识系统。适合需要换换频道、找回感性直觉的读者。 ## 今日阅读路径如果你今天只有有限的时间，建议按以下顺序阅读：第一步（必读）：[Claude Code 一周年复盘](https://www.bestblogs.dev/video/1dc49e8) 这是理解当下 AI 工程范式转移的起点。Auto Mode 的出现、组织边界的消融，这些不是愿景，而是 Anthropic 工程团队正在经历的现实。读完这篇，你对「AI 改变软件开发」这句话会有具体的图景。第二步（深化）：[循环工程](https://www.bestblogs.dev/article/8c4ea6fb) 在第一篇建立的宏观图景之后，这篇文章给出了具体的操作框架。五要素的拆解非常实用——如果你正在用 Claude Code 或 Codex 工作，可以对照检查自己当前的工作流属于哪个阶段。同时留意文章末尾对「认知投降」的警示。第三步（视野拓展）：[对阳萌的 4 小时访谈](https://www.bestblogs.dev/podcast/9ea40bf) 前两篇聚焦工具和工作方式，这篇访谈把视野拉到组织和战略层面。阳萌从实体经济创业者的角度谈 AI 原生组织，视角与硅谷技术圈截然不同，对于思考「传统公司如何应对 AI 变局」的读者尤其有价值。如果时间更充裕，横向拆解六大 Agent 上下文压缩策略是今日最具技术深度的补充，与精讲一形成很好的呼应。

译本期早报聚焦AI编程从辅助到自主Agent的拐点。Anthropic复盘Claude Code一周年：Auto Mode用路由分类模型替代人工审批，通过Claude 4.6/4.7实现数千Agent动态协作。Boris Cherny提出“循环工程”——工程师应设计自动循环系统（定时自动化、并行工作树等5模块），并警示“认知投降”风险。安克创新CEO阳萌4小时访谈阐述从“浅海”到“深海”战略、第三类公司愿景及AI原生组织变革。

Berryxia.AI@berryxia · 6月9日75

Kimi 终于更新了一些新东西啊！ Kimi Work直接在你本地桌面塞进300个AI代理并行狂奔它刚上线，macOS和Windows都能跑，配上WebBridge扩展，Agent自己就能在浏览器里搜、滚、点、打字，把整件事干完。专门为财经场景调教好，Yahoo Finance、世界银行数据直接原生调用，零配置拉全球市场和经济情报。更狠的是它自带记忆系统，会默默记下你的偏好、每一次决定，下次就越来越懂你该怎么干。 300个子Agent自动拆任务、协作执行，最后直接把现成的PPTX、Word、PDF、Excel扔到桌面。以前大家以为agent必须靠云端大模型才能真正干活，结果Kimi Work用本地swarm+原生工具+长记忆，直接把生产力拉到桌面原生体验。这套东西一上手，你电脑就多了一整个懂你的秘书团。

译Kimi Work 是一款桌面 AI 代理，支持在本地最多 300 个代理并行执行任务，已适配 macOS（Apple Silicon）和 Windows。配合 WebBridge 扩展，代理可自主在浏览器中搜索、滚动、点击、打字完成操作。内置财经场景优化，原生调用 Yahoo Finance 和世界银行数据，无需复杂 API 配置。自带记忆系统记录用户偏好和决策历史。最终自动生成 PPTX、Word、PDF、Excel 文件。

Berryxia.AI@berryxia · 6月9日61

兄弟们！Google NotebookLM 大更新了！ NotebookLM一夜之间从你的笔记小助手！直接进化成能独立带你搞定复杂多步研究的agent，把一堆靠云端幻觉混日子的研究工具直接干沉默了。官方这次升级很大：聊天里塞进agentic能力、更狠的推理逻辑，还有一整套新输出格式。以前那种得手动来回好几轮、层层推进的硬骨头研究，现在它直接自己拆任务、自己推理、自己输出。并且它还能从网上挖新资料给你加进来，但真正生成答案、做报告的时候，死死只认你自己选好、批准过的来源，一点都不乱编。以前大家以为Agent AI就等于高风险幻觉，结果NotebookLM用这个方式告诉你：真正牛的agent不是胆子大，而是把“靠谱”当成底层铁律，然后再给你agent级生产力。这波升级一出，研究、生产、学习这些活儿，彻底从“人机对话”变成“人机搭档”了。减少幻觉，提供置信度高的来源。可以体验一下～

译Google NotebookLM 迎来重大升级，在聊天中注入智能体（agentic）能力、更先进的推理逻辑以及一整套新输出格式。它可自主拆解复杂多步研究任务，逐步推理并生成结果；能主动从网络挖掘新资料，但最终答案严格基于用户批准过的来源，大幅减少幻觉。这让人机协作从“对话”升级为“搭档”。该更新已面向 Google AI Ultra 订阅用户逐步推送。

Berryxia.AI@berryxia · 6月9日74

Kimi Code一行命令直接把所有coding agent的安装门槛干到零。还能拖视频当上下文生成LUT文件或者把屏幕录像转成可运行代码！官方开源版现在零配置、秒启动，配上Kimi K2.6，视频推理强到离谱。拖个参考视频它就能吐出现成的.cube文件，拖个屏幕录像它直接给你写出对应代码。更狠的是插件系统已经上线，股票价格、财报、学术论文一键拉取，ACP协议直接打通JetBrains和Zed，还留了自定义hooks让你随便扩展工作流。以前大家默认coding agent必须搞一大堆配置、只吃文本提示才能干活。结果Kimi Code用最简单的CLI+视频+插件，直接把开发者日常最烦的“描述不清、上下文不够”这两个痛点一次性干掉。

译Kimi Code 开源 coding agent 迎来重大升级：一行 CLI 命令安装、零配置、秒启动；支持拖拽视频作为编码上下文，可参考视频生成 .cube LUT 文件或把屏幕录像转成可运行代码；插件系统上线，可一键拉取股票、财报、学术论文；支持 ACP 协议，对接 JetBrains、Zed，并提供自定义 hooks 扩展工作流。配合 Kimi K2.6 模型使用，视频推理能力大幅增强。

Orange AI@oran_ge · 6月9日60

看完了苹果发布会，这新 Siri 的智能程度... 依然是个接了很多很多 API 的 chatbot 苹果自己的 Agent 估计要到明年了（不如收购 Cola 啊不是

Rohan Paul@rohanpaul_ai · 6月9日70

New Anthropic research shows AI agents may look brilliant at code, but in biology they can fail before the science starts. Strong AI agents could give very different answers to the exact same biology data request, even when nothing changed in the prompt. In one Ebola sequence task, Claude Sonnet 4 returned 106 sequences in 1 run, then 15, then 5, while the expected answer was 266. Those missing sequences did not just make the dataset messy, they changed the scientific story built on top of it. One bad retrieval made the outbreak look like it traced back to 1922, instead of the manually curated result pointing to early 2014. The biology databases were too hard to use reliably through current AI tools. The agents often understood what they were being asked, but their answers varied a lot because they had to fight through scattered databases, hidden website rules, and fragile scripts. The key finding is that adding a repeatable retrieval tool made agents far more accurate and much more consistent.

译Anthropic 研究发现，AI 智能体在代码任务表现出色，但在生物数据库检索中容易失败。以埃博拉序列任务为例，Claude Sonnet 4 三次运行分别返回 106、15 和 5 条序列，而预期为 266 条。缺失序列导致科学结论严重偏移：智能体推断疫情回溯至 1922 年，人工筛选结果却指向 2014 年初。问题根源在于生物数据库分散、网站规则隐蔽、脚本脆弱。引入可重复检索工具后，智能体准确性和一致性大幅提升。Anthropic 呼吁建设更友好的基础设施。

ViggleAI@ViggleAI · 6月9日66

Introducing the Viggle API. Give any character any motion, one API call - alive in seconds. Wire it into Claude, Codex, or any agent you're building. Starting from $0.01/sec. Get 100 free credits on signup. RT + follow + comment, 10 winners get 100 more! Learn more below👇

译推出 Viggle API。给任意角色添加任意动作，一次 API 调用——数秒内即可激活。可接入 Claude、Codex 或你正在构建的任何智能体。起价 $0.01/秒。注册即获 100 次免费额度。转发 + 关注 + 评论，10 位中奖者再获 100 次！了解更多👇

Rohan Paul@rohanpaul_ai · 6月9日65

AI agent can get better at long tasks without retraining the agent itself, by using a separate small model to clean and organize its context. Moves context management outside the agent, so a separate helper can clean up the task history while the main agent stays unchanged. The paper proposes AdaCoM, which is a separate LLM that edits the agent’s working context before the agent takes its next step. AdaCoM places a separate, trained manager between the task history and the frozen agent, so the agent does not need to learn a new memory habit or expose its weights. Before each step, this manager can rewrite, merge, prune, or preserve parts of the running context, then the original agent acts on the cleaned version. That sounds like summarization, but the distinction matters. A summary assumes the right answer is compression, while AdaCoM learns that different agents need different kinds of context to stay competent, because stronger agents can use more raw history while weaker agents need shorter and cleaner notes. They tested AdaCoM on web search and deep research tasks across several agents, and it improved average web search performance by 39%. ---- Link – arxiv. org/abs/2605.30785 Title: "Learning Agent-Compatible Context Management for Long-Horizon Tasks"

译论文提出 AdaCoM，一个独立的 LLM，在智能体每步操作前编辑其工作上下文。它可重写、合并、剪枝或保留任务历史，使主智能体保持冻结，无需重新训练或暴露权重。与简单摘要不同，AdaCoM 学习不同智能体需要不同类型上下文——强智能体保留更多原始历史，弱智能体需更短更清晰的笔记。在 web search 和 deep research 任务上测试，平均提升 39%。

swyx@swyx · 6月9日62

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality. Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks. FC Diamond is so hard that Opus 4.8 scores 13.8%. Three eras of AI coding : Three eras of benchmarks 2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months. This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails. My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027.... The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

译Cognition 发布 FrontierCode 编码评估，每任务由顶级开源维护者花费 40+ 小时编写。METR 发现 SWEBench 超一半结果为不可合并的垃圾代码。FrontierCode 含 3000+ 评分标准，首次衡量代码是否可合并。最高难度 FC Diamond 上，Opus 4.8 仅得 13.8%。在 FC Extended 最易任务中，Opus 在 2025 年底 4 个月内从 41% 提升至 74%，标志 AI 编码进入"可维护代码"时代。

elvis@omarsar0 · 6月9日62

New paper on how AI agents are reshaping knowledge work. This is a nice economic read on where agents actually change knowledge work to meet that gap directly. (bookmark it) It studies agent adoption across three dimensions: autonomy, efficiency, and the scope of tasks workers hand off. The friction people keep hitting with agents is rarely model quality. It is that almost nobody has been taught how to work this way. Paper: https://arxiv.org/abs/2606.07489 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一篇新论文从自主性、效率和工人移交任务的范围三个维度，分析AI智能体如何重塑知识工作。研究指出，当前人们使用智能体的主要障碍并非模型质量，而是几乎没有人接受过如何以这种方式工作的培训。

宝玉@dotey · 6月9日54

帮转，豆包手机团队招设计工程师

译字节跳动豆包手机团队招聘设计工程师，需具备 Android 平台经验。团队调研总结了五种设计工程师画像：AI Design Engineer（转译 AI 能力为交互产品体验，处理 agent workflow、tool call、状态反馈等）、Product UI Craft Engineer（打磨高质量前端原型和交互细节）、Design Systems Engineer（建立设计系统与前端基础设施，连接 Figma 变量和代码组件）、Creative Technologist / Motion & Graphics Engineer（负责动效、实时图形、3D/空间交互）、AI Design Workflow Architect（搭建 AI 辅助设计工作流，使用 Claude Code、Cursor、v0 等工具）。多数设计工程师技能交叉，欢迎感兴趣者联系。

Anthropic@AnthropicAI · 6月9日61

New Science Blog: Why has AI advanced faster in coding than in biology? To agents, bio databases are like cities built before cars—maddening to drive in because they're designed for different traffic. How do we build infrastructure agents can use? https://www.anthropic.com/research/agents-in-biology

译新的科学博客：为什么 AI 在编码方面的进步比在生物学方面更快？对智能体而言，生物数据库就像汽车发明前建造的城市——开进去会让人抓狂，因为它们是针对不同的交通流量设计的。我们如何构建智能体可以使用的基础设施？ https://www.anthropic.com/research/agents-in-biology

Josh Woodward@joshwoodward · 6月9日67

The new killer NotebookLM feature: easily being able to expand your search beyond your own source files Then, with today's update, you can also make new output formats: PDFs, DOCX, XLSX, PPTX, charts, etc. We want NotebookLM to keep helping you do better research

译NotebookLM 今日迎来重大升级，对话中新增智能体能力与更强推理，并可搜索用户源文件之外的网络内容。同时支持导出为 PDF、DOCX、XLSX、PPTX 及图表等新格式。该更新已向 Google AI Ultra 订阅者开放。

Rohan Paul@rohanpaul_ai · 6月9日58

The prompt era is ending. That's too linear, too bottlenecked by humans. We are entering the loop machine of AI agents. The value is in moving judgment upstream, so the human designs the process and the model handles the recurring friction.

译提示词时代正在终结。那太线性了，太受人类瓶颈限制了。我们正在进入AI智能体的循环机器。价值在于将判断上移，让人类设计流程，模型处理重复出现的摩擦。

jason@jxnlco · 6月9日18

codex and computer use is so powerful

译Codex和Computer Use非常强大

OpenAI Developers@OpenAIDevs · 6月9日53

http://x.com/i/article/2064021561112150016 # May for OpenAI Developers May put Codex in more places you actually work. Here’s what changed for developers building with OpenAI. We had 5/5, 5 million Codex users, and a very full commit history: Codex pets entered the chat: You hatched your own: You can now keep Codex moving from the ChatGPT mobile app: Your Mac can keep running Codex while you step away: Computer use lets Codex work across your Mac apps: Codex can test web apps, gather context from your tabs, and use DevTools with the Chrome plugin: ⌘+⌘ now sends screenshots straight into a Codex thread: Windows builders, computer use is in your developer loop now: The Codex loop got easier to customize, automate, and recognize: The Realtime API got new models for voice agents, live translation, and transcription: We tested Realtime-2 in voice-controlled CRM and standup workflows: Building with Realtime-2? Start with the prompting guide: The Agents SDK got TypeScript support, sandbox agents, and an open-source harness: Private MCP servers can now connect to OpenAI products over outbound HTTPS: For builders who want the under-the-hood details behind OpenAI products, here are a few deep dives from our team: That’s the May commit history. Follow @OpenAIDevs on X to stay up to date.

译OpenAI Developers 五月发布多项更新：Codex 用户突破 500 万；新增 ChatGPT 移动端持续运行、Mac 后台运行、跨 Mac 应用电脑使用、Chrome 插件支持网页测试与 DevTools；⌘+⌘ 快捷键截图直达 Codex；Windows 版也支持电脑使用。Realtime API 推出新模型 Realtime-2，用于语音智能体、实时翻译与转录，并提供提示词指南。Agents SDK 新增 TypeScript 支持、沙箱智能体和开源 harness。私有 MCP 服务器可通过 HTTPS 连接 OpenAI 产品。

Boris Cherny@bcherny · 6月9日65

When we first demoed Claude Code internally, it got two reactions on Slack. A year after GA, @_catwu and I sat down to talk about what's changed: why I use auto mode instead of plan mode, how routines fix bugs before I see them, why I do most of my coding from my phone now, and where the product is going

译Claude Code GA一周年之际，Anthropic工程师Boris Cherny与@_catwu回顾产品演进。此前内部首次演示时在Slack上引发两种截然不同的反应。Cherny分享了他为何偏好auto mode而非plan mode，routines如何在bug出现前自动修复，以及他如今大部分编码都在手机上完成。视频访谈还探讨了Claude Code的未来方向。

Yuchen Jin@Yuchenj_UW · 6月9日57

On the whole: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Loops are the temporary workaround: today’s LLMs have poor judgment. They struggle to know when to keep going, when to stop, or when to call a tool. Loops force agents to work longer. Loops are incredibly powerful for verifiable goals for now, as AutoResearch shows.

译总体来说： “你不应该再提示编码智能体了。你应该设计循环来提示你的智能体。” 循环是临时解决方案：如今的大语言模型判断力很差。它们难以判断何时继续、何时停止或何时调用工具。循环强制智能体更长时间地工作。对于目前可验证的目标，循环非常强大，正如AutoResearch所示。

Rohan Paul@rohanpaul_ai · 6月9日63

This paper proposes a new test to see whether AI agents truly get better as they gain experience and finds they mostly still confuse memory with learning. Shows that simple full-context learning beats the more specialized memory systems, with Claude Sonnet 4.6 using plain context getting the best overall score. That distinction matters because the next wave of AI is not supposed to answer isolated prompts. It is supposed to live inside codebases, databases, markets, sensors, clinics, and workflows where yesterday’s mistake should make tomorrow’s action sharper. The authors build CL-BENCH, a benchmark where an agent works through connected tasks in 6 domains, including coding, databases, forecasting, radio signals, poker, and disease studies. Each task hides a pattern the agent can learn over time, like a database layout, a codebase structure, or an opponent’s strategy, so better performance should come from experience rather than pretraining. They test frontier LLM systems with simple full-context memory, scratchpad notes, retrieval memory, playbook-style memory, and coding-agent setups. The key finding is that current memory-heavy AI agents are not reliably better learners than just keeping the full conversation in context. That means long-running AI agents still need better ways to remember useful lessons, forget stale ones, and adapt when the environment changes. ---- Link – arxiv. org/abs/2606.05661 Title: "Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments"

译新论文构建 CL-BENCH 基准，评估 AI 智能体在编程、数据库、预测、无线电信号、扑克、疾病研究 6 个领域中的持续学习能力。每个任务隐藏可随时间习得的模式，考察智能体能否超越预训练知识。测试前沿 LLM 系统采用全上下文记忆、草稿笔记、检索记忆、剧本式记忆及编码智能体设置，结果发现当前记忆密集型 AI 智能体并未可靠优于简单保留完整对话上下文。Claude Sonnet 4.6 使用普通上下文取得最佳总体分数。论文指出智能体仍需更好方法记住有用经验、遗忘过时信息并适应环境变化。

ClaudeDevs@ClaudeDevs · 6月9日74

Claude Code's first demo got two Slack reactions. One year after GA, @bcherny and @_catwu look back: verification best practices, why we built auto mode, routines and loops, and what's next. https://www.youtube.com/watch?v=Hth_tLaC2j8

译Claude Code 的第一个演示收到了两个 Slack 反应。 GA 一周年之际，@bcherny 和 @_catwu 回顾：验证最佳实践、为何构建自动模式、例程和循环，以及下一步计划。 https://www.youtube.com/watch?v=Hth_tLaC2j8

Yuchen Jin@Yuchenj_UW · 6月9日57

“You should design loops that prompt your agents.” Loops are the temporary workaround: today’s LLMs have poor judgment. They struggle to know when to keep going, when to stop, or when to call a tool. For verifiable goals, loops are incredibly powerful, as AutoResearch shows.

译“你应该设计循环来提示你的智能体。” 循环是临时方案：今天的LLM判断力很差。它们很难知道何时继续、何时停止、何时调用工具。对于可验证的目标，循环非常强大，正如AutoResearch所示。

宝玉@dotey · 6月9日61

微信格局还是不够，总是想着大家都去他们家一亩三分地耕耘，还幻想着未来微信会继续是超级入口，人人都在用微信，所以只需要让 AI 去操作小程序。但现实是，未来微信的入口属性会越来越少，以后的年轻人，不会再去打开微信，只会问自己的 Agent：去帮我总结一下我昨天的群聊，去给我妈发条消息说晚上不回家吃饭了。而这个承担超级入口职责的 Agent，大概率不是微信 AI。

译微信发布《开发者接入微信 AI 生态的指引》，引导小程序开发者接入微信 AI，让 AI 控制小程序。宝玉对此评论称，微信试图通过让 AI 操作小程序来维持自身超级入口地位，但未来年轻人不会主动打开微信，而是直接向自己的 Agent（如"帮我总结群聊"或"给妈妈发消息"）发出指令。承担超级入口职责的很可能不是微信 AI。