The paper argues that sparse autoencoders may not be bad steering tools after all, and much of the earlier failure may have come from choosing and naming the wrong features. The problem is that earlier work made sparse autoencoders look weak because their features were labelled in a way that may not match what those features actually cause inside the model. A sparse autoencoder is a small helper model that breaks an LLM’s hidden activity into many possible “features,” such as a topic, style, or concept. So a sparse autoencoder finds directions inside a model, but an unnamed direction is not yet a usable control knob. The authors replace vague or inherited labels with a supervised pipeline that asks whether one feature’s activity reliably tracks a real label in data. As to the mechanism, if a feature fires on “alcohol,” and forcing that feature upward makes the model talk about alcohol, the label is no longer just descriptive; it has causal weight. The paper also finds that very high sparsity may not be necessary, meaning the feature does not need to be extremely rare to be useful for steering. Also to note here, both prompting and feature steering are ways to push an LLM toward a desired behavior. Prompting remains stronger because the model was trained to obey prompts, while feature steering is more like pressing directly on the machinery and hoping the rest stays intact. Prompting says “write about alcohol” in the input; feature steering instead turns up the model’s internal “alcohol-related” feature and sees whether the output changes in that direction. ---- Link – arxiv. org/abs/2605.31183 Title: "Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines"

译论文认为稀疏自编码器作为LLM控制工具并非此前认为的那么差，失败源于特征标注方式与模型内部实际因果不匹配。作者提出用监督管道替代模糊标签，验证特征活动是否真实追踪数据标签，使特征具有因果权重。例如，强制“酒精”特征增强可使模型输出转向酒精话题。论文还发现极高稀疏度并非必要。与提示工程相比，提示更强（模型经训练服从提示），而特征控制更像直接拨动机器。

Rohan Paul@rohanpaul_ai · 6月11日63

LLM judges can change their safety verdict when the same answer is translated or rewritten. The problem is that many AI teams now use LLMs to judge whether another model’s answer is safe, but safety is not always a simple yes or no question. Those judges can be shaky exactly where careful judgment matters most. The paper proposes a stress test where the same basic answer is shown to judges after translation or rewriting, then the researchers check whether the judges still give the same safety verdict. They are better when harm is obvious, as in violent or extremist content, because the cues are loud and familiar. They become much weaker when safety depends on context, judgment, and regulation, as in financial advice, creditworthiness, or culturally sensitive responses. They also disagreed with each other a lot, and high raw agreement sometimes hid weak real reliability because many judges kept choosing the same label by default. ---- Link – arxiv. org/abs/2605.31381 Title: "LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories"

译一项新研究指出，用大语言模型评判其他模型回答是否安全的“LLM安全法官”存在严重不稳定：将相同回答翻译或改写后，法官可能给出不同安全判定。在暴力、极端内容等明显危害场景下表现较好，但在需结合上下文判断的金融建议、信用评估、文化敏感回复等场景中可靠性显著下降。不同法官之间也常出现分歧，高原始一致性有时会掩盖低真实可靠性——因为许多法官默认选择同一标签。论文标题为“LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories”。

Rohan Paul@rohanpaul_ai · 6月11日67

Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper proposes a Agents’ Last Exam, a benchmark that asks AI agents to finish real expert work, and today’s agents mostly fail. Even strong agents of today are nowhere near reliable on the hardest real workflows, which means benchmark success has not yet become broad workplace capability. So this paper shifts the question from “can AI answer hard questions?” to “can AI complete real work that people get paid to do?” Most of today's AI benchmarks show impressive scores, but they do not prove that agents can finish useful work in real jobs. Agents’ Last Exam tries to fix this by testing agents on long tasks from 55 digital work areas, including engineering, finance, medicine, law, media, and science. The tasks come from experts’ real completed projects, and the agent must use normal computer tools like files, browsers, command lines, and desktop software to produce a finished result. The authors tested many current agent systems and models, then scored their finished work with automatic checks or strict rubrics instead of loose human opinions. The main result is that today’s best systems still struggle badly, with an average full pass rate of only 2.6% on the hardest tier. ---- Link – arxiv. org/abs/2606.05405 Title: "Agents' Last Exam"

译一篇新论文提出“Agents’ Last Exam”基准，测试 AI 智能体完成真实专家工作的能力。任务来自工程、金融、医学、法律、媒体、科学等 55 个数字工作领域的实际项目，要求智能体使用文件、浏览器、命令行、桌面软件等常规工具产出可交付成果。评测采用自动检查或严格评分标准。结果显示，当前最强智能体在最难任务层级的平均完全通过率仅 2.6%，远低于其基准测试分数所暗示的水平。论文指出，基准成功尚未转化为广泛的职场能力。

ginobefun@hongming731 · 6月11日59

http://x.com/i/article/2064862052729176064 # BestBlogs 早报 · 06-11｜AI 政策、万亿 IPO、编程鸿沟在线阅读本期早报 ## 导语今天的早报聚焦三条主线。Anthropic CEO Dario Amodei 发表万字政策长文，用《魔戒》中树须的比喻揭示 AI 与政策之间的时间错位，并提出覆盖安全审计、失业保障与国际治理的五领域行动框架。与此同时，OpenAI 正式确认已秘密提交 S-1 招股书，估值超 8500 亿美元，与 Anthropic、SpaceX 三家巨头极有可能包揽人类史上最大规模的几起 IPO。在 AI 编程领域，MIT 与宾夕法尼亚大学追踪 10 万名开发者的最新研究给出了一个冷静的数字：代码行数暴增 17.3 倍，实际发布的软件版本仅增长 30%。此外，谷歌发布 DiffusionGemma 开源模型，以并行生成取代传统自回归方式，文本生成速度提升四倍；Simon Willison 对 Claude Fable 5 的上手评测显示这是一个强大、昂贵且知识密集的模型；SpaceX 创纪录的 IPO 估值背后隐藏着一个违反历史增长规律的假设。阿里云开发者和阿里技术团队分别从知识库分层编排和 Harness Engineering 两个方向贡献了来自中国工程师的系统性实践。今天的精讲将逐一展开。 ## 精讲一：Dario Amodei — 关于 AI 指数级发展的政策在《魔戒》的一个支线情节中，两个霍比特人试图唤醒树须——一棵智慧但行动极其缓慢的树人——来保卫他的森林。树须用一整天的时间才完成对另一棵树的问候，让他和他的同伴及时行动几乎不可能。Anthropic CEO Dario Amodei 在这篇发布于 2026 年 6 月的政策长文中，用这个比喻精准刻画了 AI 与政策之间的时间错位：AI 正以闪电般的速度前进，政策却移动得非常缓慢。 Amodei 指出，AI 的 scaling law 已有超过十年的实证支持。四年内，模型从勉强写出连贯的一行代码，进化到编写 AI 公司大部分代码。类似的飞跃也发生在生物学、物理学、数学、金融、法律和翻译等领域。如果这些 scaling law 继续有效哪怕一两年，我们就很可能迎来 Amodei 所说的"Powerful AI"——一个"数据中心中的天才之国"。与此同时，国会可能需要数年才能行动，而在这几年里，AI 可以从一个有趣的玩具变成上述的那种全然不同的存在。直到最近，安全倡导者（包括 Anthropic）一直在推动保留灵活性的政策行动——透明度立法、芯片出口管制、AI 劳动力影响数据收集等。这些虽有必要，但远远不够。转折点出现在 Claude Mythos Preview 的发布：前沿模型对网络安全构成了真实威胁，有可能扰乱金融部门、关键基础设施和国家安全。Mythos 级别的模型证明了一个事实——AI 模型现在已经是具有全球和国家战略意义的工具。Amodei 认为，生物风险可能紧随其后，严重的 AI 自主性风险也不远了。基于这一判断，Amodei 提出了五领域行动框架。第一，前沿模型安全审计。仿照 FAA 对航空安全的监管模式，建立强制性的安全审计与红队测试机制，要求任何达到前沿水平的模型在部署前必须通过独立的第三方安全评估。第二，应对持久性失业。 AI 有可能在短时间内替代大量工作岗位，Amodei 建议通过工资保险、全民基本收入（UBI）等措施缓冲劳动力替代带来的社会冲击。第三，加速下游监管改革。特别是生物医药等领域，让 AI 的突破能够更快惠及患者，而不是被过时的审批流程所阻滞。第四，平衡国家与社会权力。防止 AI 被用于集中化监控与控制，确保技术赋权于公民而非削弱其权利。第五，构建 AI 时代的国际治理新秩序。避免各国在 AI 军备竞赛中失控，建立类似核不扩散条约的多边合作框架。这篇长文的意义在于，它不是一位 CEO 的个人观点集，而是从一个正在经历指数级变化的行业内部发出的系统性政策蓝图。Amodei 强调，AI 的 scaling law 正与政策制定者的感知之间形成越来越大的鸿沟。当"等等看"不再是一个负责任的选项时，如何设计既能跟上技术速度又不扼杀创新的治理结构，将是这个时代最重要的制度挑战之一。阅读建议：这篇文章是理解当前 AI 治理最前沿讨论的必读文本。全文较长但结构清晰，建议优先关注五领域框架部分，以及 Mythos 事件如何改变了政策可行性的讨论。阅读原文 ## 精讲二：OpenAI 秘交招股书，美股开启万亿 IPO“三国杀” 6 月 8 日，OpenAI 在官网发布声明，正式确认已向美国证券交易委员会秘密提交了 S-1 招股书。声明中的一句话格外引人注目："我们最近秘密提交了 S-1 文件。我们预计它会泄露，所以干脆直接公布。"这家估值超过 8500 亿美元的公司，终于向公开市场迈出了实质性的一步。但 OpenAI 也在声明中给过热的预期降温，明确表示"尚未决定 IPO 时间"，并暗示作为私营公司可能更容易实现某些目标。这番表态既展示了拥抱资本的身段，也为自己在未竟的使命与巨大的利益之间留下了回旋余地。这场 IPO 竞速的背景是三巨头的资本博弈。就在 6 月 1 日，Anthropic 已经秘密提交了 IPO 申请，私募估值 9650 亿美元，反超 OpenAI 今年 3 月创下的 8520 亿美元估值。马斯克旗下 SpaceX 已率先启动 IPO 路演，最快将于 6 月 12 日上市。在其上市文件中，OpenAI、Anthropic 和谷歌均被列为 AI 领域的"主要竞争对手"。咨询公司 Riveron 的资本市场顾问 Jeff Bernstein 点出了本质："这是一场资本争夺战。"他暗示，如果让对方先冲出去，就会带走大量可用的 IPO 资本。 OpenAI 的财务底牌相当亮眼。月收入已达 20 亿美元，营收增长速度是 Alphabet 和 Meta 同期的 4 倍。ChatGPT 周活跃用户突破 9 亿，订阅用户超过 5000 万。其月度网页访问量和移动端会话数是紧随其后的 AI 应用的 6 倍，总时长占比是竞品的 4 倍。企业级市场贡献了 40% 以上的营收，并有望在 2026 年底前与消费级业务并驾齐驱。在 GPT-5.4 的驱动下，API 每分钟处理量突破 150 亿 Token。Codex 的周活用户已超过 200 万，过去三个月增长了 5 倍。但光鲜背后是惊人的现金消耗——OpenAI 已筹集超 1800 亿美元，截至 2030 年的数千亿美元计算承诺意味着其烧钱速度将刷开历史上任何其他上市公司的纪录。在提交 S-1 的同一天，奥特曼与首席科学家 Jakub Pachocki 联名发表了题为《为所有人造福：我们的计划》的长文，系统阐述了公司进入"第三阶段"的愿景。文章将 AI 的普及比作上世纪 20 年代电力进入美国乡村——电力没有一夜之间改变每个家庭，但随着普及，日常生活发生了根本变化。三个目标清晰可见：构建一个自动化的 AI 研究员（内部相信到 2028 年 3 月，相当一部分研究将由 AI 系统与研究人员共同完成）；加速经济发展确保收益被广泛分享；为地球上的每个人提供个人 AGI。三家公司合计可能从公开市场募资高达千亿美元级别。银行家们已告诉它们，谁先上市谁就能定义这个行业，抢先吸引那些渴望投资 AI 公司的大量资金。不过历史并不总是站在先行者一边——Lyft 抢先于 Uber 上市，但一年后股价较发行价下跌约 66%，Uber 同期仅下跌约 30%。投资者对 SpaceX 大规模 IPO 的反应、全球经济的整体健康状况，以及不可预测的收入增长和飙升的计算成本，都将影响 OpenAI 最终的 IPO 时间表。阅读建议：这篇文章提供了 OpenAI IPO 最完整的中文报道，财务数据和竞争格局分析尤其值得关注。如果你关注 AI 行业的资本动态，这是今天必读的一篇。阅读原文 ## 精讲三：MIT 追踪 10 万名开发者，揭示了 AI 编程的转化真相：代码翻了 17 倍、软件只增三成当写代码变得更容易，软件产出会随之变多吗？MIT 和宾夕法尼亚大学的研究人员用迄今最大规模的实证数据回答了这个问题：会，但远没有想象中那么多。这项发表在美国国家经济研究局（NBER）的工作论文追踪了 10 万名开发者。研究数据来源于三大板块：GitHub 公开数据集（全球 1.8 亿开发者和 3.95 亿个公开仓库）、微软内部 Copilot 用户的订阅与使用明细，以及 Apple App Store、Google Play Store、Chrome Web Store 和 SourceForge 四大主流软件分发市场的月度面板数据。研究人员将 AI 编程工具的演进分为三代。第一代是 GitHub Copilot 代表的"自动补全"：开发者敲击键盘时，它能预测后文的代码片段并提供相应建议。在这一时期，开发者的生产力提升了 26%。第二代是以 Claude Code 和 Cursor 为代表的"同步代理"，可直接在 IDE 中与开发者实时对话、跨文件编辑、运行单元测试，开发者变成"监工"，需实时审阅 AI 的阶段性产出。第三代是 2025 年中出现的"异步代理"，如 OpenAI Codex 和 GitHub Copilot Coding Agent，人类直接将需求工单指派给智能体，智能体在云端虚拟机上独立完成编码、测试并提交 PR 供人类审查。截至 2026 年初，带有 Claude Code 署名的代码提交在 GitHub 公开仓库中占比已超 5%。数据看起来惊人：使用第一代工具后提交数量增长 40%，引入第二代后累积增幅升至 140%，第三代全面铺开后达到 180%。其中仅智能体自主撰写并直接提交的代码就占全部增量的 34%。获益最多的是低活跃度开发者——在同步代理阶段，低活跃群体的提交次数增加了 217%，高活跃群体增幅为 62%。更重要的是，研究首次证实底层模型迭代可直接驱动提效：追踪 Claude Code 使用者时发现，用户的生产力在 2025 年 11 月 Opus 4.5 发布后出现了一次与使用时间无关的上涨。在不同工具之间，Claude Code 带来的同步提效达到 199%，远超 GitHub Sync Agent 的 43% 和 OpenAI Codex 的 94%。然而，软件生产是一条从代码行到版本发布的六层流水线。研究揭示了一个"漏斗衰减"效应：三代 AI 工具累积下来，代码行数增加到原来的 17.3 倍，文件数量增长降至 3.9 倍，逐级递减后，最终的软件发布数仅提升了 30%。在同步代理时代，智能体推动代码行数量增长了 741%，但到合并请求环节已降至 65%，到独立项目数仅增长 26%。团队建立的常替代弹性（CES）生产函数模型显示，AI 产出与人工投入之间的替代弹性系数约为 0.25——远低于 1 时，意味着两个生产要素存在极强的互补性，必须严格以固定比例搭配使用。代入参数计算，理论增益上限仅为 26%：哪怕未来的 AI 可以一秒钟写出全世界的代码，只要不革新软件工业流程，最终发布率的提升都无法突破这一天花板。供给侧的数据同样值得关注。Apple App Store 新上线应用从每月 3-5 万款增加到约 10 万款，Chrome 插件市场新扩展从月均约 5000 个增加至 1.3 万个，Google Play 商店新应用发布量也从长期下滑趋势中回升并稳定在约 6 万款。但需求侧反应冷淡：新应用上线三个月内总使用量持平甚至小幅下滑。所谓的"长尾效应"假设并未得到数据支持——供给的快速扩张并未带来对应的需求增长。上线前三个月内从未获得基本受众的"僵尸应用"比例正在增加：iOS 平台上评分数少于 10 的新 App 占比从 79% 升至 86%，Chrome 插件商店中下载量低于 10 次的扩展比例从 18% 升至 31%。这项研究的核心洞察是：AI 编程工具的提效是真实的，但它主要发生在软件生产流水线的上游。代码审查、测试、跨团队协调、发布管理这些下游环节仍然是人类主导的领域，而正是这些环节构成了从代码到产品的关键瓶颈。目前层级 5（项目仓库协调）和层级 6（版本发布管理）仍是 AI 无法介入的领域。阅读建议：这是目前关于 AI 编程生产率最严谨的大规模实证研究。文章对三代工具演进的梳理和"漏斗衰减"模型的分析，对理解 AI 在软件工程中的真实影响至关重要。推荐所有技术管理者仔细阅读。阅读原文 ## 速览知识库分层编排：从传统 RAG 到原生智能体知识上下文层阿里云开发者团队提出「金字塔知识库」范式，通过五层分层（原则 / 架构 / 规范 / 实现 / 经验）与角色感知路由，解决 RAG 在工程知识库中的粒度混乱与关联缺失问题。文章系统对比了 Naive RAG、LLM Wiki、Graphify、GraphRAG 四种范式，指出平坦的向量检索将知识当作"一袋词"，而工程知识本质上是"一棵树和一张图"。金字塔设计的独到之处在于角色-层级访问矩阵：架构师看到原则和架构层，开发者看到架构、规范和实现层，每个角色有独立的 contextbudget 和 priorityorder，系统按优先层顺序逐层填充内容直到预算用完，确保有限的 context window 优先填充该角色最需要的知识。对于正在构建企业级知识库的团队，这篇文章提供了一套完整的从方法论到实现的参考框架。阅读原文谷歌发布 DiffusionGemma：开源模型实现 4 倍文本生成速度谷歌 CEO 桑达尔·皮查伊宣布推出 DiffusionGemma，将谷歌的文本扩散研究成果引入 Gemma 4 系列。核心创新在于摒弃传统逐 token 的自回归预测方式，转而同时生成整个文本块，推理速度提升高达 4 倍。这款开源实验性模型为追求速度的开发者提供了一条新路径，也为文本生成架构的多样化探索打开了空间。DiffusionGemma 的出现提醒我们，自回归不是语言模型的唯一解法，并行生成可能是一个被低估的方向。它代表了一种"赛马"式的前沿探索——在 Transformer 统治的时代，用扩散模型做文本生成的尝试值得持续关注。阅读原文 Claude Fable 5 的初步印象 Simon Willison 在 Claude Fable 5 发布后立即进行了约 5.5 小时的上手测试。他的评价是这东西有点猛——慢、贵，但几乎能轻松应对他扔给它的所有任务。Fable 5 拥有 100 万 token 上下文窗口和 12.8 万最大输出 token，知识截止日期为 2026 年 1 月。价格为 Opus 4.5/4.6/4.7/4.8 的两倍（$10/百万输入 token，$50/百万输出 token），且不因更长上下文而加价。它在一天内帮他构建了一个完整的 CPython WASM 沙箱，并为他的 LLM 库交付了重要功能。值得注意的是，Fable 5 与 Mythos 5 拥有相同能力，但配备了更严格的安全分类器。API 还提供了在触发拒绝时自动回退到其他模型的机制，这是 Anthropic 在安全与可用性之间找到的一个巧妙平衡。阅读原文 Harness 长程自动化工程：AI 编程与技能开发实践经验阿里技术团队系统阐述了 Harness Engineering 的概念与完整实践。核心理念是通过约束机制、反馈闭环、工作流编排和效果评估，将 Agent 的运行纳入可观测、可控制、可迭代的框架。文章设定了两个核心目标：Agent 长时自主运行（3 小时以上不中断），以及人类只需深度参与目标设定和结果验收。实践中的关键发现包括：专业 Agent 分工优于通用 Agent，Rubric 结构化评估是拉开差距的关键，以及人类需要转变思维成为 Agents 的管理者而非过程控制者。文章特别指出，AI 几乎短时间编写了 100% 的代码，人类像以前一样做 code review 会成为协作中的瓶颈。这是目前中文社区关于 AI Agent 工程化实践最系统的分享之一。阅读原文逃逸速度 — SpaceX 的增长前沿 SpaceX 以 1.77 万亿美元估值完成史上最大 IPO，但本文的冷峻分析指出：支撑这一估值的是一条连续 15 年保持 41.5% 年增长率的路径。SpaceX 的收入确实在快速增长（2022 年 46 亿美元到 2025 年 187 亿美元，三年翻了四倍），但要从 187 亿增长到摩根士丹利预测的 2040 年 3.4 万亿美元，意味着 182 倍的扩张。虽然增长率低于特斯拉历史上的 62%，但 SpaceX 面临的绝对规模使其成为统计异常值。更值得关注的是发行结构：只有约 4%（750 亿美元）向公众出售，其余 96% 锁定在内部人士手中。这篇文章是对科技 IPO 估值逻辑的一次有力质疑，值得每一位关注资本市场的读者细读。阅读原文编码你的领域知识：Spotify 数据助手背后的上下文层 Spotify Engineering 详细介绍了他们构建 AI 数据助手的方法论。面对超过 7 万个数据集和 PB 级数据（每日处理 1.4 万亿数据点），直接把所有 schema 喂给 LLM 行不通——不仅上下文窗口装不下，schema 本身也不传达完整信息。一个 INT64 类型的列不会告诉你哪些是遗留测试数据，也不会解释"活跃用户"的确切定义。Spotify 的解决方案是构建一个"上下文层"：由领域专家策划数据集描述、经过验证的问题-SQL 对以及业务文档。每个数据集群还有持续计算的健康评分，确保上下文随着 schema 演变保持准确。这个案例的核心启示是：在数据密集场景下，AI 助手的可靠性不取决于模型能力，而取决于人类如何结构化和维护领域知识。阅读原文为什么更多上下文会让智能体变笨，以及该如何修正 Nupur Sharma 在 AI Engineer 的演讲中解释了一个反直觉的现象：更大的上下文窗口反而会降低智能体质量。当开发者习惯性地将海量数据直接灌入提示词时，性能会呈 U 型曲线下降——先是改善，过了拐点后急剧恶化。她给出了几种实用的架构模式来应对：上下文筛选与分层加载，只在需要时拉入相关片段；混合编排策略，结合 RAG 和 Agent 循环；专家智能体分工，每个 Agent 专注于特定领域并接受特定上下文；以及裁判节点评估，用专门的评估模块在关键节点做质量把关。对于正在构建生产级 Agent 系统的工程师，这场演讲提供了一套从"更多上下文"到"更好的上下文"的思维转换框架。阅读原文 ## 补充阅读 - [Claude Fable 5：最强 AI 正在变成"特权资源"](https://www.bestblogs.dev/article/f360573e) — 深度解读 Fable 5 发布的标志性意义：前沿 AI 从"能力竞赛"转向"访问权竞赛"，最强模型不再只按价格分层，也开始按信任边界分层。对 AI 治理和商业模式演进感兴趣的读者值得关注。 - [刚刚，Claude Mythos 5 发布！5000 万行代码 1 天搞定](https://www.bestblogs.dev/article/ae0d70bc) — Anthropic 发布旗舰模型 Fable 5 与 Mythos 5 的中文速报，后者为满血版仅限受信任用户，引入了模型路由的安全新范式。 - [如何构建一个更"好"的知识库？](https://www.bestblogs.dev/article/ef05a619) — 从评估标准、索引与查询流程、切分策略到前沿架构，系统性拆解构建高质量 RAG 知识库的技术原理与工程实践。 - ["资本的义务是给股东赚钱，不是保护人类" AI 教父辛顿最新对话](https://www.bestblogs.dev/article/6cc82403) — 辛顿深入探讨 AI 的"理解"本质、数字生命的信息共享优势，以及人类可能被自身造物"驯化"的深层悖论。 - [iPod、iPhone 创造者 Tony Fadell：AI 时代做产品，有 atoms 的公司才有护城河](https://www.bestblogs.dev/article/a0229387) — Tony Fadell 分享对 AI 时代产品判断力、系统架构能力和硬件护城河的深刻见解，强调人始终要在循环中。 ## 今日阅读路径如果你的时间有限，推荐按以下顺序阅读今天的三篇核心内容： 1. [MIT 追踪 10 万名开发者](https://www.bestblogs.dev/article/a8e2bccb) — 用数据揭示 AI 编程的真实生产率效应，"代码 17 倍、软件只增三成"这个结论会影响你对 AI 编程工具的判断。约 15 分钟。 1. [Dario Amodei 的 AI 政策长文](https://www.bestblogs.dev/article/bff54423) — 理解 AI 治理最前沿讨论的必读文本，五领域行动框架为政策制定提供了清晰路线图。约 20 分钟。 1. [OpenAI 秘交招股书](https://www.bestblogs.dev/article/ba4c2197) — 三巨头 IPO 竞速的完整图景，财务数据和竞争分析让你快速把握 AI 行业的资本格局。约 10 分钟。 BestBlogs 是 AI 驱动的私人阅读助手，帮助你建立稳定、可信、个性化的高质量信息输入。它帮你判断什么值得读、协助你读懂，并逐渐理解你关注什么。

译Anthropic CEO Dario Amodei 发布万字政策长文，以《魔戒》树须比喻AI与政策的时间错位，提出五领域行动框架（安全审计、失业保障、下游监管、权力平衡、国际治理）。OpenAI确认秘密提交S-1招股书，估值超8500亿美元，月收入20亿美元，周活跃用户9亿；与估值9650亿美元的Anthropic、SpaceX开启万亿级IPO竞速。MIT与宾夕法尼亚大学追踪10万开发者发现：AI编程工具使代码行数暴增17.3倍，实际发布的软件版本仅增长30%。

AK@_akhaliq · 6月11日53

SCAIL-2 Unifying Controlled Character Animation with End-to-end In-Context Conditioning

译SCAIL-2 统一可控角色动画与端到端上下文条件化

Google DeepMind@GoogleDeepMind · 6月11日64

In Sierra Leone, a surging student population is outpacing available teachers. Our latest research explores how AI can act as a partner to support educators in these environments – amplifying their reach without replacing their essential expertise and skills. 🧵

译在塞拉利昂，激增的学生人数正超过可用教师资源。我们最新的研究探索了AI如何在这些环境中作为合作伙伴支持教育工作者——扩大他们的影响力，同时不取代其核心的专业知识与技能。🧵

elvis@omarsar0 · 6月10日60

// Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today are built once and remain frozen or mostly unchanged. The harness, like the skills, needs to evolve with new models. What if the scaffold rewrites itself? This new work treats the harness, the prompts, tools, and control flow around the model as a learnable artifact that improves from its own runs rather than staying a fixed wrapper you hand-maintain. The scaffolding becomes the part that compounds, run after run. If you run long-horizon agents, a self-modifying harness turns scaffold upkeep from manual work into something the system earns on its own. Paper: https://arxiv.org/abs/2606.09498 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译当前多数智能体脚手架（scaffold）构建后保持静态。新研究Self-Harness将harness（提示词、工具、控制流）作为可学习的工件，通过自身运行迭代改进，而非手动维护的固定包装器。运行长周期智能体时，自我修改的harness将维护工作转化为系统自动获得的能力。论文：arxiv.org/abs/2606.09498。

AK@_akhaliq · 6月10日56

SWE-Explore Benchmarking How Coding Agents Explore Repositories

译SWE-Explore 评估编码智能体如何探索仓库

AK@_akhaliq · 6月10日57

On the Geometry of On-Policy Distillation

译关于在策略蒸馏的几何

AK@_akhaliq · 6月10日66

Latent Spatial Memory for Video World Models

译视频世界模型的潜在空间记忆

Microsoft Research@MSFTResearch · 6月10日63

New research in Nature Methods from Project Ex Vivo shows AI models learn more from diverse cell states than from scaled datasets alone, a finding that could reshape how therapies are matched to patients. https://msft.it/6013vgE8l

译在《Nature Methods》上发表的最新研究来自Project Ex Vivo，表明AI模型从多样化的细胞状态中学到的知识，比仅从规模化数据集中学到的更多，这一发现可能重塑疗法与患者的匹配方式。https://msft.it/6013vgE8l

AK@_akhaliq · 6月10日51

SpatialWorld Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

译SpatialWorld 评测多模态智能体在真实世界任务中的交互式空间推理能力

Tencent Hy@TencentHunyuan · 6月9日74

🚀Introducing UniRL, an RL infra for unified multimodal models. Together with two new RL algorithms: DRPO and Flow-DPPO. One RL loop across diffusion/flow matching models, LLMs/VLMs, and unified multimodal models👇 Code: http://github.com/Tencent-Hunyuan/UniRL (yes — U(you)-ni-(need) RL 😉)

译🚀推出UniRL，一个用于统一多模态模型的RL基础设施。附带两种新RL算法：DRPO和Flow-DPPO。一个覆盖扩散/流匹配模型、LLM/VLM以及统一多模态模型的RL循环👇 代码：http://github.com/Tencent-Hunyuan/UniRL （是的——U(you)-ni-(need) RL 😉）

Tencent Hy@TencentHunyuan · 6月9日67

🚀Introducing UniRL, an RL infra for unified multimodal models. Together with two new RL algorithms: DRPO and Flow-DPPO. One RL loop across diffusion/flow matching models, LLMs/VLMs, and unified multimodal models👇 Code: http://github.com/Tencent-Hunyuan/UniRL (yes — U(you)-ni-(need) RL 😉) 1、Most RL stacks are built for one modality. UniRL applies a single post-training loop — generate → score → advantage → update → sync — across model families. Model and algorithm are two independent axes, so your coverage is the model × algorithm product, not a fixed recipe menu. 2、One loop, every modality: text→image, text/image→video, vision-language, text-only LLM and VLM, the LLM→diffusion prompt-enhancer, and unified autoregressive+diffusion generation (Hunyuan-Image 3 and Bagel) — a model class no single-purpose RL repo can even express. 3、Built to scale: pluggable rollout engines (train-side / SGLang / vLLM-Omni) behind one typed contract, FSDP2 sharding, and three deployment modes from a single config knob. 4、Two team-original algorithms headline the release: FlowDPPO: Policy optimization for flow/diffusion models with trust-region masks based on exact divergence (See our paper: Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models https://github.com/Tencent-Hunyuan/UniRL/blob/main/FlowDPPO/HY_FlowDPPO.pdf) DRPO: LLM RL with a smooth, advantage-weighted quadratic regularizer (See our paper: Rethinking the Divergence Regularization in LLM RL [https://arxiv.org/abs/2606.09821])

译腾讯混元推出UniRL，一个支持统一多模态模型的强化学习基础设施，并发布两个新算法DRPO和Flow-DPPO。UniRL通过单个后训练循环（生成→评分→优势→更新→同步）覆盖扩散/流匹配模型、LLM/VLM及统一多模态模型（如Hunyuan-Image 3和Bagel）。模型与算法作为独立轴，可实现模型×算法的组合覆盖。框架支持可插拔rollout引擎（训练侧/SGLang/vLLM-Omni）、FSDP2分片和三种部署模式。FlowDPPO针对流/扩散模型引入基于精确散度的信任域策略优化；DRPO为LLM RL提供平滑的优势加权二次正则化方法。代码已开源。

Rohan Paul@rohanpaul_ai · 6月9日64

Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close. A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back. Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved. The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like. When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads. The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings. ---- Link – arxiv. org/abs/2606.04032v2 Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

译一篇论文系统研究了Transformer注意力中QKV投影的必要性，发现Key和Value可共享同一投影（Q-K=V变体），仅增加3.1%的困惑度，便将KV cache削减50%，大幅降低推理内存。最佳变体保留Query独立，使注意力保持方向性。与GQA和MQA结合时，可分别实现87.5%和96.9%的cache缩减。弱变体Q=K-V因导致因果注意力过于对称且无cache节省而无效。

Rohan Paul@rohanpaul_ai · 6月9日60

AGI needs agents that actively explore what they do not know, not just models that answer better. This new large (111 page) survey paper from from top labs across US and China talks about epistemic exploration, which means an agent should actively reduce uncertainty, learn near the edge of what it can do, and keep future paths open. Exploration is not randomness; it is the disciplined act of asking which observation would change your beliefs, which attempt would improve your skill, and which path must remain open before it closes. It breaks this into 3 needs: seek useful information, turn hard-but-learnable experiences into better ability, and avoid getting stuck in one narrow strategy too early. The authors organize AI progress into 5 levels: responder, reasoner, agent, prospector, and ecosystem, where each level explores a wider space than the last. A responder mostly gives an answer, a reasoner searches through possible thoughts, an agent tests the outside world, a prospector simulates futures, and an ecosystem uses many agents working together. Paper - "Agent Exploration Toward Artificial General Intelligence"

译一篇来自中美顶级实验室的111页综述论文提出，AGI需要主动探索未知（认知探索），而非仅提升回答能力。论文将AI进展分为五级：responder（响应者）、reasoner（推理者）、agent（智能体）、prospector（勘探者）和ecosystem（生态系统），每级探索空间更广。核心强调智能体应通过获取有用信息、将困难经验转化为能力、避免过早锁定单一策略来降低不确定性，保持未来路径开放。

Rohan Paul@rohanpaul_ai · 6月9日65

AI agent can get better at long tasks without retraining the agent itself, by using a separate small model to clean and organize its context. Moves context management outside the agent, so a separate helper can clean up the task history while the main agent stays unchanged. The paper proposes AdaCoM, which is a separate LLM that edits the agent’s working context before the agent takes its next step. AdaCoM places a separate, trained manager between the task history and the frozen agent, so the agent does not need to learn a new memory habit or expose its weights. Before each step, this manager can rewrite, merge, prune, or preserve parts of the running context, then the original agent acts on the cleaned version. That sounds like summarization, but the distinction matters. A summary assumes the right answer is compression, while AdaCoM learns that different agents need different kinds of context to stay competent, because stronger agents can use more raw history while weaker agents need shorter and cleaner notes. They tested AdaCoM on web search and deep research tasks across several agents, and it improved average web search performance by 39%. ---- Link – arxiv. org/abs/2605.30785 Title: "Learning Agent-Compatible Context Management for Long-Horizon Tasks"

译论文提出 AdaCoM，一个独立的 LLM，在智能体每步操作前编辑其工作上下文。它可重写、合并、剪枝或保留任务历史，使主智能体保持冻结，无需重新训练或暴露权重。与简单摘要不同，AdaCoM 学习不同智能体需要不同类型上下文——强智能体保留更多原始历史，弱智能体需更短更清晰的笔记。在 web search 和 deep research 任务上测试，平均提升 39%。

elvis@omarsar0 · 6月9日62

New paper on how AI agents are reshaping knowledge work. This is a nice economic read on where agents actually change knowledge work to meet that gap directly. (bookmark it) It studies agent adoption across three dimensions: autonomy, efficiency, and the scope of tasks workers hand off. The friction people keep hitting with agents is rarely model quality. It is that almost nobody has been taught how to work this way. Paper: https://arxiv.org/abs/2606.07489 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一篇新论文从自主性、效率和工人移交任务的范围三个维度，分析AI智能体如何重塑知识工作。研究指出，当前人们使用智能体的主要障碍并非模型质量，而是几乎没有人接受过如何以这种方式工作的培训。

Rohan Paul@rohanpaul_ai · 6月9日63

This paper proposes a new test to see whether AI agents truly get better as they gain experience and finds they mostly still confuse memory with learning. Shows that simple full-context learning beats the more specialized memory systems, with Claude Sonnet 4.6 using plain context getting the best overall score. That distinction matters because the next wave of AI is not supposed to answer isolated prompts. It is supposed to live inside codebases, databases, markets, sensors, clinics, and workflows where yesterday’s mistake should make tomorrow’s action sharper. The authors build CL-BENCH, a benchmark where an agent works through connected tasks in 6 domains, including coding, databases, forecasting, radio signals, poker, and disease studies. Each task hides a pattern the agent can learn over time, like a database layout, a codebase structure, or an opponent’s strategy, so better performance should come from experience rather than pretraining. They test frontier LLM systems with simple full-context memory, scratchpad notes, retrieval memory, playbook-style memory, and coding-agent setups. The key finding is that current memory-heavy AI agents are not reliably better learners than just keeping the full conversation in context. That means long-running AI agents still need better ways to remember useful lessons, forget stale ones, and adapt when the environment changes. ---- Link – arxiv. org/abs/2606.05661 Title: "Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments"

译新论文构建 CL-BENCH 基准，评估 AI 智能体在编程、数据库、预测、无线电信号、扑克、疾病研究 6 个领域中的持续学习能力。每个任务隐藏可随时间习得的模式，考察智能体能否超越预训练知识。测试前沿 LLM 系统采用全上下文记忆、草稿笔记、检索记忆、剧本式记忆及编码智能体设置，结果发现当前记忆密集型 AI 智能体并未可靠优于简单保留完整对话上下文。Claude Sonnet 4.6 使用普通上下文取得最佳总体分数。论文指出智能体仍需更好方法记住有用经验、遗忘过时信息并适应环境变化。

Perplexity@perplexity_ai · 6月9日76

We published new research with Harvard on the shift from chat interfaces to autonomous agents like Computer. Over 3 months, findings show workers using Computer finish tasks in 87% less time at 94% lower cost than Search alone, with higher satisfaction. https://research.perplexity.ai/articles/how-ai-agents-reshape-knowledge-work

译我们与哈佛大学发表新研究，关于从聊天界面转向像Computer这样的自主智能体的转变。超过3个月的研究结果表明，使用Computer的工人在完成任务上比仅使用搜索快87%，成本低94%，且满意度更高。 https://research.perplexity.ai/articles/how-ai-agents-reshape-knowledge-work

Tencent Hy@TencentHunyuan · 6月8日69

Can AI truly edit audio, not just generate it? 🎧 Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, and other collaborators, introduces MMAE. MMAE--A Massive Multitask Audio Editing Benchmark, is the first comprehensive evaluation benchmark for speech and audio "Banana🍌" Instead of simply requiring the AI to "generate" audio, it demands that the AI understand an existing audio clip and precisely modify it according to natural language instructions—altering what needs to be changed while leaving the rest untouched. Current models show an Exact Match Rate (EMR) below 5%, revealing a major gap in reliable audio editing. MMAE includes: ✅ 2,000 high-fidelity samples from real-world scenarios ✅ 17,741 fine-grained rubric evaluation items ✅ 7 modality settings across sound, music, speech and their mixtures ✅ 6 task complexity from basic modifications to multi-hop reasoning and multi-round editing ✅ 8 operation types across local and global granularities How to use: arXiv: http://arxiv.org/abs/2606.07229 GitHub: https://github.com/ddlBoJack/MMAE HuggingFace: https://huggingface.co/datasets/BoJack/MMAE Demo: https://youtu.be/6At5nTWhlXI

译腾讯混元联合上海交大、南洋理工等机构推出MMAE（Massive Multitask Audio Editing Benchmark），这是首个全面评估AI语音/音频编辑能力的基准。MMAE要求模型理解现有音频并按自然语言指令精确修改，而非简单生成。当前模型在该基准上的精确匹配率（EMR）低于5%，暴露了可靠音频编辑的短板。MMAE包含2000个真实场景高保真样本、17741条细粒度评估项，覆盖声音/音乐/语音及混合共7种模态、6种任务复杂度（基础修改到多跳推理及多轮编辑）、8种操作类型（局部到全局）。论文、代码、数据集和演示已公开。

Rohan Paul@rohanpaul_ai · 6月8日60

Great Stanford + MIT + Harvard + Anthropic paper. Gives a clear training-based reason for why larger models learn abilities smaller models miss. Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals. The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts. Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge. In a crowded data mixture, common patterns get first claim on the model’s internal machinery. Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again. They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters. The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less. Larger models can remember weak rare signals long enough to turn them into real learned skills. ---- Link – arxiv. org/abs/2605.29548 Title: "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention"

译该论文指出，更大模型能学到罕见技能，是因为训练中遗忘更少，其额外容量保护了弱学习信号。核心机制：常见任务先抢占神经元，罕见任务在出现频率足够形成稳定知识前就被覆盖。小模型可能短暂捕捉到罕见信号，但随即被下一波常见任务更新覆盖。实验使用OLMo语言模型（4M–4B参数）验证：大模型在低频任务上表现更优，保留更多任务特征，且常见任务更新对罕见任务的梯度干扰更小。作者强调，问题不仅在于小模型能否表征任务，更在于训练中罕见任务能否在众多常见任务反复冲击下持续存在。

Rohan Paul@rohanpaul_ai · 6月8日56

Strong AI agents still struggle with long research work because they often fail to keep testing and improving. New Stanford, MIT, NVIDIA, Google and other top labs paper shows shows that today’s strongest research agents win less by brilliance than by refusing to stop testing. The paper proposes AutoLab, a benchmark with 36 tasks where each agent starts from working but weak code and must make it better within a fixed time limit. The tasks cover system speedups, puzzles, model development, and CUDA kernel work, so the test is not just about writing code once but about managing a long work session. The authors tested 17 strong models and found that the best results did not mainly come from the first idea being good, but from the model staying active, testing often, and using feedback well. The best first idea was not the strongest predictor of success; persistence was. Claude Opus 4.6 led the benchmark not because it always guessed the right move immediately, but because it kept benchmarking and folding empirical feedback into the next attempt. Several other frontier models failed in a more revealing way: they either quit early with time left on the clock, or thought so long that they ran out of time before submitting anything useful. ---- Link – arxiv. org/abs/2606.05080 Title: "AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?"

译斯坦福、MIT、英伟达、谷歌等顶级实验室联合提出新基准 AutoLab，包含 36 个任务。每个任务中，智能体从可工作的弱代码起步，需在固定时间内迭代优化。任务涵盖系统加速、谜题、模型开发和 CUDA 内核。17 个前沿模型测试结果显示，成功的关键不是初版方案有多好，而是能否持续测试、频繁实验并利用实证反馈。Claude Opus 4.6 领跑基准，靠的是坚持迭代而非初始判断力，而其他前沿模型要么提前放弃，要么思考过久导致超时。

meng shao@shao__meng · 6月8日64

AGENTS.md 在 Coding Agents 中真的有用吗？这篇论文，大规模实证研究仓库级上下文文件（AGENTS.md、CLAUDE.md 等）对编码 Agent 实际效果的影响，可能有些反直觉！感谢 @rasbt 分享！论文在这：https://arxiv.org/abs/2602.11988 研究背景：实践先行，证据滞后 AGENTS.md 已成为行业惯例，GitHub 上已有 6 万+ 仓库采用，Claude Code (CLAUDE.md)、Codex、Qwen Code 等 Agent 都内置 /init 自动生成。但此前研究多停留在内容分类与描述性统计，缺少对任务完成率的严格评估。核心难点在于：主流基准 SWE-bench 来自 Django、Flask 等知名仓库，这些项目本来就没有开发者手写的 context file，无法直接评估该实践的真实价值。实验设计：双基准、三条件、四 Agent · 基准：SWE-bench Lite（300 任务，11 个热门 Python 仓库）+ 新建 AGENTBENCH（138 任务，12 个已含开发者 context file 的冷门仓库） · 三种条件：① 无 context file ② LLM 生成（各 Agent 官方 /init 流程）③ 开发者手写（仅 AGENTBENCH） · Agent/模型：Claude Code + Sonnet 4.5、Codex + GPT-5.2 / GPT-5.1 mini、Qwen Code + Qwen3-30B · 指标：任务成功率、步数、推理成本、工具调用轨迹核心发现：效果微弱，成本显著 1. 成功率：边际效应，甚至为负 · LLM 生成：8 组设置中 5 组下降，平均 -0.5%（SWE-bench）/ -2%（AGENTBENCH） · 开发者手写：平均 +4%，优于 LLM 生成，但 Claude Code 上甚至不如无文件 · 跨模型、跨 prompt 结论稳健一句话：自动生成 context file 不仅无益，还可能略有害；手写的提升也很有限。 2. 效率：无文件反而最便宜（步数，成本） · LLM 生成：+2.45 / +3.92 步，+20% / +23% · 开发者手写：+3.34 步，最高 +19% 3. 代码库概览几乎无效 Context file 常被推荐用于「帮助 Agent 快速定位代码」。实测显示：有无 context file，Agent 首次接触相关文件所需的步数并无显著差异。95–100% 的 LLM 生成文件都包含代码库概览，但对导航帮助甚微。轨迹分析：Agent 听话，但听话很贵论文排除了「Agent 忽略 context file」这一假设。轨迹分析表明： · 指令遵从度高：context file 提到 uv，使用率从 <0.01 次/任务升至 1.6 次；提到仓库专用工具，从 <0.05 升至 2.5 次 · 行为更「认真」：更多测试、更多文件搜索/阅读、更多 lint/质量检查 · 推理更深：GPT-5.2 推理 token 增加 14–22% 机制链条： Context file 写入额外要求 → Agent 更严格遵从（测试、探索、专用工具） → 步数与成本上升 → 成功率未同步提升（甚至更差） Context file 不是被忽略，而是被过度执行——把「建议性流程」当成了「必做清单」，增加了任务复杂度，却没有换来更高成功率。一个关键反转：文档冗余假说当移除仓库中所有其他文档（.md、docs/、示例代码）后，LLM 生成的 context file 反而带来 +2.7% 提升，且优于开发者手写的。这说明： · 在文档齐全的仓库里，context file 与 README、docs 高度冗余 · 开发者口述的「加了 AGENTS.md 后 Agent 变强了」，很可能是因为目标仓库本身文档稀缺，context file 填补了信息真空 · 对 Django 这类文档完善的知名项目，额外 context 的价值被稀释消融实验：生成质量的上限 · 更强模型生成 ≠ 更好 context：GPT-5.2 生成的文件在 SWE-bench 上略好（+2%），在 AGENTBENCH 上反而更差（-3%） · 不同 prompt 无一致优势：Codex prompt vs Claude prompt 效果因数据集而异，差异很小自动生成 context file 的改进空间，目前看来很有限。实践建议 · 依赖 /init 自动生成：谨慎——平均略降成功率，成本 +20%+ · 长篇架构概览、目录枚举：避免——与代码探索冗余，不加速定位 · 测试/lint/构建命令：精简写入——Agent 会严格执行，但过多要求推高成本 · 仓库专用工具（uv、pdm 等）：值得写——指令遵从度高，且代码中不易推断 · 分层/按需引用：方向正确——「做 X 时读 Y.md，否则忽略」减少无关负担

译论文大规模实证检验 AGENTS.md 等仓库级上下文文件对编码 Agent 的影响。在 SWE-bench Lite（300 任务）和新建 AGENTBENCH（138 任务）上测试 Claude Code、Codex、Qwen Code 等组合。核心发现：LLM 自动生成的 context file 在 8 组设置中 5 组成功率下降，平均 -0.5%（SWE-bench）/-2%（AGENTBENCH），成本增加 +20%+；开发者手写仅平均 +4%。冗余假说：移除其他文档后，自动生成反而 +2.7%。建议避免自动生成，精简测试/lint 命令，优先写入仓库专用工具。

Rohan Paul@rohanpaul_ai · 6月8日66

New MIT study. Code volume surges by 300%, but output increases by only 30%: The AI dividend meets an awkward reality Autonomous AI coding agents raised commits by 180%, but releases rose only 30%. The paper’s main idea is that software production has weak links, so faster code writing does not help as much when humans still need to review, connect, test, package, and ship the work. The authors also check app marketplaces and find more new apps, but no increase in total usage, which means more software appeared without clear evidence that users adopted more software. The marketplace evidence points the same way: more new apps appeared, but total usage did not rise. The authors compare more than 100,000 GitHub developers before and after they start using 3 generations of AI coding tools, from autocomplete to more independent coding agents. Autocomplete raised commits by 40%, interactive coding agents raised them by 140%, and autonomous coding agents raised them by 180%. The 180% commit gain shrank to 50% for the number of projects and 30% for actual releases. The estimated "elasticity of substitution" is 0.25 i.e. for every big improvement in AI’s usefulness, only a small amount of human work can be replaced. Because AI can write code faster, but humans are still needed to decide what to build, check if the code works, connect it with the rest of the product, fix messy edge cases, and actually ship it. --- papers .ssrn.com/sol3/papers.cfm?abstract_id=6859839

译麻省理工新研究追踪超10万GitHub开发者使用三代AI编码工具（自动补全、交互式agent、自主agent）的生产漏斗。自主AI agent使代码提交数提升180%，但实际发布仅增30%。代码量激增近300%，经人工审核后收益降至150%，最终发布仅增约30%。研究估算替代弹性为0.25，即AI能力大幅提升时仅能替代少量人类工作。应用市场同样显示新应用数量增加，但总使用量未升。瓶颈在于人类仍需负责审查、测试、打包和发布等环节，AI加速的局部任务并未转化为同等产出增长。

AYi@AYi_AInotes · 6月8日62

Google的研究找到了一种把 AI记忆大幅压缩的技术，让本地跑大模型 + 自己数据变得更容易了。也就是说可以把 1000 万个文档的向量存储，从 31GB 内存压缩到只剩 4GB，而且搜索速度还比现在最常用的 FAISS 更快。

译Google提出一种AI记忆压缩技术，可将1000万个文档的向量存储从31GB内存压缩至仅4GB，且搜索速度超过目前最常用的FAISS方法。该技术使本地运行大语言模型并结合个人数据变得更加可行。

Rohan Paul@rohanpaul_ai · 6月8日49

This paper tests whether today’s AI agents can build better AI agents without human design help. i.e. whether an AI can act more like an AI engineer. That means it must invent a strategy, write the agent code, test it, learn from failures, and improve the system without a human guiding every choice. Shows they are still weak at reliably building the systems that do tasks. Their benchmark, called Meta-Agent Challenge, gives an AI coding agent a safe workspace, a scoring API, limited time, and limited model calls, then asks it to create another agent that performs well on hidden test tasks. They tested this across 5 areas, including math, science questions, competitive programming, software bug fixing, and long terminal tasks. The main result is that current agents usually do not beat strong human-made agent setups, and the few good results mostly come from closed frontier models like Claude. Complete autonomy is not just tool use. It is budget awareness, failure recovery, restraint under pressure, and the discipline to change designs instead of polishing a bad one. Overall, Meta-Agent Challenge (MAC) suggests that today’s agents are not yet self-improving engineers. They are powerful executors with flashes of design judgment, still missing the boring reliability that makes engineering real. ---- Link – arxiv. org/abs/2606.04455 Title: "The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?"

译一项新研究提出Meta-Agent Challenge（MAC）基准，测试AI智能体能否在没有人类设计帮助的情况下自主构建更优智能体。智能体需在安全工作区内自行发明策略、编写代码、测试并从失败中学习。实验覆盖数学、科学问答、竞赛编程、代码修复和长终端任务5个领域。结果显示，当前智能体大多无法超越人工设计的强智能体系统，仅Claude等少数封闭前沿模型取得较好表现。研究认为，当前智能体更像是强大的执行者，而非具备可靠自改进能力的工程师。

Rohan Paul@rohanpaul_ai · 6月8日49

A Primer paper about how reasoning models improve after training Shows that better reasoning models depend less on raw data size and more on checkable training evidence. reasoning data is NOT simple question-and-answer pairs. The useful part is often the feedback that says why an answer, step, tool action, or full attempt was good or bad. A prompt and a response tell you what a model said, but not why that answer became learnable, which judge blessed it, which failures were hidden, or whether the skill was already inside the base model. The core idea is to describe each training example as a record that includes the task, the model’s behavior, the checking signal, and metadata about where it came from. The authors sort reasoning data by how it can be checked, such as exact rule-based checks for math and code, environment checks for agents using tools, and human or model judgments when no exact checker exists. They also explain why common assumptions fail, because long reasoning traces may be fake, harder examples may be useless for some models, and larger datasets may still miss important coverage. The key point is that agent data should preserve mess: failed actions, retries, recoveries, state differences, and terminal checks, because that is where learning signal often lives. ---- Link – arxiv. org/abs/2606.02113 Title: "A Primer in Post-Training Reasoning Data: What They Know About How It Works"

译论文指出，更好的推理模型更依赖可验证的训练证据，而非原始数据规模。推理数据的关键不是简单问答对，而是提供答案、步骤、工具操作或完整尝试好坏判断的反馈信号。每个训练样本应描述为包含任务、模型行为、检查信号和元数据的记录。研究者按检查方式分类：数学和代码用精确规则、智能体工具用环境检查，无精确检查器时用人类或模型判断。常见误区包括：长推理链可能虚假、更难样例对部分模型无效、更大数据集仍可能缺失关键覆盖。智能体数据应保留失败动作、重试、恢复、状态差异和终端检查等“混乱”信息，因为学习信号常在其中。

Rohan Paul@rohanpaul_ai · 6月7日62

Great idea for self-evolving AI scientists from this new MIT paper. Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder. The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims. The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced. Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself. So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema. A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language. ---- arxiv. org/abs/2606.01444 Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

译MIT论文（F.Y. Wang & M.J. Buehler, arXiv:2606.01444, 2026）提出Self-Revising Discovery Systems框架，使AI科学家能自主识别当前思维模式不足并添加新科学概念，而非仅更努力搜索。系统将数据、模型、工具输出、失败及声明均视为类型化产物（typed provenance），从而区分三种模式：retrieval（添加已知对象）、search（探索固定模式）和discovery（可验证的模式转换）。论文通过Kan obstruction和Left Kan extension数学化定义了真正新颖性——由旧证据传输后的逐点残差量化，使novelty可客观测量。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性，以及CategoryScienceClaw发现各向异性纤维网络刚度规则。

Rohan Paul@rohanpaul_ai · 6月7日66

New MIT paper, great idea for self-evolving AI scientists from Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder. The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims. The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced. Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself. So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema. A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language. ---- arxiv. org/abs/2606.01444 Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

译MIT团队提出自我演进AI科学家框架，核心创新是让AI识别当前推理空间过小并主动添加新科学概念，而非仅在固定模式内搜索。论文将数据点、模型、工具输出、失败、声明均视为带类型的artifact，明确区分检索（添加已知对象）、搜索（探索固定schema）和发现（可验证的模式扩展）。通过类型化copresheaf与Kan障碍理论证明，真正发现是可验证的schema扩展：旧证据由左Kan扩展传输，创新性通过逐点残差量化。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性，以及CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444（2026）。

elvis@omarsar0 · 6月6日65

// Continual Learning Bench // One of the research areas with lots of investments is continual learning. While there are many efforts, there is very little progress in measuring it. So the big question is, do dedicated memory systems actually make agents learn from experience? Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management. CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances. If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning. Paper: https://arxiv.org/abs/2606.05661 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译持续学习领域投入多但进展缓慢。CL-Bench（持续学习基准）在六个由专家验证、包含共享可学习结构的领域上测试，发现简单的上下文学习（ICL）基线优于专门为记忆管理构建的系统。该基准引入增益指标以隔离真正学习效果，结果显示智能体常过度拟合即时观察或未能跨实例复用知识。研究指出，若普通ICL基线超过你的记忆架构，则该架构增加的是开销而非学习。论文：arxiv.org/abs/2606.05661。

SemiAnalysis@SemiAnalysis_ · 6月6日61

Sequential Monte Carlo speculative decoding from @makora_ai keeps multiple draft tokens alive in parallel instead of rewinding failed matches.

译来自 @makora_ai 的序贯蒙特卡洛投机解码会并行保持多个草稿 token 存活，而不是回退失败的匹配。

SemiAnalysis@SemiAnalysis_ · 6月6日49

@makora_ai 's sequential Monte Carlo speculative decoding keeps multiple draft tokens alive in parallel instead of rewinding failed matches

译@makora_ai 的顺序蒙特卡洛推测解码将多个草案 token 并行保持存活，而不是回退失败的匹配。

Chubby♨️@kimmonismus · 6月6日65

AI scientists may be moving from search to real discovery. A new MIT paper proposes a framework for self-revising AI systems that don’t just explore a fixed scientific vocabulary, but can expand the vocabulary itself, introducing new variables, tools, verifiers, and model structures when existing ones are no longer enough. True scientific progress is often not just about finding better answers, but about changing the space in which answers can exist. If this scales, AI could become far more than a research assistant: it could become an auditable partner in building new scientific world models. Still early, but conceptually very exciting.

译MIT Buehler团队提出Self-Revising Discovery Systems框架，让AI能自主扩展科学词汇（变量、工具、验证器、模型结构），而非仅搜索固定空间。论文使用typed copresheaf和Kan obstruction数学框架形式化智能体工作流，证明真正发现是可验证的schema扩展：旧证据通过Left Kan extension迁移，新异性由pointwise残差客观量化，区分发现与搜索。三种模态：检索（添加已知对象）、搜索（固定schema）、发现（验证的范式转换）。案例包括Builder/Breaker发现蛋白质模式条件合规性，CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444（2026）。

Emad@EMostaque · 6月6日33

If Claude is good enough for Nobel Prize winners it is good enough for you https://arxiv.org/abs/2606.03300

译如果 Claude 对诺贝尔奖得主来说都足够好，那对你也一样。 https://arxiv.org/abs/2606.03300

Rohan Paul@rohanpaul_ai · 6月6日79

Anthropic’s new chemistry report has a genuinely wild result. Claude Opus 4.7 is now competitive with dedicated NMR software, and the bigger story is that it can work the problem backwards, i.e. infer the molecule from the spectrum.” NMR software is the chemist’s expert tool for turning molecular structures into predicted lab spectra. So Opus 4.7 is no longer just “helping chemists read data” — it can work backward from NMR data and propose the molecule’s structure, a task the report says existing mainstream tools generally leave to human chemists. Note, that Opus 4.7, a general-purpose model with no chemistry-specific fine-tuning. Claude Opus 4.7 made the smallest hydrogen prediction errors and nearly matched MestReNova on carbon, meaning it can predict NMR signals about as well as specialist chemistry tools. So AI now handle one of chemistry’s hidden bottlenecks: translating between a molecule, its spectral shadow, and the structure a chemist actually needs to trust.

译Anthropic最新化学报告显示，通用大模型Claude Opus 4.7（无化学微调）在NMR核磁共振谱分析上匹配甚至超越专用软件MestReNova，氢预测误差最小，碳预测近乎一致。更关键的是，它能从NMR光谱反向推导分子结构——这一任务以往只能由人类化学家完成。这意味着AI现在可以处理化学中的关键瓶颈：在分子结构、谱图与最终确认之间自动翻译。

Microsoft Research@MSFTResearch · 6月6日60

During the Inside Azure Innovations breakout at Build 2026, Microsoft Azure CTO, deputy CISO and technical fellow Mark Russinovich introduced Project Mosaic, an experimental optical interconnect technology from Microsoft Research Cambridge using micro-LEDs for low-power, high-speed data transmission. A live demo led by senior researcher Kaoutar Benyahya displays individual LED modulation forming letters, proving the concept’s real-time responsiveness. Check out Mark and Kaoutar starting @ 38:38: https://msft.it/6015vdhS9

译微软Azure CTO Mark Russinovich在Build 2026上介绍Project Mosaic，这是微软剑桥研究院的实验性光学互连技术，采用micro-LED实现低功耗、高速数据传输。高级研究员Kaoutar Benyahya现场演示单个LED调制形成字母，证明概念具备实时响应能力。

Chubby♨️@kimmonismus · 6月6日72

We are in for a wild ride, and this is just the beginning: 'World-first' vaccine designed by artificial intelligence Researchers at the University of Cambridge have trialled what they describe as the world’s first AI-designed vaccine component in humans. The vaccine uses an AI-designed “super-antigen” intended to train the immune system against a broad family of coronaviruses, including existing Covid variants and animal coronaviruses that could potentially cause future pandemics. Instead of designing a vaccine around one current virus strain, researchers fed AI genetic data from many known coronaviruses. The AI then designed an antigen meant to trigger immune protection across the whole virus family, even if the virus mutates or jumps from animals to humans. The first human trial involved 39 people and mainly tested safety. The immune response was described as modest, but the result is still seen as promising because it shows that an AI-designed vaccine antigen can be tested in humans. A larger study with around 200 people will now examine how well the vaccine actually trains the immune system.

译剑桥大学研究人员开展了据称全球首个AI设计疫苗成分的人体试验。该疫苗使用AI设计的“超级抗原”，旨在训练免疫系统对抗包括现有新冠变种及可能引发未来大流行的动物冠状病毒在内的广泛冠状病毒家族。首次人体试验仅39人，主要验证安全性。免疫反应虽属中等，但被视为有前景，证明AI设计的疫苗抗原可以在人体中测试。下一步计划进行约200人的更大规模研究。

Anthropic@AnthropicAI · 6月6日73

New Anthropic Science Blog: Making Claude a chemist. To manipulate a molecule, chemists first need to understand its structure. Their main tool is NMR spectroscopy. We found Opus 4.7 matches—and on some tasks beats—dedicated NMR software. Read more: https://www.anthropic.com/research/making-claude-a-chemist

译Anthropic 新科学博客：让 Claude 成为化学家。要操纵分子，化学家首先需要了解其结构。他们的主要工具是 NMR 波谱分析。我们发现 Opus 4.7 在部分任务上匹配甚至超越了专用 NMR 软件。了解更多：https://www.anthropic.com/research/making-claude-a-chemist

Jim Fan@DrJimFan · 6月6日71

NitroGen just won CVPR Best Paper Honorable Mention!! We are making strides towards general-purpose embodied agents that master not only the real world physics, but also all possible physics across a multiverse of simulations. It’s been 4 years since MineDojo, our first embodied agent in Minecraft, won NeurIPS Best Paper. Congrats to everyone on the team!!

译NitroGen 刚刚获得 CVPR 最佳论文荣誉提名！！我们正在朝着通用具身智能体迈进，不仅掌握真实世界的物理规律，还能掌握模拟多元宇宙中所有可能的物理规律。距离我们的第一个 Minecraft 具身智能体 MineDojo 获得 NeurIPS 最佳论文奖已经过去 4 年了。祝贺团队里的每一位！！