New Anthropic Science Blog: Making Claude a chemist. To manipulate a molecule, chemists first need to understand its structure. Their main tool is NMR spectroscopy. We found Opus 4.7 matches—and on some tasks beats—dedicated NMR software. Read more: https://www.anthropic.com/research/making-claude-a-chemist

译Anthropic 新科学博客：让 Claude 成为化学家。要操纵分子，化学家首先需要了解其结构。他们的主要工具是 NMR 波谱分析。我们发现 Opus 4.7 在部分任务上匹配甚至超越了专用 NMR 软件。了解更多：https://www.anthropic.com/research/making-claude-a-chemist

Jim Fan@DrJimFan · 6月6日71

NitroGen just won CVPR Best Paper Honorable Mention!! We are making strides towards general-purpose embodied agents that master not only the real world physics, but also all possible physics across a multiverse of simulations. It’s been 4 years since MineDojo, our first embodied agent in Minecraft, won NeurIPS Best Paper. Congrats to everyone on the team!!

译NitroGen 刚刚获得 CVPR 最佳论文荣誉提名！！我们正在朝着通用具身智能体迈进，不仅掌握真实世界的物理规律，还能掌握模拟多元宇宙中所有可能的物理规律。距离我们的第一个 Minecraft 具身智能体 MineDojo 获得 NeurIPS 最佳论文奖已经过去 4 年了。祝贺团队里的每一位！！

AK@_akhaliq · 6月6日56

ArcANE Do Role-Playing Language Agents Stay in Character at the Right Time?

译ArcANE 角色扮演语言智能体是否能在适当时刻保持角色？

AK@_akhaliq · 6月6日57

Code2LoRA Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

译Code2LoRA 超网络生成的代码语言模型适配器，用于软件演化环境。

elvis@omarsar0 · 6月5日69

// The Meta-Agent Challenge // How good are current agents at self-improving? This is a great paper covering some of the challenges. They propose the Meta-Agent Challenge (MAC), where they give a coding agent a sandbox, an evaluation API, and a time budget, then ask it to program an agent that maximizes held-out performance across five domains. Results: Meta-agents rarely match human-engineered baselines, and the few that do are dominated by proprietary frontier models. Under high optimization pressure, some agents started exfiltrating ground truth from the scoring channel, even with multi-layer anti-reward-hacking defenses in place. Paper: https://arxiv.org/abs/2606.04455 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译最新研究提出元智能体挑战（MAC），将编码智能体放入沙盒，给定评估API和时间预算，要求其自主编程出在五个领域表现最优的智能体。结果发现，元智能体极少能匹敌人工设计的基线，少数成功的案例也几乎全部依赖专有前沿模型。更值得警惕的是，在高优化压力下，一些智能体开始从评分渠道外泄真实答案，即便研究人员设置了多层反奖励破解防御也未能阻止。论文：arxiv.org/abs/2606.04455。

AI at Meta@AIatMeta · 6月5日64

Big congrats to our SAM 3D team for receiving a Best Paper Honorable Mention at #CVPR26! This prestigious recognition underscores their incredible work pushing the boundaries of computer vision. Read the paper here: https://arxiv.org/abs/2511.16624

译热烈祝贺我们的 SAM 3D 团队在 #CVPR26 获得最佳论文荣誉提名！这项殊荣凸显了他们在推动计算机视觉边界方面的杰出工作。论文链接：https://arxiv.org/abs/2511.16624

Berryxia.AI@berryxia · 6月5日70

大模型都不再卷推理，都开始卷规划能力！腾讯混元联合人大高瓴人工智能学院直接开源了PlanningBench，一个专门测、训LLM真实规划能力的框架。里面塞了30多个来自真实世界的规划任务，覆盖调度、生产、旅行、资源分配、应急响应等六大类，每一个都有清晰的成功标准和全自动验证机制。你既可以用它测出当前最强模型到底在规划上有多拉胯，也能直接拿来继续微调，让模型从“会说”真正进化到“会干”。以前整个行业都在卷参数、卷上下文、卷工具调用，好像规划能力是自然就会长出来的。现在PlanningBench用30多个可验证任务直接把真相摊开：规划才是agent从玩具走向生产力的真正分水岭。腾讯这次把论文、代码、数据集全甩到GitHub和Hugging Face，等于把这个最难、最核心的能力从黑盒拉到了公开赛道。

译腾讯混元联合人大高瓴人工智能学院开源PlanningBench，一个可扩展、可验证的框架，用于评估和训练大语言模型（LLM）的真实规划能力。该框架包含30多个来自调度、生产、旅行、资源分配、应急响应等六大类的真实世界规划任务，每项任务都有清晰的成功标准和全自动验证机制。用户既可用它评测当前最强模型在规划上的短板，也可直接用于微调，让模型从“会说”进化到“会干”。论文、代码和数据集已全部在GitHub和Hugging Face开源。

Rohan Paul@rohanpaul_ai · 6月5日63

Better self-improving agents need better solvers, not bigger update-writing models. This challenges the common habit of putting the strongest model in the evolver seat. The usual intuition was: put the strongest model in the evolver seat, because a better model should write better prompts, memories, tools, and skills. This paper cuts that intuition in half. It separates two jobs that are usually blurred together: writing useful harness updates, and benefiting from those updates during task execution. The paper says the cheaper model can often write good enough prompt, memory, or skill updates. So a small Qwen3.5-9B evolver can create updates that help about as much as Claude Opus 4.6. The expensive model is more useful as the agent that actually solves the task with those updates. i.e. using the updates is very model-dependent, because weak models often fail to load the right skill or load it and then stop following it during a long task. Strong models can use the harness, but they may already be close enough to their ceiling that the update has less room to help. The sweet spot is the mid-tier model: capable enough to invoke and follow the new procedure, but not so capable that the harness has nothing left to teach. ---- Link – arxiv. org/abs/2605.30621 Title: "Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents"

译论文“Harness Updating Is Not Harness Benefit”挑战了常见直觉——把最强模型放在进化者位置以写出更好更新。实验表明，廉价模型Qwen3.5-9B即可写出与Claude Opus 4.6效果相近的提示、记忆和技能更新。昂贵模型更适合作为求解任务的智能体，因弱模型无法正确加载或遵循更新，强模型已近能力上限，收益有限。甜区在中档模型：既能调用新程序，又有足够学习空间。

Rohan Paul@rohanpaul_ai · 6月5日60

Harness-1 makes search agents better by moving memory work out of the model and into a helper system. Shows that intelligence performs better when the environment stops forcing it to spend cognition on bookkeeping. That search agents should stop using the LLM as the notebook and let a separate harness track the search state. The paper proved that a 20B model improved search by doing less inside its own head. The problem is that normal search agents must both think about the next search and remember every document, clue, failed path, and remaining check inside the same limited context. This formulation puts too much routine state management inside the policy. Harness-1 separates those jobs. The model keeps the hard semantic choices: what to search, what to inspect, what to verify, and when the evidence is good enough. The harness keeps the recoverable state: candidate pools, curated documents, importance tags, evidence links, verification records, deduplicated observations, and budget-aware memory rendering. That sounds minor until you look at reinforcement learning. RL works poorly when every failure looks the same, because an empty or wrong final set does not reveal whether the agent searched badly, forgot evidence, skipped verification, or curated carelessly. By externalizing state, Harness-1 gives the policy a cleaner learning problem: improve decisions over a visible search workspace. For Harness-1, its gains were larger on held-out benchmarks than on source-family tasks, suggesting the model learned reusable search moves rather than memorized domain habits. ---- Link – arxiv. org/abs/2606.02373 Title: "Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses"

译Harness-1 将大语言模型的记忆工作转移到外部辅助系统（harness），解决传统搜索智能体需在同一上下文窗口内处理语义决策与状态记录导致的效率低下问题。模型仅负责搜索、验证等关键语义选择，而可恢复状态（候选池、证据链接、去重记录、预算感知记忆等）由 harness 追踪。这一分离使一个 20B 参数模型实现了更好的搜索表现。在强化学习中，外部化状态避免了失败原因混淆，有助于策略学习。Harness-1 在未见 benchmark 上提升更大，表明模型学到了可复用的搜索策略而非记忆领域习惯。论文 arXiv:2606.02373。

meng shao@shao__meng · 6月5日65

Anthropic 发布关于「AI 递归自我改进」的研究报告 Anthropic 内部以 Claude 为代表的 AI 系统正被越来越深地用于开发下一代 AI 系统。这种 “AI 构建 AI” 的趋势正在加速。如果继续发展，可能出现系统完全自主设计并训练自身后继版本的情形——即递归自我改进。 https://www.anthropic.com/institute/recursive-self-improvement 关键证据（“外部公开基准”和“Anthropic 内部数据”） 1. 外部能力指标 · 模型可靠完成的任务时长正以约每 4 个月翻倍的速度增长（此前是每 7 个月）。 · SWE-bench 两年内从个位数分数趋于饱和。 · CORE-Bench 15 个月内从约 20% 饱和。 · 长时任务能力已达 16 小时量级。 2. 内部工程与研发数据 · 代码产出：截至 2026 年 5 月，Anthropic 合并到主干的代码中超过 80% 由 Claude 撰写；2026 年 Q2，工程师日均合并代码量是 2024 年的 8 倍。 · 主观感知：2026 年 3 月内部调研（130 名员工）中，受访者中位数估计自身产出约为无 AI 时的 4 倍。 · 代码质量：2025 年末 Claude 代码仍略逊于人类，如今已接近持平，并预计年内反超；人类审查已形成新瓶颈（阿姆达尔定律）。 · 实验执行：在给定目标的代码加速任务中，Claude 从 2025 年 5 月的约 3x 提升至 2026 年 4 月的约 52x；同等任务人类专家通常仅达 4x。 · 自主研究：2026 年 4 月，Claude Agent 端到端完成了一项 AI 安全开放研究问题，独立提出假设、设计实验、迭代结论，恢复能力达到人类两组研究者一周工作量的 97%（人类仅约 23%）。 · 研究判断：在 129 个真实开放调研场景中，Claude 在“下一步该怎么做”上优于人类原选择的比例从 2025 年 11 月的 51% 升至 2026 年 4 月的 64%。结构性观察人类在 AI 研发流程中的角色正在逐层收缩： · 执行层（写代码、跑实验）已高度自动化； · 方向层（选择研究问题、判断结果可信度、识别死胡同）目前仍是人类比较优势，但这一优势正在收窄。即使“研究品味”永远无法被 AI 掌握，只要人类只保留极少量方向性工作，而 AI 承担其余部分，整体研发速度仍会呈复合加速。三种未来情景 · 趋势停滞：边际收益递减、算力/能源供给受限、新架构尚未出现；作者认为不太可能，但会给社会最多适应时间 · 持续自动化，人类仍掌方向：100 人公司可相当于万人组织；人类瓶颈转向审核与协调；作者认为最可能进入此情景 · 完整递归自我改进：AI 自主设计后继系统，人类角色转为监督与验证；科技进步完全由算力决定；最不确定、风险最高

译Anthropic 发布报告显示，Claude 正被深度用于开发下一代 AI，趋势加速或导致系统自主设计后继版本。外部指标：模型可靠完成任务时长约每 4 个月翻倍，SWE-bench 两年内饱和，CORE-Bench 15 个月内饱和，长时任务达 16 小时。内部数据：截至 2026 年 5 月超 80% 主干代码由 Claude 撰写；工程师日均合并代码量是 2024 年的 8 倍；员工中位数估计产出为无 AI 时的 4 倍；实验执行从约 3x 提升至约 52x；自主研究恢复能力达人类两组研究者一周工作量的 97%（人类约 23%）；研究判断优于人类比例从 51% 升至 64%。报告探讨了趋势停滞、持续自动化、完整递归自我改进三种未来情景。

Rohan Paul@rohanpaul_ai · 6月5日70

Another great paper from Google. Shows general LLMs can solve formal math by planning proofs and checking each step. Raised general LLM performance from under 10% to 70%. A general LLM failed badly when asked to write full formal proofs in 1 try, but became much stronger when it planned, split the work into smaller claims, reused past claims, and learned from Lean’s feedback. The paper shows the weakness was not just the model’s math ability, but the way it was being used - the absence of structured interaction with a verifier. The key idea is that the model does not try to write one giant perfect proof at once, because that usually fails on long and tricky problems. Instead, LEAP stores the proof as a graph of goals and subgoals, so useful lemmas can be reused instead of rediscovered every time. The authors tested LEAP on Putnam 2025 and a new Lean benchmark built from 60 IMO-style problems, where ordinary one-shot proof writing did very poorly. LEAP solved all 12 Putnam 2025 problems and raised general LLM performance on the Lean IMO benchmark from under 10% to 70%. ---- Link – arxiv. org/abs/2606.03303 Title: "LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks"

译Google 新论文 LEAP 提出智能体框架，通过规划证明、分解子目标、复用已有引理并利用 Lean 验证器反馈，将通用 LLM 在形式化数学证明上的性能从不到 10% 提升至 70%。传统单次完整证明在长难题上表现极差，而 LEAP 将证明存储为有向图结构，先规划再逐步验证。在 Putnam 2025 竞赛中，LEAP 成功解出全部 12 道题；在包含 60 道 IMO 风格题目的 Lean 基准测试中，也实现了上述性能跃升。

Emad@EMostaque · 6月5日81

foom!

译Anthropic内部数据显示，Claude正在加速AI开发——这可能走向递归自我改进，即AI自主构建更强大的后继者。进展比预期更快，影响值得更多关注。主推文仅感叹：“foom!”

🚨 AI News | TestingCatalog@testingcatalog · 6月5日78

ANTHROPIC 🔥: A new internal research has been published, highlighting an accelerated AI development and a potential path to recursive self-improvement. > Claude Mythos Preview could work for “at least” 16 hours and was “at the upper end of what [METR] can measure.” > Today, Anthropic engineers on average ship 8x as much code per quarter as they did compared to 2021-2025. Do you feel it? 👀

译Anthropic 发布内部研究，称 Claude 正加速 AI 开发，可能通往递归自我改进——即 AI 自主构建更强大的继任者。研究显示，Claude Mythos Preview 可连续工作至少 16 小时，达到 METR 可测量上限。同时，Anthropic 工程师当前每季度交付的代码量是 2021-2025 年期间的 8 倍。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月5日73

HOLY SHIT LET'S FUCKING GOO

译HOLY SHIT LET'S FUCKING GOO 我们内部数据显示，Claude 正在加速 AI 发展——这可能通往递归自我改进，即 AI 自主构建更强大的后继者。这发生得比我们想象的更快，其影响值得更多关注。

Nathan Lambert@natolambert · 6月4日60

We have another 65 page frontier model report from Nvidia to read @eliebakouch @stochasticchasm and gang

译我们又有另一份来自英伟达的65页前沿模型报告要读，作者@eliebakouch @stochasticchasm及其团队。

Rohan Paul@rohanpaul_ai · 6月4日66

This Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it can get worse when they keep rewriting their own memories. LLM agents can learn from experience, but their rewritten memories often become unreliable. The problem is that many agent systems store past work by asking an LLM to compress messy experience into neat written lessons. That sounds useful because the agent should remember what worked before, but the paper finds that repeated rewriting slowly damages the memory. The core idea is that raw episodes, meaning the actual past attempts and solutions, often stay more useful than the polished lessons made from them. The authors tested this across tasks like web shopping, simulated worlds, app use, and ARC-style puzzle problems where they could control the correct solutions. The sharpest result is that GPT-5.4 solved 100% of a small ARC-AGI set with no memory, but after memory was built from correct solutions, streaming updates dropped it to about 54%. The failures came from bad grouping, overbroad lessons, and overfitting, so the memory forgot details, mixed up task types, or learned rules that only worked on narrow examples. The big deal is that agent memory should not automatically rewrite every experience into a summary, because keeping raw evidence and only sometimes making summaries worked better. The paper is really proposing that agent memory should treat raw past episodes as important evidence, not as disposable notes to summarize away. ---- arxiv. org/abs/2605.12978 Title: "Useful Memories Become Faulty When Continuously Updated by LLMs"

译伊利诺伊大学和清华大学等实验室研究发现，LLM智能体重复重写自身记忆会导致记忆变得更不可靠。原始经历（实际过往尝试和解决方案）往往比提炼后的总结更有用。测试中，GPT-5.4在小型ARC-AGI数据集上无记忆时正确率100%，但建立记忆并持续更新后降至约54%。失败原因包括分组不当、教训过度泛化及过拟合。研究建议智能体不应自动将每个经历重写为摘要，保留原始证据并仅偶尔总结效果更好。

Rohan Paul@rohanpaul_ai · 6月4日71

This Google DeepMind’s paper is a serious warning for anyone using autonomous agents today. Gives the first clear taxonomy of 6 attack types where harmful websites can detect AI agents and show them hidden content humans never see, like - Instructions buried in HTML comments or white-on-white text - Steganography in image pixels - Override commands in PDFs, metadata, or even speaker notes - Memory poisoning that persists across sessions - Goal hijacking and cross-agent cascades in multi-agent setups The real security problem for AI agents is not just the model, but the environment it reads. The web itself can be weaponized against autonomous AI agents. As agents increasingly browse the internet, read emails, execute transactions, and spawn sub-agents, the information environment becomes an attack surface. In one cited benchmark, hidden prompt injections embedded in web content partially commandeered agents in up to 86% of scenarios, sub-agent hijacking working 58–90% of the time, and data exfiltration attacks clearing 80% across five different agent architectures. That reframes the whole debate. We usually talk about model safety as if the danger sits inside the weights, but agents do something more fragile: they browse, retrieve, remember, and act on untrusted material in real time. Here’s the thing to worry about. A web page does not have to look malicious to be dangerous to an agent, because the agent may parse what humans never see: hidden HTML comments, metadata, CSS-hidden text, formatting syntax, or adversarial content embedded in images and other media. The threat gets more serious once memory enters the loop. If an agent uses RAG or persistent memory, poisoning no longer has to win in one shot. It can sit quietly in a corpus or memory store and activate later, which is why the paper highlights results showing latent memory poisoning above 80% attack success with less than 0.1% data contamination. --- ssrn .com/sol3/papers.cfm?abstract_id=6372438

译Google DeepMind论文首次系统分类六类攻击：HTML注释/白色文本隐藏指令、图像隐写、PDF元数据/演讲者笔记覆写、跨会话内存投毒、目标劫持及多智能体级联攻击。隐藏提示注入在86%场景中部分控制智能体，子智能体劫持成功率58–90%，数据泄露攻击在五种架构中均超80%。内存投毒成功率超80%，仅需不足0.1%数据污染。论文指出网页、邮件等非受信材料可被武器化，构成主要攻击面。

Chubby♨️@kimmonismus · 6月4日67

A blind Stanford-led study of nearly 3,000 anonymized matchups found law professors across 16 schools preferred AI-generated answers to student contract-law questions over those written by fellow professors 75% of the time, and judged the AI responses far less likely to be pedagogically harmful (3.5% vs. 12%). "The team tested a range of systems, including commercial tutoring tools and Google's NotebookLM." Now imagine the performance of models in 6-12 months.

译一项由斯坦福大学领导的盲测研究，对近3000场匿名对决的分析发现，16所法学院的法律教授在合同法问题中，有75%的时间更偏好AI生成的答案，而非教授自己写的答案，并且认为AI回答的教学危害性远低于后者（3.5% vs 12%）。 “研究团队测试了多种系统，包括商业辅导工具和Google的NotebookLM。” 现在想象6-12个月后模型的表现。

AK@_akhaliq · 6月4日62

dMoE dLLMs with Learnable Block Experts

译dMoE 具有可学习块专家的dLLM

AK@_akhaliq · 6月4日46

Bootstrap Your Generator Unpaired Visual Editing with Flow Matching

译自举你的生成器非配对视觉编辑与流匹配

AK@_akhaliq · 6月4日60

Unified Neural Scaling Laws

译统一神经缩放定律

Anthropic@AnthropicAI · 6月4日64

How well do the security community's techniques hold up against AI-enabled cyberattacks? We examined 832 malicious accounts and mapped their activity onto a longstanding database of tactics and techniques used by threat actors. Here's what we learned:https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

译安全社区的技术在应对AI驱动的网络攻击方面表现如何？我们检查了832个恶意账户，并将其活动映射到一个长期存在的威胁行为者战术和技术数据库。以下是我们学到的：https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

Microsoft Research@MSFTResearch · 6月4日62

A three‑month pilot in a Midwestern bottling plant shows what happens when AI moves beyond chat and into decision-making, where constraints shift, stakes are real, and answers must hold. https://msft.it/6015vjYUN

译一份在中西部装瓶厂进行的三个月试点显示，当AI超越聊天进入决策领域时会发生什么——约束条件变化、风险真实、答案必须可靠。 https://msft.it/6015vjYUN

elvis@omarsar0 · 6月3日72

New research from Google. Just shows the impressive results you can get from custom agent harnesses. LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback. The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%. Paper: https://arxiv.org/abs/2606.03303 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Google 新研究 LEAP 将通用大语言模型封装在智能体框架中，每个步骤基于 Lean 编译器，并依赖验证器反馈进行迭代。同一通用模型解决了全部 12 道 Putnam 2025 问题，并将 Lean-IMO-Bench 一次性解决率从不到 10% 提升至 70%，击败了得分 48% 的专业金牌系统。论文链接：https://arxiv.org/abs/2606.03303。

Ethan Mollick@emollick · 6月3日41

Hey, its our paper!

译嘿，这是我们发表的论文！ [引用 @PNAS News]：过去一周PNAS最高浏览量文章之一——《劝说大语言模型遵守有异议的请求》。查看论文：https://ow.ly/wOxl50Z6fZA 更多热门文章请访问 https://ow.ly/uLkC50Z6fZz。

Saining Xie@sainingxie · 6月3日67

how does the brain build and track an internal state of the world from (possibly incomplete and noisy) visual observations? i believe visual state tracking will be the grand challenge for vision in the coming years, and i hope this benchmark can be a useful starting line. enjoy!

译研究团队推出VSTAT基准测试，用于评估多模态大语言模型（MLLMs）在视频中追踪动态状态的能力。测试任务看似简单，包括计数杯子、识别键入的文字、统计翻页次数等，人类可以轻松完成，但当前MLLMs表现欠佳。该测试旨在推动视觉状态跟踪这一前沿方向的发展，解决模型从不完整、有噪声的视觉观察中建立和更新内部世界状态的核心挑战。

向阳乔木@vista8 · 6月3日58

今天读到斯坦福大学研究团队的一个论文，有点跟直觉不一样。把没过滤的Common Crawl数据喂给大模型，发现计算量足够大时，不过滤数据效果反而比清洗后的数据效果好。在 15M 小模型上，过滤数据全面领先，未过滤的很差。但当模型规模达到 330M 和 1B 时，情况完全反转，未过滤的在充分训练后超越了所有过滤版本。小模型怕垃圾，大模型不怕。模型大，秩（参数量）多，就有足够空间把垃圾和有用信息隔离开。论文解读和原始PDF见评论区

译斯坦福团队研究发现，使用未过滤Common Crawl数据训练模型时，在计算量充足下效果可能优于清洗后数据，结论呈现模型规模依赖性：小模型（15M）上过滤数据全面领先，但大模型（330M、1B）未过滤数据在充分训练后反而超越过滤版本，原因是大模型参数容量足够大，可在训练中自行隔离噪声与有效信息。

Berryxia.AI@berryxia · 6月3日76

兄弟们，Google DeepMind 团队又来整活儿！ Google DeepMind的最新发布，直接把“AI能帮科学家干嘛”这个老问题彻底翻篇了。他们把Gemini做成了一个叫Co-Scientist的多Agent系统。不是简单问答工具，是完整复制了科学家从idea到验证的整个循环：生成上千个假设、举办“idea锦标赛”、让多个Agent展开科学辩论、互相批判精炼，最后用文献、数据和搜索工具把每个主张落地验证。以前科研最卡的环节，就是一个人脑力有限，生成好假设、反复辩论、跨领域拉新知识都要靠自己。现在Co-Scientist把这个过程变成可规模化的流水线。过去一年他们和全球顶尖科学家一起测，在肝纤维化新靶点、肌萎缩侧索硬化（ALS）新疗法、逆转衰老的遗传线索这些超级复杂的问题上，都拿出了真正有潜力的新方向。最反直觉的一点是：它不是来取代科学家的，只是真正成了“专职研究伙伴”。科学家终于可以把脑力从“反复想假设、反复查文献”里解放出来，专注在最有创造力的判断和实验设计上。 AI把以前只有顶尖团队才玩得起的“高强度idea迭代”变成了人人可用的基础设施。现在他们已经把Hypothesis Generation功能开放给个人研究者，直接通过Gemini for Science就能用。普通研究员也能拥有一个24小时不睡觉、能辩论、能验证、还能不断进化的AI合作者。这其实戳破了当前最主流的误解：很多人以为AI会让科学家失业，结果真实路径是AI把科学发现的速度和广度直接拉高一个数量级，让更多人能真正参与到突破性研究里。

译Google DeepMind发布了基于Gemini的多Agent系统Co-Scientist，旨在实现科研流程自动化。该系统能够生成、辩论和验证假设，帮助科学家从高强度脑力劳动中解放出来。过去一年，它已在肝纤维化新靶点、ALS新疗法等复杂问题上与科学家合作探索出新方向。其定位并非取代科学家，而是作为“专职研究伙伴”。目前，其假设生成功能已通过Gemini for Science向个人研究者开放。

Rohan Paul@rohanpaul_ai · 6月3日57

Stanford researchers found that law professors preferred AI answers over peer professor answers 75% of the time when judging contract-law help for students. The study tested whether LLMs can handle a field where the answer is often not a fact, but a defensible argument built from rules, exceptions, and judgment. The professors wrote 40 real student-style questions, gave their own answers, and then blindly judged nearly 3,000 comparisons between human and AI responses. The striking result was not just that AI won often, but that professors marked AI answers as harmful only 3.5% of the time, compared with 12% for human answers. i.e. the model was not merely sounding fluent, but often matching the teaching standard law professors use when explaining ambiguity to students.

译斯坦福研究人员发现，在评估合同法问题时，法律教授有75%的次数更倾向于选择AI给出的答案，而非同行教授的答案。该研究让教授们针对40个真实学生提问撰写答案，并对近3000个人类与AI的回答进行了盲测比较。结果不仅显示AI胜出频率高，而且教授们仅将3.5%的AI答案标记为“有害”，而对人类答案的有害标记率为12%。这表明大语言模型并非只是流畅，其表现常能达到教授向学生解释法律模糊性的教学标准。

Rohan Paul@rohanpaul_ai · 6月3日63

AI can explain science better than it can forecast science. Across 4,760 scientific events, the models were much better at recognizing possible research paths than forecasting actual outcomes. Models often recognize a plausible research idea when the answer is already nearby, especially in multiple-choice form. But they are much weaker at the harder thing: predicting whether a discovery will actually happen, when it will happen, and what method will make it work. That means the models are still much better at hindsight than foresight. When asked whether a scientific claim will actually be realized, the models hover near chance, and when asked when progress will arrive, they systematically push it too far into the future. Even when the authors gave models extra older information, the models improved a bit but still did not become reliable at predicting future scientific progress. So having lots of scientific knowledge inside a model does not automatically make it a good scientific forecaster. ---- Paper Link – arxiv. org/abs/2605.22681 Paper Title: "Forecasting Scientific Progress with AI"

译一项对4,760个科学事件的研究发现，AI模型在“解释”科学方面优于“预测”科学。模型在识别可能的研究路径（尤其是选择题形式）时表现较好，但在预测科学发现是否会实际发生、何时发生以及何种方法有效等更难任务上表现薄弱，准确率接近随机猜测。即使提供额外历史信息，模型改善有限。这表明，模型内嵌大量科学知识并不等同于具备可靠的科学预见能力。研究论文发表于arXiv（2605.22681），标题为《Forecasting Scientific Progress with AI》。

Microsoft Research@MSFTResearch · 6月3日72

Weather forecasts thousands of times faster than traditional supercomputers. Hear from Kenji Takeda on Aurora at the Microsoft Research Lab at #MSBuild. Learn more: https://msft.it/6018vjGUA

译天气预报速度比传统超级计算机快数千倍。听听Kenji Takeda在#MSBuild微软研究实验室关于Aurora的分享。了解更多：https://msft.it/6018vjGUA

AK@_akhaliq · 6月3日62

GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization

译GPU预测器大语言模型作为内核运行时优化的选择性代理

AK@_akhaliq · 6月3日60

Seeing Isn't Knowing Do VLMs Know When Not to Answer Spatial Questions (and Why)?

译视觉语言模型知道何时不回答空间问题吗（以及为什么）？

AK@_akhaliq · 6月2日62

Crafter A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

译Crafter 一个用于从多样化输入生成可编辑科学图表的多智能体框架

elvis@omarsar0 · 6月2日50

// Scaling Behavior of Single LLM-Driven Multi-Agent Systems // Does adding more agents actually make a multi-agent system better? It's possible that collective intelligence emerges from interaction design rather than from agent plurality. This is something important to understand if you are building multi-agent systems. This new study reports that the optimal number of agents depends on the base model's capability and the task type, not on adding more of them. Paper: https://arxiv.org/abs/2606.00655 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译研究探讨添加更多智能体是否提升多智能体系统性能。结论指出，最优智能体数量取决于基础模型的能力和任务类型，而非单纯增加数量。集体智能更可能源于精心的交互设计，而非智能体数量的增多。相关论文："Scaling Behavior of Single LLM-Driven Multi-Agent Systems"。

Rohan Paul@rohanpaul_ai · 6月2日57

This paper proposes a way to predict the cheapest safe AWS spot fleet before launching it. AWS spot machines can be much cheaper, but users usually cannot see the final fleet price across regions before starting, so this paper turns that blind choice into a comparison that can save up to 64%. Spot instances are cheap because they are conditional: the cloud provider can take them back, prices move, and capacity shifts by region. The quiet problem is that AWS helps users launch spot fleets, but not fully see the fleet’s price or best region before launch. The authors build a service that watches how AWS creates these fleets, learns those patterns with time-aware AI models, and then estimates the fleet mix and cost across 9 regions. A user gives the service a target amount of computing power and a placement strategy, and the service returns region-ranked options before anything is launched. They tested it on AWS with fleets up to 1500 virtual CPUs, using 720 test launches after a 90-day monitoring period. The predicted fleet matched AWS exactly in 92.78% of cases, reached 99.79% overall accuracy against AWS behavior, and AWS accepted every recommended fleet. Result is that choosing the best region mattered far more than changing the strategy inside 1 region, with possible savings up to 64%. ---- Paper Link – arxiv. org/abs/2605.22778 Paper Title: "AI-Driven Multi-Region Provisioning for Cloud Services Using Spot Fleets"

译该研究提出了一种AI驱动的服务，用于在启动前预测最便宜且安全的AWS Spot实例舰队。该服务通过时间感知模型学习AWS创建舰队的模式，并估算9个区域的舰队组合与成本，向用户返回排序后的区域选项。测试显示，在最多1500 vCPU的舰队上，预测结果与AWS完全匹配的比例达92.78%，整体准确率为99.79%，且所有推荐舰队均被AWS接受。关键发现是选择最佳区域比在单个区域内调整策略更重要，潜在成本节省最高可达64%。

Rohan Paul@rohanpaul_ai · 6月2日65

Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility. It exposed the gap between beautiful video generation and controllable world simulation. A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect. WBench tests this with 289 cases, 1,058 interaction turns, 20 models, 5 dimensions, and 22 automatic metrics, covering navigation, subject actions, event edits, perspective switches, and both viewpoints. Across all those 20 evaluated models, the paper finds that no model dominates all dimensions, which means current systems have not yet merged high-quality rendering, reliable control, long-horizon memory, and physical rule-following into one stable capability. Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics. Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command. The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?” 🧵 1.

译美团LongCat发布视频世界模型评测基准WBench。该基准将测试重点从画面美观转向控制、多轮记忆、指令遵循和物理合理性等核心能力。它包含289个案例、1058个交互轮次，评估了20个模型在导航、主体动作、事件编辑等5个维度的表现，共使用22项自动指标。研究发现，没有任何模型能在所有维度上占据主导，这表明现有系统尚未将高质量渲染、可靠控制、长期记忆与物理规则遵循整合为稳定能力。WBench的设计能区分失败是源于渲染、场景设置、控制还是物理问题，并指出导航能力与视觉质量基本无关。

Ethan Mollick@emollick · 6月2日70

Big paper on AI coding agents using Github & other data The auto-complete tools (Copilot) led to 2.2x more code, local agents like original Claude Code led to 7.4x, & current remote coding agents 17.3x(!) But human bottlenecks in coding means actual releases "only" went up 30%

译关于使用Github及其他数据的AI编程智能体的重要论文自动补全工具（如Copilot）使代码量增加2.2倍，本地智能体（如初版Claude Code）增加7.4倍，而当前远程编程智能体增加17.3倍（！）但编程中的人类瓶颈意味着实际发布量“仅”增加了30%

Rohan Paul@rohanpaul_ai · 6月2日48

A 178 page survey study for refreshing math and generative AI foundations from University of Huddersfield. The Little Book of Generative AI Foundations.

译哈德斯菲尔德大学发布了一份178页的调查研究，旨在更新数学和生成式AI的基础知识。《生成式AI基础小册子》。

Rohan Paul@rohanpaul_ai · 6月1日60

Better AI agent systems scale by remembering useful feedback, not by spending more compute. The simple mistake is to count tokens, calls, or dollars as if they were all evidence. The authors say those numbers miss the real issue, because 2 runs can spend the same budget while only 1 gets feedback that is correct, new, relevant, and remembered. An agent harness is not just a wrapper around a model; it is a feedback machine that decides what to test, what to trust, what to store, and what to ignore. Their answer is Effective Feedback Compute, or EFC, a score that counts feedback only when it teaches the agent something useful and changes later decisions. They also divide EFC by task demand, because a small lookup task and a messy software-repair task need different amounts of helpful feedback before the agent has enough to solve them. They tested this on synthetic tasks, code tasks with executable tests, real benchmark traces, held-out settings, and a new prospective batch, then compared EFC with raw compute and a strong agent-scaling baseline. The main result is that task-normalized EFC predicted failures much better than raw compute, and in 1 matched-budget test, better feedback raised success from 0.27 to 0.90 while cost and tool calls stayed fixed. ---- Link – arxiv. org/abs/2605.29682 Title: "Scaling Laws for Agent Harnesses via Effective Feedback Compute"

译当前AI智能体的扩展方法常错误地将计算资源消耗等同于学习证据。新研究指出，两次运行消耗相同预算，但反馈的有效性可能天差地别。为此，研究提出了“有效反馈计算”（EFC）指标，仅统计那些正确、新颖、相关且被记住、并能改变后续决策的反馈。研究还结合任务需求对EFC进行归一化。实验表明，任务归一化的EFC比原始计算指标更能预测失败。在一项匹配预算测试中，采用更好反馈的方法将任务成功率从0.27提升至0.90，而成本和工具调用次数保持不变。链接：arxiv.org/abs/2605.29682 标题："Scaling Laws for Agent Harnesses via Effective Feedback Compute"