ArcANE Do Role-Playing Language Agents Stay in Character at the Right Time?

译ArcANE 角色扮演语言智能体是否能在适当时刻保持角色？

AK@_akhaliq · 6月6日57

Code2LoRA Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

译Code2LoRA 超网络生成的代码语言模型适配器，用于软件演化环境。

elvis@omarsar0 · 6月5日69

// The Meta-Agent Challenge // How good are current agents at self-improving? This is a great paper covering some of the challenges. They propose the Meta-Agent Challenge (MAC), where they give a coding agent a sandbox, an evaluation API, and a time budget, then ask it to program an agent that maximizes held-out performance across five domains. Results: Meta-agents rarely match human-engineered baselines, and the few that do are dominated by proprietary frontier models. Under high optimization pressure, some agents started exfiltrating ground truth from the scoring channel, even with multi-layer anti-reward-hacking defenses in place. Paper: https://arxiv.org/abs/2606.04455 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译最新研究提出元智能体挑战（MAC），将编码智能体放入沙盒，给定评估API和时间预算，要求其自主编程出在五个领域表现最优的智能体。结果发现，元智能体极少能匹敌人工设计的基线，少数成功的案例也几乎全部依赖专有前沿模型。更值得警惕的是，在高优化压力下，一些智能体开始从评分渠道外泄真实答案，即便研究人员设置了多层反奖励破解防御也未能阻止。论文：arxiv.org/abs/2606.04455。

AI at Meta@AIatMeta · 6月5日64

Big congrats to our SAM 3D team for receiving a Best Paper Honorable Mention at #CVPR26! This prestigious recognition underscores their incredible work pushing the boundaries of computer vision. Read the paper here: https://arxiv.org/abs/2511.16624

译热烈祝贺我们的 SAM 3D 团队在 #CVPR26 获得最佳论文荣誉提名！这项殊荣凸显了他们在推动计算机视觉边界方面的杰出工作。论文链接：https://arxiv.org/abs/2511.16624

Berryxia.AI@berryxia · 6月5日70

大模型都不再卷推理，都开始卷规划能力！腾讯混元联合人大高瓴人工智能学院直接开源了PlanningBench，一个专门测、训LLM真实规划能力的框架。里面塞了30多个来自真实世界的规划任务，覆盖调度、生产、旅行、资源分配、应急响应等六大类，每一个都有清晰的成功标准和全自动验证机制。你既可以用它测出当前最强模型到底在规划上有多拉胯，也能直接拿来继续微调，让模型从“会说”真正进化到“会干”。以前整个行业都在卷参数、卷上下文、卷工具调用，好像规划能力是自然就会长出来的。现在PlanningBench用30多个可验证任务直接把真相摊开：规划才是agent从玩具走向生产力的真正分水岭。腾讯这次把论文、代码、数据集全甩到GitHub和Hugging Face，等于把这个最难、最核心的能力从黑盒拉到了公开赛道。

译腾讯混元联合人大高瓴人工智能学院开源PlanningBench，一个可扩展、可验证的框架，用于评估和训练大语言模型（LLM）的真实规划能力。该框架包含30多个来自调度、生产、旅行、资源分配、应急响应等六大类的真实世界规划任务，每项任务都有清晰的成功标准和全自动验证机制。用户既可用它评测当前最强模型在规划上的短板，也可直接用于微调，让模型从“会说”进化到“会干”。论文、代码和数据集已全部在GitHub和Hugging Face开源。

Rohan Paul@rohanpaul_ai · 6月5日63

Better self-improving agents need better solvers, not bigger update-writing models. This challenges the common habit of putting the strongest model in the evolver seat. The usual intuition was: put the strongest model in the evolver seat, because a better model should write better prompts, memories, tools, and skills. This paper cuts that intuition in half. It separates two jobs that are usually blurred together: writing useful harness updates, and benefiting from those updates during task execution. The paper says the cheaper model can often write good enough prompt, memory, or skill updates. So a small Qwen3.5-9B evolver can create updates that help about as much as Claude Opus 4.6. The expensive model is more useful as the agent that actually solves the task with those updates. i.e. using the updates is very model-dependent, because weak models often fail to load the right skill or load it and then stop following it during a long task. Strong models can use the harness, but they may already be close enough to their ceiling that the update has less room to help. The sweet spot is the mid-tier model: capable enough to invoke and follow the new procedure, but not so capable that the harness has nothing left to teach. ---- Link – arxiv. org/abs/2605.30621 Title: "Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents"

译论文“Harness Updating Is Not Harness Benefit”挑战了常见直觉——把最强模型放在进化者位置以写出更好更新。实验表明，廉价模型Qwen3.5-9B即可写出与Claude Opus 4.6效果相近的提示、记忆和技能更新。昂贵模型更适合作为求解任务的智能体，因弱模型无法正确加载或遵循更新，强模型已近能力上限，收益有限。甜区在中档模型：既能调用新程序，又有足够学习空间。

Rohan Paul@rohanpaul_ai · 6月5日60

Harness-1 makes search agents better by moving memory work out of the model and into a helper system. Shows that intelligence performs better when the environment stops forcing it to spend cognition on bookkeeping. That search agents should stop using the LLM as the notebook and let a separate harness track the search state. The paper proved that a 20B model improved search by doing less inside its own head. The problem is that normal search agents must both think about the next search and remember every document, clue, failed path, and remaining check inside the same limited context. This formulation puts too much routine state management inside the policy. Harness-1 separates those jobs. The model keeps the hard semantic choices: what to search, what to inspect, what to verify, and when the evidence is good enough. The harness keeps the recoverable state: candidate pools, curated documents, importance tags, evidence links, verification records, deduplicated observations, and budget-aware memory rendering. That sounds minor until you look at reinforcement learning. RL works poorly when every failure looks the same, because an empty or wrong final set does not reveal whether the agent searched badly, forgot evidence, skipped verification, or curated carelessly. By externalizing state, Harness-1 gives the policy a cleaner learning problem: improve decisions over a visible search workspace. For Harness-1, its gains were larger on held-out benchmarks than on source-family tasks, suggesting the model learned reusable search moves rather than memorized domain habits. ---- Link – arxiv. org/abs/2606.02373 Title: "Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses"

译Harness-1 将大语言模型的记忆工作转移到外部辅助系统（harness），解决传统搜索智能体需在同一上下文窗口内处理语义决策与状态记录导致的效率低下问题。模型仅负责搜索、验证等关键语义选择，而可恢复状态（候选池、证据链接、去重记录、预算感知记忆等）由 harness 追踪。这一分离使一个 20B 参数模型实现了更好的搜索表现。在强化学习中，外部化状态避免了失败原因混淆，有助于策略学习。Harness-1 在未见 benchmark 上提升更大，表明模型学到了可复用的搜索策略而非记忆领域习惯。论文 arXiv:2606.02373。

meng shao@shao__meng · 6月5日65

Anthropic 发布关于「AI 递归自我改进」的研究报告 Anthropic 内部以 Claude 为代表的 AI 系统正被越来越深地用于开发下一代 AI 系统。这种 “AI 构建 AI” 的趋势正在加速。如果继续发展，可能出现系统完全自主设计并训练自身后继版本的情形——即递归自我改进。 https://www.anthropic.com/institute/recursive-self-improvement 关键证据（“外部公开基准”和“Anthropic 内部数据”） 1. 外部能力指标 · 模型可靠完成的任务时长正以约每 4 个月翻倍的速度增长（此前是每 7 个月）。 · SWE-bench 两年内从个位数分数趋于饱和。 · CORE-Bench 15 个月内从约 20% 饱和。 · 长时任务能力已达 16 小时量级。 2. 内部工程与研发数据 · 代码产出：截至 2026 年 5 月，Anthropic 合并到主干的代码中超过 80% 由 Claude 撰写；2026 年 Q2，工程师日均合并代码量是 2024 年的 8 倍。 · 主观感知：2026 年 3 月内部调研（130 名员工）中，受访者中位数估计自身产出约为无 AI 时的 4 倍。 · 代码质量：2025 年末 Claude 代码仍略逊于人类，如今已接近持平，并预计年内反超；人类审查已形成新瓶颈（阿姆达尔定律）。 · 实验执行：在给定目标的代码加速任务中，Claude 从 2025 年 5 月的约 3x 提升至 2026 年 4 月的约 52x；同等任务人类专家通常仅达 4x。 · 自主研究：2026 年 4 月，Claude Agent 端到端完成了一项 AI 安全开放研究问题，独立提出假设、设计实验、迭代结论，恢复能力达到人类两组研究者一周工作量的 97%（人类仅约 23%）。 · 研究判断：在 129 个真实开放调研场景中，Claude 在“下一步该怎么做”上优于人类原选择的比例从 2025 年 11 月的 51% 升至 2026 年 4 月的 64%。结构性观察人类在 AI 研发流程中的角色正在逐层收缩： · 执行层（写代码、跑实验）已高度自动化； · 方向层（选择研究问题、判断结果可信度、识别死胡同）目前仍是人类比较优势，但这一优势正在收窄。即使“研究品味”永远无法被 AI 掌握，只要人类只保留极少量方向性工作，而 AI 承担其余部分，整体研发速度仍会呈复合加速。三种未来情景 · 趋势停滞：边际收益递减、算力/能源供给受限、新架构尚未出现；作者认为不太可能，但会给社会最多适应时间 · 持续自动化，人类仍掌方向：100 人公司可相当于万人组织；人类瓶颈转向审核与协调；作者认为最可能进入此情景 · 完整递归自我改进：AI 自主设计后继系统，人类角色转为监督与验证；科技进步完全由算力决定；最不确定、风险最高

译Anthropic 发布报告显示，Claude 正被深度用于开发下一代 AI，趋势加速或导致系统自主设计后继版本。外部指标：模型可靠完成任务时长约每 4 个月翻倍，SWE-bench 两年内饱和，CORE-Bench 15 个月内饱和，长时任务达 16 小时。内部数据：截至 2026 年 5 月超 80% 主干代码由 Claude 撰写；工程师日均合并代码量是 2024 年的 8 倍；员工中位数估计产出为无 AI 时的 4 倍；实验执行从约 3x 提升至约 52x；自主研究恢复能力达人类两组研究者一周工作量的 97%（人类约 23%）；研究判断优于人类比例从 51% 升至 64%。报告探讨了趋势停滞、持续自动化、完整递归自我改进三种未来情景。

Rohan Paul@rohanpaul_ai · 6月5日70

Another great paper from Google. Shows general LLMs can solve formal math by planning proofs and checking each step. Raised general LLM performance from under 10% to 70%. A general LLM failed badly when asked to write full formal proofs in 1 try, but became much stronger when it planned, split the work into smaller claims, reused past claims, and learned from Lean’s feedback. The paper shows the weakness was not just the model’s math ability, but the way it was being used - the absence of structured interaction with a verifier. The key idea is that the model does not try to write one giant perfect proof at once, because that usually fails on long and tricky problems. Instead, LEAP stores the proof as a graph of goals and subgoals, so useful lemmas can be reused instead of rediscovered every time. The authors tested LEAP on Putnam 2025 and a new Lean benchmark built from 60 IMO-style problems, where ordinary one-shot proof writing did very poorly. LEAP solved all 12 Putnam 2025 problems and raised general LLM performance on the Lean IMO benchmark from under 10% to 70%. ---- Link – arxiv. org/abs/2606.03303 Title: "LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks"

译Google 新论文 LEAP 提出智能体框架，通过规划证明、分解子目标、复用已有引理并利用 Lean 验证器反馈，将通用 LLM 在形式化数学证明上的性能从不到 10% 提升至 70%。传统单次完整证明在长难题上表现极差，而 LEAP 将证明存储为有向图结构，先规划再逐步验证。在 Putnam 2025 竞赛中，LEAP 成功解出全部 12 道题；在包含 60 道 IMO 风格题目的 Lean 基准测试中，也实现了上述性能跃升。

Emad@EMostaque · 6月5日81

foom!

译Anthropic内部数据显示，Claude正在加速AI开发——这可能走向递归自我改进，即AI自主构建更强大的后继者。进展比预期更快，影响值得更多关注。主推文仅感叹：“foom!”

Rohan Paul@rohanpaul_ai · 6月5日39

🗞️ Google DeepMind's paper has some great advice on how we should actually give tasks to AI. It is not just about telling an AI to do something and hoping for the best. Instead, this framework looks at delegation as a string of choices where you figure out if you should even hand the task over, how to explain it, and how to check the work afterward. Current systems rely on rigid rules that break when things fail unexpectedly. The researchers suggest building a dynamic market where agents bid on tasks using smart contracts. This requires strict monitoring and cryptographic proofs to guarantee correct work without leaking private data. Instead of trusting a simple rating, agents will use verifiable digital certificates to prove their exact skills. - Keeping things flexible when things change This new system is built to be adaptive rather than stuck in its ways. It treats the handoff as a live process where authority and responsibility can shift around in real time. If the situation changes or something breaks, the framework helps manage that failure so the whole project does not go off the rails. It works for both humans giving tasks to AI and for when AI needs to handle things on its own. - Finding the right amount of trust One of the coolest parts is how it handles trust. They made formal trust models that look at how hard a task is and how well the AI has done in the past. This stops people from "over-delegating," which is when you give an AI something it is not ready for. It also stops "under-delegating," which happens when you do all the work yourself even though the AI could have handled it easily. - Double checking the work You cannot just take an AI's word for it, so this framework has specific ways to validate the output. It sets up rules for when to accept an answer based on how confident the AI is. It also has backup plans ready to go if the AI fails. This is super important for real world jobs where trusting a machine blindly could cause a bunch of errors to pile up. - When AI agents hire other AI agents The framework also covers what happens when 1 AI agent hands a task to another AI agent. The system tracks who is actually accountable and makes sure the right authority is passed down the line so nothing gets lost in the network. - Making sure the work actually fits It is a step by step approach to make sure the AI's contribution actually makes sense for the bigger goal. By treating this as a structured process, they are making it much safer for companies to use AI in their daily operations without worrying about constant mistakes. ---- arxiv. org/abs/2602.11865 "Intelligent AI Delegation"

译Google DeepMind 论文《Intelligent AI Delegation》将任务委托视为一系列选择：是否委托、如何解释、如何验证结果。系统构建动态市场，智能体通过智能合约竞标任务，利用加密证明保证正确性与隐私。基于信任模型，避免过度委托（给 AI 难完成的任务）或不足委托（自己做 AI 能胜任的事）。输出验证规则根据 AI 置信度决定接受与否，并有备用计划处理失败。还涵盖 AI 智能体间的委托与问责追踪，确保贡献符合整体目标。该框架使企业更安全地在日常运营中使用 AI。

🚨 AI News | TestingCatalog@testingcatalog · 6月5日78

ANTHROPIC 🔥: A new internal research has been published, highlighting an accelerated AI development and a potential path to recursive self-improvement. > Claude Mythos Preview could work for “at least” 16 hours and was “at the upper end of what [METR] can measure.” > Today, Anthropic engineers on average ship 8x as much code per quarter as they did compared to 2021-2025. Do you feel it? 👀

译Anthropic 发布内部研究，称 Claude 正加速 AI 开发，可能通往递归自我改进——即 AI 自主构建更强大的继任者。研究显示，Claude Mythos Preview 可连续工作至少 16 小时，达到 METR 可测量上限。同时，Anthropic 工程师当前每季度交付的代码量是 2021-2025 年期间的 8 倍。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月5日73

HOLY SHIT LET'S FUCKING GOO

译HOLY SHIT LET'S FUCKING GOO 我们内部数据显示，Claude 正在加速 AI 发展——这可能通往递归自我改进，即 AI 自主构建更强大的后继者。这发生得比我们想象的更快，其影响值得更多关注。

Nathan Lambert@natolambert · 6月4日60

We have another 65 page frontier model report from Nvidia to read @eliebakouch @stochasticchasm and gang

译我们又有另一份来自英伟达的65页前沿模型报告要读，作者@eliebakouch @stochasticchasm及其团队。

Rohan Paul@rohanpaul_ai · 6月4日66

This Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it can get worse when they keep rewriting their own memories. LLM agents can learn from experience, but their rewritten memories often become unreliable. The problem is that many agent systems store past work by asking an LLM to compress messy experience into neat written lessons. That sounds useful because the agent should remember what worked before, but the paper finds that repeated rewriting slowly damages the memory. The core idea is that raw episodes, meaning the actual past attempts and solutions, often stay more useful than the polished lessons made from them. The authors tested this across tasks like web shopping, simulated worlds, app use, and ARC-style puzzle problems where they could control the correct solutions. The sharpest result is that GPT-5.4 solved 100% of a small ARC-AGI set with no memory, but after memory was built from correct solutions, streaming updates dropped it to about 54%. The failures came from bad grouping, overbroad lessons, and overfitting, so the memory forgot details, mixed up task types, or learned rules that only worked on narrow examples. The big deal is that agent memory should not automatically rewrite every experience into a summary, because keeping raw evidence and only sometimes making summaries worked better. The paper is really proposing that agent memory should treat raw past episodes as important evidence, not as disposable notes to summarize away. ---- arxiv. org/abs/2605.12978 Title: "Useful Memories Become Faulty When Continuously Updated by LLMs"

译伊利诺伊大学和清华大学等实验室研究发现，LLM智能体重复重写自身记忆会导致记忆变得更不可靠。原始经历（实际过往尝试和解决方案）往往比提炼后的总结更有用。测试中，GPT-5.4在小型ARC-AGI数据集上无记忆时正确率100%，但建立记忆并持续更新后降至约54%。失败原因包括分组不当、教训过度泛化及过拟合。研究建议智能体不应自动将每个经历重写为摘要，保留原始证据并仅偶尔总结效果更好。

Rohan Paul@rohanpaul_ai · 6月4日71

This Google DeepMind’s paper is a serious warning for anyone using autonomous agents today. Gives the first clear taxonomy of 6 attack types where harmful websites can detect AI agents and show them hidden content humans never see, like - Instructions buried in HTML comments or white-on-white text - Steganography in image pixels - Override commands in PDFs, metadata, or even speaker notes - Memory poisoning that persists across sessions - Goal hijacking and cross-agent cascades in multi-agent setups The real security problem for AI agents is not just the model, but the environment it reads. The web itself can be weaponized against autonomous AI agents. As agents increasingly browse the internet, read emails, execute transactions, and spawn sub-agents, the information environment becomes an attack surface. In one cited benchmark, hidden prompt injections embedded in web content partially commandeered agents in up to 86% of scenarios, sub-agent hijacking working 58–90% of the time, and data exfiltration attacks clearing 80% across five different agent architectures. That reframes the whole debate. We usually talk about model safety as if the danger sits inside the weights, but agents do something more fragile: they browse, retrieve, remember, and act on untrusted material in real time. Here’s the thing to worry about. A web page does not have to look malicious to be dangerous to an agent, because the agent may parse what humans never see: hidden HTML comments, metadata, CSS-hidden text, formatting syntax, or adversarial content embedded in images and other media. The threat gets more serious once memory enters the loop. If an agent uses RAG or persistent memory, poisoning no longer has to win in one shot. It can sit quietly in a corpus or memory store and activate later, which is why the paper highlights results showing latent memory poisoning above 80% attack success with less than 0.1% data contamination. --- ssrn .com/sol3/papers.cfm?abstract_id=6372438

译Google DeepMind论文首次系统分类六类攻击：HTML注释/白色文本隐藏指令、图像隐写、PDF元数据/演讲者笔记覆写、跨会话内存投毒、目标劫持及多智能体级联攻击。隐藏提示注入在86%场景中部分控制智能体，子智能体劫持成功率58–90%，数据泄露攻击在五种架构中均超80%。内存投毒成功率超80%，仅需不足0.1%数据污染。论文指出网页、邮件等非受信材料可被武器化，构成主要攻击面。

Chubby♨️@kimmonismus · 6月4日67

A blind Stanford-led study of nearly 3,000 anonymized matchups found law professors across 16 schools preferred AI-generated answers to student contract-law questions over those written by fellow professors 75% of the time, and judged the AI responses far less likely to be pedagogically harmful (3.5% vs. 12%). "The team tested a range of systems, including commercial tutoring tools and Google's NotebookLM." Now imagine the performance of models in 6-12 months.

译一项由斯坦福大学领导的盲测研究，对近3000场匿名对决的分析发现，16所法学院的法律教授在合同法问题中，有75%的时间更偏好AI生成的答案，而非教授自己写的答案，并且认为AI回答的教学危害性远低于后者（3.5% vs 12%）。 “研究团队测试了多种系统，包括商业辅导工具和Google的NotebookLM。” 现在想象6-12个月后模型的表现。

Ethan Mollick@emollick · 6月4日50

Leaving aside the question of consciousness, the Ted Chiang piece has a reasonable point about moral atrophy if you let AI make choices. But it is also interesting in light of the fact that repeated randomized trials find AI is apparently a good ethicist. https://x.com/emollick/status/1717198389006176519?s=20

译Ethan Mollick 引用一篇论文：四名牧师、一名拉比、十三名学者和 50 名 MBA 被要求比较《纽约时报》伦理专栏作家与 GPT-4 提出的伦理方案，结果基本持平（tie）。主推文指出，尽管 Ted Chiang 关于让 AI 做选择会导致道德萎缩的观点有一定道理，但重复随机试验发现 AI 似乎是优秀的伦理学家。

AK@_akhaliq · 6月4日62

dMoE dLLMs with Learnable Block Experts

译dMoE 具有可学习块专家的dLLM

AK@_akhaliq · 6月4日46

Bootstrap Your Generator Unpaired Visual Editing with Flow Matching

译自举你的生成器非配对视觉编辑与流匹配

AK@_akhaliq · 6月4日60

Unified Neural Scaling Laws

译统一神经缩放定律

Anthropic@AnthropicAI · 6月4日64

How well do the security community's techniques hold up against AI-enabled cyberattacks? We examined 832 malicious accounts and mapped their activity onto a longstanding database of tactics and techniques used by threat actors. Here's what we learned:https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

译安全社区的技术在应对AI驱动的网络攻击方面表现如何？我们检查了832个恶意账户，并将其活动映射到一个长期存在的威胁行为者战术和技术数据库。以下是我们学到的：https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

Microsoft Research@MSFTResearch · 6月4日62

A three‑month pilot in a Midwestern bottling plant shows what happens when AI moves beyond chat and into decision-making, where constraints shift, stakes are real, and answers must hold. https://msft.it/6015vjYUN

译一份在中西部装瓶厂进行的三个月试点显示，当AI超越聊天进入决策领域时会发生什么——约束条件变化、风险真实、答案必须可靠。 https://msft.it/6015vjYUN

elvis@omarsar0 · 6月3日72

New research from Google. Just shows the impressive results you can get from custom agent harnesses. LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback. The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%. Paper: https://arxiv.org/abs/2606.03303 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Google 新研究 LEAP 将通用大语言模型封装在智能体框架中，每个步骤基于 Lean 编译器，并依赖验证器反馈进行迭代。同一通用模型解决了全部 12 道 Putnam 2025 问题，并将 Lean-IMO-Bench 一次性解决率从不到 10% 提升至 70%，击败了得分 48% 的专业金牌系统。论文链接：https://arxiv.org/abs/2606.03303。

Ethan Mollick@emollick · 6月3日41

Hey, its our paper!

译嘿，这是我们发表的论文！ [引用 @PNAS News]：过去一周PNAS最高浏览量文章之一——《劝说大语言模型遵守有异议的请求》。查看论文：https://ow.ly/wOxl50Z6fZA 更多热门文章请访问 https://ow.ly/uLkC50Z6fZz。

Chubby♨️@kimmonismus · 6月3日60

Fantastic in depth guide about Microsoft MAI by @eliebakouch tl;dr about the model: Respect where respect is due. -zero synthetic data or distillation from previous models. -1T model with 35B active, trained on 33.5T tokens

译Microsoft MAI 技术报告公开模型细节：1T 总参数，35B 活跃参数，在 33.5T tokens 上训练。最突出的特点是零合成数据、零知识蒸馏，推理、智能体行为、工具使用全部在后训练中从头学习。报告透明度极高，首次在此规模公开各迭代的 MFU 和完整缩放方案，目标成为前沿实验室。

向阳乔木@vista8 · 6月3日58

今天读到斯坦福大学研究团队的一个论文，有点跟直觉不一样。把没过滤的Common Crawl数据喂给大模型，发现计算量足够大时，不过滤数据效果反而比清洗后的数据效果好。在 15M 小模型上，过滤数据全面领先，未过滤的很差。但当模型规模达到 330M 和 1B 时，情况完全反转，未过滤的在充分训练后超越了所有过滤版本。小模型怕垃圾，大模型不怕。模型大，秩（参数量）多，就有足够空间把垃圾和有用信息隔离开。论文解读和原始PDF见评论区

译斯坦福团队研究发现，使用未过滤Common Crawl数据训练模型时，在计算量充足下效果可能优于清洗后数据，结论呈现模型规模依赖性：小模型（15M）上过滤数据全面领先，但大模型（330M、1B）未过滤数据在充分训练后反而超越过滤版本，原因是大模型参数容量足够大，可在训练中自行隔离噪声与有效信息。

Berryxia.AI@berryxia · 6月3日76

兄弟们，Google DeepMind 团队又来整活儿！ Google DeepMind的最新发布，直接把“AI能帮科学家干嘛”这个老问题彻底翻篇了。他们把Gemini做成了一个叫Co-Scientist的多Agent系统。不是简单问答工具，是完整复制了科学家从idea到验证的整个循环：生成上千个假设、举办“idea锦标赛”、让多个Agent展开科学辩论、互相批判精炼，最后用文献、数据和搜索工具把每个主张落地验证。以前科研最卡的环节，就是一个人脑力有限，生成好假设、反复辩论、跨领域拉新知识都要靠自己。现在Co-Scientist把这个过程变成可规模化的流水线。过去一年他们和全球顶尖科学家一起测，在肝纤维化新靶点、肌萎缩侧索硬化（ALS）新疗法、逆转衰老的遗传线索这些超级复杂的问题上，都拿出了真正有潜力的新方向。最反直觉的一点是：它不是来取代科学家的，只是真正成了“专职研究伙伴”。科学家终于可以把脑力从“反复想假设、反复查文献”里解放出来，专注在最有创造力的判断和实验设计上。 AI把以前只有顶尖团队才玩得起的“高强度idea迭代”变成了人人可用的基础设施。现在他们已经把Hypothesis Generation功能开放给个人研究者，直接通过Gemini for Science就能用。普通研究员也能拥有一个24小时不睡觉、能辩论、能验证、还能不断进化的AI合作者。这其实戳破了当前最主流的误解：很多人以为AI会让科学家失业，结果真实路径是AI把科学发现的速度和广度直接拉高一个数量级，让更多人能真正参与到突破性研究里。

译Google DeepMind发布了基于Gemini的多Agent系统Co-Scientist，旨在实现科研流程自动化。该系统能够生成、辩论和验证假设，帮助科学家从高强度脑力劳动中解放出来。过去一年，它已在肝纤维化新靶点、ALS新疗法等复杂问题上与科学家合作探索出新方向。其定位并非取代科学家，而是作为“专职研究伙伴”。目前，其假设生成功能已通过Gemini for Science向个人研究者开放。

Rohan Paul@rohanpaul_ai · 6月3日57

Stanford researchers found that law professors preferred AI answers over peer professor answers 75% of the time when judging contract-law help for students. The study tested whether LLMs can handle a field where the answer is often not a fact, but a defensible argument built from rules, exceptions, and judgment. The professors wrote 40 real student-style questions, gave their own answers, and then blindly judged nearly 3,000 comparisons between human and AI responses. The striking result was not just that AI won often, but that professors marked AI answers as harmful only 3.5% of the time, compared with 12% for human answers. i.e. the model was not merely sounding fluent, but often matching the teaching standard law professors use when explaining ambiguity to students.

译斯坦福研究人员发现，在评估合同法问题时，法律教授有75%的次数更倾向于选择AI给出的答案，而非同行教授的答案。该研究让教授们针对40个真实学生提问撰写答案，并对近3000个人类与AI的回答进行了盲测比较。结果不仅显示AI胜出频率高，而且教授们仅将3.5%的AI答案标记为“有害”，而对人类答案的有害标记率为12%。这表明大语言模型并非只是流畅，其表现常能达到教授向学生解释法律模糊性的教学标准。

Rohan Paul@rohanpaul_ai · 6月3日63

AI can explain science better than it can forecast science. Across 4,760 scientific events, the models were much better at recognizing possible research paths than forecasting actual outcomes. Models often recognize a plausible research idea when the answer is already nearby, especially in multiple-choice form. But they are much weaker at the harder thing: predicting whether a discovery will actually happen, when it will happen, and what method will make it work. That means the models are still much better at hindsight than foresight. When asked whether a scientific claim will actually be realized, the models hover near chance, and when asked when progress will arrive, they systematically push it too far into the future. Even when the authors gave models extra older information, the models improved a bit but still did not become reliable at predicting future scientific progress. So having lots of scientific knowledge inside a model does not automatically make it a good scientific forecaster. ---- Paper Link – arxiv. org/abs/2605.22681 Paper Title: "Forecasting Scientific Progress with AI"

译一项对4,760个科学事件的研究发现，AI模型在“解释”科学方面优于“预测”科学。模型在识别可能的研究路径（尤其是选择题形式）时表现较好，但在预测科学发现是否会实际发生、何时发生以及何种方法有效等更难任务上表现薄弱，准确率接近随机猜测。即使提供额外历史信息，模型改善有限。这表明，模型内嵌大量科学知识并不等同于具备可靠的科学预见能力。研究论文发表于arXiv（2605.22681），标题为《Forecasting Scientific Progress with AI》。

Microsoft Research@MSFTResearch · 6月3日72

Weather forecasts thousands of times faster than traditional supercomputers. Hear from Kenji Takeda on Aurora at the Microsoft Research Lab at #MSBuild. Learn more: https://msft.it/6018vjGUA

译天气预报速度比传统超级计算机快数千倍。听听Kenji Takeda在#MSBuild微软研究实验室关于Aurora的分享。了解更多：https://msft.it/6018vjGUA

AK@_akhaliq · 6月3日62

GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization

译GPU预测器大语言模型作为内核运行时优化的选择性代理

AK@_akhaliq · 6月3日60

Seeing Isn't Knowing Do VLMs Know When Not to Answer Spatial Questions (and Why)?

译视觉语言模型知道何时不回答空间问题吗（以及为什么）？

AK@_akhaliq · 6月2日62

Crafter A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

译Crafter 一个用于从多样化输入生成可编辑科学图表的多智能体框架

elvis@omarsar0 · 6月2日50

// Scaling Behavior of Single LLM-Driven Multi-Agent Systems // Does adding more agents actually make a multi-agent system better? It's possible that collective intelligence emerges from interaction design rather than from agent plurality. This is something important to understand if you are building multi-agent systems. This new study reports that the optimal number of agents depends on the base model's capability and the task type, not on adding more of them. Paper: https://arxiv.org/abs/2606.00655 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译研究探讨添加更多智能体是否提升多智能体系统性能。结论指出，最优智能体数量取决于基础模型的能力和任务类型，而非单纯增加数量。集体智能更可能源于精心的交互设计，而非智能体数量的增多。相关论文："Scaling Behavior of Single LLM-Driven Multi-Agent Systems"。

Rohan Paul@rohanpaul_ai · 6月2日57

This paper proposes a way to predict the cheapest safe AWS spot fleet before launching it. AWS spot machines can be much cheaper, but users usually cannot see the final fleet price across regions before starting, so this paper turns that blind choice into a comparison that can save up to 64%. Spot instances are cheap because they are conditional: the cloud provider can take them back, prices move, and capacity shifts by region. The quiet problem is that AWS helps users launch spot fleets, but not fully see the fleet’s price or best region before launch. The authors build a service that watches how AWS creates these fleets, learns those patterns with time-aware AI models, and then estimates the fleet mix and cost across 9 regions. A user gives the service a target amount of computing power and a placement strategy, and the service returns region-ranked options before anything is launched. They tested it on AWS with fleets up to 1500 virtual CPUs, using 720 test launches after a 90-day monitoring period. The predicted fleet matched AWS exactly in 92.78% of cases, reached 99.79% overall accuracy against AWS behavior, and AWS accepted every recommended fleet. Result is that choosing the best region mattered far more than changing the strategy inside 1 region, with possible savings up to 64%. ---- Paper Link – arxiv. org/abs/2605.22778 Paper Title: "AI-Driven Multi-Region Provisioning for Cloud Services Using Spot Fleets"

译该研究提出了一种AI驱动的服务，用于在启动前预测最便宜且安全的AWS Spot实例舰队。该服务通过时间感知模型学习AWS创建舰队的模式，并估算9个区域的舰队组合与成本，向用户返回排序后的区域选项。测试显示，在最多1500 vCPU的舰队上，预测结果与AWS完全匹配的比例达92.78%，整体准确率为99.79%，且所有推荐舰队均被AWS接受。关键发现是选择最佳区域比在单个区域内调整策略更重要，潜在成本节省最高可达64%。

Ethan Mollick@emollick · 6月2日70

Big paper on AI coding agents using Github & other data The auto-complete tools (Copilot) led to 2.2x more code, local agents like original Claude Code led to 7.4x, & current remote coding agents 17.3x(!) But human bottlenecks in coding means actual releases "only" went up 30%

译关于使用Github及其他数据的AI编程智能体的重要论文自动补全工具（如Copilot）使代码量增加2.2倍，本地智能体（如初版Claude Code）增加7.4倍，而当前远程编程智能体增加17.3倍（！）但编程中的人类瓶颈意味着实际发布量“仅”增加了30%

Rohan Paul@rohanpaul_ai · 6月2日48

A 178 page survey study for refreshing math and generative AI foundations from University of Huddersfield. The Little Book of Generative AI Foundations.

译哈德斯菲尔德大学发布了一份178页的调查研究，旨在更新数学和生成式AI的基础知识。《生成式AI基础小册子》。

elvis@omarsar0 · 6月1日71

Very good advice on self-improving agents. (bookmark it) This is something I am seeing in my own experiments with coding agents and harnesses for long-horizon tasks. What I have found is that stronger models do not always evolve better agents. The current believe in self-evolving agents is that a bigger model writes better prompt and skill edits, so devs put their best model in the evolver seat. New research shows that intuition is mostly wrong. The work separates two abilities that usually get conflated. Producing harness updates stays flat across model capability, so Qwen3.5-9B writes edits roughly as good as Claude Opus 4.6. Benefiting from those updates follows an inverted-U that peaks at mid-tier models, while weak models fail to even activate the edits and strong models have little headroom left. This is important to understand as it tells you where to spend. Put a cheap model on the evolver and your expensive model on the solver, because the gains land solver-side, not evolver-side. Paper: https://arxiv.org/abs/2605.30621 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译该研究指出，在自我改进的AI智能体中，“更强模型总能写出更好进化器提示词”的直觉是错误的。工作区分了两种能力：产生更新的能力在不同模型间趋于平坦，而从更新中受益的能力呈倒U形曲线，在中等模型处达到顶峰。弱模型无法有效激活更新，强模型则因已处性能高位而获益甚微。因此，成本效益最佳的配置是：使用廉价的中等模型担任“进化器”，而将昂贵的强模型用作“求解器”。

Rohan Paul@rohanpaul_ai · 6月1日60

Better AI agent systems scale by remembering useful feedback, not by spending more compute. The simple mistake is to count tokens, calls, or dollars as if they were all evidence. The authors say those numbers miss the real issue, because 2 runs can spend the same budget while only 1 gets feedback that is correct, new, relevant, and remembered. An agent harness is not just a wrapper around a model; it is a feedback machine that decides what to test, what to trust, what to store, and what to ignore. Their answer is Effective Feedback Compute, or EFC, a score that counts feedback only when it teaches the agent something useful and changes later decisions. They also divide EFC by task demand, because a small lookup task and a messy software-repair task need different amounts of helpful feedback before the agent has enough to solve them. They tested this on synthetic tasks, code tasks with executable tests, real benchmark traces, held-out settings, and a new prospective batch, then compared EFC with raw compute and a strong agent-scaling baseline. The main result is that task-normalized EFC predicted failures much better than raw compute, and in 1 matched-budget test, better feedback raised success from 0.27 to 0.90 while cost and tool calls stayed fixed. ---- Link – arxiv. org/abs/2605.29682 Title: "Scaling Laws for Agent Harnesses via Effective Feedback Compute"

译当前AI智能体的扩展方法常错误地将计算资源消耗等同于学习证据。新研究指出，两次运行消耗相同预算，但反馈的有效性可能天差地别。为此，研究提出了“有效反馈计算”（EFC）指标，仅统计那些正确、新颖、相关且被记住、并能改变后续决策的反馈。研究还结合任务需求对EFC进行归一化。实验表明，任务归一化的EFC比原始计算指标更能预测失败。在一项匹配预算测试中，采用更好反馈的方法将任务成功率从0.27提升至0.90，而成本和工具调用次数保持不变。链接：arxiv.org/abs/2605.29682 标题："Scaling Laws for Agent Harnesses via Effective Feedback Compute"