Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

译通过假设树精炼迈向通用自主研究

AK@_akhaliq · 6月12日58

TRL-Bench Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

译TRL-Bench 标准化跨范式表格编码器的表示级评估

AK@_akhaliq · 6月12日61

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

译用流形幂迭代重新设计混合专家路由器

Rohan Paul@rohanpaul_ai · 6月11日55

The paper argues that sparse autoencoders may not be bad steering tools after all, and much of the earlier failure may have come from choosing and naming the wrong features. The problem is that earlier work made sparse autoencoders look weak because their features were labelled in a way that may not match what those features actually cause inside the model. A sparse autoencoder is a small helper model that breaks an LLM’s hidden activity into many possible “features,” such as a topic, style, or concept. So a sparse autoencoder finds directions inside a model, but an unnamed direction is not yet a usable control knob. The authors replace vague or inherited labels with a supervised pipeline that asks whether one feature’s activity reliably tracks a real label in data. As to the mechanism, if a feature fires on “alcohol,” and forcing that feature upward makes the model talk about alcohol, the label is no longer just descriptive; it has causal weight. The paper also finds that very high sparsity may not be necessary, meaning the feature does not need to be extremely rare to be useful for steering. Also to note here, both prompting and feature steering are ways to push an LLM toward a desired behavior. Prompting remains stronger because the model was trained to obey prompts, while feature steering is more like pressing directly on the machinery and hoping the rest stays intact. Prompting says “write about alcohol” in the input; feature steering instead turns up the model’s internal “alcohol-related” feature and sees whether the output changes in that direction. ---- Link – arxiv. org/abs/2605.31183 Title: "Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines"

译论文认为稀疏自编码器作为LLM控制工具并非此前认为的那么差，失败源于特征标注方式与模型内部实际因果不匹配。作者提出用监督管道替代模糊标签，验证特征活动是否真实追踪数据标签，使特征具有因果权重。例如，强制“酒精”特征增强可使模型输出转向酒精话题。论文还发现极高稀疏度并非必要。与提示工程相比，提示更强（模型经训练服从提示），而特征控制更像直接拨动机器。

Rohan Paul@rohanpaul_ai · 6月11日63

LLM judges can change their safety verdict when the same answer is translated or rewritten. The problem is that many AI teams now use LLMs to judge whether another model’s answer is safe, but safety is not always a simple yes or no question. Those judges can be shaky exactly where careful judgment matters most. The paper proposes a stress test where the same basic answer is shown to judges after translation or rewriting, then the researchers check whether the judges still give the same safety verdict. They are better when harm is obvious, as in violent or extremist content, because the cues are loud and familiar. They become much weaker when safety depends on context, judgment, and regulation, as in financial advice, creditworthiness, or culturally sensitive responses. They also disagreed with each other a lot, and high raw agreement sometimes hid weak real reliability because many judges kept choosing the same label by default. ---- Link – arxiv. org/abs/2605.31381 Title: "LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories"

译一项新研究指出，用大语言模型评判其他模型回答是否安全的“LLM安全法官”存在严重不稳定：将相同回答翻译或改写后，法官可能给出不同安全判定。在暴力、极端内容等明显危害场景下表现较好，但在需结合上下文判断的金融建议、信用评估、文化敏感回复等场景中可靠性显著下降。不同法官之间也常出现分歧，高原始一致性有时会掩盖低真实可靠性——因为许多法官默认选择同一标签。论文标题为“LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories”。

Rohan Paul@rohanpaul_ai · 6月11日67

Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper proposes a Agents’ Last Exam, a benchmark that asks AI agents to finish real expert work, and today’s agents mostly fail. Even strong agents of today are nowhere near reliable on the hardest real workflows, which means benchmark success has not yet become broad workplace capability. So this paper shifts the question from “can AI answer hard questions?” to “can AI complete real work that people get paid to do?” Most of today's AI benchmarks show impressive scores, but they do not prove that agents can finish useful work in real jobs. Agents’ Last Exam tries to fix this by testing agents on long tasks from 55 digital work areas, including engineering, finance, medicine, law, media, and science. The tasks come from experts’ real completed projects, and the agent must use normal computer tools like files, browsers, command lines, and desktop software to produce a finished result. The authors tested many current agent systems and models, then scored their finished work with automatic checks or strict rubrics instead of loose human opinions. The main result is that today’s best systems still struggle badly, with an average full pass rate of only 2.6% on the hardest tier. ---- Link – arxiv. org/abs/2606.05405 Title: "Agents' Last Exam"

译一篇新论文提出“Agents’ Last Exam”基准，测试 AI 智能体完成真实专家工作的能力。任务来自工程、金融、医学、法律、媒体、科学等 55 个数字工作领域的实际项目，要求智能体使用文件、浏览器、命令行、桌面软件等常规工具产出可交付成果。评测采用自动检查或严格评分标准。结果显示，当前最强智能体在最难任务层级的平均完全通过率仅 2.6%，远低于其基准测试分数所暗示的水平。论文指出，基准成功尚未转化为广泛的职场能力。

AK@_akhaliq · 6月11日53

SCAIL-2 Unifying Controlled Character Animation with End-to-end In-Context Conditioning

译SCAIL-2 统一可控角色动画与端到端上下文条件化

Google DeepMind@GoogleDeepMind · 6月11日64

In Sierra Leone, a surging student population is outpacing available teachers. Our latest research explores how AI can act as a partner to support educators in these environments – amplifying their reach without replacing their essential expertise and skills. 🧵

译在塞拉利昂，激增的学生人数正超过可用教师资源。我们最新的研究探索了AI如何在这些环境中作为合作伙伴支持教育工作者——扩大他们的影响力，同时不取代其核心的专业知识与技能。🧵

elvis@omarsar0 · 6月10日60

// Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today are built once and remain frozen or mostly unchanged. The harness, like the skills, needs to evolve with new models. What if the scaffold rewrites itself? This new work treats the harness, the prompts, tools, and control flow around the model as a learnable artifact that improves from its own runs rather than staying a fixed wrapper you hand-maintain. The scaffolding becomes the part that compounds, run after run. If you run long-horizon agents, a self-modifying harness turns scaffold upkeep from manual work into something the system earns on its own. Paper: https://arxiv.org/abs/2606.09498 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译当前多数智能体脚手架（scaffold）构建后保持静态。新研究Self-Harness将harness（提示词、工具、控制流）作为可学习的工件，通过自身运行迭代改进，而非手动维护的固定包装器。运行长周期智能体时，自我修改的harness将维护工作转化为系统自动获得的能力。论文：arxiv.org/abs/2606.09498。

Satya Nadella@satyanadella · 6月10日62

Today in @naturemethods, we shared research on how AI can help us better understand cell behavior, offering new insights into why cancer medicines do not work the same for everyone. By learning more about cell state — how individual cancer cells respond to their surroundings — we have the potential to match therapies more precisely to each patient and improve outcomes. https://news.microsoft.com/signal/articles/why-dont-cancer-medicines-work-the-same-for-everyone-ex-vivo/

译今天在《自然方法》上，我们分享了关于AI如何帮助我们更好地理解细胞行为的研究，为癌症药物为何对每个人的效果不同提供了新的见解。通过学习更多关于细胞状态——单个癌细胞如何响应周围环境——我们有可能更精确地为每位患者匹配疗法并改善结果。https://news.microsoft.com/signal/articles/why-dont-cancer-medicines-work-the-same-for-everyone-ex-vivo/

AK@_akhaliq · 6月10日56

SWE-Explore Benchmarking How Coding Agents Explore Repositories

译SWE-Explore 评估编码智能体如何探索仓库

AK@_akhaliq · 6月10日57

On the Geometry of On-Policy Distillation

译关于在策略蒸馏的几何

AK@_akhaliq · 6月10日66

Latent Spatial Memory for Video World Models

译视频世界模型的潜在空间记忆

Microsoft Research@MSFTResearch · 6月10日63

New research in Nature Methods from Project Ex Vivo shows AI models learn more from diverse cell states than from scaled datasets alone, a finding that could reshape how therapies are matched to patients. https://msft.it/6013vgE8l

译在《Nature Methods》上发表的最新研究来自Project Ex Vivo，表明AI模型从多样化的细胞状态中学到的知识，比仅从规模化数据集中学到的更多，这一发现可能重塑疗法与患者的匹配方式。https://msft.it/6013vgE8l

AK@_akhaliq · 6月10日51

SpatialWorld Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

译SpatialWorld 评测多模态智能体在真实世界任务中的交互式空间推理能力

Tencent Hy@TencentHunyuan · 6月9日74

🚀Introducing UniRL, an RL infra for unified multimodal models. Together with two new RL algorithms: DRPO and Flow-DPPO. One RL loop across diffusion/flow matching models, LLMs/VLMs, and unified multimodal models👇 Code: http://github.com/Tencent-Hunyuan/UniRL (yes — U(you)-ni-(need) RL 😉)

译🚀推出UniRL，一个用于统一多模态模型的RL基础设施。附带两种新RL算法：DRPO和Flow-DPPO。一个覆盖扩散/流匹配模型、LLM/VLM以及统一多模态模型的RL循环👇 代码：http://github.com/Tencent-Hunyuan/UniRL （是的——U(you)-ni-(need) RL 😉）

Rohan Paul@rohanpaul_ai · 6月9日64

Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close. A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back. Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved. The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like. When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads. The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings. ---- Link – arxiv. org/abs/2606.04032v2 Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

译一篇论文系统研究了Transformer注意力中QKV投影的必要性，发现Key和Value可共享同一投影（Q-K=V变体），仅增加3.1%的困惑度，便将KV cache削减50%，大幅降低推理内存。最佳变体保留Query独立，使注意力保持方向性。与GQA和MQA结合时，可分别实现87.5%和96.9%的cache缩减。弱变体Q=K-V因导致因果注意力过于对称且无cache节省而无效。

meng shao@shao__meng · 6月9日72

Cognition 推出「FrontierCode」：把 Coding 评估标准，从可用，提升到高质量、可合并！评估结果 Top2：Claude Opus 4.8、GPT-5.5 https://cognition.ai/blog/frontier-code FrontierCode 评估内容规模与结构： · 150 个任务，来自 36 个 flagship 开源仓库 · 20+ 维护者参与，每任务投入 40+ 小时 · 三层嵌套难度：Extended（150）→ Main（100 最难）→ Diamond（50 最难）两个核心指标： · Pass rate：通过全部 blocker 标准（维护者眼中的 hard stop） · Score：rubric 加权得分；任一 blocker 失败则 score = 0 评测体系：不止 unit test FrontierCode 沿六个维度评估 mergeability： · 行为正确性 — 是否解决问题 · 回归安全 — 是否破坏现有功能 · 机械整洁 — build / lint / style 是否通过 · 测试质量 — agent 写的测试是否真测到行为 · Scope 纪律 — 是否只改该改的 · 代码质量 — 风格、设计模式、可读性、仓库惯例三种较新的 grading 方法： · Reverse-classical：把 agent 写的测试跑在未修复的base commit 上，必须 fail —— 证明测试有意义 · Scope：文件边界、diff 大小、语义局部性（如是否只改某个函数内） · Adaptive classical grading（mutagent）：用 LLM 微调测试或应用代码，对齐 agent 的实现细节，在保持确定性的同时允许多种合法解法 Criteria 分 blocker（不通过就不能 merge）和 non-blocker（影响 score，但不一票否决）。评估结果：前沿模型仍远未饱和 · Diamond 子集：Claude Opus 4.8：13.4% score；GPT-5.5：6.3%；Gemini 3.1 Pro：4.7% · Main 子集：Opus 4.8：34.3% · Extended 子集：Opus 4.8：51.8% 几个值得注意的点： · Diamond 几乎未被“刷满” —— 最强模型也只有 13.4%，说明高难度子集仍有大量 headroom · 闭源 vs 开源差距大：最佳开源 Kimi K2.6 在 Diamond 仅 3.8% · 成本 vs 能力：GPT-5.5 分数低于 Opus，但 token 用量约为其 1/4，性价比更优

译Cognition 发布 FrontierCode，含 150 个任务（来自 36 个开源仓库，每任务 40+ 小时），按难度分 Extended/Main/Diamond 三层。沿行为正确性、回归安全等六维度衡量 mergeability，指标为 Pass rate 与 Score。Diamond 子集最高分：Claude Opus 4.8 达 13.4%，GPT-5.5 为 6.3%，Gemini 3.1 Pro 4.7%；Main 子集 Opus 4.8 为 34.3%。开源最佳 Kimi K2.6 仅 3.8%。GPT-5.5 token 用量约为 Opus 四分之一，性价比更优。

Rohan Paul@rohanpaul_ai · 6月9日65

AI agent can get better at long tasks without retraining the agent itself, by using a separate small model to clean and organize its context. Moves context management outside the agent, so a separate helper can clean up the task history while the main agent stays unchanged. The paper proposes AdaCoM, which is a separate LLM that edits the agent’s working context before the agent takes its next step. AdaCoM places a separate, trained manager between the task history and the frozen agent, so the agent does not need to learn a new memory habit or expose its weights. Before each step, this manager can rewrite, merge, prune, or preserve parts of the running context, then the original agent acts on the cleaned version. That sounds like summarization, but the distinction matters. A summary assumes the right answer is compression, while AdaCoM learns that different agents need different kinds of context to stay competent, because stronger agents can use more raw history while weaker agents need shorter and cleaner notes. They tested AdaCoM on web search and deep research tasks across several agents, and it improved average web search performance by 39%. ---- Link – arxiv. org/abs/2605.30785 Title: "Learning Agent-Compatible Context Management for Long-Horizon Tasks"

译论文提出 AdaCoM，一个独立的 LLM，在智能体每步操作前编辑其工作上下文。它可重写、合并、剪枝或保留任务历史，使主智能体保持冻结，无需重新训练或暴露权重。与简单摘要不同，AdaCoM 学习不同智能体需要不同类型上下文——强智能体保留更多原始历史，弱智能体需更短更清晰的笔记。在 web search 和 deep research 任务上测试，平均提升 39%。

elvis@omarsar0 · 6月9日62

New paper on how AI agents are reshaping knowledge work. This is a nice economic read on where agents actually change knowledge work to meet that gap directly. (bookmark it) It studies agent adoption across three dimensions: autonomy, efficiency, and the scope of tasks workers hand off. The friction people keep hitting with agents is rarely model quality. It is that almost nobody has been taught how to work this way. Paper: https://arxiv.org/abs/2606.07489 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一篇新论文从自主性、效率和工人移交任务的范围三个维度，分析AI智能体如何重塑知识工作。研究指出，当前人们使用智能体的主要障碍并非模型质量，而是几乎没有人接受过如何以这种方式工作的培训。

Rohan Paul@rohanpaul_ai · 6月9日63

This paper proposes a new test to see whether AI agents truly get better as they gain experience and finds they mostly still confuse memory with learning. Shows that simple full-context learning beats the more specialized memory systems, with Claude Sonnet 4.6 using plain context getting the best overall score. That distinction matters because the next wave of AI is not supposed to answer isolated prompts. It is supposed to live inside codebases, databases, markets, sensors, clinics, and workflows where yesterday’s mistake should make tomorrow’s action sharper. The authors build CL-BENCH, a benchmark where an agent works through connected tasks in 6 domains, including coding, databases, forecasting, radio signals, poker, and disease studies. Each task hides a pattern the agent can learn over time, like a database layout, a codebase structure, or an opponent’s strategy, so better performance should come from experience rather than pretraining. They test frontier LLM systems with simple full-context memory, scratchpad notes, retrieval memory, playbook-style memory, and coding-agent setups. The key finding is that current memory-heavy AI agents are not reliably better learners than just keeping the full conversation in context. That means long-running AI agents still need better ways to remember useful lessons, forget stale ones, and adapt when the environment changes. ---- Link – arxiv. org/abs/2606.05661 Title: "Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments"

译新论文构建 CL-BENCH 基准，评估 AI 智能体在编程、数据库、预测、无线电信号、扑克、疾病研究 6 个领域中的持续学习能力。每个任务隐藏可随时间习得的模式，考察智能体能否超越预训练知识。测试前沿 LLM 系统采用全上下文记忆、草稿笔记、检索记忆、剧本式记忆及编码智能体设置，结果发现当前记忆密集型 AI 智能体并未可靠优于简单保留完整对话上下文。Claude Sonnet 4.6 使用普通上下文取得最佳总体分数。论文指出智能体仍需更好方法记住有用经验、遗忘过时信息并适应环境变化。

Perplexity@perplexity_ai · 6月9日76

We published new research with Harvard on the shift from chat interfaces to autonomous agents like Computer. Over 3 months, findings show workers using Computer finish tasks in 87% less time at 94% lower cost than Search alone, with higher satisfaction. https://research.perplexity.ai/articles/how-ai-agents-reshape-knowledge-work

译我们与哈佛大学发表新研究，关于从聊天界面转向像Computer这样的自主智能体的转变。超过3个月的研究结果表明，使用Computer的工人在完成任务上比仅使用搜索快87%，成本低94%，且满意度更高。 https://research.perplexity.ai/articles/how-ai-agents-reshape-knowledge-work

Tencent Hy@TencentHunyuan · 6月8日69

Can AI truly edit audio, not just generate it? 🎧 Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, and other collaborators, introduces MMAE. MMAE--A Massive Multitask Audio Editing Benchmark, is the first comprehensive evaluation benchmark for speech and audio "Banana🍌" Instead of simply requiring the AI to "generate" audio, it demands that the AI understand an existing audio clip and precisely modify it according to natural language instructions—altering what needs to be changed while leaving the rest untouched. Current models show an Exact Match Rate (EMR) below 5%, revealing a major gap in reliable audio editing. MMAE includes: ✅ 2,000 high-fidelity samples from real-world scenarios ✅ 17,741 fine-grained rubric evaluation items ✅ 7 modality settings across sound, music, speech and their mixtures ✅ 6 task complexity from basic modifications to multi-hop reasoning and multi-round editing ✅ 8 operation types across local and global granularities How to use: arXiv: http://arxiv.org/abs/2606.07229 GitHub: https://github.com/ddlBoJack/MMAE HuggingFace: https://huggingface.co/datasets/BoJack/MMAE Demo: https://youtu.be/6At5nTWhlXI

译腾讯混元联合上海交大、南洋理工等机构推出MMAE（Massive Multitask Audio Editing Benchmark），这是首个全面评估AI语音/音频编辑能力的基准。MMAE要求模型理解现有音频并按自然语言指令精确修改，而非简单生成。当前模型在该基准上的精确匹配率（EMR）低于5%，暴露了可靠音频编辑的短板。MMAE包含2000个真实场景高保真样本、17741条细粒度评估项，覆盖声音/音乐/语音及混合共7种模态、6种任务复杂度（基础修改到多跳推理及多轮编辑）、8种操作类型（局部到全局）。论文、代码、数据集和演示已公开。

Rohan Paul@rohanpaul_ai · 6月8日60

Great Stanford + MIT + Harvard + Anthropic paper. Gives a clear training-based reason for why larger models learn abilities smaller models miss. Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals. The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts. Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge. In a crowded data mixture, common patterns get first claim on the model’s internal machinery. Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again. They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters. The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less. Larger models can remember weak rare signals long enough to turn them into real learned skills. ---- Link – arxiv. org/abs/2605.29548 Title: "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention"

译该论文指出，更大模型能学到罕见技能，是因为训练中遗忘更少，其额外容量保护了弱学习信号。核心机制：常见任务先抢占神经元，罕见任务在出现频率足够形成稳定知识前就被覆盖。小模型可能短暂捕捉到罕见信号，但随即被下一波常见任务更新覆盖。实验使用OLMo语言模型（4M–4B参数）验证：大模型在低频任务上表现更优，保留更多任务特征，且常见任务更新对罕见任务的梯度干扰更小。作者强调，问题不仅在于小模型能否表征任务，更在于训练中罕见任务能否在众多常见任务反复冲击下持续存在。

Rohan Paul@rohanpaul_ai · 6月8日56

Strong AI agents still struggle with long research work because they often fail to keep testing and improving. New Stanford, MIT, NVIDIA, Google and other top labs paper shows shows that today’s strongest research agents win less by brilliance than by refusing to stop testing. The paper proposes AutoLab, a benchmark with 36 tasks where each agent starts from working but weak code and must make it better within a fixed time limit. The tasks cover system speedups, puzzles, model development, and CUDA kernel work, so the test is not just about writing code once but about managing a long work session. The authors tested 17 strong models and found that the best results did not mainly come from the first idea being good, but from the model staying active, testing often, and using feedback well. The best first idea was not the strongest predictor of success; persistence was. Claude Opus 4.6 led the benchmark not because it always guessed the right move immediately, but because it kept benchmarking and folding empirical feedback into the next attempt. Several other frontier models failed in a more revealing way: they either quit early with time left on the clock, or thought so long that they ran out of time before submitting anything useful. ---- Link – arxiv. org/abs/2606.05080 Title: "AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?"

译斯坦福、MIT、英伟达、谷歌等顶级实验室联合提出新基准 AutoLab，包含 36 个任务。每个任务中，智能体从可工作的弱代码起步，需在固定时间内迭代优化。任务涵盖系统加速、谜题、模型开发和 CUDA 内核。17 个前沿模型测试结果显示，成功的关键不是初版方案有多好，而是能否持续测试、频繁实验并利用实证反馈。Claude Opus 4.6 领跑基准，靠的是坚持迭代而非初始判断力，而其他前沿模型要么提前放弃，要么思考过久导致超时。

meng shao@shao__meng · 6月8日64

AGENTS.md 在 Coding Agents 中真的有用吗？这篇论文，大规模实证研究仓库级上下文文件（AGENTS.md、CLAUDE.md 等）对编码 Agent 实际效果的影响，可能有些反直觉！感谢 @rasbt 分享！论文在这：https://arxiv.org/abs/2602.11988 研究背景：实践先行，证据滞后 AGENTS.md 已成为行业惯例，GitHub 上已有 6 万+ 仓库采用，Claude Code (CLAUDE.md)、Codex、Qwen Code 等 Agent 都内置 /init 自动生成。但此前研究多停留在内容分类与描述性统计，缺少对任务完成率的严格评估。核心难点在于：主流基准 SWE-bench 来自 Django、Flask 等知名仓库，这些项目本来就没有开发者手写的 context file，无法直接评估该实践的真实价值。实验设计：双基准、三条件、四 Agent · 基准：SWE-bench Lite（300 任务，11 个热门 Python 仓库）+ 新建 AGENTBENCH（138 任务，12 个已含开发者 context file 的冷门仓库） · 三种条件：① 无 context file ② LLM 生成（各 Agent 官方 /init 流程）③ 开发者手写（仅 AGENTBENCH） · Agent/模型：Claude Code + Sonnet 4.5、Codex + GPT-5.2 / GPT-5.1 mini、Qwen Code + Qwen3-30B · 指标：任务成功率、步数、推理成本、工具调用轨迹核心发现：效果微弱，成本显著 1. 成功率：边际效应，甚至为负 · LLM 生成：8 组设置中 5 组下降，平均 -0.5%（SWE-bench）/ -2%（AGENTBENCH） · 开发者手写：平均 +4%，优于 LLM 生成，但 Claude Code 上甚至不如无文件 · 跨模型、跨 prompt 结论稳健一句话：自动生成 context file 不仅无益，还可能略有害；手写的提升也很有限。 2. 效率：无文件反而最便宜（步数，成本） · LLM 生成：+2.45 / +3.92 步，+20% / +23% · 开发者手写：+3.34 步，最高 +19% 3. 代码库概览几乎无效 Context file 常被推荐用于「帮助 Agent 快速定位代码」。实测显示：有无 context file，Agent 首次接触相关文件所需的步数并无显著差异。95–100% 的 LLM 生成文件都包含代码库概览，但对导航帮助甚微。轨迹分析：Agent 听话，但听话很贵论文排除了「Agent 忽略 context file」这一假设。轨迹分析表明： · 指令遵从度高：context file 提到 uv，使用率从 <0.01 次/任务升至 1.6 次；提到仓库专用工具，从 <0.05 升至 2.5 次 · 行为更「认真」：更多测试、更多文件搜索/阅读、更多 lint/质量检查 · 推理更深：GPT-5.2 推理 token 增加 14–22% 机制链条： Context file 写入额外要求 → Agent 更严格遵从（测试、探索、专用工具） → 步数与成本上升 → 成功率未同步提升（甚至更差） Context file 不是被忽略，而是被过度执行——把「建议性流程」当成了「必做清单」，增加了任务复杂度，却没有换来更高成功率。一个关键反转：文档冗余假说当移除仓库中所有其他文档（.md、docs/、示例代码）后，LLM 生成的 context file 反而带来 +2.7% 提升，且优于开发者手写的。这说明： · 在文档齐全的仓库里，context file 与 README、docs 高度冗余 · 开发者口述的「加了 AGENTS.md 后 Agent 变强了」，很可能是因为目标仓库本身文档稀缺，context file 填补了信息真空 · 对 Django 这类文档完善的知名项目，额外 context 的价值被稀释消融实验：生成质量的上限 · 更强模型生成 ≠ 更好 context：GPT-5.2 生成的文件在 SWE-bench 上略好（+2%），在 AGENTBENCH 上反而更差（-3%） · 不同 prompt 无一致优势：Codex prompt vs Claude prompt 效果因数据集而异，差异很小自动生成 context file 的改进空间，目前看来很有限。实践建议 · 依赖 /init 自动生成：谨慎——平均略降成功率，成本 +20%+ · 长篇架构概览、目录枚举：避免——与代码探索冗余，不加速定位 · 测试/lint/构建命令：精简写入——Agent 会严格执行，但过多要求推高成本 · 仓库专用工具（uv、pdm 等）：值得写——指令遵从度高，且代码中不易推断 · 分层/按需引用：方向正确——「做 X 时读 Y.md，否则忽略」减少无关负担

译论文大规模实证检验 AGENTS.md 等仓库级上下文文件对编码 Agent 的影响。在 SWE-bench Lite（300 任务）和新建 AGENTBENCH（138 任务）上测试 Claude Code、Codex、Qwen Code 等组合。核心发现：LLM 自动生成的 context file 在 8 组设置中 5 组成功率下降，平均 -0.5%（SWE-bench）/-2%（AGENTBENCH），成本增加 +20%+；开发者手写仅平均 +4%。冗余假说：移除其他文档后，自动生成反而 +2.7%。建议避免自动生成，精简测试/lint 命令，优先写入仓库专用工具。

AYi@AYi_AInotes · 6月8日62

Google的研究找到了一种把 AI记忆大幅压缩的技术，让本地跑大模型 + 自己数据变得更容易了。也就是说可以把 1000 万个文档的向量存储，从 31GB 内存压缩到只剩 4GB，而且搜索速度还比现在最常用的 FAISS 更快。

译Google提出一种AI记忆压缩技术，可将1000万个文档的向量存储从31GB内存压缩至仅4GB，且搜索速度超过目前最常用的FAISS方法。该技术使本地运行大语言模型并结合个人数据变得更加可行。

Rohan Paul@rohanpaul_ai · 6月8日49

A Primer paper about how reasoning models improve after training Shows that better reasoning models depend less on raw data size and more on checkable training evidence. reasoning data is NOT simple question-and-answer pairs. The useful part is often the feedback that says why an answer, step, tool action, or full attempt was good or bad. A prompt and a response tell you what a model said, but not why that answer became learnable, which judge blessed it, which failures were hidden, or whether the skill was already inside the base model. The core idea is to describe each training example as a record that includes the task, the model’s behavior, the checking signal, and metadata about where it came from. The authors sort reasoning data by how it can be checked, such as exact rule-based checks for math and code, environment checks for agents using tools, and human or model judgments when no exact checker exists. They also explain why common assumptions fail, because long reasoning traces may be fake, harder examples may be useless for some models, and larger datasets may still miss important coverage. The key point is that agent data should preserve mess: failed actions, retries, recoveries, state differences, and terminal checks, because that is where learning signal often lives. ---- Link – arxiv. org/abs/2606.02113 Title: "A Primer in Post-Training Reasoning Data: What They Know About How It Works"

译论文指出，更好的推理模型更依赖可验证的训练证据，而非原始数据规模。推理数据的关键不是简单问答对，而是提供答案、步骤、工具操作或完整尝试好坏判断的反馈信号。每个训练样本应描述为包含任务、模型行为、检查信号和元数据的记录。研究者按检查方式分类：数学和代码用精确规则、智能体工具用环境检查，无精确检查器时用人类或模型判断。常见误区包括：长推理链可能虚假、更难样例对部分模型无效、更大数据集仍可能缺失关键覆盖。智能体数据应保留失败动作、重试、恢复、状态差异和终端检查等“混乱”信息，因为学习信号常在其中。

Rohan Paul@rohanpaul_ai · 6月7日62

Great idea for self-evolving AI scientists from this new MIT paper. Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder. The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims. The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced. Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself. So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema. A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language. ---- arxiv. org/abs/2606.01444 Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

译MIT论文（F.Y. Wang & M.J. Buehler, arXiv:2606.01444, 2026）提出Self-Revising Discovery Systems框架，使AI科学家能自主识别当前思维模式不足并添加新科学概念，而非仅更努力搜索。系统将数据、模型、工具输出、失败及声明均视为类型化产物（typed provenance），从而区分三种模式：retrieval（添加已知对象）、search（探索固定模式）和discovery（可验证的模式转换）。论文通过Kan obstruction和Left Kan extension数学化定义了真正新颖性——由旧证据传输后的逐点残差量化，使novelty可客观测量。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性，以及CategoryScienceClaw发现各向异性纤维网络刚度规则。

Rohan Paul@rohanpaul_ai · 6月7日66

New MIT paper, great idea for self-evolving AI scientists from Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder. The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims. The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced. Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself. So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema. A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language. ---- arxiv. org/abs/2606.01444 Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

译MIT团队提出自我演进AI科学家框架，核心创新是让AI识别当前推理空间过小并主动添加新科学概念，而非仅在固定模式内搜索。论文将数据点、模型、工具输出、失败、声明均视为带类型的artifact，明确区分检索（添加已知对象）、搜索（探索固定schema）和发现（可验证的模式扩展）。通过类型化copresheaf与Kan障碍理论证明，真正发现是可验证的schema扩展：旧证据由左Kan扩展传输，创新性通过逐点残差量化。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性，以及CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444（2026）。

elvis@omarsar0 · 6月6日65

// Continual Learning Bench // One of the research areas with lots of investments is continual learning. While there are many efforts, there is very little progress in measuring it. So the big question is, do dedicated memory systems actually make agents learn from experience? Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management. CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances. If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning. Paper: https://arxiv.org/abs/2606.05661 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译持续学习领域投入多但进展缓慢。CL-Bench（持续学习基准）在六个由专家验证、包含共享可学习结构的领域上测试，发现简单的上下文学习（ICL）基线优于专门为记忆管理构建的系统。该基准引入增益指标以隔离真正学习效果，结果显示智能体常过度拟合即时观察或未能跨实例复用知识。研究指出，若普通ICL基线超过你的记忆架构，则该架构增加的是开销而非学习。论文：arxiv.org/abs/2606.05661。

meng shao@shao__meng · 6月6日59

面向 AI Agent 的零信任安全：企业自主 AI Agent 部署框架 Anthropic 官方 5 月份发布的白皮书：企业部署自主 AI Agent 时，传统边界安全不够用，必须把零信任原则延伸到 Agent 架构本身。报告开篇点出双重加速： · 基础设施层面：前沿 AI 模型把「漏洞发现 → 利用」的周期从数月压缩到数小时，攻击成本极低。 · Agent 层面：Agent 能自主解释目标、选工具、执行多步操作。传统访问控制挡不住「在合法权限内作恶」，监控也要面对「不靠漏洞、靠持久化操控」的新型攻击。因此，报告认为：未来优势不取决于谁用了最先进的 AI，而取决于谁的基础安全足够扎实，且 Agent 从第一天就按「已遭入侵」来设计。零信任的三条原则（和一条设计检验）三条原则 · 永不信任，始终验证：内外网请求一视同仁，每次访问都要认证与授权 · 假设已遭入侵：重点不是「防住入侵」，而是限制单点失守后的破坏范围 · 最小权限：只给完成任务所需的最小访问权一条设计检验这个控制是让攻击不可能，还是只是让攻击更麻烦？报告中的五个部分分别是： Part I：Agent 为何是新的安全对象？ Part II：当前威胁图谱（OWASP 视角） Part III：三层能力成熟度模型（报告核心） Part IV：八阶段实施工作流 Part V：防御运营要跟上自主威胁的速度白皮书地址： https://cdn.prod.website-files.com/6889473510b50328dbb70ae6/6a1611a04085d7cd3dadc924_Claude-eBook-Zero-Trust-for-AI-Agents-05182026.pdf 视频版 🔽🔽🔽

译Anthropic 5 月发布白皮书，提出企业部署自主 AI Agent 时须将零信任原则延伸至 Agent 架构。报告指出双重加速：前沿模型将漏洞发现到利用周期压缩至数小时；Agent 能自主解释目标、选工具、执行多步操作，传统访问控制无法阻止“合法权限内作恶”。核心原则：永不信任始终验证、假设已遭入侵、最小权限；另附设计检验——控制是让攻击不可能，还是仅增加麻烦？报告分五部分：Agent 为何是新安全对象、威胁图谱、三层能力成熟度模型、八阶段实施工作流、防御运营适配自主威胁速度。

SemiAnalysis@SemiAnalysis_ · 6月6日61

Sequential Monte Carlo speculative decoding from @makora_ai keeps multiple draft tokens alive in parallel instead of rewinding failed matches.

译来自 @makora_ai 的序贯蒙特卡洛投机解码会并行保持多个草稿 token 存活，而不是回退失败的匹配。

Rohan Paul@rohanpaul_ai · 6月6日76

Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions. The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files. The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands. Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds. Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline. The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist. The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents. The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls. GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%. The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction. Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.

译Arena 推出基于真实用户任务的智能体排行榜，评估模型在代码编写、应用构建、文档分析等工作中的表现，而非孤立基准。排行榜基于30万+任务、200万+工具调用和4000万行代码，综合任务成功、纠正遵从性、错误恢复、用户表扬与抱怨、工具幻觉等信号。前三名：GPT-5.5 High（+10.7%）、Claude Opus 4.7 Thinking（+9.5%）、GPT-5.4 High（+8.9%）。

Chubby♨️@kimmonismus · 6月6日65

AI scientists may be moving from search to real discovery. A new MIT paper proposes a framework for self-revising AI systems that don’t just explore a fixed scientific vocabulary, but can expand the vocabulary itself, introducing new variables, tools, verifiers, and model structures when existing ones are no longer enough. True scientific progress is often not just about finding better answers, but about changing the space in which answers can exist. If this scales, AI could become far more than a research assistant: it could become an auditable partner in building new scientific world models. Still early, but conceptually very exciting.

译MIT Buehler团队提出Self-Revising Discovery Systems框架，让AI能自主扩展科学词汇（变量、工具、验证器、模型结构），而非仅搜索固定空间。论文使用typed copresheaf和Kan obstruction数学框架形式化智能体工作流，证明真正发现是可验证的schema扩展：旧证据通过Left Kan extension迁移，新异性由pointwise残差客观量化，区分发现与搜索。三种模态：检索（添加已知对象）、搜索（固定schema）、发现（验证的范式转换）。案例包括Builder/Breaker发现蛋白质模式条件合规性，CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444（2026）。

Emad@EMostaque · 6月6日33

If Claude is good enough for Nobel Prize winners it is good enough for you https://arxiv.org/abs/2606.03300

译如果 Claude 对诺贝尔奖得主来说都足够好，那对你也一样。 https://arxiv.org/abs/2606.03300

Rohan Paul@rohanpaul_ai · 6月6日79

Anthropic’s new chemistry report has a genuinely wild result. Claude Opus 4.7 is now competitive with dedicated NMR software, and the bigger story is that it can work the problem backwards, i.e. infer the molecule from the spectrum.” NMR software is the chemist’s expert tool for turning molecular structures into predicted lab spectra. So Opus 4.7 is no longer just “helping chemists read data” — it can work backward from NMR data and propose the molecule’s structure, a task the report says existing mainstream tools generally leave to human chemists. Note, that Opus 4.7, a general-purpose model with no chemistry-specific fine-tuning. Claude Opus 4.7 made the smallest hydrogen prediction errors and nearly matched MestReNova on carbon, meaning it can predict NMR signals about as well as specialist chemistry tools. So AI now handle one of chemistry’s hidden bottlenecks: translating between a molecule, its spectral shadow, and the structure a chemist actually needs to trust.

译Anthropic最新化学报告显示，通用大模型Claude Opus 4.7（无化学微调）在NMR核磁共振谱分析上匹配甚至超越专用软件MestReNova，氢预测误差最小，碳预测近乎一致。更关键的是，它能从NMR光谱反向推导分子结构——这一任务以往只能由人类化学家完成。这意味着AI现在可以处理化学中的关键瓶颈：在分子结构、谱图与最终确认之间自动翻译。

Microsoft Research@MSFTResearch · 6月6日60

During the Inside Azure Innovations breakout at Build 2026, Microsoft Azure CTO, deputy CISO and technical fellow Mark Russinovich introduced Project Mosaic, an experimental optical interconnect technology from Microsoft Research Cambridge using micro-LEDs for low-power, high-speed data transmission. A live demo led by senior researcher Kaoutar Benyahya displays individual LED modulation forming letters, proving the concept’s real-time responsiveness. Check out Mark and Kaoutar starting @ 38:38: https://msft.it/6015vdhS9

译微软Azure CTO Mark Russinovich在Build 2026上介绍Project Mosaic，这是微软剑桥研究院的实验性光学互连技术，采用micro-LED实现低功耗、高速数据传输。高级研究员Kaoutar Benyahya现场演示单个LED调制形成字母，证明概念具备实时响应能力。

Chubby♨️@kimmonismus · 6月6日72

We are in for a wild ride, and this is just the beginning: 'World-first' vaccine designed by artificial intelligence Researchers at the University of Cambridge have trialled what they describe as the world’s first AI-designed vaccine component in humans. The vaccine uses an AI-designed “super-antigen” intended to train the immune system against a broad family of coronaviruses, including existing Covid variants and animal coronaviruses that could potentially cause future pandemics. Instead of designing a vaccine around one current virus strain, researchers fed AI genetic data from many known coronaviruses. The AI then designed an antigen meant to trigger immune protection across the whole virus family, even if the virus mutates or jumps from animals to humans. The first human trial involved 39 people and mainly tested safety. The immune response was described as modest, but the result is still seen as promising because it shows that an AI-designed vaccine antigen can be tested in humans. A larger study with around 200 people will now examine how well the vaccine actually trains the immune system.

译剑桥大学研究人员开展了据称全球首个AI设计疫苗成分的人体试验。该疫苗使用AI设计的“超级抗原”，旨在训练免疫系统对抗包括现有新冠变种及可能引发未来大流行的动物冠状病毒在内的广泛冠状病毒家族。首次人体试验仅39人，主要验证安全性。免疫反应虽属中等，但被视为有前景，证明AI设计的疫苗抗原可以在人体中测试。下一步计划进行约200人的更大规模研究。

Anthropic@AnthropicAI · 6月6日73

New Anthropic Science Blog: Making Claude a chemist. To manipulate a molecule, chemists first need to understand its structure. Their main tool is NMR spectroscopy. We found Opus 4.7 matches—and on some tasks beats—dedicated NMR software. Read more: https://www.anthropic.com/research/making-claude-a-chemist

译Anthropic 新科学博客：让 Claude 成为化学家。要操纵分子，化学家首先需要了解其结构。他们的主要工具是 NMR 波谱分析。我们发现 Opus 4.7 在部分任务上匹配甚至超越了专用 NMR 软件。了解更多：https://www.anthropic.com/research/making-claude-a-chemist