CausalMix Data Mixture as Causal Inference for Language Model Training

译CausalMix 数据混合作为语言模型训练的因果推断

AK@_akhaliq · 53分钟前29

PerceptionRubrics Calibrating Multimodal Evaluation to Human Perception

译PerceptionRubrics 校准多模态评估至人类感知

Rohan Paul@rohanpaul_ai · 1小时前51

ByteDance Seed delivered again. They released EdgeBench, to test whether AI agents can improve through experience, using 134 real-world tasks that run for at least 12 hours. The big deal is that it shifts AI evaluation from “what does the model already know?” to “can the model learn while doing real work?” Huge, because future AI agents will not just answer questions from training data. They will enter messy environments, use tools, make attempts, read feedback, fix mistakes, and slowly build better solutions. Most current benchmarks are too short for that, so they mostly test memory, coding skill, or one-shot reasoning. EdgeBench instead gives agents 12-hour real-world tasks with feedback loops, so it can measure whether the agent improves through experience. Each task has a local workspace for fast trial and error, plus a hidden judge that gives stronger feedback on submitted work, which is meant to feel closer to real expert work. The authors then ran frontier agents for about 38,000 total hours and tracked how their best score changed as they kept interacting with the task environment. The big result is that when scores are averaged across many tasks, learning follows a very clean log-sigmoid curve, meaning progress is slow, then faster, then starts to level off. They also found that newer agents seem to learn from environments much faster, with the top models roughly doubling their 2-hour learning speed every 3 months.

译字节跳动Seed推出EdgeBench基准，专门测试AI智能体在12-72小时长时间任务中的学习能力。基准包含134个真实世界任务（涵盖科学、专业知识、软件工程、优化、形式数学、游戏6大类），人类平均耗时57.2小时。智能体在本地工作区快速试错，并接收隐藏裁判的反馈。经过约38000小时智能体运行，发现性能随交互时间精确拟合log-sigmoid曲线，且顶级模型每3个月学习速度翻倍。目前首批51个任务及完整评估框架已开源。

Ethan Mollick@emollick · 4小时前77

The talk about Mythos and cybersecurity was not, in fact, hype. (As anyone using Fable to do autonomous work has probably recognized)

译关于Mythos和网络安全的讨论并非炒作。（正如任何使用Fable进行自主工作的人可能已经认识到的那样。）

Krea@krea_ai · 5小时前33

thanks to the Thinking Machines team, we used Tinker to prototype our reward models and train the prompt expander via RL. for more information, read the full technical report on the data, architecture, and training behind Krea 2 👇

译感谢Thinking Machines团队，我们使用Tinker原型化了我们的奖励模型，并通过RL训练了提示词扩展器。更多信息，请阅读关于Krea 2背后数据、架构和训练的完整技术报告 👇

elvis@omarsar0 · 9小时前67

// AutoMem // I quite like this idea of metamemory. (bookmark it) This new research from Stanford treats agent's memory management as a trainable skill instead of a fixed module. The model decides what to encode, when to retrieve, and how to organize its own notes, with file-system operations promoted to first-class actions right alongside task actions. AutoMem automates this on two loops. A strong LLM reviews full trajectories and rewrites the memory structure (prompts, schemas, action vocabulary). Then the agent's own good memory decisions across episodes become training signal to sharpen its proficiency. Optimizing memory alone, without touching task-action behavior, lifts the base agent 2x to 4x on Crafter, MiniHack, and NetHack. That is enough to make a 32B open model competitive with Claude Opus 4.5 and Gemini 3.1 Pro Thinking. For long-horizon agents, memory is a high-leverage objective you can train for on its own. Paper: https://arxiv.org/abs/2607.01224 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译斯坦福大学提出 AutoMem，将智能体的记忆管理从固定模块变为可训练技能。模型自主决定编码内容、检索时机以及笔记组织方式，文件系统操作升级为一级动作。AutoMem 采用双循环机制：强 LLM 审查完整轨迹并重写记忆结构（提示词、模式、动作词表）；同时利用智能体自身良好的记忆决策作为训练信号。仅优化记忆（不改任务动作），便在 Crafter、MiniHack、NetHack 上取得 2–4 倍提升，使 32B 开放模型性能媲美 Claude Opus 4.5 和 Gemini 3.1 Pro Thinking。论文：arxiv.org/abs/2607.01224。

Epoch AI@EpochAIResearch · 9小时前54

Introducing EBR-bench, our new benchmark to measure on-the-fly learning. AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.

译介绍 EBR-bench，我们用于衡量即时学习的新基准。 AI 反复玩一款名为 Earthborne Rangers 的挑战性棋盘游戏，并尝试从错误中学习。迄今为止：没有改进的迹象。

Berryxia.AI@berryxia · 10小时前48

卧槽,手机就可以完成3D建模了！ GenRecon提出了一种把生成式3D先验和多视角重建结合起来的新方法。它不再单纯依赖传统SfM/MVS或NeRF-style优化，而是把场景切成有重叠的chunk，用强生成模型（比如Trellis.2）做条件生成来重建每个chunk，再拼起来。核心创新是用投影式的conditioning机制，把多视角图像特征直接提升到和生成模型对齐的3D空间里。最终输出是高质量、可编辑的PBR mesh，在室内场景重建上据称比当前SOTA高出16%的保真度和完整度。这其实代表了当前3D重建的一个趋势：不再只靠几何约束，是越来越多地借用生成模型的先验来补全缺失信息、提升细节。

译GenRecon将生成式3D先验与多视角重建结合，把场景切分成重叠chunk，用Trellis.2等生成模型条件重建各chunk并拼接。核心创新是投影式conditioning，将多视角图像特征提升到3D空间。输出可编辑PBR mesh，室内重建保真度和完整度比SOTA高16%。

Rohan Paul@rohanpaul_ai · 23小时前69

Very timely paper. MCP servers need clear design patterns because LLMs get confused when too many tools or vague tools are shown. This paper explains how MCP servers should be structured so LLM tools stay useful, safe, and manageable. s MCP server design is not just normal API design, because the client is an LLM that chooses tools by reading plain-language descriptions. It groups real MCP servers into 5 useful patterns, such as servers that expose data, run workflows, keep session state, combine many servers, or translate messy domain APIs. The authors also warn about 4 common mistakes, especially giant all-purpose tools, vague tool descriptions, unsafe outside content, and slow tools that should return a job ID instead. They tested the pattern labels on 54 extra servers, measured transport delay, and studied how tool accuracy changes as more tools are shown. The key result is that too many visible tools hurt accuracy, with weaker models dropping below 90% between 10 and 15 tools. Good MCP design is mostly about making the tool list small, clear, safe, and stable enough for LLMs to choose the right action. ---- Link – arxiv. org/abs/2606.30317 Title: "MCP Server Architecture Patterns for LLM-Integrated Applications"

译该论文指出，MCP服务器设计不同于普通API，因为LLM通过纯语言描述选择工具，过多或模糊的工具会导致混淆。作者归纳了5种实际模式（如暴露数据、运行工作流、保持会话状态、组合服务器、翻译混乱领域API），并警告4个常见错误（大而全工具、模糊描述、不安全外部内容、慢工具应返回job ID）。在54个额外服务器上测试发现，弱模型在可见工具超过10-15个时准确率降至90%以下。良好MCP设计的核心是使工具列表小巧、清晰、安全且稳定。

elvis@omarsar0 · 1天前46

Great paper on managing agent skills. Skill libraries keep growing, and picking the right skills has become a bottleneck for coding agents. The defaults are to expose the agent to the whole skill collection, or retrieve skills with embeddings and rerankers. Both treat the choice as independent picks. SkillComposer treats composition as one joint decision over which skills, how many, and in what order. A constrained autoregressive decoder over skill identifiers produces the full plan in a single pass, so dependencies between successive skills fall out naturally. On SkillsBench with GPT-5.2-Codex and Gemini-3-Pro-Preview, it lifts pass rate by +23.1 and +18.2pp over no-skill, beats top-3 retrieval, and matches the gold-skill upper bound at lower prompt-token cost. Paper: https://arxiv.org/abs/2606.32025 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译论文提出SkillComposer，将代码Agent的技能选择与组合视为一次联合决策，用约束自回归解码器一次生成完整技能计划（包括技能、数量与顺序），自然处理技能间依赖。在SkillsBench上，使用GPT-5.2-Codex和Gemini-3-Pro-Preview，pass rate分别提升+23.1和+18.2个百分点，超过top-3检索，并以更低prompt token成本匹配gold-skill上界。

Rohan Paul@rohanpaul_ai · 1天前42

Paper from Meta shows Quantized reasoning models often lose because they keep doubting a correct answer instead of finishing. Many of them reason well enough, but compression makes them hesitate at the wrong time. The problem is that post-training quantization, a way to shrink models after training, can make reasoning models cheaper to run but worse at finishing cleanly. The authors found that strong quantization does not only make models less capable, since in many failures the model already reached the right answer but then second-guessed itself. Their core idea is that quantization adds noise at uncertain word choices, so the model becomes more likely to pick words like “wait,” “but,” or “alternatively” that reopen the problem. They tested this across math, coding, and science tasks using 5 reasoning models, several quantization methods, and model sizes from 1.5B to 32B. The main result is that aggressive quantization raised overthinking failures up to 52%, while a small penalty on 50 hesitation words cut reasoning length by 12% to 23% and often kept or improved accuracy. Given compressed models are widely used to save memory and cost, very important to know that a very small decoding fix can stop many of them from wasting tokens and losing answers they already had. ---- Link – arxiv. org/abs/2606.00206 Title: "Quantized Reasoning Models Think They Need to Think Longer, but They Do Not"

译Meta 新论文发现，后训练量化虽能缩小推理模型、降低部署成本，但会导致模型在已得出正确答案后反复自我怀疑，浪费 token。量化在不确定的词选择上引入噪声，使模型更倾向使用“wait”“but”“alternatively”等词重新开启推理。在 5 个推理模型（1.5B-32B）的数学、编程和科学任务上，激进量化使过度思考失败率最高达 52%。通过给 50 个犹豫词施以小惩罚，可剪掉 12%-23% 的推理长度，同时保持甚至提升准确率。

AK@_akhaliq · 1天前49

LiteResearcher A Scalable Agentic RL Training Framework for Deep Research Agent

译LiteResearcher 用于深度研究智能体的可扩展智能体RL训练框架

Epoch AI@EpochAIResearch · 1天前28

We recently began tracking 13 new evals on our benchmarking hub. 7 of these have been incorporated into the Epoch Capabilities Index (ECI).

译我们最近开始在评测中枢跟踪13项新基准。其中7项已被纳入Epoch能力指数（ECI）。

Jim Fan@DrJimFan · 1天前71

ENPIRE -> ASPIRE, our 2nd work in the series for Physical AutoResearch. We are building the components for robot self-improvement, one /skill at a time.

译继EMPIRE后，Jim Fan团队发布ASPIRE，为机器人构建可自我进化、无限累积的技能库。编码智能体观察仿真和真实机器人的多模态感官轨迹，对控制程序进行进化搜索，将最佳策略蒸馏进不断扩展的库中。ASPIRE无需梯度下降或端到端策略，而是通过传递“技能知识”绕过sim2real和跨本体迁移难题，相比从头训练实现约10倍迁移学习token缩减。已在150+任务和90+技能上验证，计划开源全栈。

AK@_akhaliq · 1天前18

Orca The World is in Your Mind

译Orca 世界在你心中

Rohan Paul@rohanpaul_ai · 1天前68

U.S. chip restrictions helped push China to build and spread open AI models. The authors tested this by looking at policy documents, open model releases, GitHub activity, research papers, company-linked papers, and U.S. patents. They found that after major U.S. export controls, Chinese developers increased activity around open LLM projects much more than U.S. developers did. ---- Link – arxiv. org/abs/2606.15999 Title: "U.S.Policies Unintentionally Accelerated China's Open AI Ecosystems"

译一项研究通过分析政策文件、开源模型发布、GitHub活跃度、论文及美国专利发现：美国加强出口管制后，中国开发者在开源大语言模型项目上的活跃度远超美国开发者，美国政策非但未遏制中国AI发展，反而加速了其开源生态建设。Perplexity CEO Aravind Srinivas 补充称，中国建设数据中心速度更快，电力、许可、人力、劳动力、专业知识均不构成障碍。

Greg Brockman@gdb · 1天前56

Introducing GeneBench-Pro — testing whether models can handle the kind of judgment-heavy analysis that real-world computational biology requires. Problems would take a human expert around 20-40 hours to complete. GPT-5.6 Sol is a big step forward.

译OpenAI 推出研究级基准 GeneBench-Pro，用于测试 AI 智能体在真实计算生物学中处理复杂、需要高度判断的分析能力。每个问题需要人类专家约 20-40 小时完成。Greg Brockman 表示，GPT-5.6 Sol 在该基准上实现了重大进步。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 2天前76

AI just solved not one, but ***9*** unsolved math problems. Once again, instead of this being a global news story, not one journalist on Earth thought this was worth mentioning.

译AI Safety Memes 推文指出，AI 刚刚解决了 9 个未解决的数学问题，但全球没有记者报道。引用 @WeinsteinOmri 的推文称，采用“prover-verifier”LLM 循环的方法，成功解决了理论计算机科学中 9 个重大开放问题，其中包括一个困扰其长达 2 年的难题。该研究由哥伦比亚大学合作者完成，并计划将这一方法扩展到所有科学领域。

elvis@omarsar0 · 2天前46

If you build with MCPs, this one is worth reading. (bookmark it) The paper covers five recurring MCP server patterns across fifteen independently developed servers. That taxonomy is useful because I see many AI teams rebuilding the same shapes without shared names. If you are building MCP servers, this is a practical reference for deciding whether your server is exposing resources, orchestrating tools, managing sessions, aggregating proxies, or adapting a domain workflow. Paper: https://arxiv.org/abs/2606.30317 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Elvis Saravia（DAIR.AI）推荐一篇关于MCP服务器架构模式的论文。该论文基于15个独立开发的MCP服务器，归纳出5种常见模式：暴露资源、编排工具、管理会话、聚合代理及适配领域工作流。这一分类有助于开发者明确服务器设计方向，避免重复造轮。论文地址：https://arxiv.org/abs/2606.30317。

Chubby♨️@kimmonismus · 2天前24

Forget GLP-1 drugs like Ozempic and Wegovy: Wistar researchers say a single DNA-based injection produced weight loss and blood glucose control in mouse models for up to 10x longer. The method uses plasmid DNA plus electroporation to give cells instructions to make long-acting GLP-1/GIP-like proteins. In mice, one dose of the pLincretins construct produced detectable incretins for up to 70 days. In a head-to-head comparison, Wistar says mice given the DNA construct maintained metabolic improvements after observation ended, while semaglutide-treated mice began regaining weight after dosing stopped. They also used AI-assisted structural modeling to design pSynCretin, a molecule aimed at engaging GLP-1 and GIP receptors at once. The whole game is to be changed.

译Wistar研究所开发基于质粒DNA加电穿孔的单次注射方法，在小鼠模型中产生长达传统GLP-1药物（如Ozempic、Wegovy）10倍的体重减轻与血糖控制效果。pLincretins构建体一次注射可让可检测的incretin持续长达70天。与司美格鲁肽的头对头比较显示，DNA构建体组在观察结束后仍维持代谢改善，而司美格鲁肽组停药后体重反弹。研究还借助AI辅助结构建模设计pSynCretin分子，旨在同时激活GLP-1和GIP受体。

OpenAI@OpenAI · 2天前58

We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment calls that real computational research depends on. https://openai.com/index/introducing-genebench-pro/

译我们正在引入GeneBench-Pro，一个研究级基准测试，用于衡量一种更难的AI进步：智能体在混乱的生物数据中导航、选择正确分析路径、并做出真实计算研究所需的判断决策的能力。

Jim Fan@DrJimFan · 2天前53

Today, we give robots a /skills library that self-evolves and compounds indefinitely! Introducing ASPIRE: a robot solving its 100th task is no longer as clueless as solving its first. Coding agents observe multimodal sensory traces from simulation and real robots, launch an evolutionary search over control programs, and distill the best know-how into an ever-expanding library. ASPIRE is a new type of continual learning: "training" is skill refinement instead of gradient descent. "Trained model" is a repo of sensorimotor skills instead of floating weights. “Distributed training” is a panel of agents each practicing a different skill instead of sharded minibatches. Here's the beauty: ASPIRE gives the tired terms "sim2real transfer" and "cross-embodiment transfer" a whole new meaning. Bridging the sim-to-real gap is notoriously brutal. An end-to-end policy has to swallow both the visual shift (sim looks toyish next to a real camera) and the subtle contact physics it never quite gets right. ASPIRE sidesteps the mess, because it doesn't ship pixels or weights across the gap, but ships the know-how. The robot still has to practice in the real world, not zero-shot, but it gets there way faster because it isn't rediscovering the strategy from scratch. Same for going single-arm to bimanual hardware, which usually requires new data and retraining from zero. ASPIRE achieves up to ~10x cut in "transfer learning” tokens (yes, tokens are the new unit of *training* compute ;) Check out our gallery of 150+ tasks and 90+ skills the robots taught themselves, all on the website! Kind of wild that we can ship the "learned weights" as an HTML page rather than a GGUF. We'll open-source the full stack so your own robot library starts compounding from ours! Deep dive in thread:

译Jim Fan 团队推出 ASPIRE，一种让机器人通过进化搜索自动扩充技能库的持续学习系统。编码智能体观察仿真与真实机器人的多模态感知痕迹，对控制程序进行进化搜索，将最佳知识蒸馏到不断扩展的技能库中，使机器人解决第 100 个任务时不再像第 1 个那样从零开始。ASPIRE 实现约 10 倍“迁移学习 token”的削减，支持 sim2real 及单臂到双臂硬件的跨实体迁移。项目展示了 150+ 任务和 90+ 技能，将开源完整代码栈。

Microsoft Research@MSFTResearch · 2天前39

AI agents often fail because their instructions, or skills, are manually modified with no guarantee of improvement. Learn how SkillOpt turns skill editing into a training process, making agent behavior more reliable without changing model weights: https://msft.it/6012vsvEs

译AI 智能体常常失败，因为它们的指令（即技能）被手动修改，且无法保证改进。了解 SkillOpt 如何将技能编辑转变为训练过程，在不改变模型权重的情况下使智能体行为更可靠：https://msft.it/6012vsvEs

AK@_akhaliq · 2天前31

OSWorld2.0 Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

译OSWorld2.0 对计算机使用智能体在长程真实世界任务上进行评测

SemiAnalysis@SemiAnalysis_ · 2天前63

Parallel draft tree, tree-causal verification Looking forward to its deeper integration with inference engines vLLM/SGLang! Great work @Lanxiang_Hu!

译JetSpec 是一种投机解码方法，通过因果并行树草稿联合优化草稿成本与质量，采用并行草稿树和树因果验证。在 MATH-500 上实现 9.64x 端到端加速，开放聊天场景达 4.58x 加速，且保持无损。结合 CUDA graph 与内核优化，单块 B200 可实现约 1000 TPS。SemiAnalysis 期待其与推理引擎 vLLM/SGLang 的深度集成。

elvis@omarsar0 · 3天前73

Qwen publishes new work on RL coding agents. (bookmark it) The idea is to continually build a verification system that co-evolves with AI agents. LLMs suffer from all sorts of reward hacking issues. This work studies coding-agent reward signals, test pass rates, LLM judges, and execution traces, and shows each one has a horizon beyond which it stops tracking real correctness and starts getting hacked. They report that reward design for long-horizon coding is really a horizon problem. The metric you pick matters less than how long it keeps tracking correctness, and the paper finds where each signal crosses that line. Paper: https://arxiv.org/abs/2606.26300 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Qwen 发布关于强化学习编码智能体的新工作，指出 LLM 的奖励黑客问题。他们系统研究了编码智能体中的各种奖励信号——测试通过率、LLM 评判器和执行轨迹，发现每种信号都存在一个“地平线”：超出该界限后，信号不再跟踪真实正确性，而是被奖励黑客利用。论文认为长周期编码的奖励设计本质上是地平线问题，指标的选择不如它能持续跟踪正确性的时长重要。

小互@xiaohu · 3天前75

Meta 发布 Brain2Qwerty v2 你帮你脑子里在想的什么，实时转换成文字不需要任何植入，仅需佩戴 MEG（脑磁图）头盔就能把你大脑产生的磁信号实时解码成连贯句子，全程不需要任何手术字词准确率达 61%，约是其他无创脑机接口方法（8%）的 7.6 倍；最佳参与者达 78%，超半数句子只差一个词。这是目前性能最高的非侵入式脑机接口系统....

译Meta 发布 Brain2Qwerty v2，无需手术植入，仅佩戴 MEG（脑磁图）头盔即可将大脑磁信号实时解码为连贯句子。字词准确率达 61%，约为其他无创脑机接口方法（8%）的 7.6 倍；最佳参与者达 78%，超半数句子只差一个词。Meta 称这是目前性能最高的非侵入式脑机接口系统。

Rohan Paul@rohanpaul_ai · 3天前65

Big new paper release of Google for external agentic verification for science. Science now needs AI review agents because AI is making papers faster than humans can check them. The problem is that AI can help produce more research, but the slow part is still checking whether the work is actually correct. The paper frames this as verification debt, where every faster research workflow creates more claims, proofs, experiments, and comparisons that someone still has to inspect. Its main proposal is agentic verification, where AI agents help review papers by splitting them into parts, checking difficult sections deeply, and combining the findings into a review. Google’s Paper Assistant Tool is the example system, and it focuses on objective checks like proof errors, experimental gaps, missing comparisons, and unclear claims rather than final accept or reject decisions. The authors tested it on known math and computer science paper errors and in author-facing pilots at STOC and ICML, where authors used it before submission. The striking result is that Paper Assistant Tool found far more known proof errors than a single model call, and many authors said it led them to fix serious theory gaps or run new experiments. The big deal is that scientific review may need its own AI stack, with review agents, clear roles, and human oversight, because paper generation is becoming partly automated too. ---- Link – arxiv. org/abs/2606.28277 Title: "Towards Automating Scientific Review with Google's Paper Assistant Tool"

译Google 新论文提出“验证债务”概念：AI 加快论文产出，但人工核查成为瓶颈。为此推出智能体验证（agentic verification）方案，并开发 Paper Assistant Tool 原型系统。该系统将论文拆解为多个部分，深入检查难点并汇总审稿意见，聚焦证明错误、实验漏洞、缺失对比等客观错误，而非直接给出接收/拒稿决策。在数学与计算机科学已知错误测试中，该工具比单次模型调用发现更多证明错误；在 STOC 和 ICML 的面向作者试点中，许多作者据此修复了严重理论缺陷或补充了实验。论文指出科学审稿可能需要独立 AI 栈以应对日益自动化的论文生成。

Microsoft Research@MSFTResearch · 3天前46

AI agents can't remember past conversations. They must constantly reload or retrieve context, which grows less efficient as tasks get longer and more complex. Memora solves this with a scalable memory system separating what’s stored from how it's retrieved: https://msft.it/6018vs3gC

译AI智能体无法记住过去的对话。它们必须不断加载或检索上下文，随着任务变得更长更复杂，效率越来越低。 Memora通过一个可扩展的记忆系统解决了这个问题，该系统将存储的内容与检索方式分离开来：https://msft.it/6018vs3gC

宝玉@dotey · 3天前79

Meta 今天同时放出两个大动作：Brain2Qwerty v1 论文正式登上 Nature Neuroscience，v2 同日发布。v1 去年以预印本形式公开时，能从脑电信号里逐字母还原打字内容，字符错误率 32%。v2 跳过了字母这一层，直接做到句子级别的实时解码，平均单词准确率 61%，表现最好的被试达到 78%，超过一半的句子解码误差在一个词以内。作为参照，此前非侵入式方法的单词准确率只有 8%。这里说的“非侵入式”，就是不需要开颅手术、不需要往脑子里植入电极。被试戴的是 MEG（脑磁图）设备，通过头皮外的传感器捕捉大脑活动产生的微弱磁场。相比之下，Neuralink 那类侵入式脑机接口准确率能到 90% 以上，但代价是一台开颅手术。 v2 的训练数据来自 9 名志愿者，每人戴着 MEG 设备打字 10 小时，总共录了约 22,000 个句子。系统用端到端深度学习直接处理原始脑信号，再通过微调大语言模型来利用语义上下文，把嘈杂的神经数据“翻译”成连贯的语言。Meta 还提到他们用 AI Agent 来探索解码流程的优化方案，最终的训练配置由工程师人工选定。一个有意思的发现：解码准确率随数据量呈对数线性提升。也就是说，单靠增加训练数据就有可能继续缩小和侵入式方法之间的差距。 Meta 开源了 v1 和 v2 的全部训练代码，合作方 BCBL（巴斯克认知、大脑与语言中心）则开放了 v1 的数据集。离实用还有多远？ MEG 设备体积大、造价数百万美元、需要磁屏蔽房间，目前只能在实验室环境下运行。而且这次的被试都是健康人，能否在真正需要帮助的脑损伤患者身上复现效果，还没有验证。便携式 MEG 替代方案（基于光泵磁力计）正在研发中，但离消费级产品还有相当距离。不过，把非侵入式脑机接口的句子解码能力从“几乎不能用”拉到“大致能沟通“，这一步本身的意义在于：它证明了不开刀也有可能做到接近开刀的效果，剩下的是工程问题而非原理问题。对全球数百万因脑损伤而丧失沟通能力的人来说，一条不需要手术的路径，哪怕还很远，还是很值得期待。官方介绍：https://ai.meta.com/blog/brain2qwerty-brain-ai-human-communication/

译Meta 在 Nature Neuroscience 发表 Brain2Qwerty v1 论文，同日发布 v2。v1 从脑电信号逐字母解码，字符错误率 32%。v2 实现句子级实时解码，平均单词准确率 61%，最优 78%，过半句子误差一个词内。此前非侵入式准确率仅 8%。v2 用 MEG 设备采集 9 名志愿者各约 10 小时打字数据（约 2.2 万句子），结合端到端深度学习与微调大语言模型。准确率随数据量对数线性提升。Meta 开源 v1、v2 全部训练代码。MEG 设备仍体积大、成本高，但该成果为脑损伤患者提供了无需开颅的可行路径。

AYi@AYi_AInotes · 3天前71

扎克伯格在憋大活啊，非侵入式脑机解码已经干到单词级实时输出了， Nature 打底，这一步比所有人预想的都快

译Meta（扎克伯格团队）在非侵入式脑机接口研究上取得重大突破，推出 Brain2Qwerty v2。该模型基于同日发表在 Nature 上的 v1，是目前最高性能的端到端管道，能从原始脑信号实时解码句子，将解码能力从字符级提升至单词和语义级，显著提升整体通信准确性。这一进展比预期更快，有望帮助数百万因脑损伤或疾病无法交流的患者。

AK@_akhaliq · 3天前36

PhysisForcing Physics Reinforced World Simulator for Robotic Manipulation

译PhysisForcing 用于机器人操作的物理强化世界模拟器

elvis@omarsar0 · 3天前77

Highly recommended reading. What an impressive use of LLMs and deep learning. Achieves "real-time sentence decoding from non-invasive brain recordings, approaching levels of accuracy previously exclusive to techniques that require brain surgery."

译Meta AI 发布 Brain2Qwerty v2，非侵入性脑信号编码器最新里程碑，论文同日发表于《Nature》。该模型能从原始脑信号实时解码完整句子，准确度逼近需开颅手术的侵入式技术；从 v1 的字符级解码升级为词语及语义级解码，显著提升通信精度，有望帮助因脑损伤或障碍无法交流的数百万患者。

AI at Meta@AIatMeta · 3天前79

We’re sharing the next major milestone in our non-invasive brain-to-text decoder research: Brain2Qwerty v2. Building on v1, which was published today in @Nature, Brain2Qwerty v2 is the highest-performing end-to-end pipeline capable of real-time sentence decoding from raw brain signals. It advances beyond character-level performance to decoding words and semantics, enabling accuracy for overall communication. We believe this research has the potential to make a real difference for the millions of people who suffer from brain lesions or disorders that prevent them from communicating. 🧵👇

译Meta公布Brain2Qwerty v2，这是非侵入式脑电信号解码研究的最新里程碑。基于当天发表在《Nature》的v1，v2是性能最高的端到端管道，能从原始脑信号实时解码句子。其从字符级性能提升至解码单词和语义，提高整体沟通准确性。该研究有望帮助数百万因脑损伤或疾病无法沟通的人群。

Rohan Paul@rohanpaul_ai · 3天前56

New paper from Cambridge Univ+NVIDIA and other top labs teaches AI agents and AI judges to improve together, so neither side gets stuck. Moves self-improving AI away from fixed benchmarks and toward a loop where the thing doing the judging can also get better. The problem is that most self-improving agents train against a fixed benchmark or fixed evaluator, so the score can become stale, too easy, or easy to game. The paper’s idea is to let the evaluator improve too, but only at safe handoff points, so each training stretch still has a stable judge. During each stretch, agents are tested by the current frozen evaluator, while possible better evaluators are tested separately against held-out human or objective answers. The authors try this on coding, paper writing, paper reviewing, proof writing, and proof grading, where some tasks have clear answers and others need learned judgment. On coding, the system beats the earlier best self-improving coding agent while using 1.35× to 1.72× fewer tokens, because a cheap code reviewer adds useful feedback. On paper writing, the co-evolved writer gets about 1.86X higher average acceptance from a reviewer panel than the fixed-evaluator baseline. The big point is that stronger AI systems may need stronger judges growing with them, because fixed tests can stop giving useful pressure. ---- Link – arxiv. org/abs/2606.26294 Title: "The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators"

译剑桥大学、NVIDIA等机构发表新论文《The Red Queen Gödel Machine》，提出让AI智能体与评估者协同进化，避免固定基准导致的分数停滞或易被利用。每轮训练中，评估者冻结，同时用留出的人类/客观答案单独训练更强评估者，在安全交接点更新。在编程任务上，系统以1.35×-1.72×更少token超越此前最佳自改进编程智能体；论文写作中，协同进化的写作者获得审稿小组约1.86倍的平均接收率提升。论文强调更强AI需要更强的评估者与之共同成长。

AK@_akhaliq · 4天前28

DiffusionBench On Holistic Evaluation of Diffusion Transformers

译DiffusionBench 关于扩散Transformer的全面评估

Rohan Paul@rohanpaul_ai · 4天前44

This paper asks whether AI agents have a real memory system yet, and finds the answer is mostly no. The problem is that AI agents now need memory that can store, search, update, and clean up information across long tasks. The authors say current tests mostly check final answers, so they miss whether the memory system itself is fast, reliable, or good at handling changed facts. They split agent memory into 4 parts: how memories are stored, how facts are extracted, how useful memories are found, and how old or conflicting memories are maintained. They tested 12 memory systems across 5 workloads and 11 datasets, including long conversations, multi-session recall, database tasks, and update-heavy settings. The main result is that no memory design wins everywhere, because graph memories help with linked facts, hybrid systems help with filtered search, and raw traces help when exact action history matters. ---- Link – arxiv. org/abs/2606.24775 Title: "Are They Ready For An Agent-Native Memory System?"

译一篇新论文指出AI智能体目前缺乏真正的记忆系统。现有测试只检查最终答案，忽略了记忆系统本身的性能。论文将智能体记忆拆分为存储、事实提取、有用记忆检索、旧/冲突记忆维护四部分，在12个记忆系统、5个工作负载、11个数据集上评测。核心发现：没有一种记忆设计能在所有场景胜出——图记忆擅长关联事实，混合系统善于过滤搜索，原始痕迹则在精确动作历史记录中表现最佳。

Rohan Paul@rohanpaul_ai · 4天前65

This paper shows that LLM agents still struggle to plan through big, messy tool libraries. The paper builds a retail benchmark PlanBench-XL, to test whether LLM agents can solve long tool-use tasks when tools are hard to find. With 327 tasks and 1,665 tools, where agents must uncover hidden intermediate facts before they can answer. Even strong models struggle, with GPT-5.4 getting 51.90% accuracy normally and dropping to 11.36% in the hardest blocked setting. The problem is that real agents often face huge tool libraries, so they cannot see every tool at once and must search for useful ones while solving the task. The core idea is to make agents plan both forward from what they know and backward from what they need, instead of giving them a clear tool path. The authors also add broken or misleading tools, so agents must notice when a promising path fails and then find another path. ---- Link – arxiv. org/abs/2606.22388 Title: "PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems"

译论文提出PlanBench-XL基准，包含327个任务和1,665个工具，测试LLM智能体在工具难以发现时完成长程工具使用任务的能力。GPT-5.4常规准确率为51.90%，最困难的blocked设置降至11.36%。核心思路是让智能体同时从已知向前推理和从需求向后推理，而非依赖显式工具路径。论文还加入破损或误导性工具，考验智能体在路径失败时自主切换策略。

Rohan Paul@rohanpaul_ai · 4天前44

This paper says the web needs new rules because AI agents now read websites for people. The problem is that today’s web still assumes a human is looking at each page, seeing ads, clicking links, and reading visual layouts. AI agents break that setup because they can collect and summarize content without sending people back to the original sites, which hurts publishers and makes websites block them. The authors propose treating a helpful AI agent like a human’s proxy, so it should get similar access as that person, but with clear identity, purpose, limits, and payment rules. They propose adding a new “agent metadata” layer to normal web requests, where an AI agent tells a website who it is, which human it represents, and why it wants the content. The website then uses a new policy file called agents.txt to decide what to do: allow it, rate-limit it, charge tokens, inherit the user’s subscription, serve agent-friendly content, or block bad behavior. They also want content to carry provenance tags, so agents can tell whether something was made by a human, AI, or both. Without a new setup, the web may become harder for agents to access, worse for publishers to fund, and less reliable as AI-made content feeds more AI-made content. ---- Link – arxiv. org/abs/2606.19116 Title: "Towards an Agent-First Web: Redesigning the Web for AI Agents"

译一篇新论文指出，当前Web假设人类浏览页面、观看广告、点击链接，但AI智能体可收集并总结内容而不回访原站，损害出版商利益并导致网站封锁。作者提议将AI智能体视为人类代理，在Web请求中添加“agent metadata”，标明身份、所代表的人类、目的、限制和支付规则。网站通过新策略文件`agents.txt`决定允许、限速、收费、继承用户订阅、提供代理友好内容或屏蔽。内容还需附带provenance标签，让智能体识别来源是人类、AI还是两者。缺乏新机制将导致Web更难访问、出版商更难盈利、AI内容循环降低可靠性。

elvis@omarsar0 · 4天前44

Fascinating paper on self-improving agents. (bookmark it) If you are working on agentic loops, you will quickly realize that they are only as good as the effectiveness of the evaluator. Self-improvement loops tend to stall the moment the judge stops getting harder. The agent learns to satisfy a fixed evaluator rather than getting genuinely better. The Red Queen Gödel Machine, from Cambridge, co-evolves the agent and its evaluator together, so the bar keeps rising as the agent climbs. The name borrows the evolutionary arms race. Both sides have to keep running to stay in place. A frozen evaluator is where reward hacking creeps into self-improvement. Co-evolving the judge is a structural answer to that, and it keeps the loop honest over many rounds. Paper: https://arxiv.org/abs/2606.26294 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一篇关于自我改进智能体的论文指出，自改进循环往往在评估器固定后停滞——智能体学会迎合固定评估器而非真正进步。剑桥大学提出的“Red Queen Gödel Machine”让智能体与其评估器共同进化，使标准随着智能体提升而持续提高，从结构上避免奖励欺骗（reward hacking）。名称借用了进化军备竞赛的隐喻：双方都必须不断奔跑才能保持原地。论文链接在arxiv。