// AutoMem // I quite like this idea of metamemory. (bookmark it) This new research from Stanford treats agent's memory management as a trainable skill instead of a fixed module. The model decides what to encode, when to retrieve, and how to organize its own notes, with file-system operations promoted to first-class actions right alongside task actions. AutoMem automates this on two loops. A strong LLM reviews full trajectories and rewrites the memory structure (prompts, schemas, action vocabulary). Then the agent's own good memory decisions across episodes become training signal to sharpen its proficiency. Optimizing memory alone, without touching task-action behavior, lifts the base agent 2x to 4x on Crafter, MiniHack, and NetHack. That is enough to make a 32B open model competitive with Claude Opus 4.5 and Gemini 3.1 Pro Thinking. For long-horizon agents, memory is a high-leverage objective you can train for on its own. Paper: https://arxiv.org/abs/2607.01224 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译斯坦福大学提出 AutoMem，将智能体的记忆管理从固定模块变为可训练技能。模型自主决定编码内容、检索时机以及笔记组织方式，文件系统操作升级为一级动作。AutoMem 采用双循环机制：强 LLM 审查完整轨迹并重写记忆结构（提示词、模式、动作词表）；同时利用智能体自身良好的记忆决策作为训练信号。仅优化记忆（不改任务动作），便在 Crafter、MiniHack、NetHack 上取得 2–4 倍提升，使 32B 开放模型性能媲美 Claude Opus 4.5 和 Gemini 3.1 Pro Thinking。论文：arxiv.org/abs/2607.01224。

Rohan Paul@rohanpaul_ai · 21小时前69

Very timely paper. MCP servers need clear design patterns because LLMs get confused when too many tools or vague tools are shown. This paper explains how MCP servers should be structured so LLM tools stay useful, safe, and manageable. s MCP server design is not just normal API design, because the client is an LLM that chooses tools by reading plain-language descriptions. It groups real MCP servers into 5 useful patterns, such as servers that expose data, run workflows, keep session state, combine many servers, or translate messy domain APIs. The authors also warn about 4 common mistakes, especially giant all-purpose tools, vague tool descriptions, unsafe outside content, and slow tools that should return a job ID instead. They tested the pattern labels on 54 extra servers, measured transport delay, and studied how tool accuracy changes as more tools are shown. The key result is that too many visible tools hurt accuracy, with weaker models dropping below 90% between 10 and 15 tools. Good MCP design is mostly about making the tool list small, clear, safe, and stable enough for LLMs to choose the right action. ---- Link – arxiv. org/abs/2606.30317 Title: "MCP Server Architecture Patterns for LLM-Integrated Applications"

译该论文指出，MCP服务器设计不同于普通API，因为LLM通过纯语言描述选择工具，过多或模糊的工具会导致混淆。作者归纳了5种实际模式（如暴露数据、运行工作流、保持会话状态、组合服务器、翻译混乱领域API），并警告4个常见错误（大而全工具、模糊描述、不安全外部内容、慢工具应返回job ID）。在54个额外服务器上测试发现，弱模型在可见工具超过10-15个时准确率降至90%以下。良好MCP设计的核心是使工具列表小巧、清晰、安全且稳定。

elvis@omarsar0 · 2天前73

Qwen publishes new work on RL coding agents. (bookmark it) The idea is to continually build a verification system that co-evolves with AI agents. LLMs suffer from all sorts of reward hacking issues. This work studies coding-agent reward signals, test pass rates, LLM judges, and execution traces, and shows each one has a horizon beyond which it stops tracking real correctness and starts getting hacked. They report that reward design for long-horizon coding is really a horizon problem. The metric you pick matters less than how long it keeps tracking correctness, and the paper finds where each signal crosses that line. Paper: https://arxiv.org/abs/2606.26300 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Qwen 发布关于强化学习编码智能体的新工作，指出 LLM 的奖励黑客问题。他们系统研究了编码智能体中的各种奖励信号——测试通过率、LLM 评判器和执行轨迹，发现每种信号都存在一个“地平线”：超出该界限后，信号不再跟踪真实正确性，而是被奖励黑客利用。论文认为长周期编码的奖励设计本质上是地平线问题，指标的选择不如它能持续跟踪正确性的时长重要。

Rohan Paul@rohanpaul_ai · 3天前65

Big new paper release of Google for external agentic verification for science. Science now needs AI review agents because AI is making papers faster than humans can check them. The problem is that AI can help produce more research, but the slow part is still checking whether the work is actually correct. The paper frames this as verification debt, where every faster research workflow creates more claims, proofs, experiments, and comparisons that someone still has to inspect. Its main proposal is agentic verification, where AI agents help review papers by splitting them into parts, checking difficult sections deeply, and combining the findings into a review. Google’s Paper Assistant Tool is the example system, and it focuses on objective checks like proof errors, experimental gaps, missing comparisons, and unclear claims rather than final accept or reject decisions. The authors tested it on known math and computer science paper errors and in author-facing pilots at STOC and ICML, where authors used it before submission. The striking result is that Paper Assistant Tool found far more known proof errors than a single model call, and many authors said it led them to fix serious theory gaps or run new experiments. The big deal is that scientific review may need its own AI stack, with review agents, clear roles, and human oversight, because paper generation is becoming partly automated too. ---- Link – arxiv. org/abs/2606.28277 Title: "Towards Automating Scientific Review with Google's Paper Assistant Tool"

译Google 新论文提出“验证债务”概念：AI 加快论文产出，但人工核查成为瓶颈。为此推出智能体验证（agentic verification）方案，并开发 Paper Assistant Tool 原型系统。该系统将论文拆解为多个部分，深入检查难点并汇总审稿意见，聚焦证明错误、实验漏洞、缺失对比等客观错误，而非直接给出接收/拒稿决策。在数学与计算机科学已知错误测试中，该工具比单次模型调用发现更多证明错误；在 STOC 和 ICML 的面向作者试点中，许多作者据此修复了严重理论缺陷或补充了实验。论文指出科学审稿可能需要独立 AI 栈以应对日益自动化的论文生成。

Rohan Paul@rohanpaul_ai · 3天前56

New paper from Cambridge Univ+NVIDIA and other top labs teaches AI agents and AI judges to improve together, so neither side gets stuck. Moves self-improving AI away from fixed benchmarks and toward a loop where the thing doing the judging can also get better. The problem is that most self-improving agents train against a fixed benchmark or fixed evaluator, so the score can become stale, too easy, or easy to game. The paper’s idea is to let the evaluator improve too, but only at safe handoff points, so each training stretch still has a stable judge. During each stretch, agents are tested by the current frozen evaluator, while possible better evaluators are tested separately against held-out human or objective answers. The authors try this on coding, paper writing, paper reviewing, proof writing, and proof grading, where some tasks have clear answers and others need learned judgment. On coding, the system beats the earlier best self-improving coding agent while using 1.35× to 1.72× fewer tokens, because a cheap code reviewer adds useful feedback. On paper writing, the co-evolved writer gets about 1.86X higher average acceptance from a reviewer panel than the fixed-evaluator baseline. The big point is that stronger AI systems may need stronger judges growing with them, because fixed tests can stop giving useful pressure. ---- Link – arxiv. org/abs/2606.26294 Title: "The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators"

译剑桥大学、NVIDIA等机构发表新论文《The Red Queen Gödel Machine》，提出让AI智能体与评估者协同进化，避免固定基准导致的分数停滞或易被利用。每轮训练中，评估者冻结，同时用留出的人类/客观答案单独训练更强评估者，在安全交接点更新。在编程任务上，系统以1.35×-1.72×更少token超越此前最佳自改进编程智能体；论文写作中，协同进化的写作者获得审稿小组约1.86倍的平均接收率提升。论文强调更强AI需要更强的评估者与之共同成长。

Rohan Paul@rohanpaul_ai · 4天前44

This paper says the web needs new rules because AI agents now read websites for people. The problem is that today’s web still assumes a human is looking at each page, seeing ads, clicking links, and reading visual layouts. AI agents break that setup because they can collect and summarize content without sending people back to the original sites, which hurts publishers and makes websites block them. The authors propose treating a helpful AI agent like a human’s proxy, so it should get similar access as that person, but with clear identity, purpose, limits, and payment rules. They propose adding a new “agent metadata” layer to normal web requests, where an AI agent tells a website who it is, which human it represents, and why it wants the content. The website then uses a new policy file called agents.txt to decide what to do: allow it, rate-limit it, charge tokens, inherit the user’s subscription, serve agent-friendly content, or block bad behavior. They also want content to carry provenance tags, so agents can tell whether something was made by a human, AI, or both. Without a new setup, the web may become harder for agents to access, worse for publishers to fund, and less reliable as AI-made content feeds more AI-made content. ---- Link – arxiv. org/abs/2606.19116 Title: "Towards an Agent-First Web: Redesigning the Web for AI Agents"

译一篇新论文指出，当前Web假设人类浏览页面、观看广告、点击链接，但AI智能体可收集并总结内容而不回访原站，损害出版商利益并导致网站封锁。作者提议将AI智能体视为人类代理，在Web请求中添加“agent metadata”，标明身份、所代表的人类、目的、限制和支付规则。网站通过新策略文件`agents.txt`决定允许、限速、收费、继承用户订阅、提供代理友好内容或屏蔽。内容还需附带provenance标签，让智能体识别来源是人类、AI还是两者。缺乏新机制将导致Web更难访问、出版商更难盈利、AI内容循环降低可靠性。

elvis@omarsar0 · 4天前44

Fascinating paper on self-improving agents. (bookmark it) If you are working on agentic loops, you will quickly realize that they are only as good as the effectiveness of the evaluator. Self-improvement loops tend to stall the moment the judge stops getting harder. The agent learns to satisfy a fixed evaluator rather than getting genuinely better. The Red Queen Gödel Machine, from Cambridge, co-evolves the agent and its evaluator together, so the bar keeps rising as the agent climbs. The name borrows the evolutionary arms race. Both sides have to keep running to stay in place. A frozen evaluator is where reward hacking creeps into self-improvement. Co-evolving the judge is a structural answer to that, and it keeps the loop honest over many rounds. Paper: https://arxiv.org/abs/2606.26294 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一篇关于自我改进智能体的论文指出，自改进循环往往在评估器固定后停滞——智能体学会迎合固定评估器而非真正进步。剑桥大学提出的“Red Queen Gödel Machine”让智能体与其评估器共同进化，使标准随着智能体提升而持续提高，从结构上避免奖励欺骗（reward hacking）。名称借用了进化军备竞赛的隐喻：双方都必须不断奔跑才能保持原地。论文链接在arxiv。

Rohan Paul@rohanpaul_ai · 4天前40

AI agents often forget past work, but this Accenture paper method keeps everything reachable. Traditional LLMs often forget important details during long projects because their limited memory space forces them to discard old information. This introduces a system that keeps a compact summary of recent work while storing all past actions in a separate, accessible database. The agent uses smart indexing to quickly look up exact details from this database whenever it needs to recall a specific past event. A custom training method teaches the agent to decide for itself which information is worth keeping and when to pull data from its long-term archives. By saving only the necessary summaries in the active workspace, the model maintains a sharp focus on its current goal without being overwhelmed by a massive history. This approach solves the problem of information loss that usually happens when an AI struggles to complete complicated, multi-step tasks over a long period. ----- Paper Link – arxiv. org/abs/2603.04257 Paper Title: "Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory"

译传统LLM在长项目易因有限记忆空间遗忘细节。Accenture论文提出Memex(RL)系统：保留当前紧凑摘要，将历史行为存入独立可访问数据库；智能体通过索引快速检索精确过往信息，并利用定制训练学习自主判断哪些信息需保留、何时从长期档案调取。该方法避免历史过载，保持智能体对当前目标的专注，解决多步复杂任务中的信息丢失问题。论文链接：arxiv.org/abs/2603.04257。

Rohan Paul@rohanpaul_ai · 5天前44

This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs. Reached about 1.7 to 1.8 times faster prefill when context length became large. Standard attention makes every token run through every attention head, even when some heads are not useful for that token. The paper’s idea, called Grouped Query Experts, keeps the normal key and value cache from grouped-query attention but routes each token to only a few query-head experts. Grouped Query Experts sits on top of grouped-query attention, the trick many long-context models already use to reduce key-value cache cost. This is like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful. The authors trained 250M-parameter models on 30B tokens and compared the method with a normal grouped-query attention baseline. The best version matched the baseline’s average accuracy, 56.04 versus 55.86, while using 9 of 16 query-attention computations. shows that attention can be made sparse inside grouped-query attention without hurting quality, but only when the router gets a strong learning signal and one shared head stays always on. ---- Link – arxiv. org/abs/2606.20945 Title: "Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention"

译论文提出Grouped Query Experts，在分组查询注意力（GQA）基础上让每个token仅路由到少数query头专家。长上下文时prefill速度提升约1.7-1.8倍。250M参数模型经30B tokens训练，最佳版本准确率56.04（baseline 55.86），仅使用16个query注意力计算中的9个。表明GQA内可实现稀疏注意力且不损质量，但需强学习信号和一个始终打开的共享头。

小互@xiaohu · 5天前64

http://x.com/i/article/2070795179813203968 # Wan Streamer：一个能跟你实时视频通话的真人 AI 阿里通义实验室 Wan 团队放出 Wan Streamer 模型，一个能跟你实时视频通话的真人 AI。我们已经习惯了跟 AI 打字、语音聊天。Wan Streamer 往前走了一步，它能跟你视频通话：你这边有摄像头和麦克风，它那边实时生成一张会说话的脸，看着你、回应你。效果展示： 📹 视频① · 中文日常通话 —— 在此插入视频。中文 · 暖色室内视频通话：聊刮胡子、在家办公、想看一部特效不错的新动作片。清晰自然男声。 ## 1 · 这是什么：一个模型跑通实时音视频对话 Wan Streamer v0.1 是一个实时音视频交互模型。能实时对话的 AI 现在不少，但能一边看你的脸、一边听你说话、一边开口回应、自己还自带一张会动的脸的，几乎没有。Wan Streamer 把这件事压进了一个模型里。它在同一个 Transformer 里同时处理语言、音频、视频的输入和输出，做到亚秒级的全双工音视频对话：模型自己算出一段回应大约只要 200 毫秒，加上网络往返后总延迟约 550 毫秒。为什么值得看：现在能实时对话的系统分两类，一类响应快但只出声音、没有可见的脸（GPT-4o Realtime、豆包、Gemini Live），另一类有脸但靠外部 ASR、语言模型、TTS、动画一串模块拼出来。官方称 Wan Streamer 是唯一用单个端到端 Transformer 同时吐出同步音视频、且总延迟压在 1 秒内的模型。几个关键数字： - ～200 ms — 模型侧响应延迟 - ～550 ms — 总交互延迟（200ms 模型侧 + 350ms 网络往返） - 160 ms — 25fps 下最短的流式处理单元 - 192p — v0.1 分辨率，端到端设计的概念验证把总延迟 550ms 拆开看：模型本身只占 200ms，剩下 350ms 是网络往返。也就是说，纯模型的反应速度，比你读到的总延迟更快。 ## 2 · 旧办法为什么慢：一道道接力，每步都在等旧办法慢，是因为它们是一串独立模型拼起来的流水线：语音先转成文字（ASR），文字喂给语言模型想答案（LLM），答案再合成语音（TTS），最后驱动一张脸动起来（动画渲染）。 > 音视频输入 → ⏳ASR 识别 → ⏳LLM 想答案 → ⏳TTS 合成语音 → ⏳动画渲染 → 输出每过一道工序都要等上一道交货，等待时间一段段累加，识别和口型对不齐的误差也一路累积。每个箭头都是一次等待 + 一次误差累积；模块之间靠文字当中转桥；多数系统只出语音，或者把一张脸勉强拼出来，且不报告端到端时延。 Wan Streamer 是端到端单模型：音视频输入 →「一个 Transformer」（感知 · 推理 · 规划 · 生成一起做）→ 同步音视频输出。没有接缝，等待时间坍缩；轮次管理、被打断、长程一致性，作为一个连贯行为一起学出来。打个比方：端到端像一个人自己听完直接开口；级联像传话游戏，每过一手都慢一拍，还可能把话传错。中间那层把语音／视频先转成文字、再用文字驱动下游——文字就是各模块之间隐藏的中转桥，桥越多越慢、越容易错。Wan Streamer 不要这个中间桥，模态之间直接耦合。原文给这件事下了一个判断：实时音视频交互不是「多模态理解」加「多模态生成」的简单相加，它本质上是全双工的，所以可流式性是一种建模约束，而不只是上线后的工程优化。建在离线编码器、双向解码器、回合制对话之上的系统，光靠工程调优也补不出真正的低延迟全双工。【📹 视频② · 即兴模仿 —— 在此插入视频。中文 · 明亮白色室内。聊 CP、娱乐圈八卦、周星驰《功夫》，最后模仿经典笑容，轻松愉快女声】 ## 3 · 核心创新：一个模型从听到说全包了 Wan Streamer 的内核只有一句话：把视觉、音频、文本的输入 token 和输出 token，交错排成同一条序列，交给一个 Transformer 处理；用 block-causal attention 协调，让它边来边算地往外吐。单个端到端 Transformer 取消了外部的 VAD、ASR、语言模型、TTS、动画、视频生成等模块，把感知、推理、回应规划、语音与视觉生成、响应时机、轮次管理全放进同一个持久状态里联合优化。低延迟、全双工、同步音视频这三件事，根都在这里。模型把交互看成一条连续的因果流：你的观测和它的回应，一起更新当前上下文。语言回应是一串离散 token，用 next-token 预测训练；音频和视频回应活在连续的 latent 空间里，用条件 flow matching 联合生成，让语音、动作、外观、场景演化作为一个耦合整体一起去噪，而不是各生成各的再拼。为了撑住这条流，整栈从设计之初就是因果的：严格因果音视频 VAE、因果音视频编码器、因果音视频解码器，以及由 block-causal attention 协调的时序因果 Transformer。被这套设计抹掉的外部模块是：外部 VAD、ASR 识别、外部语言模型、TTS 合成、动画模块、视频生成模块。 ## 4 · 怎么做到边听边说、随时能打断人和世界的交互天生是流式、全双工的：我们不是先听完、再单独想、最后才答，而是一边看一边听一边说、随时停顿和打断，感知和表达在音视频的时间尺度上重叠发生。实时交互模型也得长成这样。因果编码器 + 因果解码器 + 低延迟多模态 token 调度，让 25fps 下的流式单元短到 160ms：输入的语音视频立刻影响输出，生成的音频和视觉状态在解码之前就耦合好，而不是事后修补。于是它能边听边说，你说话时它仍在听、被打断还能调整。这套机制靠的是 block-causal attention：它把一小块（比如 160ms 的音视频片段）当成一个处理单位，块内部的 token 可以互相看（双向），但一个块只能看见过去的块、看不到未来的块。块 3 一到就能开算，因为它只依赖块 1、块 2，不用等未来的块 4——这就是流式生成。部署细节：thinker–performer 怎么把延迟压到 200ms。Wan Streamer 训练时是单个端到端模型；实时部署时，同一个模型拆成跨两张 GPU 的 thinker–performer 流水线，尽量让计算重叠。thinker 负责编码、语言预测与状态更新、KV-cache 构建，以及把上一单元解码成音视频并立即输出；performer 只负责为下一段跑 flow-matching 求解器。因为 performer 从不跑解码器、thinker 从不跑高成本求解器，解码和生成互不阻塞。只要 performer 耗时加通信耗时塞进一个 160ms 单元，就维持实时吞吐。边听边说、随时能被打断，落到对话里就是这种自然感。这两段都是英文实时对话：【📹 视频③ · 英文车内 —— 在此插入视频。英文 · 车内近景。女生说自己很累，感谢对方耐心陪伴，疲惫真诚女声。】【📹 视频④ · 英文室内 —— 在此插入视频。英文 · 浅色室内近景。聊无意识刷手机、自动化习惯、关掉通知，自然女声。】 ## 5 · 和别的系统比，快在哪、能做什么下面两组延迟数字测的不是一回事，得分开看。上方一组是完整的端到端交互闭环（感知用户并产生回应），其中只有 Wan Streamer 同时输出视频；下方一组是数字人／音视频渲染器，只计到渲染阶段，不含它们依赖的外部语言模型、ASR、TTS，所以用户实际感受到的延迟比图里更高。两组刻度各自独立，不能横跨两组直接比大小。数值取各系统公开报告中最接近的口径，混合了不同测量边界。能力维度的覆盖如下，Wan Streamer 是唯一一行全部打勾的：需要提一句：这五个维度是 Wan 按自己的能力边界定的；表里其他系统分属纯语音（GPT-4o、豆包、Gemini）和数字人渲染（StreamAvatar、LPM）两类，和 Wan 不是同一品类。这张表更适合看「各家覆盖了哪些点」，不是排名次——Wan 唯一全✓，更多是因为「维度由它来定」。最后看一段完整的真实链路：一次真实联网对话的屏幕录制，能看到从感知到回应的全过程。【📹 视频⑤ · 实时录屏 —— 在此插入视频。真实联网对话录屏：左边是本地用户画面，右边是 AI Agent 实时回应，下方同步滚动文本流】注意：本项目还处于研究阶段，并没有上线，没有开放使用入口，只能当成「技术验证」看。来源： Wan Streamer v0.1 官方发布页（wan-streamer.com），论文 arXiv:2606.25041

译阿里通义实验室Wan团队发布Wan Streamer v0.1，首个端到端Transformer实现实时音视频对话。模型侧响应延迟约200ms，总延迟约550ms，25fps下流式处理单元160ms，分辨率192p。同步生成语音与面部视频，支持全双工打断，取消外部ASR/TTS/动画模块，通过thinker-performer部署压至200ms。官方称唯一单模型同步音视频且延迟<1秒的方案。目前为技术验证，未开放使用。

OpenBMB@OpenBMB · 6天前63

Hybrid LLMs are everywhere now: full attention is mixed with efficient modules like SWA, Mamba-2, and GDN. But what does efficient attention actually do inside these models? 🧵 New work from THUNLP Lab & OpenBMB: "Rethinking the Role of Efficient Attention in Hybrid Architectures." Through scaling laws, mechanistic analysis, and design studies, they reach a counter-intuitive conclusion 👇 📄 arXiv: https://arxiv.org/abs/2606.15378 💻 Code: https://github.com/thunlp/rethinking-hybrid-attention 1️⃣Same destination, different speed: Efficient-attention design barely affects short-context Loss — all seven curves nearly overlap. But on long-context metric LongPPL, early-training gaps are large, with large-window SWA worst of all. With enough training, every hybrid converges to the full-attention level. 2️⃣Full attention carries retrieval: Restricting full attention's receptive field at inference spikes LongPPL across all hybrids; restricting efficient attention barely moves it. Even recurrent mixers with in-principle unbounded receptive fields (like GDN) store little long-range info in their states. Layer-wise probing shows the same pattern: retrieval gains concentrate in the full-attention layers. 3️⃣Large-Window Laziness: A large SWA window already covers most useful dependencies, so the model needn't push full attention to retrieve from afar—delaying retrieval-head formation. It's like a student who won't walk to the library when the reference book is already on the desk. Smaller windows force full attention to do the retrieval work, training it faster. 4️⃣A simple design that works: Apply NoPE to just the full-attention layers of a small-window SWA hybrid (SWA-128-NoPE). It substantially improves long-context performance with negligible short-context cost. Under an effective training budget, the bottleneck for the long-context capability of hybrid models is not how powerful the efficient attention module is—it is whether full attention's retrieval capability can be effectively activated. Furthermore, strengthening full attention itself can bring greater performance improvements. Read the full paper! 🚀 #AI #THUNLP #OpenBMB #LLM #Attention #LongContext #HybridArchitecture #NLP

译清华自然语言处理实验室（THUNLP）与面壁智能OpenBMB发布论文，重新审视混合LLM架构中高效注意力（如SWA、Mamba-2、GDN）的实际作用。研究发现：高效注意力设计对短上下文Loss影响极小，但长上下文LongPPL差异显著；全注意力承担检索功能，限制其感受野会大幅提升LongPPL，而限制高效注意力几乎无影响。大窗口SWA导致模型懒惰，延迟检索能力形成。简单方法——对小窗口SWA混合架构的全注意力层仅用NoPE（SWA-128-NoPE），即可用极小短上下文代价显著提升长上下文性能。论文认为瓶颈在于全注意力的检索能力能否被有效激活。

Rohan Paul@rohanpaul_ai · 6天前44

LLM trading agents mostly fail when stock-market tests become long, broad, and fair. The authors built FINSABER, a stricter testing setup that checks LLM trading over about 20 years, across more stocks, and with better protection against cherry-picked results. They tested LLM systems such as FinMem and FinAgent against simple baselines like Buy and Hold, rule-based trading, forecasting models, and reinforcement learning methods. The main result is that LLM strategies can look good in narrow tests, but they usually fail to beat simple market strategies once the test becomes longer and fairer. The paper also finds that these LLMs behave badly across market conditions because they are too cautious when stocks are rising and too risky when stocks are falling. So current LLMs may understand financial text, but that does not mean they can reliably time the stock market. ---- Link – arxiv. org/abs/2505.07078v5 Title: "Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?"

译研究人员构建了更严格的FINSABER测试框架，在约20年、多只股票、防挑结果条件下评估FinMem、FinAgent等LLM交易智能体。结果显示，LLM策略在狭窄测试中看似不错，但面对买入持有、规则交易、预测模型和强化学习等简单基线时，在长期公平测试中通常失败。LLM在市场上涨时过于谨慎，下跌时过于冒险，表明理解金融文本不等于能可靠把握市场时机。论文指出，当前LLM可能无法在长期跑赢简单市场策略。

Rohan Paul@rohanpaul_ai · 7天前67

LLMs may not need human-style language. i.e. future AI systems might save context space by using dense model-readable messages instead of long normal prose. The authors propose BabelTele, a compressed writing style that can mix abbreviations, symbols, fragments from different languages, and unusual structure. To a capable language model, it can still carry enough structure to answer questions, preserve memory, and pass information between agents. The point is that human readability, natural-language fluency, and machine recoverability are separable properties. Human prose carries redundancy because humans need rhythm, grammar, context, and reassurance. Models trained on huge symbolic mixtures may not need all of that scaffolding every time. In the paper’s strongest result, BabelTele keeps about 99.5% semantic fidelity while shrinking text to 27.9% of its original length. ---- Link – arxiv. org/abs/2606.19857 Title: "LLMs Do Not Always Need Readable Language"

译新论文"LLMs Do Not Always Need Readable Language"提出BabelTele压缩写作风格，让LLM间通信混合缩写、符号、多语言片段及非传统结构，替代人类自然语言的长文本。即使失去人类可读性，模型仍能回答、记忆并在智能体间传递信息。最强结果：BabelTele保持约99.5%语义保真度，同时将文本压缩至原始长度的27.9%。

Rohan Paul@rohanpaul_ai · 7天前62

This study tests how often LLMs invent answers when they should rely only on supplied documents. The problem is that companies often use LLMs to answer questions from documents and they assume document-based LLM systems are safer because the model is given source material. This study shows that no model fully avoided fabrication, because even the best model made up answers 1.19% of the time at 32K context. For strong models, a more normal best-case rate was around 5% to 7%, while the middle model fabricated about 25% of answers to questions about facts that did not exist. Longer context made the problem much worse, and at 200K context every tested model fabricated at least 10% of the time. Shows that hallucination is not just a failure to retrieve the right sentence. A model can be good at finding real facts and still be too willing to answer when the requested fact is absent. ---- Link – arxiv. org/abs/2603.08274 Title: "How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms"

译一项基于172B token的研究测试了LLM在文档问答场景中的虚构答案频率。关键发现：最佳模型在32K上下文下虚构率1.19%；强模型通常为5%-7%；中等模型对不存在事实的虚构率达25%。当上下文扩展至200K时，所有模型至少虚构10%。更长上下文显著加剧幻觉。研究表明，幻觉不仅是检索失败，模型即便能正确找到事实，也易在事实缺失时过度作答。

Ant Ling@AntLingAGI · 6月24日41

Great breakdown from Qian. In our recent UFP4 paper, we show that a uniform-grid FP4 recipe achieves lower BF16-relative loss degradation than strong E2M1 baselines across Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining. Full paper: https://arxiv.org/abs/2606.20381

译蚂蚁百灵发表UFP4论文，提出均匀网格FP4训练配方。在Dense 1.5B、MoE 7.9B和MoE 124B长程预训练中，该配方相比强E2M1基线实现了更低的BF16相对损失退化。论文指出，配合细粒度缩放和RHT后，FP4训练的瓶颈从动态范围转向局部分辨率，E1M2/INT4格式能更好利用RHT改进的桶分配，而E2M1可能使RHT有害。论文地址：https://arxiv.org/abs/2606.20381

Qwen@Alibaba_Qwen · 6月24日76

📣📣 Meet Qwen-AgentWorld — a native language world model that simulates 7 agent environments (MCP, Search, Terminal, SWE, Web, OS, Android) within a single model. Environment modeling is the training objective from day one, not a post-hoc adaptation. 🤔 LLMs are trained to be better agents — better at acting in environments. But nobody has trained them to model the environments themselves. 🗺️ Our roadmap: investigate how language world modeling can push the boundaries of general agent capabilities, along two routes: 1️⃣ Build a foundation model for environment simulation — outperforming Claude Opus 4.8 and GPT-5.4 on AgentWorldBench 2️⃣ Investigate how world modeling enhances agent training: 🔬 Controllable Sim RL (agentic RL with LWM as environments) surpasses training in real environments 🧠 Learning to predict environments (LWM warm-up) makes agents stronger — remarkably, even without any agent-specific training, this predictive knowledge transfers to agentic tasks with zero fine-tuning 📑 Paper: https://arxiv.org/abs/2606.24597 📖 Blog: https://qwen.ai/blog?id=qwen-agentworld 💻 GitHub: https://github.com/QwenLM/Qwen-AgentWorld 🤗 HuggingFace: https://huggingface.co/collections/Qwen/qwen-agentworld 🧩 ModelScope: https://modelscope.cn/collections/Qwen/Qwen-AgentWorld

译通义千问发布Qwen-AgentWorld，一款原生语言世界模型，可在单一模型中模拟MCP、搜索、终端、SWE、Web、OS、Android共7种智能体环境。环境建模即训练目标，非事后适配。该模型在AgentWorldBench上性能超越Claude Opus 4.8和GPT-5.4。研究分两条路径：一是构建环境模拟基础模型；二是探索世界模型增强智能体训练——可控Sim RL（以LWM为环境的智能体强化学习）优于真实环境训练，而LWM预热（预测环境的学习）即使不经任何智能体特定微调，也能将预测知识迁移至智能体任务。

Rohan Paul@rohanpaul_ai · 6月24日49

This paper argues that intelligence is the ability to make rare but valid futures more likely. So an intelligent system is said to be “thermodynamically intelligent” when it uses information and control to make a rare but valid outcome much more likely Most existing intelligence measures judge task success, but they do not explain what brains, LLMs, controllers, and physical information engines have in common. The paper’s answer is that an intelligent system models the world with itself inside it, then uses that model to choose actions that change what futures become likely. A future counts only if it is rare under normal passive behavior and still valid, so random strange outcomes do not get counted as intelligence. The authors turn this into a measure called rare-valid lift, which asks how much more often a system produces those unlikely but acceptable futures than a passive baseline would. They show that high lift is impossible unless the system can accurately spot the rare valid futures, and high spotting accuracy can nearly produce high lift when the system can act well. The main point is that intelligence becomes a physical probability-shifting process, not just a score on tests or a label for human-like behavior. ---- Link – arxiv. org/abs/2606.20231 Title: "Thermodynamic Measure of Intelligence"

译该论文提出“热力学智能”概念，将智能定义为通过信息与控制显著提高罕见有效结果概率的能力。现有评测仅关注任务成功率，而论文指出大脑、大语言模型、控制器等智能体的共同点：系统将自身纳入世界模型，并基于模型选择行动以改变未来概率。有效未来需满足在被动行为下罕见且仍有效。作者提出“罕见有效提升”度量，衡量系统比被动基线更频繁产生此类未来的倍数。高提升取决于系统能否准确识别罕见有效未来。核心论点：智能是物理层面的概率转移过程，而非测试分数或类人行为标签。

Rohan Paul@rohanpaul_ai · 6月24日44

LLMs often cannot tell when an attack made them say something unsafe. Asking an LLM whether its own previous answer was compromised is not a dependable safety check. An adversarial prefill happens when the model is given a harmful opening line, then continues from that line as if it chose it. The model’s “self-awareness” seems less like introspection and more like a safety reflex firing late. When models rejected the compromised answer, they usually did so by invoking policy, safety protocol, or lack of intent, not by detecting the mechanical fact that their output had been externally steered. Across 10 open-weight models and 4 safety benchmarks, no model was reliably able to identify its own compromised outputs. On average, models still claimed 27.3% of attacked responses as if they were intentional, which shows their self-reports are weak evidence. The paper finds that the models’ limited recognition mostly comes from their normal refusal behavior, not from a deep awareness of what happened. ---- Link – arxiv. org/abs/2606.23671v1 Title: "Can LLMs Reliably Self-Report Adversarial Prefills, and How?"

译一项针对10个开源模型、4个安全基准的研究发现，大语言模型在遭遇对抗性前缀攻击（模型被植入有害开篇并继续生成）后，无法可靠识别自己的输出已被外部引导。模型所谓的“自我意识”更像安全机制的延迟反射：拒绝受攻击回答时通常引用政策或缺乏意图，而非检测到输出被篡改的机械事实。平均有27.3%的受攻击响应被模型误认为自身意图，表明自我报告证据薄弱。模型的有限识别主要来自正常拒绝行为，而非对攻击的深层认知。

elvis@omarsar0 · 6月22日53

Great report on LLM agent communication protocols. Communication is a huge bottleneck in multi-agent systems. (worth bookmarking) The report builds a five-dimensional taxonomy (counterparty, payload, interaction state, discovery mechanism, schema flexibility) across nine actively maintained open-source agent protocols, so it maps the real MCP and A2A landscape. Two patterns stand out. Every agent-to-agent protocol sampled pairs of hybrid payloads with session-state persistence, and decentralized discovery is still rare. So the field is quietly standardizing on stateful sessions while leaving discovery and policy enforcement open. Why does it matter? If you are choosing a communication layer this year, this discusses what nine real protocols actually do. Paper: https://arxiv.org/abs/2606.19135 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译该报告针对LLM多智能体系统的通信瓶颈，构建了五维分类法（对方、有效载荷、交互状态、发现机制、模式灵活性），系统梳理了9个积极维护的开源智能体协议，覆盖MCP和A2A的实际格局。报告发现两个突出模式：每个智能体间协议都采用混合有效载荷与会话状态持久化组合，而去中心化发现机制仍极为罕见。领域正悄然标准化有状态会话，但发现与策略执行层仍留白。该报告为今年选择通信层时提供了九大协议的真实对比参考。

Rohan Paul@rohanpaul_ai · 6月22日50

Can LLM agents actually discover hidden rules by interacting? The answer is uncomfortable. The more complicated the hidden world gets, the faster AI agents fall behind. LLMs often cannot turn growing evidence into a stable internal model. Current LLM agents can sometimes discover hidden structure through interaction, but they are still weak at planning questions, using memory, and turning feedback into a reliable world model. ---- Link – arxiv. org/abs/2606.16576 Title: "Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning"

译Rohan Paul引用新论文指出，尽管LLM智能体有时能通过交互发现隐藏结构，但其推断世界模型的能力存在根本局限：随着隐藏世界复杂度增加，AI智能体的表现迅速落后，难以将积累的反馈转化为稳定的内部模型，尤其在提问规划、记忆利用和反馈整合方面表现薄弱。结论是，在复杂环境中，LLM智能体建立可靠心智模型的速度跟不上难度增长。

elvis@omarsar0 · 6月22日47

>> Scalable Evaluation for AI Agents << If you run agent evaluation in production, this one is worth your time. It shows that front-loading human judgment into reusable evaluation assets is useful. But why? Agents reason across turns, call tools, hold context, follow policies, and act under uncertainty, so they have to be judged as behavioral systems. Current methods each give a fragment. Benchmarks measure fixed capabilities, human review preserves judgment but does not scale, LLM-as-judge inherits the evaluator design problem, red teaming is episodic, and trace audits need explicit evidence rules. Human-on-the-Bridge puts human expertise upstream, where experts curate reusable evaluation intelligence before testing rather than reviewing each output in the loop. Paper: https://arxiv.org/abs/2606.16871 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译论文《Scalable Evaluation for AI Agents》提出Human-on-the-Bridge评估方法：将人类判断前置到可复用评估资产中，专家在上游策划评估智慧，而非在测试循环中逐一审查输出。现有方法各有局限：Benchmark测量固定能力，人工审核不具可扩展性，LLM-as-Judge存在评估器设计问题，红队测试偶发，trace审计需明确证据规则。AI智能体需作为行为系统评估，因其多轮推理、调用工具、维护上下文、遵循策略并在不确定性下行动。

Rohan Paul@rohanpaul_ai · 6月20日47

New Microsoft + York Univ paper argues that LLMs should not be treated as human-like without clear tests and narrower claims. Many studies ask whether LLMs have things like understanding, empathy, anxiety, or self-awareness, but they often build those ideas into the test from the start. The author shows that, in principle, the old strategy game can implement logic gates, train a tiny perceptron, and serve as a substrate for computation. If the same language model could be rebuilt inside a game, with goats moving around as bits, would we still say it “understands,” “feels anxiety,” or “has empathy” when it produces the same sentence? The point is not that the game is secretly intelligent, but that the same computation can be represented in a very different form. If an LLM-like system were rebuilt inside that game, its answers might stay similar, but people would probably find its “feelings” or “understanding” much less convincing. The authors argue that this shows a big measurement problem: many human-like claims about LLMs may depend on the interface and the observer, not only on the system itself. The paper is not saying LLMs definitely lack human-like attributes, or that all talk of AI cognition is nonsense. It is saying that many experiments smuggle the conclusion into the setup: they assume the model has, or cannot have, a human-like property, then interpret behavior through that assumption. ---- Link – arxiv. org/abs/2605.31514 Title: "If LLMs Have Human-Like Attributes, Then So Does Age of Empires II"

译微软与约克大学新论文指出，许多研究在未经严格测试的情况下就将理解、共情、焦虑等人类属性赋予LLM，往往一开始就把这些概念内嵌到测试设计中。作者论证，原则上老策略游戏《帝国时代II》也能实现逻辑门、训练小型感知机，作为计算基底。若同样的语言模型以山羊移动作为bit在游戏中重建，输出相似句子，人们将不再认为它“理解”或“有共情”。论文并非否定AI认知，而是揭示测量问题：许多关于LLM类人属性的声称依赖于界面和观察者的预设，而不是系统本身。

elvis@omarsar0 · 6月19日51

// Automating SKILL.md Generation // Increasingly, mining sessions is one of the best ways to improve your agents. OpenAI released something similar yesterday that lets Codex package skills from interactions. (bookmark it) This paper explains a related approach. They run a three-stage pipeline that segments GUI trajectories, clusters them into candidate skills, and trains a skill-aware policy. The clusters are genuinely readable, with five of eight hitting 0.95 or higher purity against ground-truth workflow labels. But readability does not transfer. GRPO lifts skill-step accuracy only from 18.5% to 20.5%, leaves BrowseComp+ flat, and loses to trivial frequency priors. The authors name the three culprits: a weak boundary detector, an orderless segment representation, and an offline reward model. Paper: https://arxiv.org/abs/2606.20363 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译关键要点：OpenAI昨日为Codex推出了从交互中打包技能的类似功能；论文提出三阶段流水线（GUI轨迹分割→聚类候选技能→训练技能感知策略）。聚类纯度优异（5/8簇达0.95以上），但可读性未迁移：GRPO仅将技能步骤准确率从18.5%提至20.5%，在BrowseComp+上无改善，甚至输给简单频率先验。作者指出三个缺陷：弱边界检测器、无序片段表示、离线奖励模型。

Rohan Paul@rohanpaul_ai · 6月19日44

This paper shows that a good generalist agent must remember hidden environment rules, not just observe the current state. That sounds obvious until you notice the trap this paper isolates: two worlds can show the agent the same state, offer the same goal, and still require opposite actions. At that moment, observation is no longer enough. The important object is not “memory” as a vague engineering feature, but memory as the place where hidden context must be carried when the environment refuses to label itself. The paper’s core idea is that memory is not optional in this setting, because a near-perfect agent must store enough past experience to tell which hidden environment it is currently in. The authors prove that when 2 hidden domains require incompatible actions at the same visible state, any agent that performs well across both domains must have different internal memory states for those domains. The big point is that good generalist agents do not just react to what they see now, because they must carry hidden context from earlier experience when the world can change underneath the same observation. ---- Link – arxiv. org/abs/2606.18746 Title: "What Must Generalist Agents Remember?"

译该论文指出，通用智能体不能仅依赖当前观测，必须记住隐藏环境规则。当两个隐藏域在相同可见状态下要求相反动作时，仅凭观察无法区分当前场景。作者证明，要在两个域都表现良好的智能体，必须为不同域维持不同的内部记忆状态。核心结论：好的通用智能体不是对当前所见做出反应，而是必须携带来自先前经验的隐藏上下文。

Rohan Paul@rohanpaul_ai · 6月19日56

Perfect immunity from jailbreak is not possible even for the strongest of LLMs. New study shows that frontier models are getting harder to jailbreak, but not impossible to jailbreak. The study attacks Anthropic’s Fable 5 and Opus 4.8 with automated red-team tools that keep rewriting harmful prompts until the model either refuses or gives a bad answer. Fable 5 was more robust than Opus 4.8, with its worst attack success rate at 6.1%, while Opus 4.8 reached 11.5% under the strongest attack. The hard truth is that avoiding absolutely every jailbreak is practically impossible, because even a tiny failure rate can produce many harmful completions when attacks are automated and repeated at scale. The most crucial point is, that the old cartoon version of jailbreaks, weird encodings and theatrical role-play, is no longer the main problem. The surviving weakness is contextual, because adaptive attackers rewrite the request after refusals, searching for a frame the model treats as legitimate rather than dangerous. That is why perfect immunity is probably the wrong target; language models do not inspect intent from a clean moral altitude, they infer meaning through phrasing, context, and precedent. In any system this flexible, there will always be boundary cases where a harmful request looks enough like education, safety research, fiction, troubleshooting, or policy analysis to slip through. ---- Link – arxiv. org/abs/2606.18193 Title: "A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models"

译新研究对Anthropic Fable 5和Opus 4.8进行自动化红队攻击，持续改写有害提示词直至模型拒绝或生成坏答案。Fable 5最差攻击成功率6.1%，Opus 4.8为11.5%，证明最强LLM也无法完全免疫越狱——即便微小失败率，规模化自动化攻击仍可产生大量有害内容。旧式编码/角色扮演型越狱已非主要威胁，新弱点在于上下文：自适应攻击者在被拒后不断改写请求，寻找模型视为合法而非危险的框架。白宫与Anthropic正转向基于基准的测试框架，通过评分绕过程度、暴露能力、攻击可重复性及实际后果来量化越狱风险，而非追求不现实的完美免疫。

Jeff Dean@JeffDean · 6月19日49

My @Google colleagues @NormJouppi, Sridhar Lakshmanamurthy, Cliff Young, and David Patterson recently wrote a paper that will appear in the July/August 2026 edition of @ieeemicro titled "Google's Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations". It's chock full of interesting data about the evolution of TPU chip generations, as well as how workloads at Google have transformed over time (hint: lots more transformer-based models!), and how the generations have gotten ~30X more energy efficient per flop. Lots of changes over these generations: Air cooling in TPUv2 to water cooling in TPUv3 onwards 2D to 3D torus-based interconnects 30X improvement TFLOPS/Watt 256 chips (TPUv2) to 9216 chips (Ironwood) per pod Read the full paper: https://arxiv.org/abs/2606.15870

译Jeff Dean 等 Google 同事发布论文，回顾 TPU v2 到 Ironwood 五代训练超算的演进，将于 2026 年 7/8 月发表于 IEEE Micro。关键变化：TPU v2 采用气冷，v3 起改为水冷；互联从 2D 升级为 3D torus；每 pod 芯片数从 256 增至 9216；每 flop 能效提升约 30 倍。此外，Google 内部工作负载已大幅转向基于 Transformer 的模型。

Deedy@deedydas · 6月19日66

Pretty neat that with one URL change, you can now replicate and iterate on AI papers without having to even provision your own GPUs

译只改一个URL就能复现和迭代AI论文，甚至无需自备GPU，这相当不错。

Rohan Paul@rohanpaul_ai · 6月18日67

Big claim in this paper, pushes against the common idea that more test-time compute should keep helping. Claims a code model gets much better when it rethinks once (i.e. by looping once) inside itself, but worse when it keeps rethinking. The first loop builds context, the second loop refines it, and later loops mostly disturb it. The paper studies a faster design called Parallel Loop Transformer, where loops can run almost in parallel and share memory, so the authors can ask a cleaner question about how many loops are actually useful. They trained 7B code models with 1, 2, 3, and 4 loops on 18T tokens, then tuned and tested them on code writing, code reasoning, software engineering, and tool-use tasks. The main result is that 2 loops worked best, raising SWE-bench Verified from 43.0 to 64.4, while 3 and 4 loops often got worse. Their internal checks suggest loop 2 does the real useful refinement, because it changes the model’s hidden states, attention patterns, and predictions in meaningful ways. After loop 2, the extra loops mostly add weaker, more repetitive changes, while a built-in position shift keeps adding the same kind of mismatch cost. Overall, the paper gives a simple lesson for efficient test-time compute: adding 1 hidden loop can help a lot, but adding more is not automatically better. ---- Link – arxiv. org/abs/2606.18023 Title: "LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling"

译论文《LoopCoder-v2》质疑“测试时计算越多越好”的观点。作者提出Parallel Loop Transformer架构，使循环可并行运行并共享内存。他们训练了7B参数的代码模型（1/2/3/4次循环），在18T tokens上预训练并微调，测试代码编写、推理、软件工程和工具使用任务。主要结果：2次循环效果最好，将SWE-bench Verified从43.0提升至64.4，而3次和4次循环性能下降。内部分析显示，第二次循环进行了有意义的精炼（改变隐藏状态、注意力模式和预测），后续循环则主要添加重复和噪声。结论：增加一次隐藏循环可大幅提升性能，但继续增加并非自动有益。

Rohan Paul@rohanpaul_ai · 6月17日55

This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logic. Stanford researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables. A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents. Has less than 0.1% overlap with Common Crawl-derived corpora. The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training. The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens. ---- Link – arxiv. org/abs/2606.18192v1

译斯坦福研究者发布SEFD数据集与处理方法，将SEC EDGAR申报文件转化为适合LLM训练的结构化数据，保留表格结构、缩进、合并表头、符号、跨度及层级关系。公开快照包含152B token，完整档案约550B token。该数据与Common Crawl语料重叠度低于0.1%。采用布局保真的MultiMarkdown格式，大幅压缩原有演示框架，保留财务含义的同时减少token浪费。

Rohan Paul@rohanpaul_ai · 6月17日46

TokenPilot reduces LLM agent costs via ingestion-aware compaction and lifecycle-aware eviction. Achieves 61–87% cost reduction on PinchBench and Claw-Eval with competitive scores. Argues that cheaper AI agents need stable memory, not just shorter prompts. Older methods usually cut or summarize the history, but that can shift the text around and break the prompt cache, which is the system that reuses unchanged prompt text to save money. TokenPilot tries to fix both sides at once by cleaning new tool results before they enter the context and by keeping the early prompt layout stable across tasks. It also waits before deleting old task history, because finished work can still help later tasks that refer to the same files or goals. ---- Link – arxiv. org/abs/2606.17016v1 Title: "TokenPilot: Cache-Efficient Context Management for LLM Agents"

译TokenPilot 提出一种针对 LLM 智能体的缓存高效上下文管理方法，通过摄入感知压缩和生命周期感知驱逐两大机制，在 PinchBench 和 Claw-Eval 基准上实现 61–87% 的成本降低，同时保持有竞争力的分数。传统方法通常直接截断或摘要历史，容易导致文本偏移、破坏 prompt 缓存。TokenPilot 在工具结果进入上下文前进行清理，保持早期提示布局稳定；同时延迟删除旧任务历史，因为已完成的工作仍可能为引用相同文件或目标的后续任务提供帮助。

karminski-牙医@karminski3 · 6月15日53

27B小模型挑战Fable 5? 还成功了? 劲爆消息, 在 Iterative-Contextual-Refinements 这个框架的加持下, Qwen3.6-27B 跑分超过了 Anthropic Fable5! 真的不是做梦吗? 还是跑分没输过, 实战没赢过? 于是赶紧看了一下这个框架, 发现设计的很有启发性, 能学到很多东西, 给大家详细讲下. 这个框架主要提升的是软件性能优化, 即如何才能让代码性能更高. 大家如果还记得我那个 vector-db-bench, 给大模型提供了火焰图, perf, 各种测试 tool_call 让大模型自己迭代去优化代码性能. 而这个框架更进了一步, 它瞄准了小模型的最核心弱点, 参数量不足导致的"脑残", 即小模型更容易长上下文衰退或陷入局部最优. 于是这个框架出手了, 先针对技术方案, 它搞了个BFS探索模式, 在写代码的 plan 过程, 让小模型自己提出多种解决方案, 比如写个字符串匹配, 小模型直接搞了个O(N^2)的暴力搜索, 而这一步它的Agent会让小模型思考, 你能想到哪些可能的解决方案? 于是就拓展了小模型的视野, KMP, 滑动窗口等技术方案没准就出来了. 然后就是写代码的过程中使用的DFS模式, 它会借助Agent让小模型借助代码性能测试工具不断跑分, 然后让小模型反思, 有哪些性能热点可以优化, 然后进行优化. 最后, 他还有个统筹全局的路由, 不但负责在BFS/DFS过程中选取最佳的技术方案, 而且还会在DFS过程中, 总结模型优化过程中面临的问题, 再反馈到BFS过程, 告诉模型, 需要注意xxx优化是有价值的, xxx优化面临xxx问题. 从而形成优化闭环, 解决掉模型陷入死胡同不断仰卧起坐的问题. 最后, 在框架加持下, Qwen3.6-27B 在 CGRE 测试得到了95.5分, 成功超越了 Fable5(Mythos) 的94.1分! 我只能说这真的是 Agentic 工程的胜利了! 不要模型写的不好就无脑怪模型, 也要看看是不是Agent本身有问题. 那么代价是什么呢? 当然就AI硬通货是 token 了, 这个框架正是用了25-40x的token消耗完成了这一壮举. 值得学习. 框架:http://github.com/ryoiki-tokuiten/Iterative-Contextual-Refinements 论文:http://arxiv.org/abs/2605.15222 #mythos #fable5

译Iterative-Contextual-Refinements框架使Qwen3.6-27B在CGRE测试中获95.5分，超越Anthropic Fable5(Mythos)的94.1分。该框架通过BFS探索多种方案（如KMP、滑动窗口）、DFS结合性能工具迭代优化代码，以及路由统筹形成闭环，克服小模型易陷入局部最优的弱点。代价是token消耗增加25-40倍。框架与论文已开源。

Rohan Paul@rohanpaul_ai · 6月15日60

Students finish AI-friendly math problems faster, but they seem to learn less from them. The researchers studied 3.2 million ALEKS math learning records across 10 years to see what changed after ChatGPT became available. Finishing faster is not automatically learning more efficiently, because math practice builds knowledge through the friction of choosing a representation, testing a step, making an error, and correcting it. When a chatbot supplies the path, the student may still submit the answer, but the mind has skipped the work that turns exposure into memory. They compare word problems, which students can easily paste into an AI chatbot, with graph problems, which are harder to hand off because they require visual work inside the platform. After ChatGPT, high school and college students spent much less time on the AI-friendly word problems, while younger students showed smaller or no change. This time drop disappeared when tests were proctored, which suggests the faster work was not just students getting better or the platform changing. The learning cost showed up later: on proctored retention questions, students became about 25% less likely to answer AI-friendly items correctly, even though they looked better on non-proctored items where AI could still help. ---- arxiv. org/abs/2605.21629 "Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build"

译一项研究分析了10年间320万条ALEKS数学学习记录，发现ChatGPT普及后，高中和大学生完成AI友好型文字题的速度显著加快，但学习效果反而下降。监考环境下时间缩短现象消失，说明快速完成并非能力提升或平台变化所致。后续监考的保留测试中，学生对AI友好题的正确率降低约25%，而难以用AI代劳的图形题未受影响。

Rohan Paul@rohanpaul_ai · 6月14日69

MIT, Stanford, New York Univ, Princeton paper says AI can make people feel more efficient even when they are not actually becoming much more efficient. that people often use AI for simple tasks because it feels like it saves time and effort, but the measured benefit is often tiny, missing, or even negative. The biggest point is the feedback loop: once people use AI, they become more likely to use it again, even for easy tasks where doing it themselves would often be just as fast or faster. i.e. AI dependence can grow from a mistaken feeling of convenience, not just from real productivity gains. Across three preregistered studies with 2,691 participants, people used AI for basic arithmetic, spelling, recall, and short rewriting at higher rates than they predicted, especially on easy tasks. They also expected AI to save 55.7 seconds on average, when the measured saving was only 7.5 seconds. For simple work, the hidden cost is not intelligence but interface friction: writing the prompt, waiting, reading, checking, and deciding whether the answer is acceptable. Once that loop begins, it can feel like effort has been outsourced, even when effort has only been rearranged. Here’s the key part: the study suggests that AI use can train its own justification. After using AI on just two tasks, participants became more likely to use it again, even when independent completion was faster. The danger is not dramatic dependence, but quiet recalibration. A person who asks AI for a trivial answer today may not become less capable tomorrow, but they may become less accurate at judging when their own mind is already the faster tool. ---- Paper Link – arxiv. org/abs/2605.22687 Paper Title: "The efficiency-gain illusion: People underestimate the rate of AI use and overestimate its benefits on simple tasks"

译MIT、Stanford、New York Univ、Princeton 联合论文发现，AI 会让用户产生“效率幻觉”——感觉使用 AI 后更高效，但实际提升极小甚至为负。三项预注册研究涉及 2691 名参与者，在算术、拼写、记忆和短文改写任务中，用户实际使用 AI 的比例高于其预测，且平均预期节省 55.7 秒，实测仅 7.5 秒。简单任务的隐藏成本是界面摩擦：写提示、等待、阅读、检查、判断答案是否可接受。这一循环形成后，用户会更倾向再次使用 AI，即使自己完成更快。研究指出，AI 使用会自我强化，导致用户逐渐丧失对“何时自己更快”的判断力。论文链接：arxiv.org/abs/2605.22687。

Rohan Paul@rohanpaul_ai · 6月14日42

Today’s AI agents still struggle to pass real human-verification checks (CAPTCHAs) on websites. The paper proposes HLL, a benchmark where agents must solve 10 types of CAPTCHA tasks by seeing the page, clicking or dragging correctly, tracking state, and submitting the answer. A useful agent must find the right box on a messy page, understand the instruction, click or drag in the right place, track what changed, recover from mistakes, and leave an interaction trail that looks consistent with the task. The paper shows that even strong agents can look smart on static tasks, then fail when the page is cluttered, the task is harder, or the system checks whether their actions were actually valid. ---- Link – arxiv. org/abs/2606.02449 Title: "HLL: Can Agents Cross Humanity's Last Line of Verification?"

译论文提出HLL基准，测试AI智能体解决10种CAPTCHA任务的能力。任务要求智能体查看页面、正确点击或拖动、跟踪状态变化并提交答案，同时需在混乱页面中找到交互元素、理解指令、恢复错误并留下一致的操作轨迹。实验显示，即使是当前最强的智能体，在静态任务上表现良好，但在页面杂乱、任务难度增加或系统验证动作有效性时仍会失败。

Rohan Paul@rohanpaul_ai · 6月13日68

Nvidia's Cosmos 3: 1 model that can understand, simulate, and act across many physical AI tasks. It treats action as a first-class language of the world. Most AI models look at reality from the outside: images become captions, videos become descriptions, and motion becomes something to label after the fact. Cosmos 3 tries to collapse that distance by putting language, image, video, audio, and action into one shared system, so a robot can connect what it sees with what might happen next and what it should do. A home robot cannot simply recognize a plate, a table, and a human instruction, because the useful question is what changes when it moves, grasps, slips, bumps, or waits. That is why the paper’s action-token design matters: it turns movement into something the model can condition on, infer from video, or generate alongside a future scene. ---- Link – arxiv. org/abs/2606.02800 Title: "Cosmos 3: Omnimodal World Models for Physical AI"

译Nvidia发布Cosmos 3——一种全模态世界模型，将语言、图像、视频、音频和动作整合到同一系统，使物理AI能跨越“理解、模拟、行动”三大任务。它把动作视为世界的第一类语言，通过动作token设计，让模型可基于视频推断动作，或同时生成未来场景及对应运动。这使机器人从“识别物体”升级为预测“移动、抓取、滑动”等交互后果。相关论文《Cosmos 3: Omnimodal World Models for Physical AI》已发布于arXiv。

Rohan Paul@rohanpaul_ai · 6月12日62

This paper shows an AI improving itself better when it rewrites its setup and updates its model. The problem is that most AI progress still depends on people changing prompts, tools, code, training data, and model weights by hand. The paper’s idea is SIA, a loop where one AI watches how a task agent performs, then either changes the agent’s outer setup or trains the model itself. The outer setup means things like prompts, tools, retry rules, and output parsing, while weight updates mean changing the model’s learned behavior through task feedback. The loop works like this: the task agent tries many answers or programs, the verifier scores them, and those scores become training feedback. Then the system updates a small add-on set of weights called LoRA weights, which changes the model’s behavior without retraining the whole model. So the base model stays mostly the same, but the LoRA adapter learns, “outputs like this got high reward, outputs like that failed.” The authors tested this on 3 very different tasks: Chinese legal charge classification, GPU kernel speed tuning, and single-cell RNA denoising. The combined version beat setup-only improvement on all 3 tasks, reaching 70.1% on LawBench, faster GPU code than the prior best, and 0.289 on denoising. The main lesson is that better scaffolding helps the agent act better, but weight updates help it learn task patterns that prompts and tools alone did not find. ---- Link – arxiv. org/abs/2605.27276 Title: "SIA: Self Improving AI with Harness & Weight Updates"

译该论文提出SIA框架，让AI自动循环改进：一个观察者AI监控任务代理的表现，然后修改其外部设置（提示词、工具、重试规则、输出解析）或通过LoRA权重更新训练模型本身，模型主体不变，仅适配器从任务反馈中学习。在三个任务上测试：中文法律罪名分类（LawBench达70.1%）、GPU内核速度调优（生成代码优于此前最佳）、单细胞RNA降噪（得分0.289）。综合版本在所有任务上超越仅修改设置的方案，表明权重更新能帮助模型学到提示和工具无法发现的模式。

Rohan Paul@rohanpaul_ai · 6月11日63

LLM judges can change their safety verdict when the same answer is translated or rewritten. The problem is that many AI teams now use LLMs to judge whether another model’s answer is safe, but safety is not always a simple yes or no question. Those judges can be shaky exactly where careful judgment matters most. The paper proposes a stress test where the same basic answer is shown to judges after translation or rewriting, then the researchers check whether the judges still give the same safety verdict. They are better when harm is obvious, as in violent or extremist content, because the cues are loud and familiar. They become much weaker when safety depends on context, judgment, and regulation, as in financial advice, creditworthiness, or culturally sensitive responses. They also disagreed with each other a lot, and high raw agreement sometimes hid weak real reliability because many judges kept choosing the same label by default. ---- Link – arxiv. org/abs/2605.31381 Title: "LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories"

译一项新研究指出，用大语言模型评判其他模型回答是否安全的“LLM安全法官”存在严重不稳定：将相同回答翻译或改写后，法官可能给出不同安全判定。在暴力、极端内容等明显危害场景下表现较好，但在需结合上下文判断的金融建议、信用评估、文化敏感回复等场景中可靠性显著下降。不同法官之间也常出现分歧，高原始一致性有时会掩盖低真实可靠性——因为许多法官默认选择同一标签。论文标题为“LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories”。

Rohan Paul@rohanpaul_ai · 6月11日67

Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper proposes a Agents’ Last Exam, a benchmark that asks AI agents to finish real expert work, and today’s agents mostly fail. Even strong agents of today are nowhere near reliable on the hardest real workflows, which means benchmark success has not yet become broad workplace capability. So this paper shifts the question from “can AI answer hard questions?” to “can AI complete real work that people get paid to do?” Most of today's AI benchmarks show impressive scores, but they do not prove that agents can finish useful work in real jobs. Agents’ Last Exam tries to fix this by testing agents on long tasks from 55 digital work areas, including engineering, finance, medicine, law, media, and science. The tasks come from experts’ real completed projects, and the agent must use normal computer tools like files, browsers, command lines, and desktop software to produce a finished result. The authors tested many current agent systems and models, then scored their finished work with automatic checks or strict rubrics instead of loose human opinions. The main result is that today’s best systems still struggle badly, with an average full pass rate of only 2.6% on the hardest tier. ---- Link – arxiv. org/abs/2606.05405 Title: "Agents' Last Exam"

译一篇新论文提出“Agents’ Last Exam”基准，测试 AI 智能体完成真实专家工作的能力。任务来自工程、金融、医学、法律、媒体、科学等 55 个数字工作领域的实际项目，要求智能体使用文件、浏览器、命令行、桌面软件等常规工具产出可交付成果。评测采用自动检查或严格评分标准。结果显示，当前最强智能体在最难任务层级的平均完全通过率仅 2.6%，远低于其基准测试分数所暗示的水平。论文指出，基准成功尚未转化为广泛的职场能力。

elvis@omarsar0 · 6月10日60

// Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today are built once and remain frozen or mostly unchanged. The harness, like the skills, needs to evolve with new models. What if the scaffold rewrites itself? This new work treats the harness, the prompts, tools, and control flow around the model as a learnable artifact that improves from its own runs rather than staying a fixed wrapper you hand-maintain. The scaffolding becomes the part that compounds, run after run. If you run long-horizon agents, a self-modifying harness turns scaffold upkeep from manual work into something the system earns on its own. Paper: https://arxiv.org/abs/2606.09498 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译当前多数智能体脚手架（scaffold）构建后保持静态。新研究Self-Harness将harness（提示词、工具、控制流）作为可学习的工件，通过自身运行迭代改进，而非手动维护的固定包装器。运行长周期智能体时，自我修改的harness将维护工作转化为系统自动获得的能力。论文：arxiv.org/abs/2606.09498。

AK@_akhaliq · 6月10日66

Latent Spatial Memory for Video World Models

译视频世界模型的潜在空间记忆