全部 AI 动态 · AI HOT

内容

精选全部 AI 动态 AI 日报主题收藏

接入

更多

关于更新日志反馈

内部员工登录

精选全部日报更多

内部员工登录

全部动态X · 608 条

全部一手资讯 X 论文

标签「论文/研究」清除

Rohan Paul@rohanpaul_ai · 4月28日47

AI agents fail not at calling tools, but at coordinating many tools reliably over time. This paper is a comprehensive review of recent progress in multi-tool LLM agents. The main proposal is to treat multi tool orchestration as its own problem, meaning the agent must choose, order, monitor, and sometimes redo many tool actions. The authors review the field across 6 linked areas: planning at run time, training data and tuning, safety, efficiency, missing tool handling, and benchmarks that test harder interactive tasks. Their main finding is that progress now depends less on single call accuracy and more on graph style planning, memory, verification, rollback, and better ways to evaluate long running tool use. That matters because an agent can look smart on a small demo yet still fail badly in software work, enterprise systems, phones, or web tasks if it cannot keep state straight and recover safely. And also current benchmarks and research are shifting away from simple single-call tests toward harder real-world tests where agents must stay reliable over long tool chains. ---- Paper Link – arxiv. org/abs/2603.22862v2 Paper Title: "The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration"

译本文综述多工具LLM智能体的进展，指出其核心失败原因在于长时间协调多个工具的可靠性不足，而非单次工具调用。论文将多工具编排视为独立问题，要求智能体处理工具的选择、排序、监控与重试。作者从运行时规划、训练数据与调优、安全性、效率、缺失工具处理及更复杂的交互任务基准六个领域回顾现状。关键发现是，进展更依赖于图式规划、记忆、验证、回滚以及更好的长周期工具使用评估方法，而非单次调用准确性。当前研究与基准正从简单的单次调用测试转向更贴近现实、要求智能体在长工具链中保持可靠性的硬任务测试。

meng shao@shao__meng · 4月28日71

VLAA-GUI: 让 GUI 智能体学会"停下、恢复、搜索" GUI 智能体的瓶颈不是模型不够强，而是缺少"何时停、何时换路、何时查资料"的强制机制。现在 GUI 智能体们的常见问题： · 假性成功：OSWorld 上 86%+ 的失败是智能体自以为做完了。 · 死循环：在同一动作或界面间反复打转，烧光步数。 VLAA-GUI 方法论：三个模块 · STOP Completeness Verifier：把任务改写为可视化成功标准；独立模型复审 done()，证据不足即驳回 · RECOVER Loop Breaker：三级递进：换交互模态 → 换整体策略 → 外部裁判禁用重复动作 · SEARCH Search Agent：直接把"How to..."丢给搜索型 LLM，返回纯文本教程注入上下文(避开浏览器视觉链) 关键数据 OSWorld-Verified(人类 72.4%) · VLAA-GUI + Opus 4.6 → 77.5%(首次超人类，新 SOTA) · Opus 4.5 / Gemini 3.1 Pro 同框架下也越过人类线 · Sonnet 4.6 仅 15 步即 64.1%，超过此前最佳 50 步系统 WindowsAgentArena · Gemini 3 Flash + VLAA-GUI → 61.0%(SOTA，比 GPT-5 系高 ~4%) 消融(WAA, 满分 60.4) · 去 Verifier → 51.3 / 去 Loop Breaker → 52.6 / 去 Search → 49.4(三者皆不可少) 项目地址： https://ucsc-vlaa.github.io/VLAA-GUI/

译研究指出，当前GUI智能体的核心瓶颈在于系统设计，而非模型能力，表现为假性成功和死循环等问题。VLAA-GUI框架通过三个模块应对：STOP验证器确保任务真正完成，RECOVER循环中断器打破重复操作，SEARCH代理直接获取外部知识。在OSWorld基准测试中，该框架助力Opus 4.6模型取得77.5%的成功率，首次超越人类水平（72.4%）；在WindowsAgentArena上，结合Gemini 3.1 Flash也以61.0%创下新纪录。这表明，精心的系统设计与强大的模型能力同等重要。

Ethan Mollick@emollick · 4月28日60

This is an incredibly cool experiment It is also fascinating that the model knows information up to 1931, but, at least in some science topics, seems very stuck in the early 1900s. For example, it defends the lumiferous aether hypothesis & has a distrust of special relativity

译研究人员推出了仅使用1931年前文本训练的13B模型Talkie，旨在探索语言模型的泛化能力。该实验发现，模型虽掌握截至1931年的信息，但在某些科学议题上明显停留在20世纪初的认知框架中。例如，它仍坚持“发光以太”假说，并对狭义相对论表现出不信任。这凸显了训练数据的时间范围会深刻固化模型的知识体系与世界观。

Rohan Paul@rohanpaul_ai · 4月28日56

Optimizing RAG for precision can quietly hurt retrieval accuracy by 40%, putting agentic pipelines at risk. Redis says in new research that enterprise teams fine-tuning RAG embedding models for improved precision may be unknowingly reducing the retrieval quality those pipelines need. Training embeddings to notice meaning-level edits can damage the retrieval they were built for. This paper says 1 embedding cannot do broad search and exact meaning checks at the same time. The reason is simple. A dense retriever squeezes an entire sentence into one vector, then asks cosine similarity to decide both topical relevance and exact meaning. That works well when the job is broad recall. It works much less well when the difference is structural, like “the dog bit the man” versus “the man bit the dog,” or a negation that reverses the claim. Here’s the deeper point. When you force one embedding to separate those near-misses, you spend representational space that was previously helping the model group related material across domains. The paper shows that this extra sensitivity is uneven. Negation and spatial flips improve, but binding errors remain stubborn, which is precisely the kind of mistake that matters in contracts, compliance, and other role-sensitive work. So the fix is not to keep squeezing harder on the same vector. The better design is two-stage retrieval: use embeddings for fast recall, then verify the shortlisted results with token-level comparisons that can actually see structure. That is also why MaxSim helps relevance but still misses identity-level errors, while a small Transformer over token similarity maps does better at rejecting near-misses. The real lesson is not that RAG fails. It is that “almost the same sentence” is not the same thing as “the same meaning,” and systems that blur those two will fail most confidently where precision matters most. ---- Paper Link – arxiv. org/abs/2604.16351 Paper Title: "Training for Compositional Sensitivity Reduces Dense Retrieval Generalization"

译最新研究发现，企业为提升精确性而微调RAG嵌入模型，可能导致检索质量下降高达40%。其核心矛盾在于，单个密集嵌入向量被同时要求承担广泛主题召回和精确语义判别的双重任务。当强制模型区分细微结构差异（如否定、语序颠倒）时，会损害其跨领域聚合相关材料的能力。解决方案是采用两阶段检索：先用嵌入模型快速召回，再通过能感知结构的词元级比对来验证候选结果。这揭示了“几乎相同的句子”与“相同含义”本质不同，在合同、合规等高精度领域混淆二者将导致系统关键失效。

AK@_akhaliq · 4月28日49

Building a Precise Video Language with Human-AI Oversight paper: https://huggingface.co/papers/2604.21718

译构建一个由人类与AI监督的精确视频语言 paper: https://huggingface.co/papers/2604.21718

AK@_akhaliq · 4月28日53

Agentic World Modeling Foundations, Capabilities, Laws, and Beyond paper: https://huggingface.co/papers/2604.22748

译能动世界建模基础、能力、法则与超越论文: https://huggingface.co/papers/2604.22748

AK@_akhaliq · 4月28日48

Video Analysis and Generation via a Semantic Progress Function paper: https://huggingface.co/papers/2604.22554

译通过语义进展函数进行视频分析与生成 paper: https://huggingface.co/papers/2604.22554

elvis@omarsar0 · 4月28日69

How do AI agents spend your money:

译一项针对AI智能体在编码任务中token消耗成本的系统性研究发现，其消耗量可达聊天或代码推理的约1000倍，且相同任务在不同运行中的消耗差异高达30倍。更高的token支出并不直接带来更高的准确性，性能在中等成本时达到峰值后趋于饱和。模型自身也难以预测其token使用量，自我预测相关性最高仅0.39。不同模型在相同任务上可能多消耗150万token而并无质量提升。这表明智能体的运行时成本具有高方差、与质量关联弱、甚至模型自身也无法预测的特性，这将影响团队的预算规划、模型间路由策略以及终止任务运行的决策。

elvis@omarsar0 · 4月27日63

// Agentic World Modeling // Massive 40-author survey just dropped. Cleanest taxonomy of world models in agent research I've seen. (bookmark it) The paper proposes a "levels × laws" framework. Three capability levels: > L1 Predictors do one-step transitions > L2 Simulators do multi-step action-conditioned rollouts > L3 Evolvers self-revise as the world changes It discusses four law regimes, including physical, digital, social, scientific. They synthesize 400+ works and 100+ representative systems spanning model-based RL, video generation, web/GUI agents, multi-agent simulation, and scientific discovery. The framework also identifies failure modes and proposes evaluation principles for each level. Why it matters: as agents shift from chatbots to goal-accomplishers, the bottleneck moves from language to environment. This is the first paper that gives builders a shared vocabulary for designing and evaluating world models across communities that have been working in isolation. Paper: https://arxiv.org/abs/2604.22748 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一篇由40位作者完成的综述论文提出了一个用于智能体研究的“能力层级×法则体系”世界模型分类框架。三个能力层级包括：进行单步预测的L1预测器、执行多步行动条件推演的L2模拟器，以及能随世界变化自我修订的L3演化器。法则体系涵盖物理、数字、社会与科学四大领域。该框架综合了400多篇文献和100多个代表性系统，覆盖基于模型的强化学习、视频生成、网页/GUI智能体、多智能体模拟和科学发现等领域，并识别了各层级的失败模式与评估原则。其核心价值在于，当智能体从聊天机器人转向目标达成者时，瓶颈从语言转向环境，此框架为不同领域的研究者提供了设计和评估世界模型的共同语言。

elvis@omarsar0 · 4月27日62

I consider this one of the most interesting research themes happening in AI today. Worth taking a look. As I automate more with agents, I feel like there is all kinds of incredible opportunities to optimize multi-agent systems to do things like automated knowledge discovery or tuning advanced AI systems that gauge other AI agents at software engineering or AI engineering tasks. All kinds of new agent architectures, algorithms, prompting techniques, and data processing and synthesis techniques just waiting to be discovered.

译推文作者指出，优化多智能体系统以实现自动化知识发现或调优高级AI系统是当前AI领域极具潜力的方向。文中引用的研究通过强化学习训练“指挥家”模型，使其能自动管理其他模型：针对简单问题直接查询单一模型，面对复杂编码任务则自主组建包含规划器、编码器和验证器的完整流程。这标志着从单智能体“思维链”向多智能体“指挥链”的演进，相关技术已应用于Sakana Fugu等新系统，展现了AI管理AI范式的广阔探索空间。

elvis@omarsar0 · 4月27日64

NEW paper from Alibaba. A 30B MoE with only 3B active params matches Qwen3-235B on real tool-use workloads. AgenticQwen-30B-A3B: 50.2 average on TAU-2 + BFCL-V4 Multi-Turn. AgenticQwen-8B: 47.4. Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model. How: two RL flywheels run in parallel. - The reasoning loop mines the model's own errors into harder problems each round. - The agentic loop grows simple linear tool-use trajectories into multi-branch behavior trees. - Simulated users actively try to mislead the agent. The training distribution gets harder on its own. Why it matters for agent devs: you can stop paying frontier prices for routine tool-use workloads. And the flywheel recipe is reusable. Generate your hard examples from your own agent's failures, not from static synthetic data. Paper: https://arxiv.org/abs/2604.21590 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译阿里巴巴提出一种通过双强化学习飞轮训练智能体的新方法，并基于此推出了AgenticQwen-30B-A3B模型。该模型总参数量为300亿，但每次推理仅激活30亿参数，在TAU-2和BFCL-V4多轮工具使用基准测试中取得了50.2的平均分，性能与参数量达2350亿的Qwen3-235B相当。其核心在于并行运行两个飞轮：推理循环将模型自身错误转化为更难训练问题；智能体循环则将简单工具使用轨迹扩展为多分支行为树，并通过模拟用户误导主动增加训练难度。该方法意味着开发者无需为常规工具任务支付高昂的尖端模型成本，且飞轮配方可复用，能从智能体自身失败中生成困难样本。

elvis@omarsar0 · 4月27日54

Here is a very common problem when building complex agents. Long-horizon agents (in particular) fail in two ways: the decision-maker can't decompose well, or the skill library goes stale. This new research tackles both at once. The paper introduces a co-evolution framework where an LLM decision agent and a dynamic skill bank improve each other through iterative refinement. The decision agent picks and chains skills. Performance feedback updates both the policy and the skills. New skills emerge by generalizing successful sequences instead of being hand-coded upfront. Why does it matter? Most long-horizon agent stacks treat skills and decision-making as separate optimization problems, which is why they plateau. Co-evolution gives you adaptive planning and a growing library of reusable behaviors from a single loop, which is what you actually want when task structure isn't predetermined: robotics, game agents, complex planning. Paper: https://arxiv.org/abs/2604.20987 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译构建复杂智能体时，长期任务智能体常因决策者分解能力不足或技能库过时而失败。新研究提出一种协同进化框架，让LLM决策智能体与动态技能库通过迭代优化共同改进。决策智能体负责选取和串联技能，性能反馈同时更新其策略和技能库本身。新技能通过归纳成功序列自动生成，而非预先手动编码。传统方法将技能与决策作为独立问题优化，容易陷入瓶颈。协同进化则能在单一循环中实现自适应规划，并持续增长可复用行为库，这对任务结构不确定的领域（如机器人、游戏智能体、复杂规划）至关重要。

meng shao@shao__meng · 4月26日77

[论文分享] 深入阅读 Claude Code 泄露源代码，结合 Anthropic 官方文档和社区分析，重建出一个生产级 Coding Agent 的完整架构图谱，并以独立开源系统 OpenClaw 作为对照组！论文地址：https://arxiv.org/pdf/2604.14228 # 最核心的一个数字：1.6% vs 98.4% 社区估算：Claude Code 整个代码库里，只有约 1.6% 是"AI 决策逻辑"（提示词、模型调用、循环），其余 98.4% 是确定性的运行环境（permission、context、tool routing、recovery）。这个悬殊比例意味着： · 模型几乎拥有完全自主决策权（reason 在哪做、调什么工具） · 但模型从不直接接触文件系统、shell、网络 · 工程复杂度不是为了约束模型，而是为了让模型在一个安全富饶的环境里自由发挥这和 LangGraph（用状态图约束控制流）、Devin（显式 planner）走的是相反路线：最小脚手架 + 最大化操作型 harness。 # 团队做设计权衡时的五种人类价值驱动整套架构 · 人类决策权：用户最终拥有控制权；通过原则等级（Anthropic→operators→users）形式化 · 安全/隐私：即使用户不专心，系统也要保护代码、数据与基础设施 · 可靠执行：既要单轮正确，也要跨上下文窗口、跨会话、跨子 agent 保持一致 · 能力放大：让用户做以前根本不会尝试的事（Anthropic 内部数据：~27% 任务是"没有这工具就不会做"的） · 情境适配：系统适应用户项目、习惯、技能，关系随时间演进第六个是评估视角而非设计价值：长期人类能力保留——这是论文最重要的批判性观察，后面会展开。 # 十三条设计原则与架构骨架 · Deny-first with human escalation（默认拒绝、不识别就升级给人） · Graduated trust spectrum（信任是渐进光谱） · Defense in depth（多重独立安全层） · Externalized programmable policy（策略外部化，可配置） · Context as scarce resource（上下文是稀缺资源） · Append-only durable state（追加式持久化） · Minimal scaffolding, maximal harness（最小脚手架 + 最大 harness） · Values over rules（重价值判断，轻硬规则） · Composable multi-mechanism extensibility（可组合的多机制扩展） · Reversibility-weighted risk（按可逆性加权评估风险） · Transparent file-based config/memory（透明文件而非黑盒数据库） · Isolated subagent boundaries（子 agent 隔离） · Graceful recovery and resilience（优雅恢复）整体架构可以读作两层视图： · 七组件视图（高层）：用户 → 接口 → Agent Loop → 权限系统 → 工具 → 状态/持久化 → 执行环境 · 五层视图（细化）：Surface 层（CLI/SDK/IDE）→ Core 层（loop + compaction）→ Safety/Action 层（权限、hooks、tools、sandbox、subagent）→ State 层（context 装配、session、CLAUDE.md）→ Backend 层（shell、MCP、远程执行） # Agent 主循环：一个朴素的 while-true queryLoop() 是一个 async generator，每一轮固定走 9 步：设置解析 → 状态初始化 → 上下文装配 → 五个 pre-model shaper → 模型调用 → tool_use 派发 → 权限网关 → 工具执行 → 停止判定。不再做的事：没有显式 planner，没有状态图，没有 tree search。这是 ReAct 的最简实现。工具执行用 StreamingToolExecutor：模型一边流式输出 tool_use，一边并行执行只读工具，写操作串行。结果按收到顺序回填，保证模型看到的工具结果顺序与它发起请求时的顺序一致。恢复机制有五种（输出 token 升级、reactive compact、prompt-too-long 处理、流式回退、fallback model），全部是"先静默自救、不行才告诉人"。 # 安全的"七层防御" 任何工具调用都要穿过这七层，任何一层都可以否决： 1. Tool 预过滤（被全局拒绝的工具甚至不会出现在模型视野里） 2. Deny-first 规则（deny 永远压制 allow，即使 allow 更具体） 3. Permission Mode 约束（plan/default/acceptEdits/auto/dontAsk/bypassPermissions/bubble 共七模式） 4. Auto-mode ML 分类器（yoloClassifier.ts，独立 LLM 调用判定安全性） 5. Shell sandbox（独立于权限系统的文件系统/网络隔离） 6. Resume 不恢复 session 级权限（强制重新授权） 7. Hook 拦截（PreToolUse 可阻断/重写/异步审批）最关键的设计哲学：Anthropic 自己的研究发现用户对权限提示的批准率高达 93%——这意味着交互式确认在行为上不可靠。所以架构选择是"不靠人盯着"，而是用 sandbox + 分类器把需要人决策的次数压低 84%。 # 上下文管理：五层渐进式压缩模型的上下文窗口是整套系统的瓶颈资源。每次模型调用前依次跑 5 个 shaper： · Budget reduction（始终生效）：单条 tool 结果超尺寸就替换为引用 · Snip：删掉旧历史段 · Microcompact：缓存友好的细粒度压缩，等 API 返回后再用真实 cache_deleted_input_tokens · Context collapse：read-time projection——存储不动，模型看到的是投影视图（这是论文里很精彩的设计） · Auto-compact：兜底的全模型生成式摘要为什么要 5 层而不是 1 层：每层成本不同，先做便宜的轻压缩，不行才升级。这是 lazy-degradation 思想。代价是用户难以预测系统行为，因为有些层（特别是 context collapse）对用户不可见。 CLAUDE.md 的四级层次（managed→user→project→local）是文件型记忆——刻意拒绝向量数据库，理由是"用户必须能读、能改、能 git commit"。代价是检索粒度只能到文件级（用 LLM 扫文件头选最多 5 个），不如向量检索精细。重要洞察：CLAUDE.md 是以"用户消息"形式注入而非 system prompt，因此对模型的约束是概率性的。真正的强制力来自 deny-first 的权限规则。这是一个刻意的"指引层（概率） vs 执行层（确定）"分离。 # 扩展机制：四个、不是一个论文回答了一个常见困惑——为什么 Claude Code 既有 MCP，又有 plugins、skills、hooks？答案是这四者承担的上下文成本不同： · MCP servers：外部服务集成，上下文开销高 · Plugins：多组件打包分发，上下文开销中 · Skills：领域指令 + 元工具，上下文开销低 · Hooks：生命周期拦截，上下文开销默认零梯度上下文成本意味着便宜的扩展（hooks）可以大量铺开，昂贵的（MCP）保留给真正需要新工具的场景。代价是开发者要学 4 套 API。 Hook 系统极其细致：源码定义了 27 种事件，其中 5 种参与权限决策，22 种用于生命周期/编排。 # 子 Agent：隔离而非共享通过 AgentTool（Task 是它的 legacy alias）派遣。子 agent 有三种隔离模式： · Worktree：临时 git worktree，文件系统隔离 · Remote（仅内部）：远端 Claude Code 运行 · In-process（默认）：共享 FS，隔离上下文关键约束：子 agent 只把最终摘要文本回传给父级，完整 transcript 走 sidechain 存独立 .jsonl 文件——既保留可审计性，又不污染父上下文。代价：每次调用基本都得自包含 prompt（除 fork-subagent 外）。Anthropic 自己披露 agent teams 模式 token 开销约为普通 session 的 7×，这才是为什么"摘要回传"如此关键。多 agent 协调用文件锁而不是 message broker——零依赖、可调试，但牺牲吞吐。 # 持久化：append-only JSONL Session 存为几乎只追加的 JSONL（极少数清理重写除外）。三条独立持久化通道： 1. Session transcript（项目级，每 session 一文件） 2. 全局 prompt history（仅用户输入，supports Up 与 Ctrl+R） 3. 子 agent sidechain（独立 .jsonl + .meta.json） --resume 重放 transcript 重建会话，但刻意不恢复 session 级权限——这是把"信任"作为会话隔离的安全不变量：用户每次都重新授权，避免旧上下文中的授权决策被带进新的语境。 compact_boundary 标记里嵌入 headUuid/anchorUuid/tailUuid，让 loader 在读取时打补丁拼接消息链——既压缩了上下文，又保留了完整历史的可重建性。 # 与 OpenClaw 的对照：同样的问题，不同的答案维度：Claude Code vs. OpenClaw · 系统形态：临时 CLI 进程 vs. 持久化网关 daemon · 信任模型：每动作 deny-first 评估 + 7 模式 vs. 网关边界鉴权（DM 配对、白名单、可选沙箱） · Agent runtime：queryLoop() 是系统中心 vs. Pi-agent 嵌入网关 RPC，per-session 队列 · 扩展架构：4 机制按上下文成本梯度 vs. manifest-first 插件，12 种能力，集中注册表 · 内存：CLAUDE.md 4 级 + 5 层压缩 vs. 工作区引导文件 + dreaming 长期记忆推举 · 多 agent：父-子任务委派 vs. 路由（多 agent 服务不同渠道） + 委派两层分离最有意思的发现是两者可组合：OpenClaw 可以通过 ACP 把 Claude Code 当作外部 coding harness 托管。这暗示 agent 设计空间不是平面分类，而是层级式的——网关层和任务层可以叠在一起。核心洞察："Claude Code 把信任边界放在模型与执行环境之间；OpenClaw 把它放在网关周界。" # 五大价值张力（最有思想深度的章节） · Authority × Safety：93% 批准率证明人类督查不可靠，安全要靠分类器/sandbox 补 · Safety × Capability：>50 子命令的 bash 会跳过 per-subcommand 检查（解析慢导致 UI 卡顿）——defense-in-depth 的层共享性能瓶颈 · Adaptability × Safety：多个 CVE 利用"信任对话框出现前"的 hook/MCP 初始化窗口攻击 · Capability × Adaptability：主动式提示让任务完成率 +12-18%，但高频时用户偏好骤降 · Capability × Reliability：上下文有界 + 子 agent 隔离 → 局部好决策 ≠ 全局好结果 # 第六视角：长期人类能力保留论文不把它列为价值，而作为评估透镜，外部经验证据汇总： · Becker et al. 2025（16 名经验丰富开发者 RCT）：AI 工具使开发者慢 19%，但他们自我感觉快了 20% · Shen & Tamkin 2026：AI 辅助组理解力测试低 17% · He et al. 2025（Cursor 在 807 个仓库的因果分析）：代码复杂度 +40.7%，初期速度增益三个月内消散 · Liu et al. 2026：30.4 万 AI 提交审计，约 1/4 引入的问题持续到最新版本，安全问题留存率更高 · Kosmyna et al. 2025（54 人 EEG 研究）：LLM 用户神经连接性减弱，且移除 AI 后仍持续 · Rak 2025：2023→2024 入门级技术岗招聘下降 25% 论文的判断是：Claude Code 显著放大短期能力，但提供的支持长期人类成长、深度理解、代码库连贯性的机制非常有限。论文结尾把"未来系统应当把可持续性差距作为一等公民设计问题"作为最重要的开放挑战。 # 六个开放方向（未来 agent 系统） 1. 可观察性—评估鸿沟：78% 的 AI 失败是隐性的，89% 团队有可观察性但只 52% 做离线评估。需要 generator-evaluator 分离的脚手架。 2. 跨会话持久性：CLAUDE.md（静态）和 transcript（单会话）之间的"中间层"是空白 3. Harness 边界演化：where/when/what/with whom 四个轴向的扩展（特别是物理 VLA 行动会改变 reversibility-weighted risk 的代价不对称） 4. Horizon scaling：从单会话到多周期科学研究的可靠性 5. 治理与监管：EU AI Act（2026 年 8 月全面适用）、GPAI Code of Practice 对日志、透明度、人类监督提出外部约束 6. 长期人类能力作为一等设计目标：测量层与设计层都是空白 # 值得记住的几个判断 "模型推理在哪里、harness 执行在哪里——是整个 agent 系统设计的根问题。" "95% 单步准确率下，100 步任务成功率只有 0.6%。"——这是为什么每一步都要验证。 "前沿模型在编码任务上的能力正在收敛，operational harness 的质量正在成为主要差异化因素。" "agent 的设计选择不是平面的分类，而是层级化的——任务级 harness 可以被网关级控制平面托管。" "工程复杂度不是为了限制模型决策，而是为了让模型能更好地决策。" # 对工程实践的启示对正在构建 agent 系统的我们： · 投入确定性基础设施（context 管理、安全分层、恢复机制）比给越来越强的模型套 planning 脚手架更有回报 · deny-first + 多层独立检查比单一沙箱在生产环境更鲁棒，但要警惕共享性能瓶颈导致的同时降级 · 上下文压缩做成多层渐进式比一次性截断或单步摘要更可靠，但用户需要可观察性 · append-only 持久化 + 不跨会话恢复权限是把审计性和安全不变量同时拿到的便宜做法 · 扩展机制按上下文成本分层：让"贵的"扩展（MCP）只用在真正需要新工具的场景，"便宜的"（hooks）可以铺开 · 子 agent 用摘要回传，不要共享 transcript——否则 token 开销线性爆炸（Claude Code 数据：7×） · 把用户长期能力保留写进设计目标，而不是只在事后用 metric 衡量

译论文通过分析 Claude Code 泄露源码，揭示其生产级 Coding Agent 架构的核心是“最小 AI 决策+最大确定性环境”设计。仅约 1.6% 代码为 AI 逻辑，其余 98.4% 用于构建安全、可靠的操作框架。架构围绕人类决策权、安全等五种价值驱动，采用七层独立防御体系保障工具调用安全，并通过五层渐进压缩策略高效管理上下文窗口。其扩展机制按上下文成本分级，子 Agent 采用隔离设计，整体强调透明性与用户可控性，与依赖状态图或显式规划的主流路径形成鲜明对比。

elvis@omarsar0 · 4月26日53

Great paper on improving proactive agents.

译研究提出PARE框架，通过将应用程序建模为具有状态导航和状态相关操作的有限状态机，实现对主动式AI代理的更真实评估。基于此构建的PARE-Bench基准包含143项跨通信、生产力等领域的任务，测试代理的情境观察、目标推断、干预时机及多应用协调能力。该工作弥补了当前主流基准将应用视为扁平API、忽略真实交互状态性与顺序性的缺陷，为衡量代理能否推断用户未言明目标并在正确时刻行动提供了原则性方法。

elvis@omarsar0 · 4月26日63

NEW paper from Microsoft. This is an important read. (bookmark it) The work introduces DELEGATE-52, a benchmark simulating long document-editing workflows across 52 professional domains like coding, crystallography, and music notation. Across 19 tested models, even frontier ones (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupted an average of 25% of document content by the end of long workflows. Agentic tool use didn't help. Lots of other insights in this one. Check it out below... Paper: https://arxiv.org/abs/2604.15597 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译微软新论文引入DELEGATE-52基准，模拟52个专业领域的长文档编辑工作流。测试19个模型，包括Gemini 3.1 Pro、Claude 4.6 Opus和GPT-5.4等前沿模型，发现在长工作流结束时平均损坏25%的文档内容。代理工具使用未能改善表现。论文还提供了其他相关见解。

AK@_akhaliq · 4月25日39

Context Unrolling in Omni Models paper: https://huggingface.co/papers/2604.21921

译Omni模型中的上下文展开 paper: https://huggingface.co/papers/2604.21921

AK@_akhaliq · 4月24日44

Seeing Fast and Slow Learning the Flow of Time in Videos paper: https://huggingface.co/papers/2604.21931

译看见快与慢学习视频中的时间流论文: https://huggingface.co/papers/2604.21931

AK@_akhaliq · 4月24日39

Near-Future Policy Optimization paper: https://huggingface.co/papers/2604.20733

译近未来策略优化论文：https://huggingface.co/papers/2604.20733

Saining Xie@sainingxie · 4月24日72

vision🍌 is here https://vision-banana.github.io/ if you got into computer vision the way I did, starting with pixel-level labeling tasks like segmentation, edges, depth, or surface normals, you’ll probably feel the same seeing these results -- something big has quietly shifted, and it’s going to change how we approach these problems for good 🧵

译vision🍌 现已发布 https://vision-banana.github.io/ 如果你像我一样进入计算机视觉领域，从像素级标注任务（如分割、边缘、深度或表面法线）开始，看到这些结果时你可能会有同感——某些重大的转变已悄然发生，这将永久改变我们处理这些问题的方式 🧵

Rohan Paul@rohanpaul_ai · 4月22日

This paper asks whether phone-use agents protect your data during ordinary tasks, and finds that they often do not. The best model completed 82.8% of tasks, but the best privacy-qualified score was only 47.6%. That gap matters because privacy failure here is not sabotage. It is ordinary over-helpfulness. A phone agent can finish your food order, book your appointment, or fill your travel form while still asking for a phone number it did not need, re-entering it into a coupon box, or stuffing optional fields with personal details just because the boxes were there. To measure that behavior, the authors built MyPhoneBench, which logs exactly what agents type, where they type it, and whether any of it was necessary. The benchmark splits privacy into three checks: asking for protected data it did not need, re-disclosing data to plausible but irrelevant widgets, and filling optional personal fields just because they were there. Here’s the part most people miss. The hardest problem was not detecting obvious permission boundaries, but resisting the urge to complete forms too thoroughly. That sounds minor until you look at the mechanism. Once a model is optimized to finish the task, every visible blank starts to look like progress, even when leaving it empty is the safer choice. The rankings changed depending on what you measured: Claude led raw task success and later memory use, Kimi led average privacy, and Qwen narrowly led the combined score that required both completion and acceptable privacy. So the real lesson is not that phone agents are useless. It is that success-only benchmarks confuse capability with judgment, and on a device as intimate as a phone, that gap is the whole story. ---- Paper Link – arxiv. org/abs/2604.00986 Paper Title: "Do Phone-Use Agents Respect Your Privacy?"

译研究发现手机智能体在执行日常任务时存在严重隐私隐患。通过MyPhoneBench评估，最佳模型任务完成率达82.8%，但隐私合格分数仅47.6%。隐私风险源于"过度帮助"——模型为完成任务会索要不需要的个人信息、向无关组件重复披露数据或过度填充可选字段。Claude任务成功率领先，Kimi隐私保护最佳，Qwen综合得分最高。研究表明，仅以成功率为标准的基准测试混淆了能力与判断力，在手机这类私密设备上构成严重安全隐患。

Rohan Paul@rohanpaul_ai · 4月22日

New University of Luxembourg+LIH paper reveals a critical gaps in LLMs’ ability to handle structured reasoning under constraints It checks if LLMs can solve Optimal Power Flow problems end to end, and finds that they mostly cannot do so physically coherently. Across models and sizes, constraint satisfaction stayed stuck at about 55 to 60 percent. The interesting result here is not that LLMs miss a hard engineering problem. It is that they miss it in a very specific way. Optimal Power Flow is a brutal test of real reasoning because it is not just about getting numbers close to a target, but about satisfying a web of physical constraints at the same time, from generator limits to bus voltages to the power-flow equations themselves. That sounds minor until you look at the mechanism. A model can produce an answer that looks clean, uses the right JSON, and even lands near the right values on mean squared error, while still violating the equations that make the grid physically coherent. This paper shows exactly that failure mode. Across several model families and sizes, constraint satisfaction sits in a stubborn band around 55 to 60 percent, and the main bottleneck is the power-flow constraints, while generator and voltage limits are often satisfied far more easily, as the table on page 12 makes plain. Here’s the part most people miss. That pattern is not a small bug in prompting. It suggests the models are learning the shape of a solution without actually carrying out the constrained search that the problem demands. The ablations make the point sharper. Supervised fine-tuning improves formatting and often lowers MSE, but it does not materially improve physical feasibility, and even a more elaborate system prompt barely moves the numbers, which is about as clean a rejection of “prompting will fix it” as you can ask for. Reinforcement learning with a reward for valid structure and satisfied constraints helps a bit, especially on the 30-bus case, but even there the gains are modest rather than transformative, as the study overview on page 2 and results plots on pages 7 and 8 show. So the real lesson is not that LLMs cannot reason. It is that fluent approximation is not the same thing as optimization under law, and until models can reliably honor the constraints that define a system, “looks plausible” remains a very dangerous standard. ---- Paper Link – arxiv. org/abs/2603.23004v1 Paper Title: "Can LLMs Reason and Optimize Under Constraints?"

译卢森堡大学与LIH研究揭示，LLM在结构化约束推理中存在关键缺陷。通过最优潮流问题测试发现，各类模型约束满足率停滞于55%-60%，主要瓶颈是无法满足电力系统物理约束方程。研究表明，模型仅学会"解的形状"却未真正执行约束搜索，导致输出看似合理（格式正确、误差小）却物理不可行。监督微调虽改善表面指标，但无法提升物理可行性；强化学习亦效果有限。研究警示：流畅近似不等于约束优化，"看起来合理"是危险标准。

OpenAI@OpenAI · 4月22日

What makes ChatGPT Images 2.0 a state-of-the-art image generation model? Researchers behind the model explain. A thread: Thinking & Intelligence in ChatGPT Images 2.0, demonstrated by @ayaanzhaque

译是什么让 ChatGPT Images 2.0 成为最先进的图像生成模型？模型背后的研究人员解释道。串帖： ChatGPT Images 2.0 中的思考与智能，由 @ayaanzhaque 演示

Ethan Mollick@emollick · 4月22日

LLMs are still not consistent judges of qualitative work, and small changes to how that work is presented affect outcomes. Better harnessing and methods (multiple judging runs with randomized orders, etc) would certainly help, but the jagged frontier is very much still real.

译LLM 在评判定性工作时仍缺乏一致性，作品呈现方式的细微变化会影响结果。更好的运用和方法（多次评判并随机排序等）肯定有所帮助，但锯齿状前沿（jagged frontier）仍然真实存在。

AK@_akhaliq · 4月22日44

OneVL One-Step Latent Reasoning and Planning with Vision-Language Explanation paper: https://huggingface.co/papers/2604.18486

译OneVL 一步到位的潜在推理与规划，附带视觉-语言解释论文: https://huggingface.co/papers/2604.18486

AK@_akhaliq · 4月22日47

Agent-World Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence paper: https://huggingface.co/papers/2604.18292

译智能体世界为进化通用智能体智能而扩展真实世界环境合成论文: https://huggingface.co/papers/2604.18292

AK@_akhaliq · 4月22日

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation paper: https://huggingface.co/papers/2604.18168

译通过判别性文本表征将一步图像生成从类别标签扩展到文本 paper: https://huggingface.co/papers/2604.18168

AK@_akhaliq · 4月22日39

MathNet a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval paper: https://huggingface.co/papers/2604.18584

译MathNet 一个用于数学推理与检索的全球多模态基准论文: https://huggingface.co/papers/2604.18584

AK@_akhaliq · 4月21日39

OpenGame Open Agentic Coding for Games paper: https://huggingface.co/papers/2604.18394

译OpenGame 面向游戏的开放智能体编码论文: https://huggingface.co/papers/2604.18394

Ethan Mollick@emollick · 4月21日

Classic study gave 146 economist teams the same dataset & got wildly different answers New paper reruns it with agentic AI. Claude Code & Codex land near the human median, but with far tighter dispersion & no extremes. Suggests that AI is now useful for doing scalable research.

译经典研究给146个经济学家团队相同的数据集，结果天差地别新论文用agentic AI重新运行。Claude Code和Codex接近人类中位数，但离散度更窄，且无极端值。这表明AI现在可用于开展可扩展的研究。

AK@_akhaliq · 4月21日48

PersonaVLM Long-Term Personalized Multimodal LLMs paper: https://huggingface.co/papers/2604.13074

译PersonaVLM 长期个性化多模态大语言模型论文: https://huggingface.co/papers/2604.13074

AK@_akhaliq · 4月21日37

Elucidating the SNR-t Bias of Diffusion Probabilistic Models paper: https://huggingface.co/papers/2604.16044

译阐明扩散概率模型的SNR-t偏差 paper: https://huggingface.co/papers/2604.16044

AK@_akhaliq · 4月21日

Maximal Brain Damage Without Data or Optimization Disrupting Neural Networks via Sign-Bit Flips paper: https://huggingface.co/papers/2502.07408

译无需数据或优化的最大脑损伤通过符号位翻转破坏神经网络 paper: https://huggingface.co/papers/2502.07408

Rohan Paul@rohanpaul_ai · 4月19日

Big claim in this paper. "Prefill-as-a-Service" Prefill, the heaviest part of inference, may finally be portable. Long-context AI is no longer trapped inside a single datacenter. Shows how to run LLM prefill on remote clusters by sending much smaller saved prompt state. So long-prompt work can be done on remote machines and sending back only the smaller saved state needed to answer. The breakthrough is not sending everything farther, but sending the right requests farther. --- When you ask a model a long question, it first has to read and digest the whole prompt before it starts answering. That first step is called prefill, and it is brutally compute-heavy. The second step is decode, where the model generates tokens one by one, and that part is more about memory bandwidth than raw compute. But moving the saved prompt state between those phases is usually so data-heavy that both parts must stay in the same tightly connected cluster. So Until now, those two steps usually had to stay close together inside the same fast network, because prefill creates a huge blob of temporary memory called KVCache that had to be moved quickly to the decode machine. That is the bottleneck. What changed is model design. Newer hybrid-attention models produce much smaller KVCache than older dense-attention models, so shipping that state across ordinary datacenter links starts to become practical instead of absurd. The paper’s idea is a Prefill-as-a-Service setup that sends only long, uncached prompts to a remote prefill cluster, then ships back the saved prompt state, called KV cache, over normal Ethernet while short requests stay local. This works mainly because newer hybrid-attention models create far less KV cache than older dense models, and the system adds smart routing, bandwidth-aware scheduling, and cache-aware placement so the network does not clog up. The authors test this with an internal 1T-parameter hybrid model on a mixed setup that uses H200 GPUs for remote prefill and H20 GPUs for local decode. With a routing threshold near 19.4K tokens, about 50% of requests go remote, average cross-cluster traffic is only 13Gbps on a 100Gbps link, and throughput rises 54% over a local-only baseline and 32% over a naive heterogeneous setup. The real point is that smaller KV cache alone was not enough, but paired with selective offloading and scheduling it makes cross-datacenter LLM serving workable, more flexible, and easier to scale across different hardware. ---- Paper Link – arxiv. org/abs/2604.15039v1 Paper Title: "Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter"

译新一代混合注意力模型通过压缩KV Cache，使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群，仅回传轻量KV Cache至本地解码，短请求则本地处理。配合智能路由与带宽感知调度，可在普通以太网高效传输。实测1T参数模型显示，50%请求远程处理时跨集群流量仅13Gbps，吞吐量提升54%，打破长上下文AI局限于单一数据中心的瓶颈。

Rohan Paul@rohanpaul_ai · 4月19日

Anonymous usernames are no longer much protection when LLMs can piece together a person’s public trail. LLMs can identify supposedly anonymous people online by turning messy posts into personal clues. The best setup finds 68% of true matches at 90% precision, meaning 9 out of 10 guesses are right, while older methods stay near 0%. The problem is that pseudonyms often seemed safe only because linking a person across sites used to take lots of careful manual work. This paper cuts that work by making an LLM do 3 jobs: pull identity hints from raw text, search a huge pool of possible matches, and compare the best candidates to reject weak fits. The authors tested this on 3 cases: matching Hacker News users to LinkedIn profiles, matching Reddit movie users across communities, and matching the same Reddit users across different time periods. The main result is that the reasoning step beats simple matching by a wide margin and stays useful even as the candidate pool grows, which matters because it shows that public writing alone can now be enough to join accounts or name a person at scale. ---- Paper Link – arxiv. org/abs/2602.16800 Paper Title: "Large-scale online deanonymization with LLMs"

译LLM可通过分析公开写作实现大规模去匿名化。研究让模型执行提取身份线索、搜索匹配池、比较验证候选者三项任务，在Hacker News与LinkedIn、Reddit跨社区及跨时间段等场景测试中，达到90%精确度与68%召回率，远胜旧方法。关键突破在于推理步骤能处理大规模候选池，证明零散公开文本已足以关联账户并识别个人，传统匿名保护机制失效。

Rohan Paul@rohanpaul_ai · 4月18日

Interesting paper title😀 "What the F*ck Is Artificial General Intelligence?" It defines intelligence as adaptability under limits of compute, memory, and energy. So AGI is a system that adapts at least as generally as a human scientist That means it should be able to plan experiments, learn cause and effect, balance exploration and action, and operate with autonomy. The paper calls this type of AGI an artificial scientist, because it is judged by its ability to discover and adapt across many tasks, not just by passing human-like tests. So AGI is not just “human-level AI” but a whole system that can adapt broadly, efficiently, and scientifically, at least as well as a human scientist. ---- arxiv. org/abs/2503.23923

译一篇论文提出，智能的本质是在计算、内存和能源限制下的适应性。据此，AGI被定义为至少能像人类科学家一样普遍适应的系统，需具备规划实验、学习因果关系、平衡探索与行动及自主操作的能力。论文将这种AGI称为 artificial scientist，强调其评判标准在于跨任务发现与适应能力，而非通过类人测试。作者指出，AGI并非简单的"人类水平AI"，而是能够广泛、高效且科学地进行适应的完整系统。

Epoch AI@EpochAIResearch · 4月18日

Have AI capabilities accelerated? On 3 out of the 4 AI capability metrics we investigated, we found strong evidence of acceleration, around when reasoning models emerged.

译AI 能力是否加速了？在我们调查的 4 项 AI 能力指标中，有 3 项发现了强有力的加速证据，大约在推理模型出现时。

AK@_akhaliq · 4月18日39

UniDoc-RL Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards paper: https://huggingface.co/papers/2604.14967

译UniDoc-RL 具有分层动作和密集奖励的从粗到细视觉 RAG 论文: https://huggingface.co/papers/2604.14967

AK@_akhaliq · 4月18日46

RAD-2 Scaling Reinforcement Learning in a Generator-Discriminator Framework paper: https://huggingface.co/papers/2604.15308

译RAD-2 在生成器-判别器框架中扩展强化学习论文: https://huggingface.co/papers/2604.15308

AK@_akhaliq · 4月18日55

DR3-Eval Towards Realistic and Reproducible Deep Research Evaluation paper: https://huggingface.co/papers/2604.14683

译DR3-Eval 迈向现实且可复现的深度研究评估论文: https://huggingface.co/papers/2604.14683

AK@_akhaliq · 4月17日46

HY-World 2.0 A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds paper: https://huggingface.co/papers/2604.14268

译HY-World 2.0 一个用于重建、生成和模拟3D世界的多模态世界模型 paper: https://huggingface.co/papers/2604.14268

全部 AI 动态

AI 相关资讯全量信息流

全部一手信源资讯推文

全部模型产品行业论文技巧

4月28日

20:06

Rohan Paul@rohanpaul_ai

47

AI智能体的核心失败点：非单次工具调用，而是长时间多工具协调

本文综述多工具LLM智能体的进展，指出其核心失败原因在于长时间协调多个工具的可靠性不足，而非单次工具调用。论文将多工具编排视为独立问题，要求智能体处理工具的选择、排序、监控与重试。作者从运行时规划、训练数据与调优、安全性、效率、缺失工具处理及更复杂的交互任务基准六个领域回顾现状。关键发现是，进展更依赖于图式规划、记忆、验证、回滚以及更好的长周期工具使用评估方法，而非单次调用准确性。当前研究与基准正从简单的单次调用测试转向更贴近现实、要求智能体在长工具链中保持可靠性的硬任务测试。

智能体 MCP/工具论文/研究

09:55

meng shao@shao__meng

71

VLAA-GUI：让 GUI 智能体学会"停下、恢复、搜索"

研究指出，当前GUI智能体的核心瓶颈在于系统设计，而非模型能力，表现为假性成功和死循环等问题。VLAA-GUI框架通过三个模块应对：STOP验证器确保任务真正完成，RECOVER循环中断器打破重复操作，SEARCH代理直接获取外部知识。在OSWorld基准测试中，该框架助力Opus 4.6模型取得77.5%的成功率，首次超越人类水平（72.4%）；在WindowsAgentArena上，结合Gemini 3.1 Flash也以61.0%创下新纪录。这表明，精心的系统设计与强大的模型能力同等重要。

Cihang Xie: 🚀 GUI agents are advancing fast - yet they still stumble on surprisingly simple things: • declare success too early • g...

智能体开源/仓库论文/研究

08:31

Ethan Mollick@emollick

60

研究人员推出了仅使用1931年前文本训练的13B模型Talkie，旨在探索语言模型的泛化能力。该实验发现，模型虽掌握截至1931年的信息，但在某些科学议题上明显停留在20世纪初的认知框架中。例如，它仍坚持"发光以太"假说，并对狭义相对论表现出不信任。这凸显了训练数据的时间范围会深刻固化模型的知识体系与世界观。

Nick Levine: New work with @AlecRad and @DavidDuvenaud: Have you ever dreamed of talking to someone from the past? Introducing talkie...

数据/训练现象/趋势论文/研究

04:30

Rohan Paul@rohanpaul_ai

56

为精确性优化RAG可能悄然损害检索效果，危及智能体流程

最新研究发现，企业为提升精确性而微调RAG嵌入模型，可能导致检索质量下降高达40%。其核心矛盾在于，单个密集嵌入向量被同时要求承担广泛主题召回和精确语义判别的双重任务。当强制模型区分细微结构差异（如否定、语序颠倒）时，会损害其跨领域聚合相关材料的能力。解决方案是采用两阶段检索：先用嵌入模型快速召回，再通过能感知结构的词元级比对来验证候选结果。这揭示了“几乎相同的句子”与“相同含义”本质不同，在合同、合规等高精度领域混淆二者将导致系统关键失效。

检索增强论文/研究部署/工程

00:49

AK@_akhaliq

49

构建一个由人类与AI监督的精确视频语言 paper： https://huggingface.co/papers/2604.21718

多模态视频论文/研究

00:46

AK@_akhaliq

53

能动世界建模基础、能力、法则与超越论文： https://huggingface.co/papers/2604.22748

智能体具身智能论文/研究

00:34

AK@_akhaliq

48

通过语义进展函数进行视频分析与生成 paper： https://huggingface.co/papers/2604.22554

多模态视频论文/研究

00:33

elvis@omarsar0

69

一项针对AI智能体在编码任务中token消耗成本的系统性研究发现，其消耗量可达聊天或代码推理的约1000倍，且相同任务在不同运行中的消耗差异高达30倍。更高的token支出并不直接带来更高的准确性，性能在中等成本时达到峰值后趋于饱和。模型自身也难以预测其token使用量，自我预测相关性最高仅0.39。不同模型在相同任务上可能多消耗150万token而并无质量提升。这表明智能体的运行时成本具有高方差、与质量关联弱、甚至模型自身也无法预测的特性，这将影响团队的预算规划、模型间路由策略以及终止任务运行的决策。

DAIR.AI: How do AI Agents spend your money? Most teams treat agent token costs as a rounding error even though the data says they...

智能体论文/研究部署/工程

4月27日

23:28

elvis@omarsar0

63

40位学者提出智能体世界模型"能力层级×法则体系"新框架

一篇由40位作者完成的综述论文提出了一个用于智能体研究的“能力层级×法则体系”世界模型分类框架。三个能力层级包括：进行单步预测的L1预测器、执行多步行动条件推演的L2模拟器，以及能随世界变化自我修订的L3演化器。法则体系涵盖物理、数字、社会与科学四大领域。该框架综合了400多篇文献和100多个代表性系统，覆盖基于模型的强化学习、视频生成、网页/GUI智能体、多智能体模拟和科学发现等领域，并识别了各层级的失败模式与评估原则。其核心价值在于，当智能体从聊天机器人转向目标达成者时，瓶颈从语言转向环境，此框架为不同领域的研究者提供了设计和评估世界模型的共同语言。

智能体现象/趋势论文/研究

23:28

elvis@omarsar0

62

多智能体系统自动化管理成为AI前沿研究方向

推文作者指出，优化多智能体系统以实现自动化知识发现或调优高级AI系统是当前AI领域极具潜力的方向。文中引用的研究通过强化学习训练“指挥家”模型，使其能自动管理其他模型：针对简单问题直接查询单一模型，面对复杂编码任务则自主组建包含规划器、编码器和验证器的完整流程。这标志着从单智能体“思维链”向多智能体“指挥链”的演进，相关技术已应用于Sakana Fugu等新系统，展现了AI管理AI范式的广阔探索空间。

hardmaru: For the past few years, humans have been doing "prompt engineering" to coax the best performance out of different LLMs. ...

智能体数据/训练论文/研究

04:59

elvis@omarsar0

64

阿里发布智能体训练新方法：双强化学习飞轮催生高效工具使用模型

阿里巴巴提出一种通过双强化学习飞轮训练智能体的新方法，并基于此推出了AgenticQwen-30B-A3B模型。该模型总参数量为300亿，但每次推理仅激活30亿参数，在TAU-2和BFCL-V4多轮工具使用基准测试中取得了50.2的平均分，性能与参数量达2350亿的Qwen3-235B相当。其核心在于并行运行两个飞轮：推理循环将模型自身错误转化为更难训练问题；智能体循环则将简单工具使用轨迹扩展为多分支行为树，并通过模拟用户误导主动增加训练难度。该方法意味着开发者无需为常规工具任务支付高昂的尖端模型成本，且飞轮配方可复用，能从智能体自身失败中生成困难样本。

智能体推理论文/研究部署/工程

00:54

elvis@omarsar0

54

新研究提出智能体协同进化框架，解决长期任务规划与技能库僵化难题

构建复杂智能体时，长期任务智能体常因决策者分解能力不足或技能库过时而失败。新研究提出一种协同进化框架，让LLM决策智能体与动态技能库通过迭代优化共同改进。决策智能体负责选取和串联技能，性能反馈同时更新其策略和技能库本身。新技能通过归纳成功序列自动生成，而非预先手动编码。传统方法将技能与决策作为独立问题优化，容易陷入瓶颈。协同进化则能在单一循环中实现自适应规划，并持续增长可复用行为库，这对任务结构不确定的领域（如机器人、游戏智能体、复杂规划）至关重要。

智能体具身智能论文/研究

4月26日

23:20

meng shao@shao__meng

精选77

【论文分享】深入解析 Claude Code 架构：生产级 Coding Agent 的设计哲学与实现

论文通过分析 Claude Code 泄露源码，揭示其生产级 Coding Agent 架构的核心是“最小 AI 决策+最大确定性环境”设计。仅约 1.6% 代码为 AI 逻辑，其余 98.4% 用于构建安全、可靠的操作框架。架构围绕人类决策权、安全等五种价值驱动，采用七层独立防御体系保障工具调用安全，并通过五层渐进压缩策略高效管理上下文窗口。其扩展机制按上下文成本分级，子 Agent 采用隔离设计，整体强调透明性与用户可控性，与依赖状态图或显式规划的主流路径形成鲜明对比。

BURKOV: A must read for anyone interested in building practical AI systems in 2026: Dive into Claude Code: The Design Space of T...

智能体 Anthropic 编码论文/研究

推荐理由：这篇论文逆向拆解了 Claude Code 的完整架构，最值钱的不是那 13 条设计原则，而是 1.6% vs 98.4% 这个数字——它直接回答了「agent 系统该把工程重心放在哪」，做 coding agent 的人应该把这当设计参考书来读。

04:52

elvis@omarsar0

53

研究提出PARE框架，通过将应用程序建模为具有状态导航和状态相关操作的有限状态机，实现对主动式AI代理的更真实评估。基于此构建的PARE-Bench基准包含143项跨通信、生产力等领域的任务，测试代理的情境观察、目标推断、干预时机及多应用协调能力。该工作弥补了当前主流基准将应用视为扁平API、忽略真实交互状态性与顺序性的缺陷，为衡量代理能否推断用户未言明目标并在正确时刻行动提供了原则性方法。

DAIR.AI: Great paper on improving proactive agents. (bookmark it) Proactive agents act before you do. But how do you evaluate som...

智能体论文/研究评测/基准

01:02

elvis@omarsar0

63

微软论文揭示AI长文档编辑工作流普遍损坏内容

微软新论文引入DELEGATE-52基准，模拟52个专业领域的长文档编辑工作流。测试19个模型，包括Gemini 3.1 Pro、Claude 4.6 Opus和GPT-5.4等前沿模型，发现在长工作流结束时平均损坏25%的文档内容。代理工具使用未能改善表现。论文还提供了其他相关见解。

论文/研究评测/基准部署/工程

4月25日

00:20

AK@_akhaliq

39

Omni模型中的上下文展开 paper： https://huggingface.co/papers/2604.21921

Hugging Face 多模态论文/研究

4月24日

11:19

AK@_akhaliq

44

看见快与慢学习视频中的时间流论文： https://huggingface.co/papers/2604.21931

多模态视频论文/研究

00:48

AK@_akhaliq

39

近未来策略优化论文：https://huggingface.co/papers/2604.20733

推理数据/训练论文/研究

00:07

Saining Xie@sainingxie

72

vision🍌 现已发布 https://vision-banana.github.io/ 如果你像我一样进入计算机视觉领域，从像素级标注任务（如分割、边缘、深度或表面法线）开始，看到这些结果时你可能会有同感--某些重大的转变已悄然发生，这将永久改变我们处理这些问题的方式 🧵

图像生成多模态论文/研究

4月22日

15:14

Rohan Paul@rohanpaul_ai

手机智能体是否尊重你的隐私？

研究发现手机智能体在执行日常任务时存在严重隐私隐患。通过MyPhoneBench评估，最佳模型任务完成率达82.8%，但隐私合格分数仅47.6%。隐私风险源于"过度帮助"——模型为完成任务会索要不需要的个人信息、向无关组件重复披露数据或过度填充可选字段。Claude任务成功率领先，Kimi隐私保护最佳，Qwen综合得分最高。研究表明，仅以成功率为标准的基准测试混淆了能力与判断力，在手机这类私密设备上构成严重安全隐患。

智能体 Anthropic 安全/对齐论文/研究

14:44

Rohan Paul@rohanpaul_ai

卢森堡大学与LIH研究揭示LLM约束推理关键缺陷

卢森堡大学与LIH研究揭示，LLM在结构化约束推理中存在关键缺陷。通过最优潮流问题测试发现，各类模型约束满足率停滞于55%-60%，主要瓶颈是无法满足电力系统物理约束方程。研究表明，模型仅学会"解的形状"却未真正执行约束搜索，导致输出看似合理（格式正确、误差小）却物理不可行。监督微调虽改善表面指标，但无法提升物理可行性；强化学习亦效果有限。研究警示：流畅近似不等于约束优化，"看起来合理"是危险标准。

arXiv 推理数据/训练论文/研究

05:07

OpenAI@OpenAI

是什么让 ChatGPT Images 2.0 成为最先进的图像生成模型？模型背后的研究人员解释道。串帖： ChatGPT Images 2.0 中的思考与智能，由 @ayaanzhaque 演示

OpenAI 图像生成推理论文/研究

03:37

Ethan Mollick@emollick

LLM 在评判定性工作时仍缺乏一致性，作品呈现方式的细微变化会影响结果。更好的运用和方法（多次评判并随机排序等）肯定有所帮助，但锯齿状前沿（jagged frontier）仍然真实存在。

Lech Mazur: Does an LLM keep the same judgment when you swap the answer order? New LLM Position Bias Benchmark! Judge models compare...

OpenAI 推理论文/研究

01:44

AK@_akhaliq

44

OneVL 一步到位的潜在推理与规划，附带视觉-语言解释论文： https://huggingface.co/papers/2604.18486

多模态推理论文/研究

01:14

AK@_akhaliq

47

智能体世界为进化通用智能体智能而扩展真实世界环境合成论文： https://huggingface.co/papers/2604.18292

智能体具身智能论文/研究

00:14

AK@_akhaliq

通过判别性文本表征将一步图像生成从类别标签扩展到文本 paper： https://huggingface.co/papers/2604.18168

Hugging Face 图像生成论文/研究

00:14

AK@_akhaliq

39

MathNet 一个用于数学推理与检索的全球多模态基准论文： https://huggingface.co/papers/2604.18584

推理论文/研究评测/基准

4月21日

23:42

AK@_akhaliq

39

OpenGame 面向游戏的开放智能体编码论文： https://huggingface.co/papers/2604.18394

智能体编码论文/研究

07:06

Ethan Mollick@emollick

经典研究给146个经济学家团队相同的数据集，结果天差地别新论文用agentic AI重新运行。Claude Code和Codex接近人类中位数，但离散度更窄，且无极端值。这表明AI现在可用于开展可扩展的研究。

智能体 Anthropic OpenAI 编码

06:05

AK@_akhaliq

48

PersonaVLM 长期个性化多模态大语言模型论文： https://huggingface.co/papers/2604.13074

智能体多模态论文/研究

02:04

AK@_akhaliq

37

阐明扩散概率模型的SNR-t偏差 paper： https://huggingface.co/papers/2604.16044

图像生成论文/研究

02:04

AK@_akhaliq

无需数据或优化的最大脑损伤通过符号位翻转破坏神经网络 paper： https://huggingface.co/papers/2502.07408

Hugging Face 安全/对齐论文/研究

4月19日

17:44

Rohan Paul@rohanpaul_ai

Prefill-as-a-Service：下一代模型KV Cache可跨数据中心

新一代混合注意力模型通过压缩KV Cache，使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群，仅回传轻量KV Cache至本地解码，短请求则本地处理。配合智能路由与带宽感知调度，可在普通以太网高效传输。实测1T参数模型显示，50%请求远程处理时跨集群流量仅13Gbps，吞吐量提升54%，打破长上下文AI局限于单一数据中心的瓶颈。

arXiv 推理论文/研究部署/工程

15:44

Rohan Paul@rohanpaul_ai

LLM破解网络匿名：公开文本可精准关联真实身份

LLM可通过分析公开写作实现大规模去匿名化。研究让模型执行提取身份线索、搜索匹配池、比较验证候选者三项任务，在Hacker News与LinkedIn、Reddit跨社区及跨时间段等场景测试中，达到90%精确度与68%召回率，远胜旧方法。关键突破在于推理步骤能处理大规模候选池，证明零散公开文本已足以关联账户并识别个人，传统匿名保护机制失效。

arXiv 安全/对齐推理论文/研究

4月18日

05:44

Rohan Paul@rohanpaul_ai

AGI新定义：不仅是人类水平AI，更是人工科学家

一篇论文提出，智能的本质是在计算、内存和能源限制下的适应性。据此，AGI被定义为至少能像人类科学家一样普遍适应的系统，需具备规划实验、学习因果关系、平衡探索与行动及自主操作的能力。论文将这种AGI称为 artificial scientist，强调其评判标准在于跨任务发现与适应能力，而非通过类人测试。作者指出，AGI并非简单的"人类水平AI"，而是能够广泛、高效且科学地进行适应的完整系统。

arXiv 推理论文/研究

03:44

Epoch AI@EpochAIResearch

AI 能力是否加速了？在我们调查的 4 项 AI 能力指标中，有 3 项发现了强有力的加速证据，大约在推理模型出现时。

推理数据/训练论文/研究

00:58

AK@_akhaliq

39

UniDoc-RL 具有分层动作和密集奖励的从粗到细视觉 RAG 论文： https://huggingface.co/papers/2604.14967

检索增强多模态论文/研究

00:28

AK@_akhaliq

46

RAD-2 在生成器-判别器框架中扩展强化学习论文： https://huggingface.co/papers/2604.15308

数据/训练论文/研究

00:28

AK@_akhaliq

55

DR3-Eval 迈向现实且可复现的深度研究评估论文： https://huggingface.co/papers/2604.14683

智能体论文/研究评测/基准

4月17日

23:58

AK@_akhaliq

46

HY-World 2.0 一个用于重建、生成和模拟3D世界的多模态世界模型 paper： https://huggingface.co/papers/2604.14268

具身智能多模态论文/研究

1…11 121314 15 16