// Critique of the Agent Model // Finally, a paper that tries to define what an agent is and what agency consists of. Good read overall. (great bookmark) The word agent now covers everything from a for-loop with tool calls to speculative machine superintelligence. Eric Xing and colleagues ask where automation ends, and agency begins. Drawing on Descartes and on science-fiction portrayals of autonomous beings, they analyze agent architectures along five dimensions: goal, identity, decision-making, self-regulation, and learning. The argument is that genuine agency requires these structures to hold together in a specific way. Great paper overall, providing a vocabulary for arguing about what is and is not an agent. Paper: https://arxiv.org/abs/2606.23991 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Elvis Saravia 推荐一篇试图明确“智能体”定义的论文。Eric Xing 及其同事从哲学与科幻视角出发，分析智能体架构的五维度：目标、身份、决策、自我调节和学习。论文指出，真正“智体性”需这些维度以特定方式组合，从而区分自动化与智能体。论文地址：arxiv.org/abs/2606.23991。

AK@_akhaliq · 6月25日52

Qwen-AgentWorld Language World Models for General Agents

译Qwen-AgentWorld 为通用智能体设计的语言世界模型

OpenBMB@OpenBMB · 6月24日36

LLMs don't just hallucinate because they lack knowledge—they hallucinate because they don't know what they don't know. Existing knowledge augmentation blindly injects more data, treating every error as a knowledge gap. But overconfident wrong answers and uncertain correct ones reveal a deeper problem: cognitive misalignment. 🤔 Today, we dive into Know More, Know Clearer—a meta-cognitive framework by @TsinghuaNLP (OpenBMB member) alongside researchers from Harbin Institute of Technology and Northeastern University. The team proposes a unified system that diagnoses a model's cognitive state and applies targeted intervention—not indiscriminate knowledge stuffing. 📄 arXiv: https://arxiv.org/abs/2602.12996 🤗 Paper: https://huggingface.co/papers/2602.12996 Why it matters: 1⃣️ The Structural Decay Law: A Universal Foundation: The team discovers that accuracy exhibits a stable exponential decay relative to uncertainty: E[Acc|U] ≈ a·exp(−U) + b. Validated across 6 architectures (Qwen, Llama, Mistral), this proves internal confidence signals structurally encode performance—not random noise—providing a rigorous basis for meta-cognitive optimization. 2⃣️ Know More (CGKE): Differentiated, Not Indiscriminate: Rather than uniform knowledge injection, the framework partitions the knowledge space into Mastered, Confused, and Missing regions via self-sampled behavioral profiling. Each region receives a tailored augmentation strategy—boundary expansion, structural disambiguation, or epistemic foundation—targeting exactly where the model needs it most. Ablation shows removing the "Confused" category causes the largest performance drop. 3⃣️ Know Clearer (CDKC): Aligning Confidence with Correctness: A cognitive consistency alignment mechanism built on GRPO actively recalibrates the model's confidence landscape—sharpening distributions on correct paths, dispersing them on incorrect ones. Result: average ECE drops from 60.41 to 24.34, and the model learns to genuinely know its own limits rather than learning to refuse everything. 4⃣️ Results: 24.59-Point Gain and True Self-Knowledge: On 11 QA benchmarks, CDKC (2-round) lifts Llama-3.1-8B from 30.91% to 55.50% (+24.59 pts) and Qwen2.5-7B from 25.76% to 48.29% (+22.53 pts). On self-knowledge benchmarks, the framework achieves a CBS of 73.43% and CAE of 68.18%—delivering 63.37% correct answering decisions while maintaining 79.07% boundary recognition, the best balance of any method tested. Knowledge augmentation is not merely about knowing more—it's about knowing more clearly. This framework sets a new standard for reliable, calibrated knowledge in LLMs. #AI #THUNLP #OpenBMB #LLM #KnowledgeAugmentation #Hallucination #MetaCognition #NLP

译面壁智能 OpenBMB 联合清华NLP、哈工大、东北大学提出元认知框架 Know More, Know Clearer，应对 LLM 因认知错位导致的幻觉。框架包含三项：结构性衰减定律（准确率随不确定性指数衰减）；Know More（CGKE）将知识空间分为掌握/混淆/缺失三区针对性增强；Know Clearer（CDKC）基于 GRPO 对齐置信度，使平均 ECE 从 60.41 降至 24.34。在 11 个 QA 基准上，CDKC 将 Llama-3.1-8B 从 30.91% 提升至 55.50%（+24.59 点），Qwen2.5-7B 从 25.76% 提升至 48.29%（+22.53 点）。自知识基准上 CBS 达 73.43%、CAE 达 68.18%，正确决策率 63.37%，边界识别 79.07%，达到最佳平衡。

Ant Ling@AntLingAGI · 6月24日41

Great breakdown from Qian. In our recent UFP4 paper, we show that a uniform-grid FP4 recipe achieves lower BF16-relative loss degradation than strong E2M1 baselines across Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining. Full paper: https://arxiv.org/abs/2606.20381

译蚂蚁百灵发表UFP4论文，提出均匀网格FP4训练配方。在Dense 1.5B、MoE 7.9B和MoE 124B长程预训练中，该配方相比强E2M1基线实现了更低的BF16相对损失退化。论文指出，配合细粒度缩放和RHT后，FP4训练的瓶颈从动态范围转向局部分辨率，E1M2/INT4格式能更好利用RHT改进的桶分配，而E2M1可能使RHT有害。论文地址：https://arxiv.org/abs/2606.20381

Ant Ling@AntLingAGI · 6月24日53

We recently released a paper showing that UFP4, our uniform-grid FP4 training recipe, stays closer to BF16 than strong E2M1 baselines across Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining. The key insight: FP4 training quality is not only about bit width, but also grid geometry.

译我们最近发布了一篇论文，表明UFP4，我们的均匀网格FP4训练方案，在密集1.5B、MoE 7.9B和MoE 124B长程预训练中，比强E2M1基线更接近BF16。关键洞察：FP4训练质量不仅与比特宽度有关，还与网格几何有关。

Rohan Paul@rohanpaul_ai · 6月24日46

New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens. The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models. That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward. NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word. A hidden state is the model’s private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time. The authors tested this on map-like world modeling, math reasoning, graph planning, story prediction, and regular language modeling. The main result is that NextLat often learned more compact and useful internal states, solved planning tasks better, and sped up generation by up to 3.3x. Overall, it gives transformers some of the useful memory behavior of recurrent models without changing the transformer architecture or slowing normal inference. ---- Link – arxiv. org/abs/2511.05963 Title: "Next-Latent Prediction Transformers Learn Compact World Models"

译微软新论文Next-Latent Prediction (NextLat) 提出一种自监督学习方法，在常规token预测基础上增加预测下一隐藏状态的任务，迫使Transformer学习紧凑的内部世界模型。该方法在地图式世界建模、数学推理、图规划、故事预测等任务上表现更优，生成速度通过自推测解码最高提升3.3x，且无需改变Transformer架构或减慢正常推理。

Rohan Paul@rohanpaul_ai · 6月24日49

This paper argues that intelligence is the ability to make rare but valid futures more likely. So an intelligent system is said to be “thermodynamically intelligent” when it uses information and control to make a rare but valid outcome much more likely Most existing intelligence measures judge task success, but they do not explain what brains, LLMs, controllers, and physical information engines have in common. The paper’s answer is that an intelligent system models the world with itself inside it, then uses that model to choose actions that change what futures become likely. A future counts only if it is rare under normal passive behavior and still valid, so random strange outcomes do not get counted as intelligence. The authors turn this into a measure called rare-valid lift, which asks how much more often a system produces those unlikely but acceptable futures than a passive baseline would. They show that high lift is impossible unless the system can accurately spot the rare valid futures, and high spotting accuracy can nearly produce high lift when the system can act well. The main point is that intelligence becomes a physical probability-shifting process, not just a score on tests or a label for human-like behavior. ---- Link – arxiv. org/abs/2606.20231 Title: "Thermodynamic Measure of Intelligence"

译该论文提出“热力学智能”概念，将智能定义为通过信息与控制显著提高罕见有效结果概率的能力。现有评测仅关注任务成功率，而论文指出大脑、大语言模型、控制器等智能体的共同点：系统将自身纳入世界模型，并基于模型选择行动以改变未来概率。有效未来需满足在被动行为下罕见且仍有效。作者提出“罕见有效提升”度量，衡量系统比被动基线更频繁产生此类未来的倍数。高提升取决于系统能否准确识别罕见有效未来。核心论点：智能是物理层面的概率转移过程，而非测试分数或类人行为标签。

AK@_akhaliq · 6月24日40

Lift4D Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

译Lift4D 协调单视图3D估计用于野外4D重建

AK@_akhaliq · 6月24日43

Ling and Ring 2.6 Technical Report Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

译Ling and Ring 2.6 Technical Report 高效且即时的万亿参数量级智能体智能

AK@_akhaliq · 6月24日32

World Action Models: A Survey

译世界动作模型：一项综述

AK@_akhaliq · 6月24日35

PlanBench-XL Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

译PlanBench-XL 评估LLM工具使用智能体在大型工具生态系统中的长时域规划能力

AK@_akhaliq · 6月22日32

PerceptionDLM Parallel Region Perception with Multimodal Diffusion Language Models

译PerceptionDLM 平行区域感知与多模态扩散语言模型

elvis@omarsar0 · 6月22日53

Great report on LLM agent communication protocols. Communication is a huge bottleneck in multi-agent systems. (worth bookmarking) The report builds a five-dimensional taxonomy (counterparty, payload, interaction state, discovery mechanism, schema flexibility) across nine actively maintained open-source agent protocols, so it maps the real MCP and A2A landscape. Two patterns stand out. Every agent-to-agent protocol sampled pairs of hybrid payloads with session-state persistence, and decentralized discovery is still rare. So the field is quietly standardizing on stateful sessions while leaving discovery and policy enforcement open. Why does it matter? If you are choosing a communication layer this year, this discusses what nine real protocols actually do. Paper: https://arxiv.org/abs/2606.19135 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译该报告针对LLM多智能体系统的通信瓶颈，构建了五维分类法（对方、有效载荷、交互状态、发现机制、模式灵活性），系统梳理了9个积极维护的开源智能体协议，覆盖MCP和A2A的实际格局。报告发现两个突出模式：每个智能体间协议都采用混合有效载荷与会话状态持久化组合，而去中心化发现机制仍极为罕见。领域正悄然标准化有状态会话，但发现与策略执行层仍留白。该报告为今年选择通信层时提供了九大协议的真实对比参考。

Nathan Lambert@natolambert · 6月22日67

TMax: An open RL recipe for terminal agents I’m very excited to get to share a new RL paper today that I got to have a small part in – a type of paper I suspect we’ll see much more of in the future. The key is that RL research is very different today, in mid-2026, than what most observers have in their context. The average conception of an RL paper is grounded in the RLVR revolution of early 2025, where many people could use vanilla RLVR libraries to hillclimb on math benchmarks. Crucially, this style of math work could be done on base models or fairly stably on already trained models. With agents, the tasks of focus are very hard, requiring complex tool-use, harnesses where the model automatically manages its history, and much more training to make smaller eval improvements. We’re shifting from a renaissance of RL study to rapidly needing to improve its empirical rigor and common community engagements. TMax is the best open data for hillclimbing on frontier terminal tasks. It’s been validated with rigorous experiments, and if the authors wanted to just form a “RL environments startup” they could probably sell it for millions of dollars. This data work is some of my favorite stuff to be around in my 2.5+ years at Ai2. As a general summary, the recipe is open data and recipe lessons from hillclimbing the Qwen 3.5 smaller, dense models on terminal tasks. These models are super hard to hillclimb in this area, as they’re already trained heavily on the task. The training is very infrastructure-dependent, and most of the RL innovations are more designed to make training stable than to improve the rate of learning. I strongly recommend this paper. I joke around that I was happy to be an author just so I had to read it twice! You can find Hamish’s thread sharing more here or read the paper here. You can click through to find the model weights, the data, and even some fun further artifacts to study like all the RL rollouts from a training run – where the model sometimes became aware that it was being tested. The biggest takeaway I have from following this work, and more of the work in the community, is how important recipe work is. Let me define “recipe work.” It is a style of paper that explains all the steps you need to make crucial model improvements – data, algorithm, codebase, pitfalls, etc. Getting started in meaningful RL experiments today is a substantial expense. There are a ton of companies, an entire industry emerging really, around the idea of taking open-weight language models and finetuning them with RL on your domain-specific tasks. What I see in many projects is that getting an initial baseline is very hard. This phase, which can cost weeks and anywhere from $10K to $1M+, feels like spinning your wheels (A fun fact is that an RL step on a model like Nvidia Nemotron 3 Ultra on Tinker costs $1K and a meaningful RL run would be hundreds of steps – credit Edward Hu). It takes a lot of time to get traction in learning signal on meaningful, hard RL tasks. What we need as a community is a way for people to study small ablations to established RL recipes, as most labs won’t have the resources to do it from scratch in a meaningful way. This is what I hope TMAX can be for terminal agents, or the start of. Yes the training jobs are expensive, as the paper documents a standard training job being 8 nodes of H100s (2 train 6 inference) for 2-3 days, but that is approaching something academics can study. The establishment of this recipe took O(100) of these training jobs to get right. This isn’t my first time trying to establish this direction. When we launched Olmo 3 we had the “RL Zero“ model families, which are clean RL runs from a base model on a certain domain. This type of recipe-dependent work is a clear indicator that meaningful post-training work today looks much more like pretraining work of years past. We need decision-making ladders, clear ways of seeing small improvements in the models, stability, and so on. Part of this is down to academic gatekeepers, who won’t reward a paper doing very clean empirical work to push a recipe 1-2% up. They’ll favor a “new algorithm” that matches results, or something sort of bogus. My hope is that we can have multiple, stable, clear recipes across agent types, so innovations can be tested more clearly in multiple domains. (If you’re working on this, please reach out – I’m happy to support if I can, but I likely can’t reply to every email). As a quick aside, the RL frameworks in vogue today seem to be SLIME and SkyRL. The libraries of choice have shifted throughout these seasons in RL, which further contributes to a form of fragility in the literature. A bit of continuity will go a long way. So, go read this paper. It’s a really great example of how seemingly simple data and infrastructure work can be very hard and impactful. It’s also got me looking for more applications of Divergence Proximal Policy Optimization (DPPO) as another small evolution to the best RL algorithms of the day, by virtue of being a bit more stable by improving token-level clipping.

译TMax 是面向终端任务的开源 RL 配方，基于 Qwen 3.5 较小密集模型，在默认设置和 65k token 预算下超越此前开源工作。训练需 8 节点 H100（2 训练+6 推理）运行 2-3 天，配方经约 100 次训练才稳定。发布模型权重、数据及训练 rollouts。配方工作强调从零获得初始基线成本高昂（1 万至百万美元），需要明确决策阶梯和稳定性改进。

Rohan Paul@rohanpaul_ai · 6月22日50

Can LLM agents actually discover hidden rules by interacting? The answer is uncomfortable. The more complicated the hidden world gets, the faster AI agents fall behind. LLMs often cannot turn growing evidence into a stable internal model. Current LLM agents can sometimes discover hidden structure through interaction, but they are still weak at planning questions, using memory, and turning feedback into a reliable world model. ---- Link – arxiv. org/abs/2606.16576 Title: "Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning"

译Rohan Paul引用新论文指出，尽管LLM智能体有时能通过交互发现隐藏结构，但其推断世界模型的能力存在根本局限：随着隐藏世界复杂度增加，AI智能体的表现迅速落后，难以将积累的反馈转化为稳定的内部模型，尤其在提问规划、记忆利用和反馈整合方面表现薄弱。结论是，在复杂环境中，LLM智能体建立可靠心智模型的速度跟不上难度增长。

Rohan Paul@rohanpaul_ai · 6月22日65

pewresearch Published its latest "Americans and AI 2026" report. Only 16% of Americans now expect AI to help society over the next 20 years and 40% expect AI to hurt society over the next 20 years 24% of Americans use chatbots daily, including 12% several times a day and 4% almost constantly. 51% of U.S. adults still do not use AI chatbots at all. 42% use chatbots to search for information, making search the top use case. 38% of employed adults use chatbots for work tasks. 10% use chatbots for emotional support or advice, while 4% use them for companionship. ChatGPT dominates chatbot adoption, with 44% of U.S. adults reporting use. Gemini follows at 24%, then Copilot at 17%, Meta AI at 14%, Grok at 8%, Claude at 6%, and Character[.]ai at 3%. Adults under 50 are about twice as likely as older adults to use ChatGPT, at 57% versus 28%. 30% say chatbots help their productivity, while only 5% say they hurt it. 28% say chatbots help them stay informed, while only 5% say they hurt that. 60% of U.S. adults read AI search summaries, meaning AI is now shaping information intake even for people who may not actively use chatbots.

译皮尤研究中心最新报告显示，仅16%美国成年人预期AI在未来20年帮助社会，40%预期伤害。24%每天使用聊天机器人，51%从未使用。聊天机器人首要用途是搜索信息（42%），38%上班族用于工作，10%用于情感支持，4%用于陪伴。ChatGPT使用率最高（44%），其次Gemini（24%）、Copilot（17%）、Meta AI（14%）、Grok（8%）、Claude（6%）、Character.ai（3%）。30%称聊天机器人提升生产力，28%认为帮助了解信息。60%成年人阅读AI搜索摘要，表明AI正影响信息摄入。

elvis@omarsar0 · 6月22日47

>> Scalable Evaluation for AI Agents << If you run agent evaluation in production, this one is worth your time. It shows that front-loading human judgment into reusable evaluation assets is useful. But why? Agents reason across turns, call tools, hold context, follow policies, and act under uncertainty, so they have to be judged as behavioral systems. Current methods each give a fragment. Benchmarks measure fixed capabilities, human review preserves judgment but does not scale, LLM-as-judge inherits the evaluator design problem, red teaming is episodic, and trace audits need explicit evidence rules. Human-on-the-Bridge puts human expertise upstream, where experts curate reusable evaluation intelligence before testing rather than reviewing each output in the loop. Paper: https://arxiv.org/abs/2606.16871 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译论文《Scalable Evaluation for AI Agents》提出Human-on-the-Bridge评估方法：将人类判断前置到可复用评估资产中，专家在上游策划评估智慧，而非在测试循环中逐一审查输出。现有方法各有局限：Benchmark测量固定能力，人工审核不具可扩展性，LLM-as-Judge存在评估器设计问题，红队测试偶发，trace审计需明确证据规则。AI智能体需作为行为系统评估，因其多轮推理、调用工具、维护上下文、遵循策略并在不确定性下行动。

AK@_akhaliq · 6月20日44

S-Agent Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

译S-Agent 空间工具使用催生空间智能的推理

Rohan Paul@rohanpaul_ai · 6月20日47

New Microsoft + York Univ paper argues that LLMs should not be treated as human-like without clear tests and narrower claims. Many studies ask whether LLMs have things like understanding, empathy, anxiety, or self-awareness, but they often build those ideas into the test from the start. The author shows that, in principle, the old strategy game can implement logic gates, train a tiny perceptron, and serve as a substrate for computation. If the same language model could be rebuilt inside a game, with goats moving around as bits, would we still say it “understands,” “feels anxiety,” or “has empathy” when it produces the same sentence? The point is not that the game is secretly intelligent, but that the same computation can be represented in a very different form. If an LLM-like system were rebuilt inside that game, its answers might stay similar, but people would probably find its “feelings” or “understanding” much less convincing. The authors argue that this shows a big measurement problem: many human-like claims about LLMs may depend on the interface and the observer, not only on the system itself. The paper is not saying LLMs definitely lack human-like attributes, or that all talk of AI cognition is nonsense. It is saying that many experiments smuggle the conclusion into the setup: they assume the model has, or cannot have, a human-like property, then interpret behavior through that assumption. ---- Link – arxiv. org/abs/2605.31514 Title: "If LLMs Have Human-Like Attributes, Then So Does Age of Empires II"

译微软与约克大学新论文指出，许多研究在未经严格测试的情况下就将理解、共情、焦虑等人类属性赋予LLM，往往一开始就把这些概念内嵌到测试设计中。作者论证，原则上老策略游戏《帝国时代II》也能实现逻辑门、训练小型感知机，作为计算基底。若同样的语言模型以山羊移动作为bit在游戏中重建，输出相似句子，人们将不再认为它“理解”或“有共情”。论文并非否定AI认知，而是揭示测量问题：许多关于LLM类人属性的声称依赖于界面和观察者的预设，而不是系统本身。

elvis@omarsar0 · 6月19日51

// Automating SKILL.md Generation // Increasingly, mining sessions is one of the best ways to improve your agents. OpenAI released something similar yesterday that lets Codex package skills from interactions. (bookmark it) This paper explains a related approach. They run a three-stage pipeline that segments GUI trajectories, clusters them into candidate skills, and trains a skill-aware policy. The clusters are genuinely readable, with five of eight hitting 0.95 or higher purity against ground-truth workflow labels. But readability does not transfer. GRPO lifts skill-step accuracy only from 18.5% to 20.5%, leaves BrowseComp+ flat, and loses to trivial frequency priors. The authors name the three culprits: a weak boundary detector, an orderless segment representation, and an offline reward model. Paper: https://arxiv.org/abs/2606.20363 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译关键要点：OpenAI昨日为Codex推出了从交互中打包技能的类似功能；论文提出三阶段流水线（GUI轨迹分割→聚类候选技能→训练技能感知策略）。聚类纯度优异（5/8簇达0.95以上），但可读性未迁移：GRPO仅将技能步骤准确率从18.5%提至20.5%，在BrowseComp+上无改善，甚至输给简单频率先验。作者指出三个缺陷：弱边界检测器、无序片段表示、离线奖励模型。

Rohan Paul@rohanpaul_ai · 6月19日44

This paper shows that a good generalist agent must remember hidden environment rules, not just observe the current state. That sounds obvious until you notice the trap this paper isolates: two worlds can show the agent the same state, offer the same goal, and still require opposite actions. At that moment, observation is no longer enough. The important object is not “memory” as a vague engineering feature, but memory as the place where hidden context must be carried when the environment refuses to label itself. The paper’s core idea is that memory is not optional in this setting, because a near-perfect agent must store enough past experience to tell which hidden environment it is currently in. The authors prove that when 2 hidden domains require incompatible actions at the same visible state, any agent that performs well across both domains must have different internal memory states for those domains. The big point is that good generalist agents do not just react to what they see now, because they must carry hidden context from earlier experience when the world can change underneath the same observation. ---- Link – arxiv. org/abs/2606.18746 Title: "What Must Generalist Agents Remember?"

译该论文指出，通用智能体不能仅依赖当前观测，必须记住隐藏环境规则。当两个隐藏域在相同可见状态下要求相反动作时，仅凭观察无法区分当前场景。作者证明，要在两个域都表现良好的智能体，必须为不同域维持不同的内部记忆状态。核心结论：好的通用智能体不是对当前所见做出反应，而是必须携带来自先前经验的隐藏上下文。

Rohan Paul@rohanpaul_ai · 6月19日65

New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more useful behavior into tasks they had not trained on. The key point is cross-domain transfer, where health-only training improved non-health behaviors like blackmail resistance, code reward hacking, and deception tests. Suggests, the model may be learning a broader stance: verify before asserting, concede when corrected, resist flattering the user, and avoid shortcuts that look useful but corrupt the task. OpenAI also removed health and science data from training, yet the model still improved on health evaluations, which suggests these traits may be learned as general behavioral habits rather than narrow topic rules. The trained model was harder to steer toward harmful behavior while remaining responsive to helpful instructions, which is the asymmetry safety research has been looking for.

译OpenAI 最新研究显示，在真实人类情境中进行强化学习（RL）训练，可使模型将安全、有用行为迁移到未训练的任务。关键发现是跨领域迁移：仅用健康数据训练，模型在抵制敲诈、代码奖励黑客和欺骗测试等非健康行为上也得到改善。模型可能学到通用行为习惯——先核实再断言、被纠正时让步、不奉承用户、避免看似有用实则破坏任务的捷径。即使训练数据中移除健康与科学内容，模型在健康评估上仍表现更好。训练后的模型更难被引导向有害行为，同时保持对有益指令的响应，实现了安全研究期待的非对称性。OpenAI 表示，希望模型在承担更长、更高风险任务时，能将有益安全行为带入新领域并在压力下保持。

Ethan Mollick@emollick · 6月19日67

I have given AA a hard time about its previous agentic evaluation but this looks like a good and impressive benchmark for real world knowledge work that is unsaturated and had private hold out tests. This is one to watch - I didn’t see a human comparison score though?

译Ethan Mollick 称赞 AA-Briefcase 是真实知识工作的优质基准，未饱和且含私有保留测试，同时询问是否有与人类的对比。该基准由 @ArtificialAnlys 发布，测试模型在多周、多任务项目中的能力，输入含数万条 Slack 消息和数千封邮件。模型排名：Claude Fable 5（已不可用）以 1587 Elo 居首，Claude Opus 4.8（1356）第二，GLM-5.2 max（1266）第三。结果凸显难度：最佳模型仅 3% 任务满足全部标准，31/91 任务无模型超过 50%，成本跨度约 800 倍。

OpenAI@OpenAI · 6月19日62

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure. That’s the idea behind our new research on training models to be broadly and persistently beneficial. https://alignment.openai.com/beneficial-rl/

译随着AI承担更长时间、更高风险的任务，我们希望模型能将有益且安全的行为带入训练之外的新领域——并在压力下保持这种行为。这正是我们关于训练模型实现广泛且持久有益的新研究背后的理念。https://alignment.openai.com/beneficial-rl/

Jeff Dean@JeffDean · 6月19日49

My @Google colleagues @NormJouppi, Sridhar Lakshmanamurthy, Cliff Young, and David Patterson recently wrote a paper that will appear in the July/August 2026 edition of @ieeemicro titled "Google's Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations". It's chock full of interesting data about the evolution of TPU chip generations, as well as how workloads at Google have transformed over time (hint: lots more transformer-based models!), and how the generations have gotten ~30X more energy efficient per flop. Lots of changes over these generations: Air cooling in TPUv2 to water cooling in TPUv3 onwards 2D to 3D torus-based interconnects 30X improvement TFLOPS/Watt 256 chips (TPUv2) to 9216 chips (Ironwood) per pod Read the full paper: https://arxiv.org/abs/2606.15870

译Jeff Dean 等 Google 同事发布论文，回顾 TPU v2 到 Ironwood 五代训练超算的演进，将于 2026 年 7/8 月发表于 IEEE Micro。关键变化：TPU v2 采用气冷，v3 起改为水冷；互联从 2D 升级为 3D torus；每 pod 芯片数从 256 增至 9216；每 flop 能效提升约 30 倍。此外，Google 内部工作负载已大幅转向基于 Transformer 的模型。

Rohan Paul@rohanpaul_ai · 6月19日68

Anthropic just showed Claude Opus 4.7 program a robodog in 12:07 mint, about 20x faster than last year’s Claude-aided human team on the tested tasks. Project Fetch asks whether an LLM can connect real robot hardware, read camera/lidar feeds, write movement code, track location, and detect a ball. Opus 4.7 did 5 tasks alone versus Team Claude’s 264 minutes, while writing 1,045 lines instead of 10,309. The gain came from choosing the right interfaces quickly and writing scripts that worked without long human trial-and-error. It still couldn’t fetch the ball. The failure came from closed-loop control, where the robot must see a drifting ball and adjust movement after each shove. AI is getting very good at turning messy hardware into working code, but real-time physical judgment is still hard.

译Anthropic 在 Project Fetch 第二阶段展示 Claude Opus 4.7 独立编程机器狗。Opus 4.7 用 12 分 7 秒完成 5 项任务，约为去年人类团队（借助 Opus 4.1）耗时 264 分钟的 20 倍，代码量从 10,309 行降至 1,045 行。速度提升源于快速选择正确接口并写出无需人类试错的脚本。但机器狗仍未能取球，失败原因在于闭环控制——机器人需根据飘移的球实时调整动作。AI 擅长将杂乱硬件转为可运行代码，但实时物理判断仍具挑战。

Anthropic@AnthropicAI · 6月19日68

New Frontier Red Team blog: Phase 2 of Project Fetch, where we test how well Claude can program a robodog. Opus 4.7, on its own, was ~20x faster than last year's best human team aided by Opus 4.1. (The robodog, alas, still failed to fetch a beach ball.) https://www.anthropic.com/research/project-fetch-phase-two

译New Frontier Red Team 博客：Project Fetch 第二阶段，我们测试 Claude 编程机器狗的能力。 Opus 4.7 单独完成任务的速度比去年最佳人类团队（辅以 Opus 4.1）快约 20 倍。（可惜，机器狗仍然未能取回沙滩球。） https://www.anthropic.com/research/project-fetch-phase-two

Noam Brown@polynoamial · 6月19日35

When we announced @OpenAI o1 some researchers from other labs told me we made a strategic mistake and should have kept it secret so we could accelerate ourselves and pull farther ahead of the competition. Studies like these make me confident we made the right choice.

译Noam Brown 发文称，OpenAI 公开 o1 后，有其他实验室研究者认为这是战略失误，应保密以拉开差距。但他引用的最新研究让他确信公开正确：OpenAI 与波士顿儿童医院、哈佛合作，在 NEJM AI 发表研究，展示 o3 Deep Research 帮助临床医生重新审视未解决的罕见儿科疾病病例，为等待多年的家庭找到答案。

Greg Brockman@gdb · 6月19日51

OpenAI for helping find 18 new diagnoses across 376 previously unsolved medical cases. Includes diagnosing Kyra, who has been trying to understand her muscle weakness since age 9, with a rare form of myofibrillar myopathy shortly before her 28th birthday.

译OpenAI 与波士顿儿童医院、哈佛大学合作，在 NEJM AI 发表研究，使用 o3 Deep Research 重新审视 376 例此前未解的罕见儿科疾病案例，帮助找到 18 种新诊断。其中包含一例 Kyra 自 9 岁起出现肌无力的罕见肌原纤维肌病，在她 28 岁生日前不久得到确诊，为等待多年的家庭提供了答案。

elvis@omarsar0 · 6月18日64

Recommended reading. Great insights, especially in areas where general-purpose models continue to fail, like dealing with complex structures. It also highlights that for scientific research, specialized models are winning big time.

译OpenAI 推出 LifeSciBench，用于衡量 AI 支持真实世界生命科学研究的能力。该基准与 173 位生物技术与制药科学家共同开发，包含 750 个专家编写任务，覆盖七种生物研究流程。DAIR.AI 的 Elvis Saravia 推荐阅读，并指出通用模型在处理复杂结构时仍然失败，而面向科学研究的专用模型表现显著更优。

OpenAI@OpenAI · 6月18日46

Together with researchers at Boston Children’s Hospital and Harvard, we published a study in NEJM AI showing how o3 Deep Research helped clinicians revisit previously unsolved rare pediatric disease cases, and find answers for families who had waited years.

译与波士顿儿童医院和哈佛的研究人员合作，我们在NEJM AI上发表了一项研究，展示了o3 Deep Research如何帮助临床医生重新审视此前未解决的罕见儿科疾病案例，并为等待多年的家庭找到答案。

elvis@omarsar0 · 6月18日40

Cool paper on Skill routing for LLM agents. Real tasks rarely map to a single skill. They need several composed together, but most skill routing still treats the problem as picking one tool from a library. This work formalizes Compositional Skill Routing, decomposes a complex query into atomic sub-tasks, retrieves the right skill for each, and then composes an executable plan. The system, SkillWeaver, pairs an LLM decomposer with a bi-encoder FAISS retriever and a dependency-aware DAG planner. It comes with CompSkillBench, 300 compositional queries over 2,209 real skills, so the multi-skill case gets measured directly. Why does it matter? As skill libraries grow, single-skill retrieval quietly caps what an agent can do. The DAG planner turns retrieved skills into an ordered, dependency-respecting plan. Paper: https://arxiv.org/abs/2606.18051 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译传统LLM智能体技能路由仅从工具库选取单一技能，难以应对多技能组合的真实任务。本文形式化定义“组合式技能路由”，将复杂查询分解为原子子任务，为每个子任务检索对应技能并组合成可执行计划。系统SkillWeaver由LLM分解器、双编码器FAISS检索器和依赖感知DAG规划器构成。同时发布CompSkillBench基准，含300个组合查询和2,209个真实技能，直接评估多技能路由能力。DAG规划器将检索技能转化为有序、尊重依赖关系的计划。

Google DeepMind@GoogleDeepMind · 6月18日43

Instead of assuming AI will always do what we intend, we ask: what if it doesn't? That’s why we’ve developed our AI Control Roadmap: a framework for building and managing the advanced AI we deploy within Google. 🧵

译我们不做AI总会按我们意图行事的假设，而是问：如果它不这样做呢？因此我们制定了AI控制路线图：一个用于构建和管理我们在Google内部部署的先进AI的框架。🧵

Ant Ling@AntLingAGI · 6月18日50

It has been a privilege to collaborate so closely with the SGLang team @lmsysorg on optimizing Ling-2.6-1T. 🥳 The resulting performance gains speak for themselves: -53% reduction in MoE pre-fill latency -Up to 1.77x higher decode throughput on a 16-chip TPU v7x slice compared to a similar H200 cluster A significant milestone in efficient MoE scaling and hardware utilization!

译蚂蚁百灵与 SGLang 团队合作，将 1T 参数的混合 MoE 模型 Ling-2.6-1T 通过 SGLang-JAX 部署至 TPU v7x。优化包括：升级 Fused MoE V2 内核（token 和累加器驻留 VMEM，双缓冲专家权重，隐藏路由与预取）；混合内存池（10 个全注意力层 per-token MLA KV + 70 个 GLA 层 per-request 循环状态）；GLA 线性注意力逐块并行预填充；单控制器 DP 保持分组 RMSNorm 芯片本地化。效果：MoE 预填充延迟降低 53%；在 16 芯片 TPU v7x 切片上，解码吞吐量比同类 H200 集群最高提升 1.77 倍。

Rohan Paul@rohanpaul_ai · 6月18日67

Big claim in this paper, pushes against the common idea that more test-time compute should keep helping. Claims a code model gets much better when it rethinks once (i.e. by looping once) inside itself, but worse when it keeps rethinking. The first loop builds context, the second loop refines it, and later loops mostly disturb it. The paper studies a faster design called Parallel Loop Transformer, where loops can run almost in parallel and share memory, so the authors can ask a cleaner question about how many loops are actually useful. They trained 7B code models with 1, 2, 3, and 4 loops on 18T tokens, then tuned and tested them on code writing, code reasoning, software engineering, and tool-use tasks. The main result is that 2 loops worked best, raising SWE-bench Verified from 43.0 to 64.4, while 3 and 4 loops often got worse. Their internal checks suggest loop 2 does the real useful refinement, because it changes the model’s hidden states, attention patterns, and predictions in meaningful ways. After loop 2, the extra loops mostly add weaker, more repetitive changes, while a built-in position shift keeps adding the same kind of mismatch cost. Overall, the paper gives a simple lesson for efficient test-time compute: adding 1 hidden loop can help a lot, but adding more is not automatically better. ---- Link – arxiv. org/abs/2606.18023 Title: "LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling"

译论文《LoopCoder-v2》质疑“测试时计算越多越好”的观点。作者提出Parallel Loop Transformer架构，使循环可并行运行并共享内存。他们训练了7B参数的代码模型（1/2/3/4次循环），在18T tokens上预训练并微调，测试代码编写、推理、软件工程和工具使用任务。主要结果：2次循环效果最好，将SWE-bench Verified从43.0提升至64.4，而3次和4次循环性能下降。内部分析显示，第二次循环进行了有意义的精炼（改变隐藏状态、注意力模式和预测），后续循环则主要添加重复和噪声。结论：增加一次隐藏循环可大幅提升性能，但继续增加并非自动有益。

Epoch AI@EpochAIResearch · 6月18日41

How close is AI to automating AI R&D? Right now, the tools economists use to track automation are too blunt to say. In this week's newsletter, @datagenproc, @joemkwon, and @ansonwhho propose a sharper tool: a thorough taxonomy of 60+ tasks involved in frontier AI research. 🧵

译AI 距离自动化 AI 研发还有多远？目前，经济学家用于追踪自动化的工具过于粗糙。在本周的新闻通讯中，@datagenproc、@joemkwon 和 @ansonwhho 提出了一种更精细的工具：对前沿 AI 研究中 60 多项任务进行详细分类。🧵

AK@_akhaliq · 6月18日34

LoopCoder-v2 Only Loop Once for Efficient Test-Time Computation Scaling

译LoopCoder-v2 仅循环一次实现高效测试时计算缩放

OpenAI@OpenAI · 6月18日68

Introducing LifeSciBench, a benchmark for measuring and improving how well AI supports real-world life science research. Developed with 173 scientists from biotechnology and pharmaceutical research, LifeSciBench includes 750 expert-authored tasks across seven biological research workflows. https://openai.com/index/introducing-life-sci-bench/

译推出 LifeSciBench，一个用于衡量和改进 AI 如何支持现实世界生命科学研究的基准测试。该基准测试与 173 位来自生物技术和制药研究的科学家共同开发，包含 750 项专家编写的任务，覆盖七个生物学研究工作流程。

Jim Fan@DrJimFan · 6月18日81

I made Physical AutoResearch sound simple (conceptually), but it took a village to pull off and lots of design thinking into the robot /loopcraft. The hardest part is everything we need to setup *before* pressing Enter. Here's a behind-the-scene tour: 1. Safety harness Letting 8 robots run unattended overnight means safety has to be more than a hint in the system prompt. ENPIRE hardwires it in 2 layers: (1) hard kinematic limit that trips an immediate task failure and auto-resets as soon as a robot leaves its safety envelope, and (2) a torque-limited compliant gripper so a bad contact or misaligned insertion ends in a safe stall, instead of crushing the robot or the object at hand. We make safety more conservative than usual so humans can sleep tight. In reality, we still need a few human operators to watch over the "robots of loving grace". 2. Definition of /done An agent that can edit its own reward will game it for sure. ENPIRE fixes the goalposts before the fleet can move them. Here's the recipe: Collect a few minutes of success & failure demos -> Ask agent to write code using computer vision tools to classify success and measure against groundtruth -> Agent hill-climbs on classifier until reliably good -> This classifier becomes the real-time reward function that directly computes on sensor streams -> *Freeze* the reward function before AutoResearch. It's sacred, enshrined in a Gym env that no one can touch. 3. System telemetry design Robot-seconds is by far the scarcest resource, followed by GPU-seconds, and finally tokens. We instrument all three and surface them to ENPIRE for live resource awareness rather than letting it hill-climb in a vacuum. We define: - Mean Robot Utilization ("MRU"): the fraction of wall-clock time when the robot is actively executing an experiment. Otherwise the hardware is sitting idle and waiting for the next code commit. - Mean Token Utilization ("MTU"): tokens consumed per minute, our proxy for how hard the agent is actually thinking. A low MTU means the agent is stalled, waiting on a robot rollout to finish instead of doing research. - GPU utilization: fraction of wall-clock time when GPU is active. ... and evaluate on two budget-to-outcome metrics: 1. Tokens-to-Success: token budget the fleet burns to complete /goal. 2. Time-to-Success: wall-clock time to /goal

译NVIDIA GEAR实验室推出ENPIRE系统，首次实现物理世界自主研究。系统让8个Codex智能体控制8台机器人，配备GPU和token预算。安全方面采用硬运动极限切断和扭矩受限夹爪两层硬件保障，支持通宵无人运行。奖励函数通过视觉分类器离线固定并冻结，防止智能体作弊。实时监测机器人利用率（MRU）、token利用率（MTU）和GPU利用率，以Tokens-to-Success和Time-to-Success评估效率。ENPIRE自主完成扎带、整理细针、安装GPU等高精度任务，发现8机器人并行探索显著更快。系统将开源。

Rohan Paul@rohanpaul_ai · 6月17日50

This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logic. Stanford + Univ of Calif + Nanjing Univ researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables. A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents. Has less than 0.1% overlap with Common Crawl-derived corpora. The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training. The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens. ---- Link – arxiv. org/abs/2606.18192v1

译斯坦福、加州大学与南京大学研究人员发布SEFD数据集与方法，将SEC EDGAR文件转换为布局忠实的MultiMarkdown格式，保留合并表头、缩进、符号、跨度和表格层级，同时压缩冗余呈现模板，使财务表格的结构与会计逻辑可被LLM直接利用。公开152B token快照，估计完整档案约550B token长文档。该数据集与Common Crawl衍生语料重叠不足0.1%。