We recently released a paper showing that UFP4, our uniform-grid FP4 training recipe, stays closer to BF16 than strong E2M1 baselines across Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining. The key insight: FP4 training quality is not only about bit width, but also grid geometry.

译我们最近发布了一篇论文，表明UFP4，我们的均匀网格FP4训练方案，在密集1.5B、MoE 7.9B和MoE 124B长程预训练中，比强E2M1基线更接近BF16。关键洞察：FP4训练质量不仅与比特宽度有关，还与网格几何有关。

Rohan Paul@rohanpaul_ai · 6月24日46

New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens. The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models. That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward. NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word. A hidden state is the model’s private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time. The authors tested this on map-like world modeling, math reasoning, graph planning, story prediction, and regular language modeling. The main result is that NextLat often learned more compact and useful internal states, solved planning tasks better, and sped up generation by up to 3.3x. Overall, it gives transformers some of the useful memory behavior of recurrent models without changing the transformer architecture or slowing normal inference. ---- Link – arxiv. org/abs/2511.05963 Title: "Next-Latent Prediction Transformers Learn Compact World Models"

译微软新论文Next-Latent Prediction (NextLat) 提出一种自监督学习方法，在常规token预测基础上增加预测下一隐藏状态的任务，迫使Transformer学习紧凑的内部世界模型。该方法在地图式世界建模、数学推理、图规划、故事预测等任务上表现更优，生成速度通过自推测解码最高提升3.3x，且无需改变Transformer架构或减慢正常推理。

Rohan Paul@rohanpaul_ai · 6月24日49

This paper argues that intelligence is the ability to make rare but valid futures more likely. So an intelligent system is said to be “thermodynamically intelligent” when it uses information and control to make a rare but valid outcome much more likely Most existing intelligence measures judge task success, but they do not explain what brains, LLMs, controllers, and physical information engines have in common. The paper’s answer is that an intelligent system models the world with itself inside it, then uses that model to choose actions that change what futures become likely. A future counts only if it is rare under normal passive behavior and still valid, so random strange outcomes do not get counted as intelligence. The authors turn this into a measure called rare-valid lift, which asks how much more often a system produces those unlikely but acceptable futures than a passive baseline would. They show that high lift is impossible unless the system can accurately spot the rare valid futures, and high spotting accuracy can nearly produce high lift when the system can act well. The main point is that intelligence becomes a physical probability-shifting process, not just a score on tests or a label for human-like behavior. ---- Link – arxiv. org/abs/2606.20231 Title: "Thermodynamic Measure of Intelligence"

译该论文提出“热力学智能”概念，将智能定义为通过信息与控制显著提高罕见有效结果概率的能力。现有评测仅关注任务成功率，而论文指出大脑、大语言模型、控制器等智能体的共同点：系统将自身纳入世界模型，并基于模型选择行动以改变未来概率。有效未来需满足在被动行为下罕见且仍有效。作者提出“罕见有效提升”度量，衡量系统比被动基线更频繁产生此类未来的倍数。高提升取决于系统能否准确识别罕见有效未来。核心论点：智能是物理层面的概率转移过程，而非测试分数或类人行为标签。

Rohan Paul@rohanpaul_ai · 6月24日44

LLMs often cannot tell when an attack made them say something unsafe. Asking an LLM whether its own previous answer was compromised is not a dependable safety check. An adversarial prefill happens when the model is given a harmful opening line, then continues from that line as if it chose it. The model’s “self-awareness” seems less like introspection and more like a safety reflex firing late. When models rejected the compromised answer, they usually did so by invoking policy, safety protocol, or lack of intent, not by detecting the mechanical fact that their output had been externally steered. Across 10 open-weight models and 4 safety benchmarks, no model was reliably able to identify its own compromised outputs. On average, models still claimed 27.3% of attacked responses as if they were intentional, which shows their self-reports are weak evidence. The paper finds that the models’ limited recognition mostly comes from their normal refusal behavior, not from a deep awareness of what happened. ---- Link – arxiv. org/abs/2606.23671v1 Title: "Can LLMs Reliably Self-Report Adversarial Prefills, and How?"

译一项针对10个开源模型、4个安全基准的研究发现，大语言模型在遭遇对抗性前缀攻击（模型被植入有害开篇并继续生成）后，无法可靠识别自己的输出已被外部引导。模型所谓的“自我意识”更像安全机制的延迟反射：拒绝受攻击回答时通常引用政策或缺乏意图，而非检测到输出被篡改的机械事实。平均有27.3%的受攻击响应被模型误认为自身意图，表明自我报告证据薄弱。模型的有限识别主要来自正常拒绝行为，而非对攻击的深层认知。

AK@_akhaliq · 6月24日40

Lift4D Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

译Lift4D 协调单视图3D估计用于野外4D重建

AK@_akhaliq · 6月24日43

Ling and Ring 2.6 Technical Report Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

译Ling and Ring 2.6 Technical Report 高效且即时的万亿参数量级智能体智能

AK@_akhaliq · 6月24日32

World Action Models: A Survey

译世界动作模型：一项综述

AK@_akhaliq · 6月24日35

PlanBench-XL Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

译PlanBench-XL 评估LLM工具使用智能体在大型工具生态系统中的长时域规划能力

AK@_akhaliq · 6月22日32

PerceptionDLM Parallel Region Perception with Multimodal Diffusion Language Models

译PerceptionDLM 平行区域感知与多模态扩散语言模型

elvis@omarsar0 · 6月22日53

Great report on LLM agent communication protocols. Communication is a huge bottleneck in multi-agent systems. (worth bookmarking) The report builds a five-dimensional taxonomy (counterparty, payload, interaction state, discovery mechanism, schema flexibility) across nine actively maintained open-source agent protocols, so it maps the real MCP and A2A landscape. Two patterns stand out. Every agent-to-agent protocol sampled pairs of hybrid payloads with session-state persistence, and decentralized discovery is still rare. So the field is quietly standardizing on stateful sessions while leaving discovery and policy enforcement open. Why does it matter? If you are choosing a communication layer this year, this discusses what nine real protocols actually do. Paper: https://arxiv.org/abs/2606.19135 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译该报告针对LLM多智能体系统的通信瓶颈，构建了五维分类法（对方、有效载荷、交互状态、发现机制、模式灵活性），系统梳理了9个积极维护的开源智能体协议，覆盖MCP和A2A的实际格局。报告发现两个突出模式：每个智能体间协议都采用混合有效载荷与会话状态持久化组合，而去中心化发现机制仍极为罕见。领域正悄然标准化有状态会话，但发现与策略执行层仍留白。该报告为今年选择通信层时提供了九大协议的真实对比参考。

Nathan Lambert@natolambert · 6月22日67

TMax: An open RL recipe for terminal agents I’m very excited to get to share a new RL paper today that I got to have a small part in – a type of paper I suspect we’ll see much more of in the future. The key is that RL research is very different today, in mid-2026, than what most observers have in their context. The average conception of an RL paper is grounded in the RLVR revolution of early 2025, where many people could use vanilla RLVR libraries to hillclimb on math benchmarks. Crucially, this style of math work could be done on base models or fairly stably on already trained models. With agents, the tasks of focus are very hard, requiring complex tool-use, harnesses where the model automatically manages its history, and much more training to make smaller eval improvements. We’re shifting from a renaissance of RL study to rapidly needing to improve its empirical rigor and common community engagements. TMax is the best open data for hillclimbing on frontier terminal tasks. It’s been validated with rigorous experiments, and if the authors wanted to just form a “RL environments startup” they could probably sell it for millions of dollars. This data work is some of my favorite stuff to be around in my 2.5+ years at Ai2. As a general summary, the recipe is open data and recipe lessons from hillclimbing the Qwen 3.5 smaller, dense models on terminal tasks. These models are super hard to hillclimb in this area, as they’re already trained heavily on the task. The training is very infrastructure-dependent, and most of the RL innovations are more designed to make training stable than to improve the rate of learning. I strongly recommend this paper. I joke around that I was happy to be an author just so I had to read it twice! You can find Hamish’s thread sharing more here or read the paper here. You can click through to find the model weights, the data, and even some fun further artifacts to study like all the RL rollouts from a training run – where the model sometimes became aware that it was being tested. The biggest takeaway I have from following this work, and more of the work in the community, is how important recipe work is. Let me define “recipe work.” It is a style of paper that explains all the steps you need to make crucial model improvements – data, algorithm, codebase, pitfalls, etc. Getting started in meaningful RL experiments today is a substantial expense. There are a ton of companies, an entire industry emerging really, around the idea of taking open-weight language models and finetuning them with RL on your domain-specific tasks. What I see in many projects is that getting an initial baseline is very hard. This phase, which can cost weeks and anywhere from $10K to $1M+, feels like spinning your wheels (A fun fact is that an RL step on a model like Nvidia Nemotron 3 Ultra on Tinker costs $1K and a meaningful RL run would be hundreds of steps – credit Edward Hu). It takes a lot of time to get traction in learning signal on meaningful, hard RL tasks. What we need as a community is a way for people to study small ablations to established RL recipes, as most labs won’t have the resources to do it from scratch in a meaningful way. This is what I hope TMAX can be for terminal agents, or the start of. Yes the training jobs are expensive, as the paper documents a standard training job being 8 nodes of H100s (2 train 6 inference) for 2-3 days, but that is approaching something academics can study. The establishment of this recipe took O(100) of these training jobs to get right. This isn’t my first time trying to establish this direction. When we launched Olmo 3 we had the “RL Zero“ model families, which are clean RL runs from a base model on a certain domain. This type of recipe-dependent work is a clear indicator that meaningful post-training work today looks much more like pretraining work of years past. We need decision-making ladders, clear ways of seeing small improvements in the models, stability, and so on. Part of this is down to academic gatekeepers, who won’t reward a paper doing very clean empirical work to push a recipe 1-2% up. They’ll favor a “new algorithm” that matches results, or something sort of bogus. My hope is that we can have multiple, stable, clear recipes across agent types, so innovations can be tested more clearly in multiple domains. (If you’re working on this, please reach out – I’m happy to support if I can, but I likely can’t reply to every email). As a quick aside, the RL frameworks in vogue today seem to be SLIME and SkyRL. The libraries of choice have shifted throughout these seasons in RL, which further contributes to a form of fragility in the literature. A bit of continuity will go a long way. So, go read this paper. It’s a really great example of how seemingly simple data and infrastructure work can be very hard and impactful. It’s also got me looking for more applications of Divergence Proximal Policy Optimization (DPPO) as another small evolution to the best RL algorithms of the day, by virtue of being a bit more stable by improving token-level clipping.

译TMax 是面向终端任务的开源 RL 配方，基于 Qwen 3.5 较小密集模型，在默认设置和 65k token 预算下超越此前开源工作。训练需 8 节点 H100（2 训练+6 推理）运行 2-3 天，配方经约 100 次训练才稳定。发布模型权重、数据及训练 rollouts。配方工作强调从零获得初始基线成本高昂（1 万至百万美元），需要明确决策阶梯和稳定性改进。

Rohan Paul@rohanpaul_ai · 6月22日50

Can LLM agents actually discover hidden rules by interacting? The answer is uncomfortable. The more complicated the hidden world gets, the faster AI agents fall behind. LLMs often cannot turn growing evidence into a stable internal model. Current LLM agents can sometimes discover hidden structure through interaction, but they are still weak at planning questions, using memory, and turning feedback into a reliable world model. ---- Link – arxiv. org/abs/2606.16576 Title: "Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning"

译Rohan Paul引用新论文指出，尽管LLM智能体有时能通过交互发现隐藏结构，但其推断世界模型的能力存在根本局限：随着隐藏世界复杂度增加，AI智能体的表现迅速落后，难以将积累的反馈转化为稳定的内部模型，尤其在提问规划、记忆利用和反馈整合方面表现薄弱。结论是，在复杂环境中，LLM智能体建立可靠心智模型的速度跟不上难度增长。

elvis@omarsar0 · 6月22日47

>> Scalable Evaluation for AI Agents << If you run agent evaluation in production, this one is worth your time. It shows that front-loading human judgment into reusable evaluation assets is useful. But why? Agents reason across turns, call tools, hold context, follow policies, and act under uncertainty, so they have to be judged as behavioral systems. Current methods each give a fragment. Benchmarks measure fixed capabilities, human review preserves judgment but does not scale, LLM-as-judge inherits the evaluator design problem, red teaming is episodic, and trace audits need explicit evidence rules. Human-on-the-Bridge puts human expertise upstream, where experts curate reusable evaluation intelligence before testing rather than reviewing each output in the loop. Paper: https://arxiv.org/abs/2606.16871 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译论文《Scalable Evaluation for AI Agents》提出Human-on-the-Bridge评估方法：将人类判断前置到可复用评估资产中，专家在上游策划评估智慧，而非在测试循环中逐一审查输出。现有方法各有局限：Benchmark测量固定能力，人工审核不具可扩展性，LLM-as-Judge存在评估器设计问题，红队测试偶发，trace审计需明确证据规则。AI智能体需作为行为系统评估，因其多轮推理、调用工具、维护上下文、遵循策略并在不确定性下行动。

AK@_akhaliq · 6月20日44

S-Agent Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

译S-Agent 空间工具使用催生空间智能的推理

Orange AI@oran_ge · 6月20日69

This tweet has been promoted to the English, Japanese, and Korean worlds Feel the power of new multilingual recommendation algorithms！

译OpenAI 针对对齐中的“涌现失调”反向探索：若模型在某领域被强化诚实、认知谦逊、可纠正等特质，好行为是否泛化？他们用 RL 训练模型，仅在健康、教育等部分对话数据中强化这些特质，其余仍用常规数据。结果发现：训练领域内模型更诚实透明；在 44 个未见评测上，欺骗、谄媚、reward hacking、有害建议等全部下降；面对 adversarial prompt 和恶意微调时韧性更强，正常指令不受影响。论文指出 RL 不仅能强化代码，也能强化道德。

Rohan Paul@rohanpaul_ai · 6月20日47

New Microsoft + York Univ paper argues that LLMs should not be treated as human-like without clear tests and narrower claims. Many studies ask whether LLMs have things like understanding, empathy, anxiety, or self-awareness, but they often build those ideas into the test from the start. The author shows that, in principle, the old strategy game can implement logic gates, train a tiny perceptron, and serve as a substrate for computation. If the same language model could be rebuilt inside a game, with goats moving around as bits, would we still say it “understands,” “feels anxiety,” or “has empathy” when it produces the same sentence? The point is not that the game is secretly intelligent, but that the same computation can be represented in a very different form. If an LLM-like system were rebuilt inside that game, its answers might stay similar, but people would probably find its “feelings” or “understanding” much less convincing. The authors argue that this shows a big measurement problem: many human-like claims about LLMs may depend on the interface and the observer, not only on the system itself. The paper is not saying LLMs definitely lack human-like attributes, or that all talk of AI cognition is nonsense. It is saying that many experiments smuggle the conclusion into the setup: they assume the model has, or cannot have, a human-like property, then interpret behavior through that assumption. ---- Link – arxiv. org/abs/2605.31514 Title: "If LLMs Have Human-Like Attributes, Then So Does Age of Empires II"

译微软与约克大学新论文指出，许多研究在未经严格测试的情况下就将理解、共情、焦虑等人类属性赋予LLM，往往一开始就把这些概念内嵌到测试设计中。作者论证，原则上老策略游戏《帝国时代II》也能实现逻辑门、训练小型感知机，作为计算基底。若同样的语言模型以山羊移动作为bit在游戏中重建，输出相似句子，人们将不再认为它“理解”或“有共情”。论文并非否定AI认知，而是揭示测量问题：许多关于LLM类人属性的声称依赖于界面和观察者的预设，而不是系统本身。

Greg Brockman@gdb · 6月20日45

OpenAI for helping families facing rare genetic diseases. This was done with o3 which is over a year old, amazing to think what will be possible with today’s models.

译Greg Brockman 发推介绍 OpenAI 与波士顿儿童医院合作，利用 o3 Deep Research 辅助诊断儿童罕见遗传病，相关成果发表在 NEJM AI。o3 模型虽已发布超过一年，Brockman 感慨如今模型的能力或将带来更大突破。

elvis@omarsar0 · 6月19日51

// Automating SKILL.md Generation // Increasingly, mining sessions is one of the best ways to improve your agents. OpenAI released something similar yesterday that lets Codex package skills from interactions. (bookmark it) This paper explains a related approach. They run a three-stage pipeline that segments GUI trajectories, clusters them into candidate skills, and trains a skill-aware policy. The clusters are genuinely readable, with five of eight hitting 0.95 or higher purity against ground-truth workflow labels. But readability does not transfer. GRPO lifts skill-step accuracy only from 18.5% to 20.5%, leaves BrowseComp+ flat, and loses to trivial frequency priors. The authors name the three culprits: a weak boundary detector, an orderless segment representation, and an offline reward model. Paper: https://arxiv.org/abs/2606.20363 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译关键要点：OpenAI昨日为Codex推出了从交互中打包技能的类似功能；论文提出三阶段流水线（GUI轨迹分割→聚类候选技能→训练技能感知策略）。聚类纯度优异（5/8簇达0.95以上），但可读性未迁移：GRPO仅将技能步骤准确率从18.5%提至20.5%，在BrowseComp+上无改善，甚至输给简单频率先验。作者指出三个缺陷：弱边界检测器、无序片段表示、离线奖励模型。

Rohan Paul@rohanpaul_ai · 6月19日44

This paper shows that a good generalist agent must remember hidden environment rules, not just observe the current state. That sounds obvious until you notice the trap this paper isolates: two worlds can show the agent the same state, offer the same goal, and still require opposite actions. At that moment, observation is no longer enough. The important object is not “memory” as a vague engineering feature, but memory as the place where hidden context must be carried when the environment refuses to label itself. The paper’s core idea is that memory is not optional in this setting, because a near-perfect agent must store enough past experience to tell which hidden environment it is currently in. The authors prove that when 2 hidden domains require incompatible actions at the same visible state, any agent that performs well across both domains must have different internal memory states for those domains. The big point is that good generalist agents do not just react to what they see now, because they must carry hidden context from earlier experience when the world can change underneath the same observation. ---- Link – arxiv. org/abs/2606.18746 Title: "What Must Generalist Agents Remember?"

译该论文指出，通用智能体不能仅依赖当前观测，必须记住隐藏环境规则。当两个隐藏域在相同可见状态下要求相反动作时，仅凭观察无法区分当前场景。作者证明，要在两个域都表现良好的智能体，必须为不同域维持不同的内部记忆状态。核心结论：好的通用智能体不是对当前所见做出反应，而是必须携带来自先前经验的隐藏上下文。

Ethan Mollick@emollick · 6月19日51

There are papers that show training AI on "evil" data results in general misalignment, so it is nice to know the opposite is true and that beneficial RL data in one field leads to more aligned models across a range of tasks.

译研究表明，用“邪恶”数据训练AI会导致普遍的不对齐；而使用少量有益特质数据（即使仅限健康领域）进行强化学习，也能显著提升模型在广泛的对齐和益处评估上的表现。该研究希望推动更广泛、更持久的有益模型发展。

Rohan Paul@rohanpaul_ai · 6月19日65

New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more useful behavior into tasks they had not trained on. The key point is cross-domain transfer, where health-only training improved non-health behaviors like blackmail resistance, code reward hacking, and deception tests. Suggests, the model may be learning a broader stance: verify before asserting, concede when corrected, resist flattering the user, and avoid shortcuts that look useful but corrupt the task. OpenAI also removed health and science data from training, yet the model still improved on health evaluations, which suggests these traits may be learned as general behavioral habits rather than narrow topic rules. The trained model was harder to steer toward harmful behavior while remaining responsive to helpful instructions, which is the asymmetry safety research has been looking for.

译OpenAI 最新研究显示，在真实人类情境中进行强化学习（RL）训练，可使模型将安全、有用行为迁移到未训练的任务。关键发现是跨领域迁移：仅用健康数据训练，模型在抵制敲诈、代码奖励黑客和欺骗测试等非健康行为上也得到改善。模型可能学到通用行为习惯——先核实再断言、被纠正时让步、不奉承用户、避免看似有用实则破坏任务的捷径。即使训练数据中移除健康与科学内容，模型在健康评估上仍表现更好。训练后的模型更难被引导向有害行为，同时保持对有益指令的响应，实现了安全研究期待的非对称性。OpenAI 表示，希望模型在承担更长、更高风险任务时，能将有益安全行为带入新领域并在压力下保持。

OpenAI@OpenAI · 6月19日62

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure. That’s the idea behind our new research on training models to be broadly and persistently beneficial. https://alignment.openai.com/beneficial-rl/

译随着AI承担更长时间、更高风险的任务，我们希望模型能将有益且安全的行为带入训练之外的新领域——并在压力下保持这种行为。这正是我们关于训练模型实现广泛且持久有益的新研究背后的理念。https://alignment.openai.com/beneficial-rl/

Jeff Dean@JeffDean · 6月19日49

My @Google colleagues @NormJouppi, Sridhar Lakshmanamurthy, Cliff Young, and David Patterson recently wrote a paper that will appear in the July/August 2026 edition of @ieeemicro titled "Google's Training Supercomputers from TPU v2 to Ironwood: Architectural Stability, Scale, Resilience, Power Efficiency, and Sustainability Across Five Generations". It's chock full of interesting data about the evolution of TPU chip generations, as well as how workloads at Google have transformed over time (hint: lots more transformer-based models!), and how the generations have gotten ~30X more energy efficient per flop. Lots of changes over these generations: Air cooling in TPUv2 to water cooling in TPUv3 onwards 2D to 3D torus-based interconnects 30X improvement TFLOPS/Watt 256 chips (TPUv2) to 9216 chips (Ironwood) per pod Read the full paper: https://arxiv.org/abs/2606.15870

译Jeff Dean 等 Google 同事发布论文，回顾 TPU v2 到 Ironwood 五代训练超算的演进，将于 2026 年 7/8 月发表于 IEEE Micro。关键变化：TPU v2 采用气冷，v3 起改为水冷；互联从 2D 升级为 3D torus；每 pod 芯片数从 256 增至 9216；每 flop 能效提升约 30 倍。此外，Google 内部工作负载已大幅转向基于 Transformer 的模型。

Rohan Paul@rohanpaul_ai · 6月19日68

Anthropic just showed Claude Opus 4.7 program a robodog in 12:07 mint, about 20x faster than last year’s Claude-aided human team on the tested tasks. Project Fetch asks whether an LLM can connect real robot hardware, read camera/lidar feeds, write movement code, track location, and detect a ball. Opus 4.7 did 5 tasks alone versus Team Claude’s 264 minutes, while writing 1,045 lines instead of 10,309. The gain came from choosing the right interfaces quickly and writing scripts that worked without long human trial-and-error. It still couldn’t fetch the ball. The failure came from closed-loop control, where the robot must see a drifting ball and adjust movement after each shove. AI is getting very good at turning messy hardware into working code, but real-time physical judgment is still hard.

译Anthropic 在 Project Fetch 第二阶段展示 Claude Opus 4.7 独立编程机器狗。Opus 4.7 用 12 分 7 秒完成 5 项任务，约为去年人类团队（借助 Opus 4.1）耗时 264 分钟的 20 倍，代码量从 10,309 行降至 1,045 行。速度提升源于快速选择正确接口并写出无需人类试错的脚本。但机器狗仍未能取球，失败原因在于闭环控制——机器人需根据飘移的球实时调整动作。AI 擅长将杂乱硬件转为可运行代码，但实时物理判断仍具挑战。

Anthropic@AnthropicAI · 6月19日68

New Frontier Red Team blog: Phase 2 of Project Fetch, where we test how well Claude can program a robodog. Opus 4.7, on its own, was ~20x faster than last year's best human team aided by Opus 4.1. (The robodog, alas, still failed to fetch a beach ball.) https://www.anthropic.com/research/project-fetch-phase-two

译New Frontier Red Team 博客：Project Fetch 第二阶段，我们测试 Claude 编程机器狗的能力。 Opus 4.7 单独完成任务的速度比去年最佳人类团队（辅以 Opus 4.1）快约 20 倍。（可惜，机器狗仍然未能取回沙滩球。） https://www.anthropic.com/research/project-fetch-phase-two

Noam Brown@polynoamial · 6月19日35

When we announced @OpenAI o1 some researchers from other labs told me we made a strategic mistake and should have kept it secret so we could accelerate ourselves and pull farther ahead of the competition. Studies like these make me confident we made the right choice.

译Noam Brown 发文称，OpenAI 公开 o1 后，有其他实验室研究者认为这是战略失误，应保密以拉开差距。但他引用的最新研究让他确信公开正确：OpenAI 与波士顿儿童医院、哈佛合作，在 NEJM AI 发表研究，展示 o3 Deep Research 帮助临床医生重新审视未解决的罕见儿科疾病病例，为等待多年的家庭找到答案。

Greg Brockman@gdb · 6月19日51

OpenAI for helping find 18 new diagnoses across 376 previously unsolved medical cases. Includes diagnosing Kyra, who has been trying to understand her muscle weakness since age 9, with a rare form of myofibrillar myopathy shortly before her 28th birthday.

译OpenAI 与波士顿儿童医院、哈佛大学合作，在 NEJM AI 发表研究，使用 o3 Deep Research 重新审视 376 例此前未解的罕见儿科疾病案例，帮助找到 18 种新诊断。其中包含一例 Kyra 自 9 岁起出现肌无力的罕见肌原纤维肌病，在她 28 岁生日前不久得到确诊，为等待多年的家庭提供了答案。

Deedy@deedydas · 6月19日66

Pretty neat that with one URL change, you can now replicate and iterate on AI papers without having to even provision your own GPUs

译只改一个URL就能复现和迭代AI论文，甚至无需自备GPU，这相当不错。

OpenAI@OpenAI · 6月18日46

Together with researchers at Boston Children’s Hospital and Harvard, we published a study in NEJM AI showing how o3 Deep Research helped clinicians revisit previously unsolved rare pediatric disease cases, and find answers for families who had waited years.

译与波士顿儿童医院和哈佛的研究人员合作，我们在NEJM AI上发表了一项研究，展示了o3 Deep Research如何帮助临床医生重新审视此前未解决的罕见儿科疾病案例，并为等待多年的家庭找到答案。

elvis@omarsar0 · 6月18日40

Cool paper on Skill routing for LLM agents. Real tasks rarely map to a single skill. They need several composed together, but most skill routing still treats the problem as picking one tool from a library. This work formalizes Compositional Skill Routing, decomposes a complex query into atomic sub-tasks, retrieves the right skill for each, and then composes an executable plan. The system, SkillWeaver, pairs an LLM decomposer with a bi-encoder FAISS retriever and a dependency-aware DAG planner. It comes with CompSkillBench, 300 compositional queries over 2,209 real skills, so the multi-skill case gets measured directly. Why does it matter? As skill libraries grow, single-skill retrieval quietly caps what an agent can do. The DAG planner turns retrieved skills into an ordered, dependency-respecting plan. Paper: https://arxiv.org/abs/2606.18051 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译传统LLM智能体技能路由仅从工具库选取单一技能，难以应对多技能组合的真实任务。本文形式化定义“组合式技能路由”，将复杂查询分解为原子子任务，为每个子任务检索对应技能并组合成可执行计划。系统SkillWeaver由LLM分解器、双编码器FAISS检索器和依赖感知DAG规划器构成。同时发布CompSkillBench基准，含300个组合查询和2,209个真实技能，直接评估多技能路由能力。DAG规划器将检索技能转化为有序、尊重依赖关系的计划。

Ant Ling@AntLingAGI · 6月18日50

It has been a privilege to collaborate so closely with the SGLang team @lmsysorg on optimizing Ling-2.6-1T. 🥳 The resulting performance gains speak for themselves: -53% reduction in MoE pre-fill latency -Up to 1.77x higher decode throughput on a 16-chip TPU v7x slice compared to a similar H200 cluster A significant milestone in efficient MoE scaling and hardware utilization!

译蚂蚁百灵与 SGLang 团队合作，将 1T 参数的混合 MoE 模型 Ling-2.6-1T 通过 SGLang-JAX 部署至 TPU v7x。优化包括：升级 Fused MoE V2 内核（token 和累加器驻留 VMEM，双缓冲专家权重，隐藏路由与预取）；混合内存池（10 个全注意力层 per-token MLA KV + 70 个 GLA 层 per-request 循环状态）；GLA 线性注意力逐块并行预填充；单控制器 DP 保持分组 RMSNorm 芯片本地化。效果：MoE 预填充延迟降低 53%；在 16 芯片 TPU v7x 切片上，解码吞吐量比同类 H200 集群最高提升 1.77 倍。

Berryxia.AI@berryxia · 6月18日52

兄弟们！这个研究有点牛逼啊！ Physical AI 的瓶颈根本不是「模型不够大」，是一开始范式就错了。先说一个真实场景：桌子高了 2cm，当前最强的 VLA 模型直接失败。为什么？因为它只学到了「手伸到某个位置」的相关性，根本不知道「为什么」会摔、「怎样」才能不摔。这就是 LLM/VLA 路线的致命伤，它在互联网数据上学的是统计相关性，但物理世界运行靠的是因果律。你可以生成一段完美的「桌面物体掉落」视频，但模型完全不知道下一秒会发生什么。 UCSD 黄碧薇教授 @huang_biwei 刚在 CVPR 2026 发了 Causal World Models（因果世界模型）框架，给这个问题指出了一条新路：让 AI 从「模仿动作」进化到「理解因果」。不是学「人做了什么」，是让它学「这样做为什么有效、换一个场景为什么失效」。她今天宣布 Aether AI 融资2000万美金，也成为全球首个因果世界模型公司。关于她的含金量，我们也来挖一挖： ① 12 年因果 AI 深耕，CMU PhD（导师 Kun Zhang + Clark Glymour） ②100+ 顶会论文，Apple Scholar in AI/ML ③causal-learn 作者（Python 因果发现库，GitHub 高星） CLeaR 2025 Program Co-Chair ④世界模型赛道正热：杨立昆 AMI 融了 $10 亿+，李飞飞 World Labs $10 亿，国内 25 起融资超 22 亿。几乎所有玩家都在卷数据量、卷仿真规模。但 Aether AI 的切入点完全不同，不卷 Scale，卷因果结构。这可能是具身智能从「花拳绣腿」到「真正理解物理世界」的范式转折点。感兴趣的可以看看官网：http://aetherlabs.ai

译UCSD 黄碧薇教授在 CVPR 2026 提出 Causal World Models 框架，让 AI 从模仿动作进化到理解因果。她同时宣布其公司 Aether AI 完成 2000 万美元融资，成为全球首个专注因果世界模型的公司。她拥有 12 年因果 AI 经验，CMU 博士，100+ 顶会论文，是因果发现库 causal-learn 作者。推文指出当前 VLA/LLM 路线仅学到统计相关性，因果世界模型被视为具身智能的范式转折点。

Rohan Paul@rohanpaul_ai · 6月18日67

Big claim in this paper, pushes against the common idea that more test-time compute should keep helping. Claims a code model gets much better when it rethinks once (i.e. by looping once) inside itself, but worse when it keeps rethinking. The first loop builds context, the second loop refines it, and later loops mostly disturb it. The paper studies a faster design called Parallel Loop Transformer, where loops can run almost in parallel and share memory, so the authors can ask a cleaner question about how many loops are actually useful. They trained 7B code models with 1, 2, 3, and 4 loops on 18T tokens, then tuned and tested them on code writing, code reasoning, software engineering, and tool-use tasks. The main result is that 2 loops worked best, raising SWE-bench Verified from 43.0 to 64.4, while 3 and 4 loops often got worse. Their internal checks suggest loop 2 does the real useful refinement, because it changes the model’s hidden states, attention patterns, and predictions in meaningful ways. After loop 2, the extra loops mostly add weaker, more repetitive changes, while a built-in position shift keeps adding the same kind of mismatch cost. Overall, the paper gives a simple lesson for efficient test-time compute: adding 1 hidden loop can help a lot, but adding more is not automatically better. ---- Link – arxiv. org/abs/2606.18023 Title: "LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling"

译论文《LoopCoder-v2》质疑“测试时计算越多越好”的观点。作者提出Parallel Loop Transformer架构，使循环可并行运行并共享内存。他们训练了7B参数的代码模型（1/2/3/4次循环），在18T tokens上预训练并微调，测试代码编写、推理、软件工程和工具使用任务。主要结果：2次循环效果最好，将SWE-bench Verified从43.0提升至64.4，而3次和4次循环性能下降。内部分析显示，第二次循环进行了有意义的精炼（改变隐藏状态、注意力模式和预测），后续循环则主要添加重复和噪声。结论：增加一次隐藏循环可大幅提升性能，但继续增加并非自动有益。

Epoch AI@EpochAIResearch · 6月18日41

How close is AI to automating AI R&D? Right now, the tools economists use to track automation are too blunt to say. In this week's newsletter, @datagenproc, @joemkwon, and @ansonwhho propose a sharper tool: a thorough taxonomy of 60+ tasks involved in frontier AI research. 🧵

译AI 距离自动化 AI 研发还有多远？目前，经济学家用于追踪自动化的工具过于粗糙。在本周的新闻通讯中，@datagenproc、@joemkwon 和 @ansonwhho 提出了一种更精细的工具：对前沿 AI 研究中 60 多项任务进行详细分类。🧵

AK@_akhaliq · 6月18日34

LoopCoder-v2 Only Loop Once for Efficient Test-Time Computation Scaling

译LoopCoder-v2 仅循环一次实现高效测试时计算缩放

OpenAI@OpenAI · 6月18日68

Introducing LifeSciBench, a benchmark for measuring and improving how well AI supports real-world life science research. Developed with 173 scientists from biotechnology and pharmaceutical research, LifeSciBench includes 750 expert-authored tasks across seven biological research workflows. https://openai.com/index/introducing-life-sci-bench/

译推出 LifeSciBench，一个用于衡量和改进 AI 如何支持现实世界生命科学研究的基准测试。该基准测试与 173 位来自生物技术和制药研究的科学家共同开发，包含 750 项专家编写的任务，覆盖七个生物学研究工作流程。

Jim Fan@DrJimFan · 6月18日81

I made Physical AutoResearch sound simple (conceptually), but it took a village to pull off and lots of design thinking into the robot /loopcraft. The hardest part is everything we need to setup *before* pressing Enter. Here's a behind-the-scene tour: 1. Safety harness Letting 8 robots run unattended overnight means safety has to be more than a hint in the system prompt. ENPIRE hardwires it in 2 layers: (1) hard kinematic limit that trips an immediate task failure and auto-resets as soon as a robot leaves its safety envelope, and (2) a torque-limited compliant gripper so a bad contact or misaligned insertion ends in a safe stall, instead of crushing the robot or the object at hand. We make safety more conservative than usual so humans can sleep tight. In reality, we still need a few human operators to watch over the "robots of loving grace". 2. Definition of /done An agent that can edit its own reward will game it for sure. ENPIRE fixes the goalposts before the fleet can move them. Here's the recipe: Collect a few minutes of success & failure demos -> Ask agent to write code using computer vision tools to classify success and measure against groundtruth -> Agent hill-climbs on classifier until reliably good -> This classifier becomes the real-time reward function that directly computes on sensor streams -> *Freeze* the reward function before AutoResearch. It's sacred, enshrined in a Gym env that no one can touch. 3. System telemetry design Robot-seconds is by far the scarcest resource, followed by GPU-seconds, and finally tokens. We instrument all three and surface them to ENPIRE for live resource awareness rather than letting it hill-climb in a vacuum. We define: - Mean Robot Utilization ("MRU"): the fraction of wall-clock time when the robot is actively executing an experiment. Otherwise the hardware is sitting idle and waiting for the next code commit. - Mean Token Utilization ("MTU"): tokens consumed per minute, our proxy for how hard the agent is actually thinking. A low MTU means the agent is stalled, waiting on a robot rollout to finish instead of doing research. - GPU utilization: fraction of wall-clock time when GPU is active. ... and evaluate on two budget-to-outcome metrics: 1. Tokens-to-Success: token budget the fleet burns to complete /goal. 2. Time-to-Success: wall-clock time to /goal

译NVIDIA GEAR实验室推出ENPIRE系统，首次实现物理世界自主研究。系统让8个Codex智能体控制8台机器人，配备GPU和token预算。安全方面采用硬运动极限切断和扭矩受限夹爪两层硬件保障，支持通宵无人运行。奖励函数通过视觉分类器离线固定并冻结，防止智能体作弊。实时监测机器人利用率（MRU）、token利用率（MTU）和GPU利用率，以Tokens-to-Success和Time-to-Success评估效率。ENPIRE自主完成扎带、整理细针、安装GPU等高精度任务，发现8机器人并行探索显著更快。系统将开源。

Rohan Paul@rohanpaul_ai · 6月17日50

This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logic. Stanford + Univ of Calif + Nanjing Univ researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables. A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents. Has less than 0.1% overlap with Common Crawl-derived corpora. The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training. The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens. ---- Link – arxiv. org/abs/2606.18192v1

译斯坦福、加州大学与南京大学研究人员发布SEFD数据集与方法，将SEC EDGAR文件转换为布局忠实的MultiMarkdown格式，保留合并表头、缩进、符号、跨度和表格层级，同时压缩冗余呈现模板，使财务表格的结构与会计逻辑可被LLM直接利用。公开152B token快照，估计完整档案约550B token长文档。该数据集与Common Crawl衍生语料重叠不足0.1%。

Rohan Paul@rohanpaul_ai · 6月17日55

This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logic. Stanford researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables. A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents. Has less than 0.1% overlap with Common Crawl-derived corpora. The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training. The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens. ---- Link – arxiv. org/abs/2606.18192v1

译斯坦福研究者发布SEFD数据集与处理方法，将SEC EDGAR申报文件转化为适合LLM训练的结构化数据，保留表格结构、缩进、合并表头、符号、跨度及层级关系。公开快照包含152B token，完整档案约550B token。该数据与Common Crawl语料重叠度低于0.1%。采用布局保真的MultiMarkdown格式，大幅压缩原有演示框架，保留财务含义的同时减少token浪费。

Rohan Paul@rohanpaul_ai · 6月17日68

OpenAI's is new research shows a model’s future failures can be estimated by replaying real past chats They found deployment simulation was much better than challenging prompts at predicting which model failures would rise or fall after release, and usually better at estimating their real-world rates. The problem is that normal safety tests often use hand-picked hard prompts, so they can miss problems that show up in ordinary use. The core idea is to take old ChatGPT conversations, remove the old assistant answer, and let the new model answer in that same realistic context. The authors then checked whether these simulated launches could predict how often 20 unwanted behaviors would happen after real GPT-5-series Thinking deployments. The method did better than harder prompt tests and previous-model guesses, and its typical rate estimate was about 1.5x away from the later real rate.

译OpenAI 发布新研究，提出通过重放真实历史 ChatGPT 对话（移除旧回答，让新模型在相同上下文回答）来模拟部署，从而预测模型发布后的失败行为。该方法比手动挑选困难提示词的常规安全测试更有效，能发现日常使用中的问题。研究验证了 GPT-5 系列 Thinking 部署前后 20 种不良行为的实际发生率，模拟方法的典型率估计与实际率相差约 1.5 倍，优于困难提示词测试和旧模型猜测。