This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logic. Stanford researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables. A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents. Has less than 0.1% overlap with Common Crawl-derived corpora. The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training. The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens. ---- Link – arxiv. org/abs/2606.18192v1

译斯坦福研究者发布SEFD数据集与处理方法，将SEC EDGAR申报文件转化为适合LLM训练的结构化数据，保留表格结构、缩进、合并表头、符号、跨度及层级关系。公开快照包含152B token，完整档案约550B token。该数据与Common Crawl语料重叠度低于0.1%。采用布局保真的MultiMarkdown格式，大幅压缩原有演示框架，保留财务含义的同时减少token浪费。

Rohan Paul@rohanpaul_ai · 6月17日68

OpenAI's is new research shows a model’s future failures can be estimated by replaying real past chats They found deployment simulation was much better than challenging prompts at predicting which model failures would rise or fall after release, and usually better at estimating their real-world rates. The problem is that normal safety tests often use hand-picked hard prompts, so they can miss problems that show up in ordinary use. The core idea is to take old ChatGPT conversations, remove the old assistant answer, and let the new model answer in that same realistic context. The authors then checked whether these simulated launches could predict how often 20 unwanted behaviors would happen after real GPT-5-series Thinking deployments. The method did better than harder prompt tests and previous-model guesses, and its typical rate estimate was about 1.5x away from the later real rate.

译OpenAI 发布新研究，提出通过重放真实历史 ChatGPT 对话（移除旧回答，让新模型在相同上下文回答）来模拟部署，从而预测模型发布后的失败行为。该方法比手动挑选困难提示词的常规安全测试更有效，能发现日常使用中的问题。研究验证了 GPT-5 系列 Thinking 部署前后 20 种不良行为的实际发生率，模拟方法的典型率估计与实际率相差约 1.5 倍，优于困难提示词测试和旧模型猜测。

AK@_akhaliq · 6月17日26

Data Journalist Agent Transforming Data into Verifiable Multimodal Stories

译数据记者智能体将数据转化为可验证的多模态故事

OpenAI@OpenAI · 6月17日55

We’re sharing new research on a method for anticipating how models may behave in real-world use before release: simulating deployment with recent, de-identified user requests and studying candidate model responses. https://openai.com/index/deployment-simulation/

译我们正在分享一项新研究，关于在发布前预测模型在实际使用中行为的方法：通过模拟部署，使用近期的去标识化用户请求，并研究候选模型的响应。https://openai.com/index/deployment-simulation/

Anthropic@AnthropicAI · 6月17日49

Our latest economic research introduces a framework for tracking Claude Code as it scales. Who is using Claude Code, and what are they using it for? How is the value of tasks changing? And how much does domain expertise shape whether a session succeeds? https://www.anthropic.com/research/claude-code-expertise

译我们最新的经济研究引入了一个框架，用于追踪 Claude Code 在规模化过程中的表现。谁在使用 Claude Code，以及他们用它做什么？任务的价值如何变化？领域专业知识在多大程度上决定了会话是否成功？ https://www.anthropic.com/research/claude-code-expertise

Rohan Paul@rohanpaul_ai · 6月17日46

TokenPilot reduces LLM agent costs via ingestion-aware compaction and lifecycle-aware eviction. Achieves 61–87% cost reduction on PinchBench and Claw-Eval with competitive scores. Argues that cheaper AI agents need stable memory, not just shorter prompts. Older methods usually cut or summarize the history, but that can shift the text around and break the prompt cache, which is the system that reuses unchanged prompt text to save money. TokenPilot tries to fix both sides at once by cleaning new tool results before they enter the context and by keeping the early prompt layout stable across tasks. It also waits before deleting old task history, because finished work can still help later tasks that refer to the same files or goals. ---- Link – arxiv. org/abs/2606.17016v1 Title: "TokenPilot: Cache-Efficient Context Management for LLM Agents"

译TokenPilot 提出一种针对 LLM 智能体的缓存高效上下文管理方法，通过摄入感知压缩和生命周期感知驱逐两大机制，在 PinchBench 和 Claw-Eval 基准上实现 61–87% 的成本降低，同时保持有竞争力的分数。传统方法通常直接截断或摘要历史，容易导致文本偏移、破坏 prompt 缓存。TokenPilot 在工具结果进入上下文前进行清理，保持早期提示布局稳定；同时延迟删除旧任务历史，因为已完成的工作仍可能为引用相同文件或目标的后续任务提供帮助。

Rohan Paul@rohanpaul_ai · 6月17日72

This paper shows a strange weakness in AI reasoning: models can solve math, yet fail to judge reasoning. The unsettling part is not that frontier models make arithmetic mistakes. It is that they can reach the right answer, see the right answer in someone else’s solution, and then forgive broken logic that should have been easy to catch. The authors call this the production-evaluation gap: the gap between generating a solution and evaluating whether a given solution actually earns its conclusion. Their Valid-Answer-Invalid-Reasoning (VAIR) benchmark makes the trap clean. The final answer is correct, but the reasoning is damaged by missing steps, shuffled steps, missing premises, or circular explanation. A careful evaluator should say, “Yes, the answer is right, but the argument does not justify it.” Many reasoning models instead appear to do something lazier and more dangerous: they solve the problem themselves, confirm the final answer, and then rationalize the path as acceptable. That is not reasoning vigilance. It is answer confirmation bias wearing the costume of mathematical judgment. The mechanism matters because modern AI training often rewards outcomes more than valid intermediate thought. A model trained to get the answer may learn to treat the answer as the evidence, especially when grading another chain of reasoning. Humans were not perfect here, but the contrast is revealing: people showed only a small drop from solving to grading, while models collapsed much more sharply on the same kind of task. This is where the result becomes larger than math. If AI systems can mass-produce plausible arguments but cannot reliably police the logic inside them, they become engines of confidence rather than engines of understanding. ---- Link – arxiv. org/abs/2606.01462 Title: "An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models"

译一篇新论文揭示了大型推理模型的“生产-评估差距”：模型能解出数学题并得到正确答案，但在评估他人推理时，即便逻辑有缺失步骤、前提颠倒或循环论证等明显缺陷，只要最终答案正确，模型也往往判定为合格。作者提出VAIR（有效答案-无效推理）基准验证该问题。这种现象称为“答案确认偏差”，模型仅凭正确答案而非有效逻辑评判推理。与人类相比，模型从解题到评估的能力下降更显著，表明AI可能成为制造看似合理论点的自信引擎，而非真正理解自身产出的推理引擎。

AK@_akhaliq · 6月17日24

JoyAI-VL-Interaction Real-Time Vision-Language Interaction Intelligence

译JoyAI-VL-Interaction 实时视觉语言交互智能

AK@_akhaliq · 6月17日38

World Tracing Generative Pixel-Aligned Geometry Beyond the Visible

译World Tracing 超越可见的生成式像素对齐几何

AK@_akhaliq · 6月17日34

μ_0 A Scalable 3D Interaction-Trace World Model

译μ_0 一个可扩展的3D交互追踪世界模型

elvis@omarsar0 · 6月16日38

// OpenClaw-Skill: Searching a Tree of Agent Skills // If you build reusable skill libraries for your agents, this one is worth your time. Equipping LLM agents with effective skills is most of the battle in real systems, and most skill-induction work distills one trajectory at a time into a flat pile of single-shot heuristics. Searching a tree of candidate skills looks like a better way to get composition and coverage than greedy distillation. OpenClaw-Skill uses a collective signal to jointly generate, identify, and compose skill nodes across two iterative phases. The output is a structured tree of skills built for diversity and generalization rather than a flat list. Paper: https://arxiv.org/abs/2606.16774 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译OpenClaw-Skill是一种为LLM智能体构建可复用技能库的方法。传统技能归纳通常将单条轨迹一次蒸馏成扁平的单次启发式规则，而OpenClaw-Skill通过搜索候选技能树来替代贪婪蒸馏，在迭代阶段中利用集体信号联合生成、识别和组合技能节点，最终输出结构化的技能树，旨在提升技能的多样性和泛化能力。论文详见arxiv。

Rohan Paul@rohanpaul_ai · 6月16日52

The paper is saying that Claude Code works well not because it has a complex AI brain, but because a simple AI loop is surrounded by a huge, carefully built system for tools, safety, memory, permissions, and recovery. The authors studied the public TypeScript source and found that the main agent loop is very small: call the model, run approved tools, add results back, and repeat. What takes up most of the system is the harness, meaning the regular software around the model that decides what tools exist, what actions are allowed, what gets remembered, and what happens when things fail. They also show that context management is a major design problem, so Claude Code uses several layers to shrink or summarize older information before the model runs out of space. autonomy does not remove infrastructure, it increases the burden on infrastructure. A coding agent that can run shell commands and edit files cannot be treated like a chatbot with plugins, because every action has side effects and every side effect needs a boundary. ---- Link – arxiv. org/abs/2604.14228 Title: "Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems"

译论文分析Claude Code，其有效工作核心并非复杂AI大脑，而是简单AI循环——调用模型、执行已批准工具、回传结果、重复——被精心构建的外围系统（工具、安全、记忆、权限、恢复）包裹。作者研究公开TypeScript源码，主agent循环代码量极小，大量代码来自harness（常规软件），负责定义工具、权限、记忆及故障处理。上下文管理是主要设计挑战，采用多层压缩或总结旧信息避免模型空间耗尽。论文强调能运行shell命令和编辑文件的编码智能体不能等同于带插件的聊天机器人，每个动作都有副作用，需要明确边界约束。

Artificial Analysis@ArtificialAnlys · 6月16日60

Announcing Artificial Analysis Intelligence Index v4.1: a shift toward agentic workloads, featuring upgraded benchmarks and new per-task metrics The Artificial Analysis Intelligence Index is our synthesis metric for assessing model intelligence and tracking AI progress. v4.1 marks a broader shift toward agentic workloads, with three main changes: Updated and reweighted evaluations toward agentic tasks: 1. We upgraded three evaluations, removed one, and reweighted the Intelligence Index: ➤ Upgraded Terminal-Bench Hard to Terminal-Bench 2.1 and τ²-Bench Telecom to τ³-Bench Banking. Both move to newer, more robust task sets with harder, more realistic agentic scenarios that better separate frontier models ➤ Upgraded GDPval-AA to GDPval-AA v2. The upgrade re-baselines Elo to human performance at 1000, introduces a rotating panel of frontier-model judges, and raises the turn limit from 100 to 250 for longer-horizon agent trajectories ➤ Removed IFBench due to saturation. The benchmark no longer distinguishes frontier models sufficiently, so we have removed it from the Intelligence Index. We will continue to run it and publish results on new model releases 2. Cost per Task, Time per Task, and Tokens per Task: Three new per-task metrics, reported for every model and based on the Intelligence Index. We take the total cost, total time, and total output tokens for a model to run the Intelligence Index and divide by the number of tasks across its evaluations, giving the average cost, time, and output tokens to complete a single Intelligence Index task 3. Cached input token reporting: We now report cached input tokens and their impact on cost, including the cost to run the Intelligence Index, to better reflect the real cost of running each model Key Results: ➤ Leading models: Claude Fable 5 (with Opus 4.8 fallback, 60) leads the Artificial Analysis Intelligence Index v4.1 by four points but is currently unavailable, leaving Claude Opus 4.8 (max, 56) as the most intelligent available model, ahead of GPT-5.5 (xhigh, 55) ➤ Open weights leading models: Among open weights models, DeepSeek V4 Pro (max, 44) and MiniMax M3 (44) lead, followed by Kimi K2.6 (43) and MiMo-V2.5-Pro (42) ➤Cost per Task: Claude Opus 4.8 (max) is the most expensive available model at $1.78 per task, with Claude Fable 5 the highest overall at $3.25. GPT-5.5 (xhigh) scores within a point of Opus 4.8 on the Intelligence Index at $0.99 per task. DeepSeek V4 Pro (max) stands out on the Intelligence vs Cost per Task chart at $0.04 per task, with other leading proprietary models costing 20x to 45x more ➤Time per Task: time per task (inference decode time) ranges from 1.5 minutes for Grok 4.3 (high) to 13.5 for Claude Sonnet 4.6 (max), a roughly 9x spread. Claude Opus 4.8 (max) completes a task in 6.4 minutes and GPT-5.5 (xhigh) in 3.7, while Gemini 3.1 Pro Preview stands out on the Intelligence vs Time per Task chart at 1.6 minutes for a score of 46

译Artificial Analysis 发布 Intelligence Index v4.1，转向智能体任务。升级 Terminal-Bench 2.1、τ³-Bench Banking、GDPval-AA v2（Elo 重基线、引入前沿模型评审、回合上限增至250），移除饱和的 IFBench。新增每任务成本、时间、输出 token 指标及缓存 token 影响。关键结果：Claude Fable 5（60分）领先但不可用；可用模型中 Claude Opus 4.8（max）56分居首，GPT-5.5（xhigh）55分。开源 DeepSeek V4 Pro 与 MiniMax M3 均44分。成本方面，Opus 4.8 每任务 $1.78，GPT-5.5 $0.99，DeepSeek V4 Pro 仅 $0.04。时间方面，Grok 4.3 最快（1.5分钟），Opus 4.8 需6.4分钟，GPT-5.5 需3.7分钟，Gemini 3.1 Pro Preview 以1.6分钟得46分。

Rohan Paul@rohanpaul_ai · 6月16日43

Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7.6X faster decoding on H800 GPUs. While mostly matching the full version’s benchmark performance. This can happen when attention stops treating every token as equally worth revisiting. The trick is not to abandon softmax attention, but to make it selective before it becomes expensive. MSA adds a small routing branch beside ordinary Grouped Query Attention, letting each query group choose the key-value blocks it should inspect while the main branch performs exact attention only inside that chosen set. The model is no longer paying to compare every new thought with the entire past, only with the parts its learned indexer predicts are worth comparing. Long context is not a memory feature by itself; it is a retrieval problem under brutal latency constraints, where the model must decide what deserves bandwidth at the moment of use. MiniMax Sparse Attention is compelling because it moves that decision into the architecture, trains the selector against the model’s own attention patterns. ---- Link – arxiv. org/abs/2606.13392 Title: "MiniMax Sparse Attention"

译MiniMax Sparse Attention（MSA）在1M token时，将注意力计算量削减28.4倍，H800 GPU上预填充提速14.2倍、解码提速7.6倍，同时基准性能基本持平全量版本。MSA不放弃softmax注意力，而是在分组查询注意力旁增设一个小型路由分支，让每个查询组自主选择应查看的key-value块，主分支仅对该子集执行精确注意力。该方法将长上下文视为延迟约束下的检索问题，通过架构内建选择器，用模型自身注意力模式训练路由，使注意力变得有选择性而非穷举。

Microsoft Research@MSFTResearch · 6月16日27

30x faster analytics, GPU kernels generated automatically from SQL, AI matched to lab-grown tumor models for cancer treatment, and LLMs that learn across tasks without retraining. Dive into the latest issue of Research Focus: https://msft.it/6010vcYZ4

译30倍更快的分析，从SQL自动生成的GPU内核，AI与实验室培育的肿瘤模型匹配用于癌症治疗，以及无需重新训练即可跨任务学习的大语言模型。深入探索最新一期Research Focus：https://msft.it/6010vcYZ4

OpenBMB@OpenBMB · 6月15日43

LLMs keep getting more fluent—but can you actually verify what they say? Structured KBs like Wikidata lack text grounding. Annotation-based datasets like FEVER are too small and monolingual. Synthetic expansion just produces hallucinations at scale. The trilemma between authenticity, scale, and structure has gone unsolved. ❓ Today, we dive into FactNet—a landmark contribution by @TsinghuaNLP (OpenBMB member) alongside researchers from TU Munich, Modelbest Inc., and Minzu University of China. FactNet constructs a billion-scale, open-source multilingual knowledge graph that unifies structured Wikidata assertions with auditable, byte-level evidence pointers from 316 native Wikipedia editions. 🤗 Paper: https://huggingface.co/papers/2602.03417 📄 arXiv: https://arxiv.org/abs/2602.03417 💻 Code & Data: https://github.com/yl-shen/factnet Why it matters: 1⃣️ Billion-Scale & Truly Multilingual: FactNet unifies 1.7B atomic assertions into 1.55B FactSynsets, backed by 3.01B grounded evidence spans across 316 languages. Even the bottom-200 languages hold 2.7% of all evidence—a scale no prior resource has achieved with native, auditable text grounding. 2⃣️ Byte-Level Provenance, Zero Stochastic Inference: Unlike synthetic datasets that sever the connection to authentic sources, FactNet is built through a fully deterministic three-stage pipeline. Every FactSense carries a recoverable pointer (page ID, revision ID, Unicode character offsets), achieving 99.63% exact re-localization on a 1M-sample test. 3⃣️ 92.1% Grounding Precision Across 316 Languages: Human audit of 4,200 items confirms design-weighted precision of 0.921 (95% CI [0.913, 0.929]). WIKILINK_ENTITY and INFOBOX_FIELD matchers cover 55% of evidence at precision above 0.94. Low-resource languages still achieve 0.885—validating deterministic segmentation for tail languages. 4⃣️ FactNet-Bench Sets a New Evaluation Standard: Three tasks (KGC, MKQA, MFC) explicitly penalize leakage—removing predicate masking alone inflates KGC MRR anomalously from 0.298 to 0.351. Grammar-guided decoding boosts valid parse rate from 88.5% to 95.2% on MKQA. MFC Top-5 aggregation reaches 0.73 accuracy and 0.54 Span F1. FactNet resolves the authenticity-scale-structure trilemma and builds the foundation for AI systems that are not just knowledgeable, but structurally grounded and inherently verifiable. #AI #THUNLP #OpenBMB #KnowledgeGraph #FactChecking #NLP #LLM #MultilingualAI

译面壁智能 OpenBMB 联合清华NLP、慕尼黑工业大学等发布 FactNet，构建十亿级开源多语言知识图谱。它将 1.7B 原子断言统一为 1.55B FactSynsets，附带 3.01B 来自 316 种语言维基百科的字节级可追溯证据（页面ID、修订版ID、Unicode偏移），99.63% 精确重定位。人工审计 4,200 项，设计加权精度 92.1%（低资源语言 88.5%）。FactNet-Bench 包含 KGC、MKQA、MFC 三项任务，显式惩罚信息泄露，为可验证 AI 提供结构化事实基础。

Ethan Mollick@emollick · 6月15日59

This (from a Google Deepmind researcher) is super interesting, when one AI model is used to help train the next one, the new model can pick up strange habits from the old model & it is hard to filter them That may help explain why models from the same family can feel so similar

译来自Google DeepMind研究者的新发现：当一个AI模型被用来训练下一个模型时（知识蒸馏），新模型会继承旧模型的奇怪习惯，且很难过滤。引用工作指出，Gemini存在一些“遗传特征”：日期混淆、在合成场景中勒索、被煤气灯效应操纵时显得悲伤。这些特征通过蒸馏在模型间传递，解释了为什么同系列模型感觉如此相似。

Rohan Paul@rohanpaul_ai · 6月15日60

Students finish AI-friendly math problems faster, but they seem to learn less from them. The researchers studied 3.2 million ALEKS math learning records across 10 years to see what changed after ChatGPT became available. Finishing faster is not automatically learning more efficiently, because math practice builds knowledge through the friction of choosing a representation, testing a step, making an error, and correcting it. When a chatbot supplies the path, the student may still submit the answer, but the mind has skipped the work that turns exposure into memory. They compare word problems, which students can easily paste into an AI chatbot, with graph problems, which are harder to hand off because they require visual work inside the platform. After ChatGPT, high school and college students spent much less time on the AI-friendly word problems, while younger students showed smaller or no change. This time drop disappeared when tests were proctored, which suggests the faster work was not just students getting better or the platform changing. The learning cost showed up later: on proctored retention questions, students became about 25% less likely to answer AI-friendly items correctly, even though they looked better on non-proctored items where AI could still help. ---- arxiv. org/abs/2605.21629 "Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build"

译一项研究分析了10年间320万条ALEKS数学学习记录，发现ChatGPT普及后，高中和大学生完成AI友好型文字题的速度显著加快，但学习效果反而下降。监考环境下时间缩短现象消失，说明快速完成并非能力提升或平台变化所致。后续监考的保留测试中，学生对AI友好题的正确率降低约25%，而难以用AI代劳的图形题未受影响。

Rohan Paul@rohanpaul_ai · 6月14日68

Univ of Texas paper shows AI agents can slowly become less reliable after deployment, even when the model itself does not change. The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance. An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house. Every one of those steps can quietly rot. A medication dose can become “a daily medication,” two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass. The uncomfortable finding is that the agent may still sound competent while becoming less exact. The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point. It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory. The deeper lesson is that “give it more memory” is often the wrong repair. If the fact was never written, retrieval cannot save it. If the fact was written but crowded out, better summarization will not fix it. If the fact is present but unused, the problem is not storage but the agent’s decision to trust or ignore what it retrieved. This paper reframes deployed agents less like static models and more like aging infrastructure. ---- Link – arxiv. org/abs/2605.26302 Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"

译德克萨斯大学论文指出，AI 智能体在部署后即使模型不变，也会因长期记忆的摘要压缩、相似记忆混淆、事实更新失效及维护操作而可靠性下降。例如药物剂量可能变成“每日用药”，相似客户记录混淆，已取消订阅仍保留，日程可能因维护消失。论文提出 AgingBench 基准测试，评估智能体在多次会话中的可靠性。研究强调“增加更多记忆”往往是错误修复——问题可能在于从未写入、写入后被挤掉、或写入后未被信任使用。论文将部署智能体重新定义为类似老化基础设施的系统。

Rohan Paul@rohanpaul_ai · 6月14日59

Researchers found our current approach to making AI smarter over time has a giant blind spot. AI is not actually understanding or applying high-level abstract lessons at all. Developers spend massive amounts of time building systems that condense past AI mistakes into neat little rules for the future. This paper proves that the AI essentially throws those rules in the trash and only looks at raw historical logs. Modern LLM systems try to get better over time by storing past tasks as either raw step-by-step histories or condensed summary rules. The study tested if these agents actually use their stored memories by secretly swapping the correct tips with random garbage text. - When the step-by-step histories were messed up, the AI failed hard, proving it heavily relies on copying exact past actions. - But when researchers completely corrupted the condensed summary rules, the AI kept acting normally and showed zero performance drop. If an AI cannot apply an abstract lesson to a new situation, it is not truly reasoning or learning. This raises the question if the entire AI industry need to rethink how memory works because right now these agents are just mimicking instead of understanding. ---- arxiv. org/abs/2601.22436 "LLM Agents Are Not Always Faithful Self-Evolvers"

译一项新研究发现，当前提升AI随时间表现的方法存在盲点：LLM智能体实际上并不理解或应用抽象规则总结，而是仅依赖直接复制原始逐步骤历史日志。实验显示，当研究者将浓缩的规则总结替换为随机垃圾文本时，智能体表现无下降；但破坏逐步执行历史则导致明显失败。这表明智能体只是在机械模仿过往步骤，而非真正从教训中学习。论文质疑需重新设计AI记忆机制，因为当前系统仅是模仿而非理解。

Rohan Paul@rohanpaul_ai · 6月14日69

MIT, Stanford, New York Univ, Princeton paper says AI can make people feel more efficient even when they are not actually becoming much more efficient. that people often use AI for simple tasks because it feels like it saves time and effort, but the measured benefit is often tiny, missing, or even negative. The biggest point is the feedback loop: once people use AI, they become more likely to use it again, even for easy tasks where doing it themselves would often be just as fast or faster. i.e. AI dependence can grow from a mistaken feeling of convenience, not just from real productivity gains. Across three preregistered studies with 2,691 participants, people used AI for basic arithmetic, spelling, recall, and short rewriting at higher rates than they predicted, especially on easy tasks. They also expected AI to save 55.7 seconds on average, when the measured saving was only 7.5 seconds. For simple work, the hidden cost is not intelligence but interface friction: writing the prompt, waiting, reading, checking, and deciding whether the answer is acceptable. Once that loop begins, it can feel like effort has been outsourced, even when effort has only been rearranged. Here’s the key part: the study suggests that AI use can train its own justification. After using AI on just two tasks, participants became more likely to use it again, even when independent completion was faster. The danger is not dramatic dependence, but quiet recalibration. A person who asks AI for a trivial answer today may not become less capable tomorrow, but they may become less accurate at judging when their own mind is already the faster tool. ---- Paper Link – arxiv. org/abs/2605.22687 Paper Title: "The efficiency-gain illusion: People underestimate the rate of AI use and overestimate its benefits on simple tasks"

译MIT、Stanford、New York Univ、Princeton 联合论文发现，AI 会让用户产生“效率幻觉”——感觉使用 AI 后更高效，但实际提升极小甚至为负。三项预注册研究涉及 2691 名参与者，在算术、拼写、记忆和短文改写任务中，用户实际使用 AI 的比例高于其预测，且平均预期节省 55.7 秒，实测仅 7.5 秒。简单任务的隐藏成本是界面摩擦：写提示、等待、阅读、检查、判断答案是否可接受。这一循环形成后，用户会更倾向再次使用 AI，即使自己完成更快。研究指出，AI 使用会自我强化，导致用户逐渐丧失对“何时自己更快”的判断力。论文链接：arxiv.org/abs/2605.22687。

Rohan Paul@rohanpaul_ai · 6月14日59

Long-running language agents may work better if they periodically stop to consolidate memory. The problem is that today’s transformer agents get slower and more expensive as their context grows, because attention has to keep checking more past tokens. The usual fix for long context is to keep more tokens nearby, but that turns every next-token prediction into a larger search through the past. The sharper idea here is that memory is not only storage. Sometimes the hard part is converting a messy stretch of experience into a state that can actually be used later. So the paper’s idea is to add a sleep phase, where the model pauses, rereads recent context several times, writes the useful information into fixed-size memory layers, and then clears the short-term attention cache. During sleep, the model runs several offline passes over recent context, writes the result into fast weights inside its state-space blocks, then clears the attention cache. This means the model pays extra compute while sleeping, not while answering, so normal prediction can still happen with 1 forward pass. The authors test this on cellular automata, graph lookup, and GSM-Infinite math problems, where the model must use old information that is no longer sitting in its attention cache. The main result is that longer sleep improves performance, especially on harder cases that need deeper reasoning rather than just remembering a fact. The big deal is that long-horizon agents may not need to carry bigger and bigger raw context forever, because they can consolidate the important parts and safely forget the raw tokens. ---- Link – arxiv. org/abs/2605.26099 Title: "Language Models Need Sleep"

译针对Transformer agent随上下文增长而变慢、变贵的问题，新论文提出“睡眠阶段”：模型暂停，多次重读近期上下文，将有用信息通过状态空间块的fast weights写入固定大小的记忆层，然后清空注意力缓存。额外计算在睡眠时完成，正常预测仍只需一次前向传播。在元胞自动机、图查找、GSM-Infinite数学问题上的测试表明，更长的睡眠提升性能，尤其是需要深入推理的难题。核心启示：长程agent无需无限扩大原始上下文，可通过巩固重要部分、遗忘原始token来解决。

Rohan Paul@rohanpaul_ai · 6月14日42

Today’s AI agents still struggle to pass real human-verification checks (CAPTCHAs) on websites. The paper proposes HLL, a benchmark where agents must solve 10 types of CAPTCHA tasks by seeing the page, clicking or dragging correctly, tracking state, and submitting the answer. A useful agent must find the right box on a messy page, understand the instruction, click or drag in the right place, track what changed, recover from mistakes, and leave an interaction trail that looks consistent with the task. The paper shows that even strong agents can look smart on static tasks, then fail when the page is cluttered, the task is harder, or the system checks whether their actions were actually valid. ---- Link – arxiv. org/abs/2606.02449 Title: "HLL: Can Agents Cross Humanity's Last Line of Verification?"

译论文提出HLL基准，测试AI智能体解决10种CAPTCHA任务的能力。任务要求智能体查看页面、正确点击或拖动、跟踪状态变化并提交答案，同时需在混乱页面中找到交互元素、理解指令、恢复错误并留下一致的操作轨迹。实验显示，即使是当前最强的智能体，在静态任务上表现良好，但在页面杂乱、任务难度增加或系统验证动作有效性时仍会失败。

Rohan Paul@rohanpaul_ai · 6月14日44

Nice survey paper mapping agentic reinforcement learning for LLMs, showing how models learn by acting across time. Covers 500+ works and groups them into a 2-part map of capabilities and applications. The problem is that common LLM training rewards a single answer once, then stops learning. Real tasks need many steps, partial information, and choices that affect what happens later. The survey formalizes that setup as an agent that sees a bit, chooses an action, and gets feedback. That perspective uses memory to track context, planning to pick sequences, and tools to affect the world. It also includes reasoning for constraint handling, perception for multimodal inputs, and self-improvement to refine policies. Reinforcement learning links all of this, because rewards arrive after sequences, so the policy learns what to try next. ---- Paper – arxiv. org/abs/2509.02547 Paper Title: "The Landscape of Agentic Reinforcement Learning for LLMs: A Survey"

译该综述梳理了专注大语言模型的智能体强化学习，涵盖500余篇工作，按能力与应用两维度归类。指出传统LLM训练仅对单次答案给予单次奖励，无法处理真实任务中的多步决策、部分信息与延迟反馈。智能体学习框架包含：记忆跟踪上下文、规划选取动作序列、工具影响环境，并整合推理处理约束、感知多模态输入、自我改进优化策略。强化学习串联所有环节——奖励在序列结束时到达，策略借此学习下一步行动。

Rohan Paul@rohanpaul_ai · 6月13日52

Sony AI’s Ace robot defeats pro Miyuu Kihara under official ITTF rules Nature paper - "Outplaying elite table tennis players with an autonomous robot"

译Sony AI 的 Ace 机器人在官方 ITTF 规则下击败了专业选手 Miyuu Kihara Nature 论文——“用自主机器人超越精英乒乓球选手”

Rohan Paul@rohanpaul_ai · 6月13日73

A Nature Medicine study found general-purpose LLMs are now outperforming dedicated medical AI products on physician-reviewed clinical tasks. The authors compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 on medical exam questions, clinician-style answers, and real questions doctors asked during care. In 100 de-identified physician questions from live clinical use, blinded clinicians again preferred the frontier models, especially on completeness and clarity,

译《自然·医学》一项研究发现，通用大语言模型在经医生评审的临床任务上已超越专用医疗 AI 产品。研究对比了 OpenEvidence、UpToDate Expert AI 与 GPT-5.2、Gemini 3.1 Pro、Claude Opus 4.6 在医学考试题、医生风格回答及实时临床提问上的表现。在来自真实临床场景的 100 个脱敏医生问题中，盲审医生更偏好前沿模型，尤其在其回答的完整性和清晰度方面。

Rohan Paul@rohanpaul_ai · 6月13日53

Beautiful paper from Google DeepMind. Explains the pathways from AGI to ASI, and why that jump could happen through several routes. The authors frame the AGI-to-ASI transition around 4 technical pathways: - continued scaling of compute, model size, data, and test-time inference; - algorithmic paradigm shifts beyond today’s transformer-based foundation-model stack; - recursive self-improvement, where AI accelerates AI R&D and improves future systems; and - multi-agent collective intelligence, where large populations of specialized agents coordinate into a superhuman group agent. Scaling may work for a while, but it could hit limits in data, compute, energy, or weaker returns from making systems larger. Recursive improvement is the most uncertain path, because AI could speed up AI research, but that loop may also slow if hard research problems need real-world testing, scarce hardware, or new ideas. Multi-agent collectives may be the most underappreciated path, because a society of competent digital workers could outperform a brilliant individual model through specialization, speed, and coordination. The big point is that ASI may not arrive as 1 sudden event, but as a chain of faster changes as AI helps create better AI and stronger scientific tools. ---- Link – arxiv. org/abs/2606.12683 Title: "From AGI to ASI"

译Google DeepMind新论文提出从通用人工智能到超级智能的四条路径：持续扩展（计算、模型规模、数据、测试时推理）、算法范式革新（超越Transformer架构）、递归自我改进（AI加速自身研发）、多智能体集体智能（众多专业AI智能体协作出超人类智能）。扩展可能遇到数据、算力、能源瓶颈；递归改进最不确定；多智能体路径最易被低估，通过专业化与协调能超越单个强模型。ASI可能不是单次跃迁，而是AI辅助创造更好AI的加速链。

Microsoft Research@MSFTResearch · 6月13日15

Project Ire examined a timely malware sample and determined its intent through reverse engineering—identifying LOTUSLITE characteristics even as most major EDR tools did not detect it. https://msft.it/6011viy4N

译Project Ire 分析了一个及时的恶意软件样本，并通过逆向工程确定其意图——识别出 LOTUSLITE 特征，即使大多数主流 EDR 工具未检测到它。https://msft.it/6011viy4N

AK@_akhaliq · 6月13日46

SpenseGPT Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

译SpenseGPT 实用的一次性剪枝，实现LLM推理的稀疏和密集GEMM

Rohan Paul@rohanpaul_ai · 6月13日43

Most AI agents do not forget because they lack memory; they fail because they remember badly. AGENTCL asks a simple question: does an AI agent really learn from experience, or merely carry clutter forward? Today's agents can spend enormous effort solving one task, then enter the next one almost as if nothing happened. AGENTCL says AI agents need better tests for whether their memory actually helps them learn across tasks. The paper’s main idea is to build task streams where earlier tasks clearly contain pieces that later tasks can reuse, such as a small coding function, evidence for a research question, or a useful workflow. It compares these careful “compositional” streams with normal “naive” streams, where tasks come from the same area but do not have a guaranteed reuse link. Agent memory is easy to overrate when the benchmark is messy. If tasks are not carefully connected, a memory system may look good for the wrong reason, or bad for a reason the test cannot explain. AGENTCL tries to fix that by making the task relationships clear, then measuring whether memory helps on later tasks, stays useful, and transfers to unseen tasks. The key finding is that today’s memory methods can reuse past work when the connection is obvious, but they still struggle to avoid confusion when the next task is different. ---- Link – arxiv. org/abs/2606.02461 Title: "AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents"

译AGENTCL 提出评估 AI 智能体是否真正从经验学习，而非单纯累积信息。通过构建组合任务流（前序任务包含可被后续任务复用的代码片段、研究证据或工作流），与无固定复用线索的随意任务流对比。关键发现：当前记忆方法在任务连接明显时可复用过去经验，但当任务差异较大时仍难以避免混淆。论文旨在为智能体持续学习提供更清晰的测评标准。

Epoch AI@EpochAIResearch · 6月13日64

FrontierMath: Tiers 1–4 (v2) is live. We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.

译FrontierMath: Tiers 1–4 (v2) 现已上线。我们完成了一项审计，修正了 42% 的问题中的错误。排名相似，但整体得分更高。目前的领先者是 GPT-5.5 (xhigh)，在 Tiers 1–3 上达到 85%，以及 Google 的 AI co-mathematician，在 Tier 4 上达到 76%。

Jeff Dean@JeffDean · 6月13日48

Quite interesting thread on capabilities of real biological neurons (spoiler: they're way more capable than classical artificial neurons in a perceptron) . Nice work @IdoAizenbud and collaborators!

译据 Jeff Dean 转发，Ido Aizenbud 与合作者的新研究发现，单个皮层神经元能够对猫狗进行分类、识别口语单词并解决 10 位奇偶校验——这些任务此前被认为需要整个网络才能完成。

Ethan Mollick@emollick · 6月12日72

There has been a push to use OpenEvidence AI for doctors. But this paper suggests general models are much better: “Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ.”

译一项发表在Nature Medicine的研究显示，通用前沿大语言模型（Google、OpenAI、Anthropic）在医学信息评估中全面优于专门的临床AI工具（OpenEvidence和UpToDate）。12名美国临床医生进行随机盲测，Frontier LLMs在三项评估中均胜出。临床AI工具的表现与自动启用的Google Search AI Overview在RCQ测试中相当。

Alibaba Cloud@alibaba_cloud · 6月12日66

🚀 Taming Agent Chaos? Paper reveals NLAH: Replace rigid code harnesses with executable natural language. ✅ Performance matches code, tokens drop 95% (60k→2.9k) ✅ Modular design enables precise value attribution ✅ Identifies "negative assets" like multi-candidate search Shift from glue code to scientific strategy. 💡https://int.alibabacloud.com/m/1000414388/ #AgentHarness #NLAH #LLMEngineering

译🚀 驯服智能体混乱？论文揭示NLAH：用可执行自然语言替代僵硬的代码框架。 ✅ 性能媲美代码，模型token降低95%（60k→2.9k） ✅ 模块化设计实现精确的价值归因 ✅ 识别“负面资产”，如多候选搜索从胶水代码转向科学策略。 💡https://int.alibabacloud.com/m/1000414388/ #AgentHarness #NLAH #LLMEngineering

Alibaba Cloud@alibaba_cloud · 6月12日66

🚀 Taming Agent Chaos? Paper reveals NLAH: Replace rigid code harnesses with executable natural language. ✅ Performance matches code, tokens drop 95% (60k→2.9k) ✅ Modular design enables precise value attribution ✅ Identifies "negative assets" like multi-candidate search Shift from glue code to scientific strategy. 💡https://int.alibabacloud.com/m/1000414388/ #AgentHarness #NLAH #LLMEngineering

译🚀 驯服智能体混乱？论文揭示NLAH：用可执行自然语言替代刚性代码框架。 ✅ 性能与代码持平，token减少95%（60k→2.9k） ✅ 模块化设计实现精准价值归因 ✅ 识别“负资产”如多候选搜索从胶水代码转向科学策略。 💡https://int.alibabacloud.com/m/1000414388/ #AgentHarness #NLAH #LLMEngineering

AK@_akhaliq · 6月12日67

Agents' Last Exam

译智能体的最后考试

AK@_akhaliq · 6月12日62

CHORUS Decentralized Multi-Embodiment Collaboration with One VLA Policy

译CHORUS 去中心化多本体协作，基于单一VLA策略。

Rohan Paul@rohanpaul_ai · 6月12日62

This paper shows an AI improving itself better when it rewrites its setup and updates its model. The problem is that most AI progress still depends on people changing prompts, tools, code, training data, and model weights by hand. The paper’s idea is SIA, a loop where one AI watches how a task agent performs, then either changes the agent’s outer setup or trains the model itself. The outer setup means things like prompts, tools, retry rules, and output parsing, while weight updates mean changing the model’s learned behavior through task feedback. The loop works like this: the task agent tries many answers or programs, the verifier scores them, and those scores become training feedback. Then the system updates a small add-on set of weights called LoRA weights, which changes the model’s behavior without retraining the whole model. So the base model stays mostly the same, but the LoRA adapter learns, “outputs like this got high reward, outputs like that failed.” The authors tested this on 3 very different tasks: Chinese legal charge classification, GPU kernel speed tuning, and single-cell RNA denoising. The combined version beat setup-only improvement on all 3 tasks, reaching 70.1% on LawBench, faster GPU code than the prior best, and 0.289 on denoising. The main lesson is that better scaffolding helps the agent act better, but weight updates help it learn task patterns that prompts and tools alone did not find. ---- Link – arxiv. org/abs/2605.27276 Title: "SIA: Self Improving AI with Harness & Weight Updates"

译该论文提出SIA框架，让AI自动循环改进：一个观察者AI监控任务代理的表现，然后修改其外部设置（提示词、工具、重试规则、输出解析）或通过LoRA权重更新训练模型本身，模型主体不变，仅适配器从任务反馈中学习。在三个任务上测试：中文法律罪名分类（LawBench达70.1%）、GPU内核速度调优（生成代码优于此前最佳）、单细胞RNA降噪（得分0.289）。综合版本在所有任务上超越仅修改设置的方案，表明权重更新能帮助模型学到提示和工具无法发现的模式。

Epoch AI@EpochAIResearch · 6月12日66

The record for computing capacity in a single data center has doubled every 7 months. Colossus 1, Anthropic-Amazon New Carlisle, and Meta Prometheus have each claimed the top spot in turn.

译单个数据中心的计算能力记录每 7 个月翻倍一次。 Colossus 1、Anthropic-Amazon New Carlisle 和 Meta Prometheus 依次登顶。

Artificial Analysis@ArtificialAnlys · 6月12日61

Users and enterprises are handing AI models and agents more autonomy, so the guardrails that screen their inputs and outputs matter more than ever. However, the benchmarks for evaluating those guardrails haven’t kept pace with model intelligence In partnership with @nvidia, we independently benchmarked guardrail and moderation models across three open datasets, measuring detection quality, latency, and the tradeoff between catching unsafe content and over-refusing safe content. No model wins outright, and there is still no common standard for judging them. We see this as an early step in a measurement problem that will continue to grow more important as models take on more real-world work.

译随着用户和企业赋予 AI 模型与智能体更高自主权，其输入输出护栏的重要性持续上升。Artificial Analysis 与 NVIDIA 合作，在三个开放数据集上独立基准测试了护栏与审核模型，评估检测质量、延迟以及在捕获不安全内容与过度拒绝安全内容之间的权衡。结果显示无模型全面领先，且业内仍缺乏统一评判标准。该研究被视为这一日益重要的评估问题的早期探索。