This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs. Reached about 1.7 to 1.8 times faster prefill when context length became large. Standard attention makes every token run through every attention head, even when some heads are not useful for that token. The paper’s idea, called Grouped Query Experts, keeps the normal key and value cache from grouped-query attention but routes each token to only a few query-head experts. Grouped Query Experts sits on top of grouped-query attention, the trick many long-context models already use to reduce key-value cache cost. This is like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful. The authors trained 250M-parameter models on 30B tokens and compared the method with a normal grouped-query attention baseline. The best version matched the baseline’s average accuracy, 56.04 versus 55.86, while using 9 of 16 query-attention computations. shows that attention can be made sparse inside grouped-query attention without hurting quality, but only when the router gets a strong learning signal and one shared head stays always on. ---- Link – arxiv. org/abs/2606.20945 Title: "Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention"

译论文提出Grouped Query Experts，在分组查询注意力（GQA）基础上让每个token仅路由到少数query头专家。长上下文时prefill速度提升约1.7-1.8倍。250M参数模型经30B tokens训练，最佳版本准确率56.04（baseline 55.86），仅使用16个query注意力计算中的9个。表明GQA内可实现稀疏注意力且不损质量，但需强学习信号和一个始终打开的共享头。

Rohan Paul@rohanpaul_ai · 5天前43

Students finish AI-friendly math problems faster, but they seem to learn less from them. The researchers studied 3.2 million ALEKS math learning records across 10 years to see what changed after ChatGPT became available. Finishing faster is not automatically learning more efficiently, because math practice builds knowledge through the friction of choosing a representation, testing a step, making an error, and correcting it. When a chatbot supplies the path, the student may still submit the answer, but the mind has skipped the work that turns exposure into memory. They compare word problems, which students can easily paste into an AI chatbot, with graph problems, which are harder to hand off because they require visual work inside the platform. After ChatGPT, high school and college students spent much less time on the AI-friendly word problems, while younger students showed smaller or no change. This time drop disappeared when tests were proctored, which suggests the faster work was not just students getting better or the platform changing. The learning cost showed up later: on proctored retention questions, students became about 25% less likely to answer AI-friendly items correctly, even though they looked better on non-proctored items where AI could still help. ---- Paper Link – arxiv. org/abs/2605.21629 Paper Title: "Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build"

译基于10年间320万条ALEKS数学学习记录的研究发现，ChatGPT出现后，学生在AI友好的文字题上完成速度显著加快，但学习效果下降，而需视觉操作的图问题受影响较小。高中和大学生用时减少，低年级变化不大；监考下时间缩短消失，说明加速非源于能力提升。后续监考保留题显示，学生对AI友好题型的正确率下降约25%，表明通过AI快速完成作业未转化为持久知识。

elvis@omarsar0 · 5天前50

If you use LLM-as-judge, this one is worth reading. (bookmark it) It's actually one of the most effective ways to use LLM-as-a-Judge for evals. Holistic judge scores hide both their reasoning and their ceiling effects. BINEVAL decomposes each evaluation criterion into atomic yes-or-no questions, answers each independently per output, then aggregates the verdicts into calibrated multi-dimensional scores. Every question-level verdict is inspectable, so you can diagnose exactly why an output scored low, and the same verdicts feed straight back as targeted prompt-improvement signal. Across SummEval, Topical-Chat, and QAGS, it matches or beats UniEval and G-Eval, training-free, with especially strong results on factual consistency. Paper: https://arxiv.org/abs/2606.27226 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译BINEVAL 是一种新型 LLM-as-Judge 评估方法，解决整体评分隐藏推理与天花板效应。它将每个评估标准分解为原子的是/否问题，对每个输出独立回答，再汇总为校准的多维分数。每个问题级判定均可检查，用于精确定位低分原因，并直接作为提示改进信号。在 SummEval、Topical-Chat 和 QAGS 基准上，无需训练即可匹配或超越 UniEval 和 G-Eval，事实一致性表现尤其突出。论文: https://arxiv.org/abs/2606.27226

Rohan Paul@rohanpaul_ai · 5天前54

Fantastic, @deepseek_ai just published their new inference optimization method. Proposes DSpark, a semi-parallel speculative decoding system that gave DeepSeek-V4 about 60% to 85% faster per-user generation at matched throughput. The biggest idea in DSpark is that faster inference is not just about drafting more tokens, but about deciding which drafted tokens are worth checking. Speculative decoding already had the basic trick: a smaller draft model guesses several next tokens, then the real model checks them in 1 pass. The problem is that long draft blocks often waste work, because later guesses are more likely to be wrong, and checking bad guesses still uses GPU capacity. DSpark’s breakthrough is to make this process selective: it drafts a block, scores how likely each prefix is to survive, then verifies only the part that is likely to pay off. The mechanism has 2 linked parts: a strong parallel draft model makes many token guesses quickly, then a tiny Markov head adjusts each guess using the token right before it. That small sequential piece matters because pure parallel drafting are fast, but their later tokens decay because each position guesses without knowing what the earlier sampled token actually was. i.e. Fully parallel drafters guesses every position too independently, which can create bad token combinations later in the block. Then the confidence scheduler estimates how many drafted tokens should be checked for each request, based on both acceptance chance and current GPU load.

译DeepSeek 提出 DSpark，一种半并行推测解码系统，使 DeepSeek-V4 在相同吞吐量下每用户生成速度提升约 60% 至 85%。核心创新在于选择性验证：草稿模型并行生成多个候选 token，再由一个小型马尔可夫头根据前一个 token 微调每个猜测，弥补纯并行推测后段 token 组合质量下降的缺陷。置信度调度器基于接受概率和 GPU 负载，动态决定每个请求需验证的 token 数量，避免无效计算。

Yuchen Jin@Yuchenj_UW · 5天前38

DeepSeek is the GOAT. 🐳 They just published DSpark, a new speculative decoding method that boosts throughput by 51% to 400%. They also open-sourced DeepSpec, the training framework behind it. This is the real open AI.

译DeepSeek 是 GOAT。🐳 他们刚刚发布了 DSpark，一种新的推测解码方法，将吞吐量提升 51% 到 400%。他们还开源了背后的训练框架 DeepSpec。这才是真正的开放 AI。

凡人小北@frxiaobei · 5天前63

DeepSeek V4 进行了一次更新。新推出了投机解码（Speculative Decoding）框架 DSpark，推理速度提升 80%。 DSpark 已被部署在 DeepSeek-V4（Flash 和 Pro）的真实线上流量中。报告：《DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation》 https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

Rohan Paul@rohanpaul_ai · 5天前50

LLMs can learn better coding behavior from problems with no known answers. Many real problems do not have a gold solution waiting in a database, especially in optimization, where the best answer may be unknown, expensive, or impossible to certify. Normal reinforcement learning works well when it can check a clear right answer, but that breaks down when the best answer is unknown. The paper’s method, called RiVER, lets the model write several programs, runs them on the same hidden tests, and rewards the programs that perform better than the others. The key trick is that RiVER does not trust raw scores directly, because some test cases naturally produce much bigger numbers and can distort training. Instead, it ranks programs within each test case, gives extra weight to the best one, and still gives smaller graded feedback to other valid programs. The authors trained models on 12 AtCoder Heuristic Contest tasks, and RiVER improved both score-based contest performance and normal pass-or-fail coding benchmarks. ---- Link – arxiv. org/abs/2606.27369 Title: "Reinforcement Learning without Ground-Truth Solutions can Improve LLMs"

译论文提出RiVER方法，让LLM从没有已知标准答案的问题中学习编码行为。RiVER使模型编写多个程序，在相同隐藏测试上运行，奖励表现较优者。关键是对每个测试用例内的程序排序，给最优者额外权重，其他有效程序也获得较小分级反馈，避免因原始分数数值差异扭曲训练。在12个AtCoder Heuristic Contest任务上，RiVER同时提升了基于分数的竞赛表现和常规通过/失败编码基准测试。arXiv:2606.27369。

Rohan Paul@rohanpaul_ai · 5天前46

This paper tests whether an older person’s everyday speech can become a useful cognitive monitoring twin, and mostly shows yes. Here AI is trying to learn how one person talks across time, including rhythm, pauses, topic context, and small stylistic habits that ordinary clinical snapshots can miss. That matters because cognitive decline often leaks into language before it becomes obvious as a dramatic symptom. The real point is that the personalized model picked up small speech patterns linked to thinking ability, while a normal GPT answer mostly missed them. The paper shows that ordinary conversations could become a low-burden way to track cognitive health over time. ---- Link – arxiv. org/abs/2606.27334 Title: "Language-Based Digital Twins for Elderly Cognitive Assistance"

译该论文测试老年人日常言语能否成为有效的认知监测双胞胎，结论基本可行。AI通过学习个体随时间变化的说话方式（节奏、停顿、主题、风格习惯），捕捉临床快照易漏掉的小模式——认知衰退往往在语言中早于明显症状出现。个性化模型能检测出与思维能力相关的细微言语变化，而普通GPT回答大多错过这些信号。研究显示，日常对话可成为一种低负担的长期认知健康追踪方式。

Ethan Mollick@emollick · 5天前81

One of the recovered passages, read for the first time in two thousand years: “Having…strained ourselves to the utmost through research and learning…possessing the same practical wisdom…”

译其中一段被复原的文字，两千年来首次被读到：“经过研究和学习的极限努力……拥有同样的实践智慧……”

Rohan Paul@rohanpaul_ai · 6天前60

MIT study. Code volume surges by 300%, but output increases by only 30%: The AI dividend meets an awkward reality. They studied 100,000+ GitHub developers and find that AI coding agents massively increase code production, but much less of that work becomes shipped software. Autonomous AI coding agents raised commits by 180%, but releases rose only 30%. The paper’s main idea is that software production has weak links, so faster code writing does not help as much when humans still need to review, connect, test, package, and ship the work. The authors also check app marketplaces and find more new apps, but no increase in total usage, which means more software appeared without clear evidence that users adopted more software. The marketplace evidence points the same way: more new apps appeared, but total usage did not rise. The authors compare more than 100,000 GitHub developers before and after they start using 3 generations of AI coding tools, from autocomplete to more independent coding agents. Autocomplete raised commits by 40%, interactive coding agents raised them by 140%, and autonomous coding agents raised them by 180%. The 180% commit gain shrank to 50% for the number of projects and 30% for actual releases. The estimated "elasticity of substitution" is 0.25 i.e. for every big improvement in AI’s usefulness, only a small amount of human work can be replaced. Because AI can write code faster, but humans are still needed to decide what to build, check if the code works, connect it with the rest of the product, fix messy edge cases, and actually ship it. --- papers .ssrn.com/sol3/papers.cfm?abstract_id=6859839

译MIT 论文分析 10 万+ GitHub 开发者使用三代 AI 编码工具的效果：自动补全使提交量增 40%，交互式智能体增 140%，自主智能体增 180%，但项目数仅增 50%，实际发布仅增 30%。应用市场同样出现新应用激增但总使用量未升。核心原因：软件开发存在弱环节——人类仍需决定功能、审查代码、测试、集成与发布。替代弹性估算仅 0.25，即 AI 能力大幅提升时，只有少量人类工作可被替代。

Rohan Paul@rohanpaul_ai · 6天前38

Today’s edition of my newsletter just went out. 🔗 https://www.rohan-paul.com/p/openais-new-paper-shows-how-they 🗞️ OpenAI’s new paper shows how they are now seeing the first version of office work where agents do most of the execution. 🗞️ New report on "The State of the AI Economy" 🗞️ New York Times: OpenAI is now leaning toward a 2027 IPO because the public market is testing whether AI giants deserve trillion-dollar prices before they prove durable profits. 🗞️ Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention 🗞️ The Economist: AI has pushed the internet’s content machine into a new phase, with books, lawsuits, research papers, apps, and songs now being produced at volumes that old review systems were not built to handle. 🗞️New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more useful behavior into tasks they had not trained on. 🗞️ MIT study. Code volume surges by 300%, but output increases by only 30%: The AI dividend meets an awkward reality. 🗞️ Qwen just released Qwen-AgentWorld, a 35B open-weight world model that learns how terminals, browsers, Android devices, code repos, search systems, OS tools, and MCP servers respond when an AI agent takes an action.

译本期周刊涵盖多项AI动态：OpenAI新论文展示智能体可执行大部分办公室工作的首个版本；NYT称OpenAI倾向于2027年IPO；OpenAI新研究发现基于真实人类场景的RL训练使模型在未来任务中更安全、有用；MIT研究显示代码量激增300%但产出仅增长30%；Qwen发布Qwen-AgentWorld，一个35B参数开放权重世界模型，可学习终端、浏览器、Android设备、代码仓库、搜索系统、OS工具及MCP服务器对AI智能体操作的响应。

Ethan Mollick@emollick · 6天前46

Finally, AI finds its ultimate uncontroversial use. A diffusion model trained on burger recipes "discovers the classic Big Mac without explicit supervision and generates novel burgers optimized for deliciousness, sustainability, or nutrition." ASI= automated slider intelligence

译终于，AI找到了其终极无争议用途。一个基于汉堡食谱训练的扩散模型“在没有显式监督的情况下发现了经典巨无霸，并生成了针对美味、可持续性或营养优化的新型汉堡。” ASI= automated slider intelligence

AK@_akhaliq · 6天前28

DanceOPD On-Policy Generative Field Distillation

译DanceOPD 策略内生成场蒸馏

AK@_akhaliq · 6天前40

ViQ Text-Aligned Visual Quantized Representations at Any Resolution

译ViQ 文本对齐的视觉量化表示，支持任意分辨率。

Microsoft Research@MSFTResearch · 6天前63

What do people actually do with AI at work? A new analysis of five million M365 Copilot conversations has answers. Scott Counts breaks it down in a new video. And dive into the analysis here: https://msft.it/6011vqpbL

译人们在工作场景中实际用 AI 做什么？对五百万次 M365 Copilot 对话的新分析给出了答案。Scott Counts 在一段新视频中进行了讲解。深入了解分析请戳：https://msft.it/6011vqpbL

Anthropic@AnthropicAI · 6天前60

To keep pace with AI progress, we're advancing how we study Claude's economic impact. Hourly sampling and survey data show us how the cadences of life shape usage, what people produce with Claude, and how perceptions of AI's impact may be changing. https://www.anthropic.com/research/economic-index-june-2026-report

译为跟上AI进步的步伐，我们正在推进研究Claude经济影响的方式。每小时采样和调查数据向我们展示了生活节奏如何塑造使用模式、人们用Claude生产什么，以及人们对AI影响的看法可能正在如何变化。https://www.anthropic.com/research/economic-index-june-2026-report

Epoch AI@EpochAIResearch · 6天前63

What are the largest software engineering tasks AI can perform? To answer this, we built MirrorCode, our long-horizon SWE benchmark that lets AI code autonomously for days at a time. The best models complete some tasks we estimate would take human engineers several weeks.

译AI能执行的最大软件工程任务是什么？为此，我们构建了MirrorCode，一个长期SWE基准测试，允许AI一次自主编程数天。最好的模型完成了一些我们估计人类工程师需要数周的任务。

Microsoft Research@MSFTResearch · 6天前41

Following up with the social copy I’ve drafted: What do people actually do with AI at work? A new analysis of five million M365 Copilot conversations has answers. Scott Counts breaks it down in a new video. And dive into the analysis here: https://msft.it/6015vUHsh

译跟进我起草的社交文案：人们在工作中的 AI 到底用来做什么？一项对五百万次 M365 Copilot 对话的新分析给出了答案。Scott Counts 在一段新视频中进行了详细解读。点击此处深入了解分析：https://msft.it/6015vUHsh

OpenBMB@OpenBMB · 6天前63

Hybrid LLMs are everywhere now: full attention is mixed with efficient modules like SWA, Mamba-2, and GDN. But what does efficient attention actually do inside these models? 🧵 New work from THUNLP Lab & OpenBMB: "Rethinking the Role of Efficient Attention in Hybrid Architectures." Through scaling laws, mechanistic analysis, and design studies, they reach a counter-intuitive conclusion 👇 📄 arXiv: https://arxiv.org/abs/2606.15378 💻 Code: https://github.com/thunlp/rethinking-hybrid-attention 1️⃣Same destination, different speed: Efficient-attention design barely affects short-context Loss — all seven curves nearly overlap. But on long-context metric LongPPL, early-training gaps are large, with large-window SWA worst of all. With enough training, every hybrid converges to the full-attention level. 2️⃣Full attention carries retrieval: Restricting full attention's receptive field at inference spikes LongPPL across all hybrids; restricting efficient attention barely moves it. Even recurrent mixers with in-principle unbounded receptive fields (like GDN) store little long-range info in their states. Layer-wise probing shows the same pattern: retrieval gains concentrate in the full-attention layers. 3️⃣Large-Window Laziness: A large SWA window already covers most useful dependencies, so the model needn't push full attention to retrieve from afar—delaying retrieval-head formation. It's like a student who won't walk to the library when the reference book is already on the desk. Smaller windows force full attention to do the retrieval work, training it faster. 4️⃣A simple design that works: Apply NoPE to just the full-attention layers of a small-window SWA hybrid (SWA-128-NoPE). It substantially improves long-context performance with negligible short-context cost. Under an effective training budget, the bottleneck for the long-context capability of hybrid models is not how powerful the efficient attention module is—it is whether full attention's retrieval capability can be effectively activated. Furthermore, strengthening full attention itself can bring greater performance improvements. Read the full paper! 🚀 #AI #THUNLP #OpenBMB #LLM #Attention #LongContext #HybridArchitecture #NLP

译清华自然语言处理实验室（THUNLP）与面壁智能OpenBMB发布论文，重新审视混合LLM架构中高效注意力（如SWA、Mamba-2、GDN）的实际作用。研究发现：高效注意力设计对短上下文Loss影响极小，但长上下文LongPPL差异显著；全注意力承担检索功能，限制其感受野会大幅提升LongPPL，而限制高效注意力几乎无影响。大窗口SWA导致模型懒惰，延迟检索能力形成。简单方法——对小窗口SWA混合架构的全注意力层仅用NoPE（SWA-128-NoPE），即可用极小短上下文代价显著提升长上下文性能。论文认为瓶颈在于全注意力的检索能力能否被有效激活。

AK@_akhaliq · 6天前44

Confidence-Aware Tool Orchestration for Robust Video Understanding

译面向鲁棒视频理解的自信感知工具编排

Rohan Paul@rohanpaul_ai · 6天前44

LLM trading agents mostly fail when stock-market tests become long, broad, and fair. The authors built FINSABER, a stricter testing setup that checks LLM trading over about 20 years, across more stocks, and with better protection against cherry-picked results. They tested LLM systems such as FinMem and FinAgent against simple baselines like Buy and Hold, rule-based trading, forecasting models, and reinforcement learning methods. The main result is that LLM strategies can look good in narrow tests, but they usually fail to beat simple market strategies once the test becomes longer and fairer. The paper also finds that these LLMs behave badly across market conditions because they are too cautious when stocks are rising and too risky when stocks are falling. So current LLMs may understand financial text, but that does not mean they can reliably time the stock market. ---- Link – arxiv. org/abs/2505.07078v5 Title: "Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?"

译研究人员构建了更严格的FINSABER测试框架，在约20年、多只股票、防挑结果条件下评估FinMem、FinAgent等LLM交易智能体。结果显示，LLM策略在狭窄测试中看似不错，但面对买入持有、规则交易、预测模型和强化学习等简单基线时，在长期公平测试中通常失败。LLM在市场上涨时过于谨慎，下跌时过于冒险，表明理解金融文本不等于能可靠把握市场时机。论文指出，当前LLM可能无法在长期跑赢简单市场策略。

Rohan Paul@rohanpaul_ai · 7天前67

LLMs may not need human-style language. i.e. future AI systems might save context space by using dense model-readable messages instead of long normal prose. The authors propose BabelTele, a compressed writing style that can mix abbreviations, symbols, fragments from different languages, and unusual structure. To a capable language model, it can still carry enough structure to answer questions, preserve memory, and pass information between agents. The point is that human readability, natural-language fluency, and machine recoverability are separable properties. Human prose carries redundancy because humans need rhythm, grammar, context, and reassurance. Models trained on huge symbolic mixtures may not need all of that scaffolding every time. In the paper’s strongest result, BabelTele keeps about 99.5% semantic fidelity while shrinking text to 27.9% of its original length. ---- Link – arxiv. org/abs/2606.19857 Title: "LLMs Do Not Always Need Readable Language"

译新论文"LLMs Do Not Always Need Readable Language"提出BabelTele压缩写作风格，让LLM间通信混合缩写、符号、多语言片段及非传统结构，替代人类自然语言的长文本。即使失去人类可读性，模型仍能回答、记忆并在智能体间传递信息。最强结果：BabelTele保持约99.5%语义保真度，同时将文本压缩至原始长度的27.9%。

Chubby♨️@kimmonismus · 7天前60

IBM just unveiled a sub-1 nanometer chip breakthrough. That honestly wasnt on my bingo card. Its new 0.7 nm / 7 angstrom technology uses a 3D "nanostack" transistor architecture to vertically stack and stagger transistors. IBM says it can fit nearly 100 billion transistors onto a chip the size of a fingernail, almost 2x the density of its 2 nm chip from 2021. • up to 50% more performance • or 70% better energy efficiency • 40% SRAM scaling for AI workloads Important caveat: this is still research, not a chip shipping tomorrow. IBM says production could happen as early as the next 5 years.

译IBM 发布世界首个次纳米节点芯片技术突破——0.7nm（7埃）工艺，采用 3D "纳米堆栈" 晶体管架构实现垂直堆叠交错。该技术可在指甲盖大小的芯片上集成近 1000 亿个晶体管，密度约为 2021 年 2nm 芯片的两倍。相比前代，性能可提升 50% 或能效提升 70%，SRAM 缩放达 40% 以适配 AI 工作负载。IBM 强调目前仍为研究阶段，量产最早可能在未来 5 年内实现。

elvis@omarsar0 · 7天前41

New research from Meta. Building synthetic training data has stayed a fixed pipeline that you hand-tune and then freeze. Autodata casts an AI agent as a data scientist that builds training and evaluation data, with an implementation called Agentic Self-Instruct that extends classic Self-Instruct with agentic planning and tool use. Think of it as meta-optimization, where the data scientist agent is itself trained to produce stronger data, so the pipeline keeps improving instead of staying static. Across computer science research, legal reasoning, and reasoning over mathematical objects, it beats classical synthetic-data methods, and meta-optimizing the agent delivers an even larger uplift. Paper: https://arxiv.org/abs/2606.25996 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Meta 发布新研究 Autodata，提出 Agentic Self-Instruct 方法。该方法将 AI 智能体视为数据科学家，通过智能体规划与工具使用，替代传统手工调优后固定的合成数据流水线。该智能体自身可通过元优化持续改进，从而生成更强训练数据。实验在计算机科学、法律推理、数学对象推理三个领域均超越经典合成数据方法，且元优化带来更大提升。论文见 arxiv。

Hao AI Lab@haoailab · 7天前52

Introducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting. JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200. ⚡️ Check out our project page for demos and a blog post on how we built it 👇 https://jetspec-project.github.io/jetspec-web/ https://haoailab.com/blogs/parallel-tree-decoding/

译Sky Computing Lab推出JetSpec，一种通过因果并行树草稿（causal parallel tree drafting）联合优化草稿成本与质量的推测解码方法，可将LLM生成延迟推向极致。在MATH-500上达到最高9.64x端到端加速，开放式聊天达4.58x，且保持无损。结合CUDA graph和kernel优化，在单B200上实现约1000 TPS。

Rohan Paul@rohanpaul_ai · 7天前80

OpenAI just released a paper showing how they are now seeing the first version of office work where agents do most of the execution. Codex has become its main work AI, producing 99.8% of internal output tokens after sitting below 10% a year earlier. The striking part is not engineering use, because Codex began as a coding tool, but the fast rise in Legal, Finance, Recruiting, Support, and business teams. Non-developer use rose 137x for individuals and 189x for organizations since Aug-25, which means agents are spreading wherever work has repeatable steps, files, rules, and messy follow-through. Top internal users now run about 71 hours of agent work per day by managing parallel tasks, turning AI from a chat box into a pool of delegated labor. Users are changing the work unit itself, since 70.2% of sampled individuals sent a request above 1 hour of human work and 25.6% sent one above 8 hours. Heavy users no longer wait for one answer, because 28.6% of OpenAI users managed 5+ concurrent agents and the 99th percentile ran about 71 hours of agent work per day.

译OpenAI 发布内部论文，显示 Codex 已成为公司主力 AI，产出 99.8% 内部输出 tokens，而一年前这一比例低于 10%。除工程部门外，法务、财务、招聘、支持及业务团队使用量快速增长。自 Aug-25 以来，非开发者个人使用增长 137 倍，组织使用增长 189 倍。重度用户日均运行约 71 小时代理任务，28.6% 的用户管理 5 个以上并发 agent，25.6% 的个体提交过超过 8 小时人工等价的任务。OpenAI 称，Agent 正使工作更复杂、更长期、更跨职能。

Rohan Paul@rohanpaul_ai · 7天前47

Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data. The main result is that agent-made data usually trained models better than standard synthetic data, and in legal tasks a trained 4B model beat a much larger 397B baseline. Treats synthetic data generation as a job for an agentic data scientist, not a prompt template. “Agentic Self-Instruct,” makes AI agents generate and meta-optimize synthetic training and evaluation data, improving performance over classical synthetic data methods across CS, legal, and math benchmarks. Autodata’s loop is simple: generate an example, let a weak model and a strong model try it, judge the results, then revise the recipe until the example sits in the useful zone. This is the best idea in the paper: difficulty is not a virtue by itself. A task should not just be “hard”; it should be hard in a way that teaches the weaker model something. If the weak model always gets it right, there is nothing to learn; if it always gets zero, there is also nothing to learn. --- The direction feels important because it reframes synthetic data from bulk imitation into curriculum design. The next frontier may not be models writing more examples, but models learning what makes an example worth learning from. ---- Link – arxiv. org/abs/2606.25996v1 Title: "Autodata: An agentic data scientist to create high quality synthetic data"

译Meta提出Autodata，将合成数据生成视为智能体数据科学家的任务。核心方法“Agentic Self-Instruct”让AI智能体生成并元优化合成训练与评估数据。循环流程：生成示例→弱模型与强模型分别尝试→判断结果→修订配方直至示例处于有用区间。论文强调难度不是美德，示例应针对弱模型的学习点。关键结果：在法律任务上，4B模型训练后超越了更大的397B基线。

Rohan Paul@rohanpaul_ai · 7天前62

This study tests how often LLMs invent answers when they should rely only on supplied documents. The problem is that companies often use LLMs to answer questions from documents and they assume document-based LLM systems are safer because the model is given source material. This study shows that no model fully avoided fabrication, because even the best model made up answers 1.19% of the time at 32K context. For strong models, a more normal best-case rate was around 5% to 7%, while the middle model fabricated about 25% of answers to questions about facts that did not exist. Longer context made the problem much worse, and at 200K context every tested model fabricated at least 10% of the time. Shows that hallucination is not just a failure to retrieve the right sentence. A model can be good at finding real facts and still be too willing to answer when the requested fact is absent. ---- Link – arxiv. org/abs/2603.08274 Title: "How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms"

译一项基于172B token的研究测试了LLM在文档问答场景中的虚构答案频率。关键发现：最佳模型在32K上下文下虚构率1.19%；强模型通常为5%-7%；中等模型对不存在事实的虚构率达25%。当上下文扩展至200K时，所有模型至少虚构10%。更长上下文显著加剧幻觉。研究表明，幻觉不仅是检索失败，模型即便能正确找到事实，也易在事实缺失时过度作答。

jason@jxnlco · 7天前47

tldr:

译Codex 在 OpenAI 的使用为我们预览了未来智能体工作的可能面貌。在一篇新论文中，OpenAI 经济研究团队着眼于从聊天到委托的更广泛转变：人们使用 AI 智能体不仅为了获取答案，还要委托更长时间、更复杂的工作。 https://openai.com/index/how-agents-are-transforming-work

Epoch AI@EpochAIResearch · 7天前31

What are the strategies of Chinese AI companies? To understand this better, @cherylwoooo, @datagenproc, and @ansonwhho scraped >1600 job postings from six major Chinese firms. Here’s what they learned. 🧵

译中国 AI 公司有哪些策略？为了更好地了解这一点，@cherylwoooo、@datagenproc 和 @ansonwhho 从六家主要中国公司抓取了超过 1600 条招聘信息。以下是他们的发现。🧵

AK@_akhaliq · 7天前27

DomainShuttle Freeform Open Domain Subject-driven Text-to-video Generation

译DomainShuttle 自由形式开放域主体驱动文本生成视频

Microsoft Research@MSFTResearch · 7天前30

Researchers introduce generative causal testing, which translates black box models into clear hypotheses and verifies them in the scanner, revealing what specific brain regions respond to in language. https://msft.it/6011vUtRd

译研究人员引入了生成式因果测试，它将黑箱模型转化为清晰的假设，并在扫描仪中进行验证，揭示了大脑特定区域对语言的反应。

AK@_akhaliq · 7天前24

Are We Ready For An Agent-Native Memory System?

译我们准备好迎接智能体原生记忆系统了吗？

Rohan Paul@rohanpaul_ai · 7天前49

Great Stanford + MIT + Harvard + Anthropic paper. Gives a clear training-based reason for why larger models learn abilities smaller models miss. Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals. The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts. Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge. In a crowded data mixture, common patterns get first claim on the model’s internal machinery. Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again. They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters. The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less. Larger models can remember weak rare signals long enough to turn them into real learned skills. ---- Link – arxiv. org/abs/2605.29548 Title: "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention"

译Stanford、MIT、Harvard与Anthropic联合论文从训练层面解释大模型能力更强的原因：大模型遗忘更少，额外容量保护了弱学习信号。常见任务优先占据神经元，罕见任务在出现足够次数前被覆盖。小模型可能短暂捕捉罕见信号，但随后被常见任务更新覆盖。实验使用OLMo模型（4M到4B参数），结果显示大模型更好掌握低频任务，保留更多任务特征，梯度干扰更小。

Ethan Mollick@emollick · 6月25日52

A lot of people who say they never use AI are using AI, but secretly. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5464215

译很多人声称从未使用AI，但实际上在秘密使用。 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5464215

elvis@omarsar0 · 6月25日46

// Critique of the Agent Model // Finally, a paper that tries to define what an agent is and what agency consists of. Good read overall. (great bookmark) The word agent now covers everything from a for-loop with tool calls to speculative machine superintelligence. Eric Xing and colleagues ask where automation ends, and agency begins. Drawing on Descartes and on science-fiction portrayals of autonomous beings, they analyze agent architectures along five dimensions: goal, identity, decision-making, self-regulation, and learning. The argument is that genuine agency requires these structures to hold together in a specific way. Great paper overall, providing a vocabulary for arguing about what is and is not an agent. Paper: https://arxiv.org/abs/2606.23991 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Elvis Saravia 推荐一篇试图明确“智能体”定义的论文。Eric Xing 及其同事从哲学与科幻视角出发，分析智能体架构的五维度：目标、身份、决策、自我调节和学习。论文指出，真正“智体性”需这些维度以特定方式组合，从而区分自动化与智能体。论文地址：arxiv.org/abs/2606.23991。

Rohan Paul@rohanpaul_ai · 6月25日43

Intelligence may be less about bigger models and more about better knowledge structures. This paper argues that current AI is being built mostly on network mathematics, not on a theory of knowledge. A human brain makes fast, adaptive decisions on roughly the power of a dim light bulb, while frontier AI often buys competence with enormous computation. The paper says biological intelligence may be efficient because it organizes meaning around goals, context, and decisions, instead of mainly searching through language patterns. It separates mental activity into physical cognition, emotional cognition, mental cognition, and intelligence, where intelligence means making useful decisions while the situation still matters. The proposed answer is Synthetic Intelligence, which would use structured semantic knowledge, meaning information tied to purpose, rather than only syntax, statistics, or neural network weights. The paper uses Asymmetric Information Resolution models to show how knowledge can be arranged into decision maps, with a simple predator-prey example where each state has only a few possible moves.

译该论文认为当前AI主要建立在网络数学而非知识理论上。人脑以极低功耗做出快速自适应决策，而前沿AI依赖巨大算力。生物智能高效是因为围绕目标、上下文和决策组织意义。论文将心智活动分为物理认知、情绪认知、心智认知和智能，其中智能指在情境仍有效时做出有用决策。提出的“合成智能”将使用结构化语义知识（信息与目的绑定），而非仅依赖语法、统计或神经网络权重。通过不对称信息解析模型展示如何将知识组织成决策图，以捕食者-猎物为例，每个状态仅包含少数可能动作。

AK@_akhaliq · 6月25日52

Qwen-AgentWorld Language World Models for General Agents

译Qwen-AgentWorld 为通用智能体设计的语言世界模型

OpenBMB@OpenBMB · 6月24日36

LLMs don't just hallucinate because they lack knowledge—they hallucinate because they don't know what they don't know. Existing knowledge augmentation blindly injects more data, treating every error as a knowledge gap. But overconfident wrong answers and uncertain correct ones reveal a deeper problem: cognitive misalignment. 🤔 Today, we dive into Know More, Know Clearer—a meta-cognitive framework by @TsinghuaNLP (OpenBMB member) alongside researchers from Harbin Institute of Technology and Northeastern University. The team proposes a unified system that diagnoses a model's cognitive state and applies targeted intervention—not indiscriminate knowledge stuffing. 📄 arXiv: https://arxiv.org/abs/2602.12996 🤗 Paper: https://huggingface.co/papers/2602.12996 Why it matters: 1⃣️ The Structural Decay Law: A Universal Foundation: The team discovers that accuracy exhibits a stable exponential decay relative to uncertainty: E[Acc|U] ≈ a·exp(−U) + b. Validated across 6 architectures (Qwen, Llama, Mistral), this proves internal confidence signals structurally encode performance—not random noise—providing a rigorous basis for meta-cognitive optimization. 2⃣️ Know More (CGKE): Differentiated, Not Indiscriminate: Rather than uniform knowledge injection, the framework partitions the knowledge space into Mastered, Confused, and Missing regions via self-sampled behavioral profiling. Each region receives a tailored augmentation strategy—boundary expansion, structural disambiguation, or epistemic foundation—targeting exactly where the model needs it most. Ablation shows removing the "Confused" category causes the largest performance drop. 3⃣️ Know Clearer (CDKC): Aligning Confidence with Correctness: A cognitive consistency alignment mechanism built on GRPO actively recalibrates the model's confidence landscape—sharpening distributions on correct paths, dispersing them on incorrect ones. Result: average ECE drops from 60.41 to 24.34, and the model learns to genuinely know its own limits rather than learning to refuse everything. 4⃣️ Results: 24.59-Point Gain and True Self-Knowledge: On 11 QA benchmarks, CDKC (2-round) lifts Llama-3.1-8B from 30.91% to 55.50% (+24.59 pts) and Qwen2.5-7B from 25.76% to 48.29% (+22.53 pts). On self-knowledge benchmarks, the framework achieves a CBS of 73.43% and CAE of 68.18%—delivering 63.37% correct answering decisions while maintaining 79.07% boundary recognition, the best balance of any method tested. Knowledge augmentation is not merely about knowing more—it's about knowing more clearly. This framework sets a new standard for reliable, calibrated knowledge in LLMs. #AI #THUNLP #OpenBMB #LLM #KnowledgeAugmentation #Hallucination #MetaCognition #NLP

译面壁智能 OpenBMB 联合清华NLP、哈工大、东北大学提出元认知框架 Know More, Know Clearer，应对 LLM 因认知错位导致的幻觉。框架包含三项：结构性衰减定律（准确率随不确定性指数衰减）；Know More（CGKE）将知识空间分为掌握/混淆/缺失三区针对性增强；Know Clearer（CDKC）基于 GRPO 对齐置信度，使平均 ECE 从 60.41 降至 24.34。在 11 个 QA 基准上，CDKC 将 Llama-3.1-8B 从 30.91% 提升至 55.50%（+24.59 点），Qwen2.5-7B 从 25.76% 提升至 48.29%（+22.53 点）。自知识基准上 CBS 达 73.43%、CAE 达 68.18%，正确决策率 63.37%，边界识别 79.07%，达到最佳平衡。

Ant Ling@AntLingAGI · 6月24日41

Great breakdown from Qian. In our recent UFP4 paper, we show that a uniform-grid FP4 recipe achieves lower BF16-relative loss degradation than strong E2M1 baselines across Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining. Full paper: https://arxiv.org/abs/2606.20381

译蚂蚁百灵发表UFP4论文，提出均匀网格FP4训练配方。在Dense 1.5B、MoE 7.9B和MoE 124B长程预训练中，该配方相比强E2M1基线实现了更低的BF16相对损失退化。论文指出，配合细粒度缩放和RHT后，FP4训练的瓶颈从动态范围转向局部分辨率，E1M2/INT4格式能更好利用RHT改进的桶分配，而E2M1可能使RHT有害。论文地址：https://arxiv.org/abs/2606.20381