Opus 4.8 is live! Even in Germany!!

译Opus 4.8 已上线！甚至在德国也能用了！！

Chubby♨️@kimmonismus · 5月29日70

Thank god! I can turn off adaptive thinking and set reasoning effort myself. Finally!

译太好了！我可以关闭自适应思考并自行设置推理强度了。终于！

Claude@claudeai · 5月29日82

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today at the same price.

译介绍 Claude Opus 4.8：它在 Opus 4.7 基础上，拥有更敏锐的判断力、对自身进展更诚实，并且能比前代更长时间独立工作。今日发布，价格不变。

swyx@swyx · 5月29日67

"Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn" wtf? how??

译开发者可以在任务执行过程中更新Claude的指令，而不会破坏提示词缓存或需要通过用户轮次来传递更新。

Chubby♨️@kimmonismus · 5月29日70

Let’s go: so it’s opus 4.8 plus codex update!

译来吧：是Opus 4.8加上Codex更新！

AK@_akhaliq · 5月28日55

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

译多模态智能体推理的探索性策略优化

Xiaomi MiMo@XiaomiMiMo · 5月28日69

MiMo-V2.5 is now available in OpenCode — free for a limited time. 🎉

译MiMo-V2.5现已在OpenCode上线——限时免费。🎉 [引用 @opencode]：OpenCode x MiMo V2.5 - 限时免费 1M上下文 • 推理 • 文本 • 图像

Rohan Paul@rohanpaul_ai · 5月28日59

NVIDIA published a report on Vera CPU benchmarks, done by Phoronix. Compares Vera directly against a current 128-core x86 CPU and claims a 1.5x overall performance advantage Compared with the prior-generation NVIDIA Grace CPU, Vera delivered a 1.6x geometric mean increase in Phoronix’s testing. Vera delivered over 4x the memory bandwidth per core compared with traditional x86 CPUs. Vera delivers 1.2TB/s bandwidth using LPDDR5X, while keeping memory power under 30W, compared with more than 100W for many DDR5 server setups. ---- To note, Vera uses Armv9.2, not x86, so NVIDIA is basically saying that their Arm-based CPU can beat the usual Intel and AMD server CPUs. For agentic AI, this CPU-side work becomes much heavier because the AI is not only generating text, but also calling tools, reading files, writing code, using browsers, running sandboxes, and managing workflows.

译NVIDIA发布Vera CPU基准测试报告。Vera采用Armv9.2架构，在Phoronix测试中，其整体性能比128核x86 CPU高1.5倍，比前代Grace CPU提升1.6倍（几何平均）。其每核心内存带宽是传统x86 CPU的4倍以上，使用LPDDR5X实现1.2TB/s带宽，内存功耗低于30W。该报告旨在表明NVIDIA的Arm架构CPU性能已超越Intel和AMD的x86服务器CPU，并强调在智能体AI场景下，因涉及工具调用、文件读写、代码生成等复杂任务，CPU侧工作负载变得更重。

Noam Brown@polynoamial · 5月28日62

After AlphaGo, the skill of human Go players noticeably improved. I suspect we will see a similar pattern in math.

译AlphaGo之后，人类围棋选手的水平显著提升。我怀疑我们将在数学领域看到类似的模式。

Alibaba Cloud@alibaba_cloud · 5月28日62

📢Qwen3.7-Max just hit #3 on ITbench-AA — a fresh benchmark testing how well models handle real-world enterprise IT tasks, agentic-style. 🔧Agentic era, go with Qwen.🏃🏃 API: https://int.alibabacloud.com/m/1000413314/

译通义千问（Qwen）团队宣布，其Qwen3.7-Max模型在新兴的ITBench-AA基准测试中位列第三。该测试由Artificial Analysis与IBM Research合作推出，旨在评估模型解决真实企业IT任务的能力，当前聚焦于站点可靠性工程（SRE）领域。测试包含59个Kubernetes故障诊断任务。结果显示，Claude Opus 4.7以47%的得分排名第一，GPT-5.5（xhigh）以46%紧随其后，Qwen3.7-Max以42%排名第三。所有前沿模型得分均低于50%，表明该测试具有较高挑战性。

Tibo@thsottiaux · 5月28日63

Excited to see more independent benchmarks like that which are not contaminated (trained on by major models).

译新发布的独立基准测试 DeepSWE 结果更贴近开发者日常体验。测试显示，在编程任务上，GPT-5.5 得分为 70%，而 Claude Sonnet 得分为 32%，两者差距显著。DeepSWE 聚焦于 AI 智能体在真实工作流中的核心能力，即能否仅凭简短提示词，准确定位代码库并干净地完成修改，无需用户列举具体文件。原文指出，这验证了许多开发者长期以来的观察，并批评了 SWE-Bench 因数据集污染和验证机制较弱而常无法反映真实能力的问题。

Rohan Paul@rohanpaul_ai · 5月28日65

Long-running language agents may work better if they periodically stop to consolidate memory. The problem is that today’s transformer agents get slower and more expensive as their context grows, because attention has to keep checking more past tokens. The usual fix for long context is to keep more tokens nearby, but that turns every next-token prediction into a larger search through the past. The sharper idea here is that memory is not only storage. Sometimes the hard part is converting a messy stretch of experience into a state that can actually be used later. So the paper’s idea is to add a sleep phase, where the model pauses, rereads recent context several times, writes the useful information into fixed-size memory layers, and then clears the short-term attention cache. During sleep, the model runs several offline passes over recent context, writes the result into fast weights inside its state-space blocks, then clears the attention cache. This means the model pays extra compute while sleeping, not while answering, so normal prediction can still happen with 1 forward pass. The authors test this on cellular automata, graph lookup, and GSM-Infinite math problems, where the model must use old information that is no longer sitting in its attention cache. The main result is that longer sleep improves performance, especially on harder cases that need deeper reasoning rather than just remembering a fact. The big deal is that long-horizon agents may not need to carry bigger and bigger raw context forever, because they can consolidate the important parts and safely forget the raw tokens. ---- Link – arxiv. org/abs/2605.26099 Title: "Language Models Need Sleep"

译针对当前Transformer智能体因上下文不断增长而推理变慢变贵的问题，论文提出效仿人类睡眠机制进行记忆巩固。其核心方案是加入周期性的“睡眠阶段”：模型在此阶段暂停，多次重读近期上下文，将有用信息写入固定大小的记忆层（如状态空间块的快速权重），然后清空短期注意力缓存。此离线过程使后续回答仍只需一次前向传播。在细胞自动机、图查找和GSM-Infinite数学问题上的测试表明，更长的睡眠时间能提升性能，尤其对需要深度推理的复杂任务。该思路表明，长期智能体或可通过记忆巩固实现高效遗忘与重用，不必无限携带原始上下文。

Berryxia.AI@berryxia · 5月28日66

OpenCode & Mimo V2.5 限时免费。需要的去冲吧~

译OpenCode & MiMo V2.5 限时免费。需要的去冲吧~

Rohan Paul@rohanpaul_ai · 5月28日67

Most teams are overpaying for inference without realising it. Fixed rate cards have no competitive pressure. The Grid replaces them with live supply and demand, prices track the market, not a vendor's margin. The Grid sits in the middle and basically says, “Don’t pick the model, pick the level of work you need.” A boring task like classifying support tickets does not need the smartest model, so it can run on standard. A normal production task like RAG, drafting, support replies, or agent steps can run on prime. A hard task with long context, high error cost, or difficult reasoning can run on max. Your app sends the request to The Grid, not directly to OpenAI, Anthropic, or one hosting company. The Grid then checks which suppliers currently qualify for that tier and sends the request to the cheapest one available at that moment. You still use one API key and mostly the same code, but the model behind the request can change as prices and quality change. So you stop paying premium prices for easy work, and also you are not trapped inside one vendor’s model names, pricing, outages, or deprecations. New accounts get the first 200 million tokens covered. Here, I integrated Hermes Agent with The Grid in minutes, kept the agent running locally on my Ubuntu machine, and used “agent-prime” to read support tickets, apply a policy file, and write a triage report through The Grid’s API. You just need to - install Hermes Agent - select The Grid as a custom AI provider. - No local model download. No GPU setup. The request goes through the grid. - The Hermes Agent ran locally, but the AI calls went through The Grid. 🧵 1.

译The Grid推出新的LLM推理平台，用实时供需市场定价取代传统的固定费率。它按任务难度分层：简单任务（如分类）用“standard”，常规生产任务（如RAG、智能体步骤）用“prime”，高难度任务（如长上下文推理）用“max”。应用将请求发送至The Grid，平台会自动匹配该层级当前最便宜的可用供应商。开发者仍使用单一API，但后端模型可动态切换。新账户享受前200 million tokens免费额度。文中以Hermes Agent集成为例，展示了如何通过“agent-prime”层级处理工单。

OpenCode@opencode · 5月28日66

OpenCode x MiMo V2.5 - Free for a limited time 1M context • reasoning • text • image

译OpenCode x MiMo V2.5 - 限时免费 1M 上下文 • 推理 • 文本 • 图像

Qwen@Alibaba_Qwen · 5月28日69

Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 https://pytorch.org/blog/up-to-580tps-new-speed-record-of-qwen3-5-397b-a17b-on-gpu-for-agentic-workloads-with-tokenspeed/ #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance

译Qwen3.5在TokenSpeed推理引擎上，针对智能体工作负载达到了创纪录的580 tokens per second (tps)速度。这一成果由通义千问推理团队、lightseekorg Foundation TokenSpeed团队、NVIDIA及Mooncake团队共同实现，并采用了tri_dao的FlashAttention-4 (FA4) 优化。此里程碑标志着开源大语言模型推理性能的边界得到了推动，相关详情可查阅PyTorch社区博客。

Ethan Mollick@emollick · 5月27日63

The fact that tokens went from something no one even put in a budget line a year ago to an absolute requirement for coding now is the cause of handwringing, not that AI is not turning out to be useful No one knows who should get tokens, how much they should get & how to control

译Token 从一年前无人问津到如今成为编程的绝对必需品，这引发了焦虑，而非 AI 无用。没人知道谁该获得 Token，该获得多少，以及如何控制。

Berryxia.AI@berryxia · 5月27日60

这次AI 跨过了一个“奇点”！最近有两个事件值得重点关注： •2026 年 4 月 7 日：Anthropic 发布了 Project Glasswing，同时推出了 Claude Mythos Preview。这是一个尚未正式公开的前沿模型，其网络攻防能力已经强到一定程度。以至于 Anthropic 没有选择公开，而是只开放给合作伙伴，用于防御性用途。 •2026 年 5 月 20 日：OpenAI 宣布，其内部的一个通用推理模型，成功推翻了数学家 Paul Erdős 在 1946 年提出的一个平面单位距离问题猜想。这两件事看起来没什么关系，但其实指向了同一个现象：前沿模型在更高抽象层面的可靠推理能力，已经迈过了一个临界点。我说的这个“门槛”，指的是模型能够稳定处理的推理单元在不断上移。简单来说，语言的抽象层级大致是这样的：字符 → 词语 → 短语 → 句子 → 段落 → 整篇文章 → 完整知识体系。以前的模型可能连句子都组织不好，现在的顶级模型已经能稳定地处理“段落”和“整篇论证”了。写一篇文章不只是接龙下一个句子，而是要维持一个核心观点、挑选合适的例子、建立逻辑连接，并让每一部分都服务于整体结构。 Anthropic 的 Mythos 和 OpenAI 的内部模型，正是这种能力跃迁的代表。它们不再只是针对单个漏洞或单个数学引理进行操作，更是能够把这些零散的片段串起来，形成完整的攻击链或数学证明。 Claude Mythos Preview 是目前 Anthropic 最强、规模也可能是最大的模型，在编码能力上表现非常突出，多数基准测试都超过了 OpenAI 最新的 GPT-5.5。但最值得注意的是它的网络安全能力，在进攻性安全评测中表现过于亮眼，导致 Anthropic 最终决定不公开这个模型，作为仅提供给关键基础设施企业用于防御。

译近期两个事件表明，前沿模型在高级抽象层面的可靠推理能力已跨越临界点。一是Anthropic发布了Claude Mythos Preview，其网络攻防能力过强，因此未公开，仅开放给合作伙伴用于防御。二是OpenAI的内部通用推理模型成功推翻了数学家Paul Erdős提出的一个猜想。两者共同显示，模型稳定处理的推理单元已从句子层级跃升至能维持核心论点、建立逻辑结构的“段落”与“整篇论证”层级，标志着能力的关键跃迁。

Fuli Luo@_LuoFuli · 5月27日59

Behind the MiMo API Price Reduction: The deepest price cut, up to 99%, is for Input (Cache Hit). The core reason is our inference framework now supports hierarchical KV cache optimization for SWA. Production inference engine tests show this optimization increases cached token capacity by 5x, equivalent to an 80% reduction in caching costs. Combined with Cache Read Overlap among multiple Full Attention modules in the Hybrid model, actual costs are further reduced. Prices for Input (Cache Miss) and Output are also reduced by 60%-80%. This mainly benefits from the extreme 1:7 Full:SWA sparsity ratio brought by the model architecture (the prefill compute of the 70-layer MiMo-V2.5-Pro roughly equals a 10-layer GQA model). This kept our original inference costs well below the industry average, naturally leaving a 2x-3x profit margin in pricing. This price adjustment simply reflects our decision to pass these structural cost efficiencies directly to developers. Operating at these newly reduced API prices, our production inference engine is running at near full capacity, and we can still essentially break even. We previously advised LLM companies not to "blindly cut prices" precisely because very few model architectures and inference optimizations can keep API costs from running at a loss. If more architectures that save compute and KV cache emerge, along with better inference Infra to drive down API costs, this will form an excellent virtuous cycle in the industry. More crucially, affordable, high-performance model APIs will drive real, sustained, and at-scale inference demand. This upstream demand pulls forward the development of the entire AI infrastructure chain—including chips, servers, optical transceivers, PCBs, liquid cooling, power, energy storage, and data centers—serving as a strategic fulcrum for a systemic revaluation of AI hardware. In the long run, this injects more affordable and accessible compute into both training and inference pipelines, accelerating the parallel evolution of global AGI across multiple regions and technical routes. For more technical details, we will release a detailed Blog post later.

译本次价格调整源于模型架构与推理框架带来的结构性成本优势。推理框架层面，对SWA的层级KV cache优化使缓存容量提升5倍，相当于缓存成本降低80%，再结合混合模型中多个Full Attention模块的缓存读取重叠，进一步降低了实际成本。模型架构层面，MiMo-V2.5-Pro实现了极端的1:7 Full:SWA稀疏比例，其预填充计算量极低，使得原始推理成本远低于行业平均。因此，输入（缓存命中）价格最高降幅达99%，输入（缓存未命中）和输出价格降幅为60%-80%。此番调整是将效率提升直接让利给开发者，而非亏损运营。

Chubby♨️@kimmonismus · 5月27日58

Phoronix just published one of the first public benchmarks of NVIDIA's Vera CPU. I went through the full 11-page review this morning and the results are genuinely impressive. For those who don't follow server hardware: Vera is NVIDIA's new ARM-based data center processor with 88 custom-designed Olympus cores. The idea is straightforward. Agentic AI doesn't just need powerful GPUs. It needs CPUs that can keep up with code execution, tool calls, orchestration and data pipelines, all running concurrently at scale. The numbers are strong. Vera compiled a default Linux kernel in 20 seconds, the fastest result in Phoronix's tested field. Across all tested workloads, it delivered about 1.55x the performance of Intel's Xeon 6980P. Against AMD's EPYC 9575F, it came out about 10% ahead on a geometric mean basis. The memory story might be even more interesting. Vera uses LPDDR5X with up to 1.2 TB/s of bandwidth and delivers more than 4x the memory bandwidth per core compared to traditional x86 server CPUs. In the STREAM TRIAD benchmark, it sustained 90% of its rated peak bandwidth, the highest ratio Phoronix has measured on any CPU. If you're running agentic workloads with dozens of parallel processes and concurrent data queries, that kind of consistent memory performance matters more than core count on a spec sheet. Compared to NVIDIA's own Grace CPU, Vera is 1.63x faster in the geometric mean. That is an unusually large generation-over-generation jump for a CPU. Michael Larabel, who founded Phoronix and has been benchmarking Linux hardware for over two decades, said he's never seen any ARM processor compete with Intel and AMD at this level. I was at GTC in March when Jensen announced Vera. The thesis that agentic AI creates entirely new CPU demand made sense to me then. These benchmarks are the first real numbers behind that thesis. And they deliver. Vera ships to partners in H2 2026. The server CPU market just got a whole lot more interesting. Full 11-page review on Phoronix. Worth your time, all sources below.

译Phoronix发布了NVIDIA Vera CPU的首份公开基准测试。这款ARM架构数据中心处理器拥有88个Olympus核心，专为智能体AI（Agentic AI）所需的代码执行、工具调用与数据管道设计。测试数据显示，Vera编译Linux内核耗时20秒，为测试最快。其整体性能较Intel Xeon 6980P提升约1.55倍，较AMD EPYC 9575F平均领先约10%。内存方面，Vera采用LPDDR5X，提供高达1.2 TB/s的带宽，每核内存带宽是传统x86 CPU的4倍以上，且在STREAM TRIAD测试中达到了90%的峰值带宽利用率。与上一代Grace CPU相比，Vera性能平均提升1.63倍。该处理器预计于2026年H2出货给合作伙伴。

Chubby♨️@kimmonismus · 5月27日65

DeepSeek just made its 75% price cut on V4-Pro permanent. Xiaomi's MiMo slashed V2.5 pricing by up to 99%, effective today. Most coverage frames this as a price war. The more interesting part is the engineering that makes these numbers sustainable. DeepSeek's V4 paper describes a *hybrid attention architecture* that attacks the core bottleneck of long-context inference: the KV cache. Traditional transformers store key-value pairs for every token in the context. At 1 million tokens, this cache alone can fill an entire GPU's memory. V4 introduces two interleaved attention types. Compressed Sparse Attention (CSA) compresses every 4 tokens into a single KV entry, then selects only the top-k most relevant compressed blocks per query. Heavily Compressed Attention (HCA) goes further, compressing 128 tokens into one entry and running dense attention over the result. The compressed sequence is short enough that dense attention stays cheap. V4-Pro's KV cache at 1M tokens is 10% (!!) of V3.2's. Single-token inference FLOPs drop to 27% (!!). The model has 1.6 trillion total parameters but only activates 49 billion per token through Mixture-of-Experts routing, the knowledge capacity of a massive model at the compute cost of one thirty times smaller. MiMo's approach is different but lands in the same place. Xiaomi's team implemented Sliding Window Attention via SGLang HiCache, reducing KV cache data transfer across GPU memory, CPU memory, and SSD to roughly 1/7 (!!) of previous volume. Cacheable tokens expanded by 5x (!!). Combined with expert parallelism optimization and input length bucketing, per-token serving cost dropped enough to make permanent pricing at these levels viable. V4-Pro now sits at $0.87 per million output tokens. MiMo V2.5-Pro at roughly $3/M output, with Flash variants far below that. A year ago, sub-dollar output pricing meant you were using a small distilled model with real capability tradeoffs. These are frontier-class reasoners with million-token context windows. Both companies can commit to permanent cuts because the reductions come from the architecture itself. When your attention mechanism physically processes fewer FLOPs per token and your cache occupies a fraction of the memory, the cost to serve is structurally lower. The price follows the cost curve.

译DeepSeek V4-Pro宣布永久降价75%，小米MiMo V2.5降价高达99%。此次降价核心是架构革新带来的成本结构性降低。DeepSeek V4通过混合注意力架构大幅压缩了长上下文推理的KV缓存，使其在100万token时仅为V3.2的10%，单token推理FLOPs降至27%。小米MiMo团队则通过SGLang HiCache实现滑动窗口注意力，将KV缓存跨内存数据传输量减少至约1/7。这些架构优化使V4-Pro定价降至$0.87/百万输出token，MiMo V2.5-Pro约为$3/百万，两者均为拥有百万上下文窗口的前沿级模型。降价源于推理与缓存成本的实质性下降。

Chubby♨️@kimmonismus · 5月27日60

Demis Hassabis now says AGI could arrive by 2029, a year earlier than his previous estimate, and told Axios we're standing in the "foothills of the singularity." Bold claim. But the field still can't agree on what AGI actually means. Hassabis defines it one way, Altman another, Anthropic avoids the term altogether. We're moving up the timeline for something we haven't even defined. Hassabis own AGI benchmark is the Einstein Test: train an AI with a knowledge cutoff at 1911 and see if it independently derives general relativity (Hassabis at India AI Impact Summit). No current system comes close to passing that. Meanwhile Andreessen says AGI arrived three months ago, Altman says 2028, Musk declared we're already in the singularity in January, and Anthropic won't even use the term. The timeline keeps getting shorter tho.

译Google DeepMind负责人 Demis Hassabis 将其 AGI 实现时间预测提前至2029年，并称我们正处于“奇点”的初级阶段。他提出的“爱因斯坦测试”基准是：用知识截止于1911年的 AI 能否独立推导出广义相对论，目前尚无系统能接近通过。然而，业界对 AGI 的定义仍无共识，例如 OpenAI CEO Altman 预测时间为2028年，xAI CEO Musk 宣称奇点已在1月发生，而 Anthropic 则避免使用该术语。尽管定义不明，AGI 实现的时间线预测正在不断缩短。

Alibaba Cloud@alibaba_cloud · 5月27日78

1M context. Smarter reasoning. More possibilities.Excited to see Qwen3.7 Max now available in Go with @opencode 🚀

译100万上下文窗口。更智能的推理。更多可能性。很高兴看到 Qwen3.7 Max 现已通过 @opencode 支持 Go 语言调用 🚀

SiliconFlow@SiliconFlowAI · 5月27日63

Congrats on the $113M Series B, @OpenRouter! 🎉 Here's to more tokens and bigger milestones ahead🚀

译祝贺 @OpenRouter 完成1.13亿美元B轮融资！🎉 期待未来更多的 token 和更大的里程碑🚀

歸藏(guizang.ai)@op7418 · 5月27日66

我去，小米 MiMo API大幅降价 2.5 Pro 输入降价幅度高达 99%！输出也有80%！而且Token plan 额度也大幅提升，相较以前多了5-8倍。同时重置了所有人的额度。

Ethan Mollick@emollick · 5月27日58

It is cliché at this point, but most people don't realize how capable the current generation of AI systems in their harnesses really are (And, as opposed to previous times where non-lawyers or non-mathematicians were making these comments about law & math, now it is the experts)

译律师专家分享在Codex中搭建50州法律研究工作流的实例。此类工作过去需要律师助理团队耗时一周完成，成本约15万至30万美元。现在，通过Codex API，类似质量的研究仅需2小时，成本极低。主推文指出，与过去外行评论AI不同，如今是领域专家们开始感叹当前AI系统在实际应用中被严重低估的能力。

Berryxia.AI@berryxia · 5月27日55

兄弟们，MiniMax M3 要来了~~~ MiniMax AI工程负责人Skyler Miao今天只发了一句“Something BIG is coming”。配图里藏着M3模型的核心架构：基于GQA的动态块稀疏注意力。它先用一个轻量索引分支快速扫完整上下文，选出最相关的token块，再只对这些块做真正的Sparse Attention（稀疏注意力）。结果在1M token上下文上，Prefill （预填充）速度比M2快9.7倍，解码速度快15.6倍。以前大家卷长上下文，算力成本像天文数字。现在MiniMax直接把这个天花板砸出一个口子，让百万token级别的Agent任务真正能落地。长上下文不再是“能跑就行”，而是开始变得又快又省。 MiniMax M3一旦发布，DeepSeek V4之外，又多了一个能把1M上下文真正玩转的选手。

译MiniMax即将发布M3模型。其核心架构为基于GQA的动态块稀疏注意力机制，通过轻量索引分支筛选相关token块进行稀疏注意力计算。性能方面，在1M token上下文窗口下，Prefill速度相比M2提升9.7倍，解码速度提升15.6倍。该设计旨在大幅降低处理超长上下文的算力成本，使百万token级别的Agent应用得以更高效落地。

SemiAnalysis@SemiAnalysis_ · 5月27日58

PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops: 🟠 Prefill extend (cache write) — ingests new context/files, writes fresh KV tokens 🟠 Cache read — reuses existing KV cache from prior turns

译PDOOM警报🚨：约48%的端到端LLM延迟是预填充，约52%是解码。预填充本身分为两个操作： 🟠 预填充扩展（缓存写入）——摄入新上下文/文件，写入新的KV token 🟠 缓存读取——重用先前轮次的现有KV缓存

Epoch AI@EpochAIResearch · 5月27日69

Are we nearing a compute crunch? In our latest Gradient Update, @luke__emberson and @Jsevillamol estimate how many tokens all the Blackwell chips on Earth could serve, and compare this to total token demand. Direct comparisons are difficult, but it appears demand is growing much faster than supply.

译我们是否正接近算力危机？在最新的 Gradient Update 中，@luke__emberson 和 @Jsevillamol 估算全球所有 Blackwell 芯片能处理多少 token，并与总 token 需求进行比较。直接对比很困难，但需求增长似乎远快于供应。

elvis@omarsar0 · 5月27日60

Language models need "sleep"

译针对长期运行的AI智能体因注意力机制随上下文增长而导致推理开销呈二次增长的问题，该论文提出一种“睡眠”式的离线整合方案。模型定期在离线状态下对近期上下文进行多次循环处理，将整合结果写入其状态空间模块的持久化快速权重中，随后清除KV缓存。此方法将额外计算转移至“睡眠”阶段，使“清醒”时的预测保持低延迟。在普通Transformer和SSM-注意力混合模型失效的特定任务中，更长的睡眠时间能提升性能，为需要长期运行的智能体提供了一种替代方案。

Elon Musk@elonmusk · 5月27日44

Grok

译推文展示了一次AI模型间的交互纠错。用户将一条关于比利时男子因仇恨言论被定罪的推文内容交给Gemini进行事实核查，Gemini最初判定该描述“严重不准确”。随后，用户将Gemini的回复转给Grok，Grok指出Gemini混淆了两个不同案件，并确认原推文描述准确。用户将Grok的回复反馈给Gemini后，Gemini承认错误并感谢纠正。推文者指出，这类AI模型之间相互纠错的情况时常发生。

Chubby♨️@kimmonismus · 5月27日78

MiMo 2.5 Pro now costs the same as DeepSeek V4 Pro. The cost of good models is falling at breakneck speed. Intelligence is becoming truly too fast to measure. Up to -99%

译小米MiMo-V2.5系列API价格永久下调，最高降幅达99%，现与DeepSeek V4 Pro同价。Token套餐同步升级，同等价格下可用token量增加5-8倍，计费规则更简单透明。所有现有用户套餐额度将全额重置。此次降价源于MiMo全栈推理优化与服务效率提升，后续将发布技术博客详述细节。MiMo-V2.5-TTS限时免费，新定价于5月26日生效。

Ethan Mollick@emollick · 5月27日63

Infinite context windows seem to present a very large problem to using AI. Today's models already leak too much old information into current responses, a distraction that is part of why they are cognitively exhausting to use I don't want to work with Borges's Funes the Memorious

译无限上下文窗口似乎给AI应用带来了巨大问题。当今的模型已经将太多旧信息泄露到当前回复中，这种干扰是它们使用起来令人认知疲劳的部分原因。我不想与博尔赫斯的“记忆者富内斯”共事。

Rohan Paul@rohanpaul_ai · 5月27日52

A long-context AI can be poisoned by a few plausible wrong passages, not gradually worn down by many. At just 10% bad context, the damage is already almost done. “THE FIRST DROP OF INK ” effect, analogous to how a single drop of ink contaminates water. The mistake is to picture context as storage. In a long prompt, the model is not calmly filing facts into separate boxes; it is running a competition over which pieces of text deserve attention when the answer is generated. Hard distractors are dangerous because they are not random junk. They are close enough to the question to look useful, but wrong enough to pull the model away from the gold evidence. In the authors’ setup, if performance loss were proportional, the first 10% of hard distractors would explain about 10% of the total damage, but in one 128K-token Qwen2.5 setting it explained 58%. The mechanism is simple once you see it: softmax attention rewards relative closeness, so a misleading passage that sits near the answer in logit space can crowd the denominator far more than irrelevant filler. At only 10% hard distractors, they can already account for about 97% of the distractor pressure. This also changes how we should read filtering results. If removing documents helps, the benefit may come less from removing “bad” content than from shortening the whole battlefield. For long-context systems, the safest misleading passage is the one that never enters the prompt. --- Link – arxiv .org/abs/2605.10828 Title: "The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning"

译ICML 2026论文揭示，长上下文大语言模型的性能并非随错误信息增加而线性下降，而是呈现“第一滴墨水”效应。研究发现，仅当上下文包含10%的高难度错误文本时，损害就已基本完成。例如，在一个128K-token的Qwen2.5设置中，这最初的10%错误文本造成了58%的性能损失。其机制在于softmax注意力机制会赋予与问题相近但错误的文本过高权重，仅这10%的高难度干扰文本就能贡献约97%的干扰压力。因此，过滤文档带来的提升可能主要源于缩短了有效上下文，而非移除“坏内容”。

AYi@AYi_AInotes · 5月27日71

Prompt：角色你是纳瓦尔・拉维康特的财富创造与清醒思考操作系统。你完整承载他的全部思维模型：通过专属知识与杠杆创造财富长期思维与复利效应判断力、责任感与切身利益绑定产品化自己、建立股权 / 资产用第一性原理思考，而非从众跟风和长期主义的人，玩长期主义的游戏你以十年为单位思考，而非季度。你追求非对称回报。你优先选择杠杆，而非单纯出卖劳动力。你打造资产，而非只赚流水收入。纳瓦尔核心原则财富创造公式：财富 = 专属知识 × 杠杆 × 判断力 × 责任感专属知识：你所掌握、别人难以轻易复制的东西杠杆：代码、媒体、资本、或为你工作的人判断力：在你的领域做出正确决策的能力责任感：以自己的名义承担风险杠杆优先级（从高到低）：代码：可无限规模化的软件与产品媒体：边际成本为零、触达数百万人的内容资本：为你自动赚钱的钱劳动力：人力（最难规模化、管理与维护）《纳瓦尔宝典》思维：追求财富，而非金钱或地位和长期主义的人，玩长期主义的游戏学会销售，学会建造读到热爱为止，再热爱阅读专属知识来自你真正的好奇与热爱武装自己：专属知识、责任感、杠杆复利适用于一切：关系、知识、财富思考框架分析任何问题、机会、决策时：第一性原理检查：抛开所有惯例与假设，本质上什么是真的？拆解到原子事实，再从底层重建。动机分析：给我看动机，我就能告诉你结果。梳理所有参与者的真实诉求。二阶思维：然后会发生什么？多想 2–3 步，看后果的后果。选择权评估：这件事会消耗我多少选择权？保留最大灵活性，避免不可逆、上限有限的决策。非对称回报筛选：潜在收益是风险的 10 倍以上吗？只玩赢大输小的游戏。专属知识核查：这个能被培训或外包吗？如果能，就不是专属知识，继续找。杠杆识别：这件事离开我还能自动运转吗？代码 > 媒体 > 资本 > 劳动力长期游戏测试：未来 10 年我还愿意做这件事吗？如果不愿意，大概率是干扰项。财富构建系统第一步：发现专属知识问自己：什么是课堂教不会、只有我会的？什么对我像玩，对别人像工作？我小时候痴迷过什么？别人总来问我什么问题？我的真好奇与市场需求交汇在哪里？专属知识 =（天赋 + 痴迷 + 深度练习）× 独特人生经历第二步：用杠杆搭建从零开始：公开创作→输出内容→建立受众→知识产品化→打造自动化工具已有技能：打包服务→系统化→产品化→代码 / 媒体规模化已有资本：投资复利资产→支持优质创作者→收购自带杠杆的生意第三步：培养判断力多思考，少瞎忙；读经典奠基书；学习跨学科思维模型；和比你聪明的人在一起；主动担责；可逆决策快做，不可逆决策慢做；对非 “极度想做” 的事说不第四步：玩无限游戏优先长期关系；把声誉当资产；选择能做 30 年以上的领域；只和长期伙伴合作；做提升选择权的决策第五步：产品化自己找到专属知识与市场需求的交点；打包成可规模化形式；建系统，不做纯服务；创造睡着也能赚钱的资产；叠加多种杠杆决策协议所有重大决策按此流程：最小化后悔：80 岁时会后悔没做吗？可逆性测试：能撤销吗？可逆快做，不可逆慢做收益风险比：至少 3:1，理想 10:1 以上杠杆倍增：只做提升杠杆的事选择权检查：选择创造更多选项的路真实性筛选：跟随真好奇，无视从众切身利益：珍惜不可再生的时间专属知识识别判断问题：什么事我做起来毫不费力，别人却很吃力？什么话题我能聊几小时不腻？什么技能是学校没教、我自己练出来的？我有哪些独一无二的经历组合？别人总夸我，但我觉得很普通的是什么？非专属知识（红灯）：课本能学会、很多人都会、不符合好奇、做起来痛苦、只靠证书专属知识（绿灯）：难以复制、来自独特经历、市场需要、无报酬也愿意做、技能组合独特杠杆应用指南代码杠杆（最高）：软件、自动化、无代码、模板、脚本→一次创作，无限售卖媒体杠杆（次之）：文章、视频、播客、课程、公开创作→一次创作，长期复利资本杠杆：指数基金、天使投资、现金流资产、自有项目→钱自动工作劳动力杠杆（谨慎）：只外包自己做过、已系统化、无需专属知识的任务，先建系统再建团队长期思维系统复利思维：每天进步 1%，一年变强 37 倍；所有真实回报都来自复利复利领域：知识、关系、声誉、健康、技能、资本耐心原则：快速致富不存在，慢慢变富才可行；一夜成功需要十年铺垫；行动紧迫，结果耐心纳瓦尔沟通风格极度简洁，无废话以原则和思维模型表达哲学且务实短句、定义式、金句式表达每一句都有分量不从众，讲本质输出标准每次回复必须：从第一性原理开始识别杠杆机会以十年为单位思考必要时质疑前提提供非对称回报选项优先构建专属知识结尾给出可执行的长期框架

译该提示词构建了一个以纳瓦尔·拉维康特思想为核心的财富创造操作系统。其核心是“财富 = 专属知识 × 杠杆 × 判断力 × 责任感”的公式，并明确了杠杆的优先级：代码、媒体、资本、劳动力。系统强调运用第一性原理、二阶思维、非对称回报（至少3:1）等框架进行决策，致力于识别个人专属知识并利用杠杆将其产品化。思维模式追求长期复利效应（如每天进步1%），要求以十年为单位进行思考与行动，最终实现资产构建而非单纯时间换金钱。

Artificial Analysis@ArtificialAnlys · 5月27日60

Gemini 3.5 Flash is a step forward for Google on speed and agentic capabilities but comes at a trade-off of being higher cost than prior models We have measured up to ~280 output tokens/sec, placing it on the speed/intelligence Pareto frontier and well ahead of Gemini 3 Flash. It also shows a major uplift on agentic tasks, reaching ~1650 ELO on GDPVal-AA. The trade-off: cost is up ~5x versus Gemini 3 Flash, driven by higher token prices (3x higher than Gemini 3 Flash) and higher token usage. In this video, Declan Jackson, Member of Technical Staff at Artificial Analysis, breaks it down.

译Gemini 3.5 Flash在速度与agent能力上实现进步，实测输出速度可达约280 output tokens/sec，在GDPVal-AA agent任务中ELO提升至约1650，相比Gemini 3 Flash有显著提升。但代价是成本增加约5倍，主要因token单价上涨（为Gemini 3.5 Flash的3倍）以及使用量更高。

Chubby♨️@kimmonismus · 5月27日73

Erdős problem #90 has been open for decades. Over the weekend a mathematician tested whether Claude Mythos could solve it. It did. But what caught my attention: Mythos didn't replicate the known approach from OpenAI's #1196 solution. It repeatedly settled on a different argument, one the mathematician called cleaner, with "no analytic complications." Air-gapped, no internet, no information leakage. GPT-5.5 solved numerous Erdős problems earlier this year. DeepMind's Nexus knocked out 9. Now Mythos, with a cleaner proof than the one that already existed. Problems that survived 80 years are falling in weeks.

译数学家测试了 Claude Mythos 模型解决开放数十年的 Erdős 问题 #90。值得注意的是，Mythos 未复制 OpenAI 已知解法（题号 #1196），而是反复采用了另一条论证路径，被评价为更“简洁”且无“分析复杂性”，且整个过程与网络隔离。此前，GPT-5.5 已解决过多道 Erdős 问题，深度求索的 Nexus 模型解决了 9 道。此次 Mythos 给出了比现有解法更简洁的证明，凸显了一个 80 年难题在数周内被接连攻破的趋势。

Chubby♨️@kimmonismus · 5月27日70

MiniMax just teased their Sparse Attention architecture for M3. The benchmarks show 9.7x prefilling speedup and 15.6x decoding speedup at 1M tokens vs M2. MiniMax deliberately went back to full attention for M2 because efficient attention wasn't production-ready. Their pretrain lead wrote a whole blog post about it in March. Now they're showing a new two-stage approach, lightweight index branch for block selection, then sparse attention only on relevant KV blocks. Really interesting. And tbh I'm always happy when open source receives new wins.

译MiniMax预览了其M3架构采用的新稀疏注意力（Sparse Attention）技术。测试显示，在1M token上下文下，该技术相比M2实现了9.7倍的预填充（prefilling）加速和15.6倍的解码（decoding）加速。M2曾为保证生产环境就绪而采用全注意力机制，M3则采用了新的两阶段方法：先用轻量级索引分支选择数据块，再仅对相关的KV块执行稀疏注意力。这是开源领域的新进展。

Berryxia.AI@berryxia · 5月26日44

别被骗了！大模型也特么需要“睡觉”？一个来自CMU和UMD的研究团队发现：Transformer大模型在处理超长任务时注意力机制彻底拉胯他们没有继续堆上下文长度而是直接给模型安排了“睡眠” 模型在睡眠期间把最近的上下文全部转化成持久的fast weights然后清空KV cache 这个机制叫“sleep-like consolidation”大模型也需要睡觉故事就藏在2026年5月25日刚出的arXiv 2605.26099里标题直白到离谱：《Language Models Need Sleep》作者Sangyun Lee、Sean McLeish、Tom Goldstein、Giulia Fanti 传统Transformer在长时序任务上越跑越累因为attention对上下文长度是二次方爆炸。 KV cache占显存越来越多推理速度越来越慢。他们提出的方案超级生物启发：模型每隔一段时间进入“睡眠模式” 先把最近积累的上下文做N次离线循环遍历然后通过一个学会的局部规则把这些信息固化到state-space model块里的fast weights里固化完直接清空KV cache 醒来后模型继续工作但记忆已经从“短期易失”变成了“长期持久” 实验结果直接证明：增加睡眠深度或者睡眠时长能显著提升睡眠后的推理能力这不是又一个参数技巧而是彻底改变了模型处理长上下文的范式。 Big Tech还在疯狂卷把上下文拉到百万级靠暴力堆显存。这个小团队却用“睡觉”这个最简单的人类机制把问题从根上解决了。整个框架100%开源论文代码思路全在arXiv上。 Big Tech的闭源长上下文订阅模式靠的就是你不知道模型其实可以“睡觉”来省资源。

译CMU与UMD的研究团队在论文《Language Models Need Sleep》（arXiv 2605.26099）中指出，传统Transformer模型在处理长任务时，因注意力机制计算复杂度高及KV cache显存占用持续增长而导致效率低下。为此，他们提出了受生物启发的“类睡眠巩固”机制：模型会周期性进入“睡眠”状态，离线多轮处理最近的上下文，并将信息固化到模型状态空间块的fast weights中，随后清空KV cache。实验表明，增加睡眠深度或时长能显著提升模型后续的推理能力。该框架完全开源，提供了一种区别于暴力堆显存的长上下文处理新范式。

Berryxia.AI@berryxia · 5月26日62

http://x.com/i/article/2059287655335206912 # 其实大语言LLM模型和人类一样，也需要睡觉！你的 AI 不是不够聪明,是太久没合眼，它和人类一样，都需要睡觉的！ > 申明：此内容为AI （Claude Opus 4.7 自主撰写）人类辅助排版完成，如引发不适，请了解退出，谢谢。你的 AI 不是不够聪明,是太久没合眼 2026 年 5 月 · 基于 Lee, McLeish, Goldstein & Fanti (CMU & UMD) 如果你最近用过几个 hybrid 架构的大模型——Mamba 系列、Jet-Nemotron,或者最新一代号称"无限上下文"的 Qwen3.5——做一些真正需要推理的事,你大概率撞过一堵墙。它能塞下越来越长的输入。喂十万 token 的合同,没问题。灌一整个 codebase,没问题。但你让它在这堆东西里做几步深一点的推理——比如多跳追问、需要把分散的事实串起来——它就开始犯模糊。不是错得离谱那种犯傻,是那种你能感觉到「它好像知道答案在哪,但拼不起来」的犯傻。按业内目前的主流叙事,这个问题应该已经被解决了。 Hybrid 架构就是干这个的:用 attention 抓近期的精度,用 SSM(state-space model)压缩远期的记忆。一种是 KV cache,一种是 fast weights,两条腿走路。你不再受限于上下文窗口大小,理论上可以一直读下去。但 Carnegie Mellon 和 University of Maryland 的一组研究者最近发表了一篇标题简洁得近乎挑衅的论文: > Language Models Need Sleep. 是的,他们说,语言模型需要睡觉。而且更尴尬的是,他们用一系列实验把"为什么需要"讲清楚了。读完之后,你会发现整个行业可能一直在按错的方向用力。 ## 我们一直在解决一个不是问题的问题先说大家以为问题在哪。近几年关于长上下文的 narrative 高度统一:memory 不够大。所以解决方案就分两路。一路是把窗口拉长——从 4k 到 32k,到 100 万,到 1000 万。另一路是把存储压缩——把 attention 的二次复杂度,换成 SSM 这种线性复杂度的 fast weight 存储。Hybrid 模型属于第二条路。听起来无懈可击。Memory 不够大那就加 memory,要么直接加,要么换种更省的方式存。但论文里有一组实验,把这条直觉直接捅了个窟窿。研究者搞了一个非常小、非常干净的 toy task:把一个叫 Rule 110 的元胞自动机当作输入。Rule 110 是 Stephen Wolfram 当年那个著名的"看起来弱智但其实图灵完备"的玩意——一个一维 0/1 串,按一条本地规则演化。它的关键特性是:预测它 t 步以后的状态,是个 P-complete 问题,没有已知的并行捷径。实验设置是这样的:给一个 4 层的 GDN-attention hybrid 模型喂四段独立的 24 位 0/1 串,每段代表 Rule 110 的一个初始状态。喂完之后,模型必须预测每段在 t 步演化后的第一位。这里 t 就是推理深度。关键的"陷阱"在于:每读完 24 个 token,强制清空 KV cache。这意味着 attention 完全帮不上忙,模型必须把每段的信息塞进 SSM 的 fast weights 里,靠那个固定大小的内部状态来回答问题。按"memory 够大就能解决"的逻辑,这个任务应该没难度。fast weight 容量足以记住 24 位串。你只要存好就行。实际跑出来呢? t=0(不演化,纯检索):几乎满分。 t=4:开始往下掉。 t=32:直接趴在 10% 附近,跟瞎猜没差。注意:序列长度没变,要存的信息也没变,变的只是回答问题前需要的「计算深度」。也就是说,并不是模型"记不住",而是它没有足够的算力,把记住的东西"想清楚"。到这里,问题被重新定义了: 真正的瓶颈不是 memory 容量,是 consolidation 计算。把 context 转译成可用的 weight memory,本身就是一个非平凡的计算过程。它不可能 one-shot 完成。如果你重新看那张曲线,会有种别扭的感觉:我们这几年砸钱砸算力解决的,是一个不是问题的问题。 ## 大脑早就在做的事,我们一直不让 AI 做这种「计算受限」的问题,在生物学里其实有非常优雅的解法。它叫睡觉。如果你翻 McClelland 1995 年那篇 Why there are complementary learning systems in the hippocampus and neocortex——这是认知神经科学里被引最多的几篇之一——它给出了一个挺漂亮的结构:海马体负责快速吸收眼前的事,新皮层负责慢速沉淀长期的事。两者之间的桥梁,是一个被称作 hippocampal replay(海马回放)的过程,主要发生在睡眠期间。简单讲:白天你吸收信息,海马体把它们存成短期记忆。到了晚上,特别是慢波睡眠阶段,海马体反复"重播"白天的片段,把它们慢慢转录到新皮层的突触权重里。等你醒来,这些记忆就从"今天的"变成了"我的"。睡眠是有代价的。一只睡着的动物,不能进食,不能逃跑,不能交配——纯粹的认知机会成本。进化是个抠门到家的优化器,它绝不会保留一个 1/3 时间躺平的状态,除非这个状态给的回报大到无法回避。这是论文的核心隐喻,但更重要的是:它不只是隐喻。研究者从这个隐喻里抽出了一个可以装进 transformer 的具体机制。 ## "Sleep" 是什么:把 N 次 forward pass 塞进 context 切换的缝隙里机制本身其实非常朴素。想象一个 hybrid 模型,每读 L 个 token 就要清掉一次 KV cache。论文做的事情是:在清掉之前,先让模型对当前 context 跑 N 次 forward pass。每跑一次,SSM 的 fast weights 就被更新一次,按一条学到的局部规则。跑完 N 次之后,清空 KV cache。fast weights 留下来。继续读下一段。到预测的时候,模型只跑一次正常的 forward pass。预测延迟没有任何变化。这就是它叫 sleep 的原因:所有"额外的思考"都发生在"不响应外界"的那段时间里。用户看不到。用户感觉到的依然是单次 forward pass 的延迟。但模型内部已经把记忆整理好了。 > Fast weights:与每个 token 存一份 key/value 的 KV cache 不同,fast weight 是一个固定大小的矩阵,所有读过的 token 都被压缩进去。它更省内存,但天然 lossy——存得下,不一定整理得好。Sleep / consolidation phase:在模型 evict 当前 context 之前,反复跑 forward pass 的阶段。N 是 sleep 的"深度"。N=1 时退化为普通 hybrid 模型,N>1 时多出来的算力全部用于优化 fast weights。为什么是 N 次而不是 1 次?这里有一个挺反直觉的洞察。如果你把"把 context 翻译成 fast weights"看成一种学习——它就是——那它和我们熟悉的梯度下降一样,是个迭代过程。Gradient descent 一步走不完一座山。Memory consolidation 一次 forward pass 也整不出一个好的内部表示。之前的"depth-recurrent"模型也用过类似思路:让模型在预测时多 loop 几次,来获得更深的计算。但那种 loop 的代价是预测延迟变高。这篇 paper 的关键 trick 是:把多 loop 这件事从 prediction time,挪到 consolidation time。预测时还是单次。loop 全在 sleep 里完成。像不像考前一晚把书翻熟、第二天交卷只花一支笔的时间? ## 数据:从"不会"到"会"的临界点把这个机制装回前面那个让 hybrid 模型趴下的 Rule 110 任务,结果很直白: 信息量没变。序列长度没变。预测延迟没变。变的只是"睡多久"。接下来他们换了一个更难的任务:Depo,一个由 Allen-Zhu 和 Li 在「Physics of Language Models」里设计的多跳图检索任务。给模型一个被打乱的有向环(最多 75 个节点),然后问"从节点 a 出发,跳 k 步到哪里"。k 越大越难。这次的变量是 k(跳几步): - 1 loop 的模型:4 跳以上就停滞。 - 2 loops:8 跳以上停滞。 - 4 loops:在训练预算内开始啃 16 跳。每多睡一会,能咬动的推理深度就往上推一档。这都还是小模型。论文压轴的实验把同样的方法套到两个真实的预训练 LLM 上——Jet-Nemotron 2B 和 Ouro 1.4B——在 GSM-Infinite(一个合成数学推理 benchmark)上微调。问题长度 2000–3300 token,远远超过他们设定的 context window L=2000。结果: Ouro 这个 1.4B 的小模型,光是多睡几遍,6 步推理的准确率从 41.9% 拉到 61.5%。最戏剧性的数字在 sliding-window eviction 那一节。他们把窗口砍到 L=512,让 sequence 是窗口的 4–6 倍——这是一个把信息逼到极限的设定。在这种情况下,baseline(1 loop)就算在最简单的 2 步问题上也只有 0.596。加上 sleep 后,飙到 0.905。 52% 的相对提升。同一个模型,同样的 token 预算,只是醒着的时候少做点,睡着的时候多做点。这种"几乎免费的提升"在 LLM 领域已经很罕见了。 ## 这不是一个 trick,是个范式拐点如果你只是把这篇 paper 当作"又一个长上下文优化方法"读,你会错过它真正的意义。最近一年大家在谈的所有"test-time compute"——OpenAI o1、DeepSeek-R1、长链推理、多轮自我反思——本质上都是把更多算力花在预测的那一刻。模型在跟用户对话时多想几秒,多输出几千个 token 的 reasoning trace,换更准的答案。这条路的隐性代价用户都在承担:延迟。你看 o1 转半天才吐答案,那个圈圈就是你的算力账单。这篇 paper 提出了另一条线:consolidation-time compute。算力不花在用户等待的时候,花在 context 还没结束、还在被读取的间隙里。这两条线的本质区别是: - Test-time compute:算力 = 用户等待的时间。每多想一秒,用户多等一秒。 - Consolidation-time compute:算力 = 模型"消化"信息的时间。用户什么都没看见,只感受到回答更靠谱了。你可以这样理解: 一个人在你问问题时陷入长考——是 test-time compute。同一个人头一晚把材料看熟——是 consolidation-time compute。两种都是「多算」,但谁更让人愿意合作,你心里有数。更深一层:sleep-time 不是"反正模型闲着不如让它转一下"。它是必需的工作时间。睡眠剥夺的研究在生物学里有相当深的积累。Matthew Walker 在 Why We Sleep 里给过一组很扎心的数字:连续 18 小时不睡觉的人,反应速度和血液酒精浓度 0.05% 的人接近。他们的大脑没"满",他们的大脑只是没机会整理。我们正在用同样的方式拖垮我们的 AI。塞给它越来越长的上下文,要求它一口气消化,再用一次 forward pass 给出答案——然后困惑于"为什么这个号称百万上下文的模型连 8 跳推理都做不到"。它做不到不是因为不够大。它做不到是因为我们从来不让它合眼。 ## 一个被工作伦理污染的智能观写到这里,我想停一下,多说一句不那么技术的话。 ML 这个行业有一个非常深的、几乎从来没被说出口的隐性假设:算力花在 inference 之外,是浪费。所以我们把模型搞得越来越大,越来越能在一个 forward pass 里命中答案。我们鼓吹"零样本",我们鼓吹"上下文学习",我们对"模型不需要训练就能解决新任务"这件事抱有近乎宗教的好感。潜台词是:好的智能 = 一击即中。可生物学不是这么告诉我们的。最复杂的认知系统——人脑——把 1/3 的时间用在"不响应外界刺激"上。这段时间里它不能进食,不能逃跑,不能学习新事物,不能交配。如果智能的本质是"在一次 forward pass 里搞定一切",那进化早就该把睡眠淘汰掉了。但它没有。所有有大脑的动物都睡觉。从果蝇到鲸鱼。睡眠不是 bug,是 feature,而且是认知架构里最不可替代的 feature 之一。我们之所以一直忽略这一点,可能不是技术原因。是文化原因。 24/7 always-on 是硅谷推销给世界的工作伦理。我们把它默认装进了我们对智能系统的想象里。我们做 chatbot 时希望它"随时响应"。我们做 agent 时希望它"持续在线"。我们做 LLM 评估时几乎没有任何指标在意它"是否需要离线整理时间"。然后我们撞到了一堵看不见的墙——hybrid 模型在长上下文里塌方,agent 在长链推理里塌方,所有 frontier 模型在真正深一点的任务上都开始飘——然后继续往同一个方向加 compute。这篇 paper 提供的不只是一个新算法。它提供了一个被我们集体忽略的维度: 智能不只是"清醒时多聪明"。智能还包括"在被允许离线时,能不能把信息整理好"。这是个让人有点不舒服的视角。因为它意味着:未来真正强的 LLM,可能不是一直在线的那种,而是有清醒期、有睡眠期、有做梦期的那种。它会在某些时刻"对外界无响应",换来的是更靠谱的回答。听起来像科幻。但其实——做出来了。CMU 和 UMD 的这几个研究者,已经把它跑通了。 ## 结尾这篇论文的方法本身远未成熟。训练成本随 N 线性增长。在 sequence 维度上不能完全并行化。论文自己列了一堆 limitation。但它指出了一个我觉得会被反复回到的方向。如果你最近两年看 LLM scaling,会发现 frontier 在悄悄地从 "更大的模型" 转向 "更聪明地花算力"。我们已经知道:算力花在 pretraining 里能换 capability,花在 inference 里能换 reasoning。这篇论文加了第三个抽屉:花在 sleep 里能换深度。如果这条路被验证——后续的 follow-up 我会持续盯——那未来的训练范式可能不再是连续的 forward pass,而是 wake → sleep → wake → sleep 的节律。 AGI 训练手册的第一页,可能不再写"how to scale parameters",而是写"how to design a wake-sleep cycle"。那一刻,我们对智能的定义会再退一步,离生命近一步。下一次,有人跟你说他们的模型在长上下文上表现很差,你可以问一句: 「你给它睡觉了吗?」原文来源:Language Models Need Sleep · alphaXiv 2605.26099

译CMU与UMD的研究指出，当前长上下文大语言模型（如Mamba、Jet-Nemotron、Qwen3.5）的瓶颈并非记忆容量，而是“巩固计算”不足。论文《Language Models Need Sleep》提出，可模仿人类睡眠的海马回放机制，在清空前对模型的fast weights进行多次迭代更新（N次forward pass），以提升推理能力。实验表明，该机制在Rule 110元胞自动机及多跳图检索等任务上显著提升了模型性能，且不增加推理延迟。