What’s new with MiMo-V2.5 series inference? We just published a blog on our full pipeline inference optimizations for MiMo-V2.5 series, including how we pushed hybrid SWA efficiency to the limit. Read the full blog here: https://mimo.xiaomi.com/blog/mimo-v2-5-inference

译MiMo-V2.5系列推理有哪些新进展？我们刚刚发布了一篇博客，详细介绍了针对MiMo-V2.5系列的全链路推理优化，包括如何将混合SWA效率推向极限。阅读全文请访问： https://mimo.xiaomi.com/blog/mimo-v2-5-inference

Fuli Luo@_LuoFuli · 5月30日63

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions Read the full technical blog: https://mimo.xiaomi.com/blog/mimo-v2-5-inference The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline. Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.

译MiMo-V2.5 系列模型（包括 MiMo-V2.5 和 MiMo-V2.5-Pro）采用混合滑动窗口注意力（Hybrid SWA）架构，将 KVCache 存储压缩至全注意力的约1/7。为将架构优势转化为实际收益，团队重新设计了 KVCache 管理、分层缓存和前缀缓存树，并优化了 SWA KVCache 处理、调度及 Prefill/Decode 流水线。经真实生产流量验证，这些优化将有效 KVCache 容量提升近5倍，主流框架下服务器端缓存命中率达93%-95%。结合 MoE 配置调优与多模态推理优化，提升了长上下文推理效率，是近期 API 降价的基础。

Chubby♨️@kimmonismus · 5月30日46

It’s reasonable to expect that the next iteration will be better. It would be surprising if GPT-5.6 wasnt an improvement over GPT-5.5. But the more interesting part is token efficiency. As models move into more complex, longer-running, agentic workflows, every wasted token becomes latency, cost, and friction. Obv. GPT-5.5 seems to be a real step here: not just more capable, but more efficient in how it reasons and executes. Kudos. High hopes for 5.6 being even more efficient.

译文章探讨了OpenAI GPT系列模型的迭代策略。核心观点是，模型更新不仅意味着能力增强，更重要的是token效率的提升。token效率的提高直接带来更低的延迟、成本和摩擦，对于未来更复杂、运行时间更长的AI智能体工作流至关重要。从GPT-5.0到GPT-5.5的每次迭代，都在能力和token效率（进而带来速度增益）上实现进步，GPT-5.5是目前最好的模型。作者肯定了GPT-5.5在推理和执行效率方面的实际提升，并对GPT-5.6将变得更高效抱有高期望。

Rohan Paul@rohanpaul_ai · 5月30日63

Reuter: ByteDance is building its own AI data-center CPUs because running agents at TikTok scale now depends on scarce server processors, not only Nvidia GPUs. inspired by Groq's "language processing units," they are testing both Arm and RISC-V, which lets it compare a mature commercial design against a more controllable open instruction set before mass production. The market is seeing a 10%-35% quarterly CPU price increases and long supply delays, hence making an in-house silicon is now cost and supply-chain move, not just a prestige project. So ByteDance wants to both reduce dependence on restricted foreign AI hardware and make inference cheaper per query. The deeper shift is that AI agents is now turning CPUs into strategic chips. A gentic inference stresses CPUs much more because one user request can trigger many smaller steps: retrieve files, call a tool, query a database, run a model, check the answer, call another model, send data across servers, and manage memory. However, ByteDance does not seem to have in-house chip design teams and is reportedly relying on several external partners, who are also expected to handle the actual silicon manufacturing. --- reuters .com/world/china/bytedance-developing-custom-cpu-chips-support-ai-rollout-sources-say-2026-05-28/

译路透社报道称，字节跳动正开发自研数据中心CPU芯片，以支持TikTok规模的AI智能体运行。此举受Groq的“语言处理单元”启发，旨在应对当前服务器处理器短缺问题。公司正在测试Arm和RISC-V两种架构，以比较成熟商业设计与可控开放指令集。由于CPU价格季度性上涨10%-35%且供应链延迟，开发自研芯片已成为一项成本与供应链策略，旨在减少对受限外国AI硬件的依赖并降低单次查询推理成本。AI智能体的推理对CPU依赖远大于传统模型，因单个用户请求可能触发多个步骤。据报道，字节跳动可能依赖外部合作伙伴进行芯片设计与制造。

SemiAnalysis@SemiAnalysis_ · 5月30日67

TRUTH SOCIAL: NVLink multicast is not supported on Blackwell "Confidential Computing" leading to 61% performance regression on SGLang Qwen3.5 397B according to @verdacloud 's recent github ticket. NVIDIA's "Confidential Computing" is complete slop as in addition Hopper's confidential computing had fully unencrypted NVLink according to NVIDIA's own "NVIDIA Secure AI with Blackwell and Hopper GPUs" Whitepaper.

译TRUTH SOCIAL：根据@verdacloud最近的GitHub工单，NVLink多播在Blackwell“机密计算”上不被支持，导致SGLang Qwen3.5 397B性能下降61%。NVIDIA的“机密计算”完全是垃圾，此外根据NVIDIA自己的《NVIDIA Secure AI with Blackwell and Hopper GPUs》白皮书，Hopper的机密计算也存在完全未加密的NVLink。

Rohan Paul@rohanpaul_ai · 5月30日64

Today’s edition of my newsletter just went out. 🔗 https://www.rohan-paul.com/p/anthropic-releases-claude-opus-48 🗞️ Anthropic releases Claude Opus 4.8 on the same day as its $965B valuation round. 🗞️ KogAI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding) with a 2B model. 🗞️ Video to Watch: Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring. 🗞️ Anthropic secures a massive post-money valuation of $965B after raising $65 B. 🗞️ Datacurve launches DeepSWE, a tougher coding benchmark made to show where leading models truly separate. 🗞️ OpenAI and Thrive just built a self-improving tax agent with up to 97% accuracy.

译本期简报要点如下：Anthropic发布了Claude Opus 4.8模型，并宣布完成650亿美元融资，投后估值达到9650亿美元。KogAI展示了其在特定硬件上的性能：使用8块AMD MI300X GPU时处理速度达3000 tokens/s，使用8块NVIDIA H200 GPU时达2100 tokens/s（FP16精度，无推测解码），模型参数为20亿。此外，Datacurve推出了更具挑战性的编程基准测试DeepSWE，旨在更清晰地评估顶尖模型的性能差异。

Rohan Paul@rohanpaul_ai · 5月30日56

Terence Tao: "We lived in a world with cognitive friction until very recently, where every task required us to use our brain. So we didn't really think about it, we just thought this was the cost of doing something intellectual. But now we have AI and the other technologies that can bring these frictions down to zero." Most research time is not spent having cinematic insights. It is spent checking cases, chasing references, translating intuition into computation, testing a path, finding it false, and deciding whether the failure taught you anything. AI changes the cost of that loop. Terence Tao says that now he can try “crazier things,” and that makes so much difference. Because unconventional ideas are often not rejected by proof, but by inconvenience. A mathematician may avoid a strange direction not because it is foolish, but because the bookkeeping, coding, or literature search needed to test it is too expensive for a hunch. This is where cognitive friction becomes scientific friction. Lowering it does not make taste, judgment, or proof disappear; it makes more weak signals cheap enough to inspect before they are abandoned. AI is making hesitation less expensive, and that is often where discovery begins.

译陶哲轩指出，研究过程中存在大量“认知摩擦”——例如验证想法、排除错误路径、将直觉转化为计算等试错环节，占据了主要时间。AI正在将这些摩擦成本降至零，使研究者能更自由地尝试“更疯狂的想法”。推文强调，许多非常规路径并非被证伪，而是被高昂的验证成本所阻碍。AI降低这一成本，让原本因“不便”而被放弃的弱信号得以被审视，这往往是发现的起点。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 5月30日39

The year is 2026. AIs are literally inventing new math, and journalists are still posting OBVIOUSLY false shit like this 99% of people have no idea what's coming because journalists failed them.

译2026年。AI正在真正发明新的数学，而记者们还在发布这种明显错误的内容。 99%的人不知道即将发生什么，因为记者们辜负了他们。

Yuchen Jin@Yuchenj_UW · 5月30日38

I asked Opus 4.8 how Anthropic implements this. It told me @ClaudeDevs isn’t an official Anthropic account. True AGI. 😂

译我问 Opus 4.8 Anthropic 是如何实现这个的。它告诉我 @ClaudeDevs 不是 Anthropic 的官方账号。真正的 AGI。😂

François Chollet@fchollet · 5月30日16

Einstein on (not) using NL for invention: "The words or the language, as they are written or spoken, do not seem to play any role in my mechanism of thought"

译爱因斯坦谈（不）用自然语言进行发明："书面或口头的语言文字，在我的思维机制中似乎不起任何作用"

宝玉@dotey · 5月30日68

我的原则： ✅ Reasoning Max ❌Speed Fast 慢就是快，多花点时间推理，你就少花一点时间去验证快就是贵，Fast 不是不好，主要是性价比不高，不差钱当然无所谓

译推文对比了AI模型的两种推理模式。主张选择Reasoning Max模式，认为多花时间进行深入推理，反而能减少后续验证时间，即“慢就是快”。而Speed Fast模式虽快，但性价比不高，除非预算充足。被引用的推文进一步支持“选择Max”，并指出这样能最大化利用用户宝贵的时间。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 5月30日40

Half the country believes AIs are stupid and not improving, yet... they're about to take everyone's job anyway? The fuck?

译半个国家的人认为AI很蠢且没有进步，然而……它们即将抢走所有人的工作？搞什么？

Rohan Paul@rohanpaul_ai · 5月30日76

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. @Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

译Kog团队在标准数据中心GPU上实现了极高的单用户推理速度，在8× AMD MI300X GPUs上达到3,000 tokens/s，在8× NVIDIA H200上达到2,100 tokens/s。相比常规推理速度（约100-300 tokens/s），实现了10-30倍提升。其核心思路是将LLM解码视为内存流问题，通过协同设计monokernel、重建同步机制、针对性内存访问映射及采用延迟张量并行的Laneformer模型架构，消除了传统流程的阻塞点。

X.PIN@thexpin · 5月29日65

http://x.com/i/article/2060305879338029061 # Huawei can't win the Nanometer race. So it is changing the game. Unable to compete at the frontier of transistor scaling, Huawei is betting that the future of chip performance lies in integration, interconnects, and light. Huawei cannot reliably win the nanometer race. So it has decided to run a different one. On May 25, 2026, He Tingbo, Huawei’s borad member and president of semiconductor business, took the stage at the International Symposium on Circuits and Systems in Shanghai and announced what she called the τ (Tau) Law, a new principle for how chips should be made faster in an era when making transistors smaller is no longer a reliable path forward. Huawei described it as the first attempt by a Chinese company to articulate a post-Moore scaling framework with global ambitions. The announcement generated a wave of coverage, most of it focused on whether this constituted a genuine scientific contribution or a rebranding of known techniques. Both framings miss the more consequential question: why is Huawei doing this at all, and what does it reveal about where the company is placing its bets? The answer starts with a set of circumstances Huawei did not choose, and a moment in the industry’s trajectory that made those circumstances easier to work with. The timing is not accidental. As transistor scaling slows globally, AI systems are becoming increasingly constrained by data movement rather than raw compute. The bottleneck is shifting from how fast a single chip can calculate to how efficiently thousands of chips can share data across a system. The industry was already moving toward advanced packaging, chiplets, and optical interconnects to address that shift. Huawei’s contribution was to turn those scattered trends into a single narrative, and claim the naming rights before anyone else did. Since 2020, U.S.-led export controls have effectively cut Huawei off from the ecosystem required to manufacture chips at the industry’s leading edge. The result is that Huawei cannot access leading-edge manufacturing on the same terms as Apple, Nvidia, or Qualcomm. The Mate 60’s appearance of 7nm-class chips, achieved through SMIC, showed that the door is not entirely shut. But competing at the industry’s true frontier has become extraordinarily difficult in a way that is structural, not temporary. That frontier has a straightforward competitive logic. Smaller transistors fit more computing power into the same area, consume less energy per operation, and run faster. This is what Moore’s Law predicted in 1965 and what the industry has organized itself around ever since. Every two years or so, the leading foundries push to a new node: 7nm, 5nm, 3nm. The companies that can access those nodes gain a measurable performance advantage over those that cannot. Competing there, at the very frontier, is what Huawei cannot currently do on equal terms. That is the constraint within which the τ Law was designed. ## A Different Variable to Optimize The τ Law proposes an answer to that constraint. In Huawei’s formulation, τ refers to the effective RC time constant that governs how quickly signals can propagate and switch states within a chip. Smaller τ means faster signals, more operations per second, higher effective performance. Moore’s Law, underneath all the transistor-count language, was always producing performance gains by reducing τ: shrink the transistors, shorten the wires connecting them, signals arrive faster. Huawei’s argument is not that this was wrong. It is that there are other ways to reduce τ that do not require a new process node: through the circuit layout, the chip architecture, and the systems connecting chips together. Huawei defines a four-layer optimization stack: the transistor itself, the circuit connecting transistors, the chip connecting circuits, and the system connecting chips. Each layer has its own version of τ, and each offers opportunities to compress signal travel time without shrinking transistor dimensions. The τ Law is a framework for pursuing all four simultaneously. Here is the honest assessment of what this represents: Huawei did not discover this direction. The physics pointing toward it, with RC delay as the binding constraint as geometric scaling slows, has been in semiconductor textbooks for decades. Intel, TSMC, and Samsung are all working on versions of the same techniques. What Huawei did was name the direction, formalize it into a single framework, and build a public roadmap around it. That is a different kind of contribution than inventing the underlying physics. But it is not nothing. Moore’s Law itself was not a discovery of new physics. It was a prediction that became a commitment that became a coordination mechanism for an entire industry. ## Folding Is Not Stacking The most tangible expression of the τ Law at the chip level is Logic Folding, and understanding it requires separating it from something it superficially resembles: conventional 3D chip stacking. The semiconductor industry has been stacking chips for years. TSMC’s SoIC, Intel’s Foveros, and Samsung’s X-Cube all take multiple finished chips and connect them vertically to reduce the distance signals travel between them. It is a genuine and increasingly important technique. But each chip in the stack is still internally structured the same way it always was: circuits laid flat across a single layer, signals running long horizontal paths to reach neighboring gates. Logic Folding addresses the interior of the chip, not the space between chips. Rather than finishing the chip and then connecting it to others, Huawei redesigns the circuit layout during the design phase, redistributing logic gates across multiple vertical layers within a single chip. Connections between layers are made through face-to-face hybrid bonding, routing signals vertically across short distances rather than horizontally across long ones. 3D stacking shortens the distance between chips. Logic Folding shortens the distance inside a chip. One is a packaging innovation applied after manufacture. The other is a design innovation applied before it. They address different layers of the same problem, which is also why they are complementary rather than competing. On the first commercial implementation, the new Kirin chip expected this autumn, Huawei claims transistor density rises from 155 million to 238 million per square millimeter, and says energy efficiency improves by 41%. These numbers come from Huawei and have not been independently verified. What can be said without qualification is that the improvement is achieved without a new manufacturing process, on existing foundry infrastructure, which is the point the τ Law is making. The goal is approaching the transistor density associated with leading-edge nodes through design rather than fabrication. This is a meaningful achievement if the numbers hold up. It is also, importantly, a packaging and integration achievement more than a transistor achievement. The performance gain comes from rethinking how circuit elements connect to each other, not from making them individually smaller. And that logic, followed to its conclusion at the system level, leads directly to co-packaged optics. CONTINUE READING AT https://www.thexpin.com/p/huawei-post-moore-chip-strategy

译由于美国出口管制，华为在芯片先进制程竞赛中面临困难。为此，华为于2026年5月提出“τ（Tau）定律”，旨在为后摩尔时代的芯片性能提升提供新框架。该定律的核心是优化有效RC时间常数（τ）以提升信号传播速度。其方法是不完全依赖制程微缩，而是从晶体管、电路、芯片互连及系统架构四个层次进行优化，以压缩τ值。华为将其描述为中国公司首次提出具有全球影响力的后摩尔扩展框架。

Chubby♨️@kimmonismus · 5月29日61

ByteDance is reportedly building its own inference chip modeled on Groq's LPU, the same architecture Nvidia paid roughly $20B to license in December. The LPU keeps the model in on-chip SRAM and skips high-bandwidth memory. HBM is the component the US restricts most tightly for export to China. ByteDance's memory partner InnoStar fabs at TSMC's mature nodes, which also sit outside the controls. Each of those choices routes around a US restriction. What's left is the architecture Nvidia just spent $20B to own. China is increasingly moving toward developing its own chips and is succeeding in becoming ever more independent of the USA. That is truly impressive. Source: The Information.

译据报道，字节跳动正在开发基于 Groq LPU 架构的自研推理芯片。该架构将模型保存在片上 SRAM 中，跳过了受美国对华出口管制最严格限制的组件——高带宽内存。字节跳动的内存合作伙伴 InnoStar 在台积电的成熟制程节点进行生产，这些节点也处于管制之外。这一系列设计选择均旨在规避美国的限制，而正是同一架构，Nvidia 刚刚花费约200亿美元获得了其授权。

Rohan Paul@rohanpaul_ai · 5月29日52

This is probably the most entertaining way to understand one of AI’s hardest AI debates. Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring. Both technically deep and genuinely entertaining. I was glued for the entire 1 hour 20 minutes. So many super cool points to learn. 🥊 Transformers - Transformers still own the present because they work at scale. They are simple, trainable, hardware-friendly, and already power the strongest AI systems we use today. - The Transformer is basically a memory machine. It stores information as keys and values, then uses attention to pull back the most useful parts when answering. - The real Transformer advantage is not just “attention.” The bigger advantage is that it fits modern hardware extremely well, so it can process huge batches of tokens fast. - Scaling is still the brutal rule. If you give Transformers more compute, more data, and more parameters, they usually keep getting better. Any Post-Transformer architecture has to scale just as well, or better. - It is not enough to look clever on small tests, because the real question is whether it improves faster than Transformers when scaled up. - A replacement cannot be slightly better. Because the whole AI stack is already built around Transformers, the next architecture may need to be around 10x better to force everyone to switch. - Transformers are powerful, but they may be brute force. A human does not need to read the entire internet many times to become smart, but current LLMs need enormous data and compute. 🥊 Post-Transformer - Post-Transformer people are not saying Transformers are bad. They are saying Transformers may be the best current tool, not the final form of machine intelligence. - The biggest Post-Transformer target is native reasoning and continual learning. Today’s LLM reasoning often feels like text-based step-by-step work added on top, instead of thinking happening naturally inside the model. - Latent reasoning is one possible next step. That means the model reasons inside its own hidden internal space, instead of writing every thought out as words. - Continual learning is still a major weakness. Humans keep learning from experience, but most Transformer-based models are trained, frozen, and then only adapt inside the prompt. - Long context is not the same as real memory. A model can read a huge prompt, but that is different from building a life history, learning from mistakes, and updating beliefs over time. - The future may be hybrid, not a clean replacement. Transformers may stay as 1 building block while newer systems add better memory, better reasoning, and better learning loops. - The most interesting possibility is that Transformers may help discover their own successor. AI agents are already getting better at research and coding, so the next architecture may come from AI-assisted architecture search. ------- - Benchmarks are a problem. Many public benchmarks are easy to game, so they may show leaderboard strength without proving deeper intelligence. - Perplexity is still probably a great metric to evaluate frontier models,, because it tests prediction quality. --- Overall, Transformers continue to dominate, but the frontier is clearly widening. Pathway’s BDH (Dragon Hatchling — brain-inspired reasoning architecture), Sakana AI’s CTMs (Continuous Thought Machines — models that think over time), and Liquid AI’s LFMs (Liquid Foundation Models — efficient multimodal foundation models) - all of these show how the frontier is expanding. --- From “Pathway (pathway[.]com)” Youtube channel (link in comment) @zuzanna_pathway

译这是一场关于AI架构的辩论。Transformer阵营指出，其凭借简单、硬件友好、可扩展的优势主导当下，核心是基于键值存储的记忆与注意力机制，并强调任何替代架构必须能在扩展性上与之匹敌，且需达到约10倍优势才能颠覆现有技术栈。Post-Transformer阵营则认为，当前大语言模型的推理更像是后置的文本步骤，真正的突破在于实现模型内部的“潜在推理”与持续学习能力，并指出长上下文不等于真正记忆，未来可能是混合架构。辩论还提到，当前公开基准测试易被优化，而困惑度（Perplexity）仍是评估前沿模型的有效指标。最后指出，尽管Transformer仍占主导，但前沿正在拓宽，并列举了Pathway的BDH、Sakana AI的CTMs和Liquid AI的LFMs等新兴架构作为例证。

Chubby♨️@kimmonismus · 5月29日38

This feels like the 2026 version of the old ‘LLMs are just stochastic parrots’ take

译推文主推文将教皇方济各（Pontifex）的言论比作“随机鹦鹉”论调的2026年新版，意指此类质疑在当下重新流行。引用的核心观点强调，AI不具备人类的亲身经历、身体感知、情感（如喜悦与痛苦）、道德意识，也无法真正理解爱、工作或责任，因其缺乏人类成长所需的感知、关系与精神视角。推文认为，尽管形式更新，这类对AI本质的否定性判断本质未变。

Rohan Paul@rohanpaul_ai · 5月29日57

This paper shows how LLMs can use shorter context more cheaply without losing much answer quality. Shows choosing the right context method for the deployment setting can cut token use by about 25% at similar quality, and by over 50% in some reused-memory cases. The problem is that long context gives a model more information, but every extra token costs money and compute, and the extra context often brings smaller gains. Longer context has diminishing returns, and the expensive tokens are often the ones added after the model already has enough signal. The authors propose an Efficiency Frontier, which compares context strategies by looking at answer quality and token cost together instead of treating them as separate scores. The key idea is that some methods are cheap per question, like retrieval, while others spend more upfront, like memory compression, but become cheaper when the same processed context is reused many times. They tested this on 5,000 HotpotQA questions, where the model has to combine facts across documents while ignoring distracting text. The main result is that the best context strategy changes with the setting: lightweight retrieval works best when reuse is low, memory compression becomes better when reuse is high, and full-context prompting is still needed for the highest scores. ---- Link – arxiv. org/abs/2605.23071 Title: "The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management"

译论文提出了“效率前沿”框架，用于统一评估LLM上下文管理策略的成本与性能权衡。核心发现是，在部署时选择合适的上下文方法可使token使用量减少约25%，在部分记忆复用场景下可降低超50%成本，且答案质量损失较小。研究指出，上下文长度存在收益递减，后增加的token成本高但收益小。在5000个HotpotQA问题的测试中，轻量检索适合低复用率，记忆压缩在高复用率下更优，而全上下文提示仍是获取最高性能所需。

SemiAnalysis@SemiAnalysis_ · 5月29日54

Running a single deep coding model at max context on Cerebras requires 24 systems ($24M Capex) just to support 256 concurrent users. At that scale, $100M gets you way more memory bandwidth in standard GB300 racks.

译在Cerebras上以最大上下文窗口运行单个深度编码模型，仅支持256个并发用户就需要24套系统（2400万美元资本支出）。在这个规模下，1亿美元在标准GB300机架中能获得高得多的内存带宽。

Rohan Paul@rohanpaul_ai · 5月29日66

Fast mode for Claude Opus 4.8 is roughly 2.5x the speed while being 3X cheaper than before. AI/ML API (@aimlapi) already integrated it on their platform and now also gives some free access to it for selected users. Their platform provides one API for 500+ AI models.

译Claude Opus 4.8发布快速模式，速度提升至2.5倍，价格变为原来的三分之一。该模型在代码质量上相比4.7版本有显著改进，代码缺陷概率降低约4倍。标准API价格为输入$5/百万token，输出$25/百万token。AI/ML API平台已第一时间集成此模型，提供500+模型的统一API接口，并为部分用户推出限时免费体验活动。

🚨 AI News | TestingCatalog@testingcatalog · 5月29日71

Claude Opus 4.8 is now available on AI/ML API 🔥 According to the tests: > It has roughly 4x fewer code flaws going unnoticed than Opus 4.7 > Has a Fast Mode at 2.5x speed, now 3x cheaper > The same $5/$25-per-M token pricing

译Claude Opus 4.8现已在AI/ML API上线🔥 根据测试： > 与Opus 4.7相比，其未被发现的代码缺陷大约减少了4倍 > 拥有2.5倍速度的快速模式，现在价格便宜3倍 > 与之前相同的$5/$25每M token定价 [引用 @aimlapi]：Claude Opus 4.8已在AIMLAPI上线 - 首发可用！ ~与4.7相比，代码缺陷漏检的可能性降低约4倍快速模式2.5倍速度，现在价格便宜3倍价格不变：$5/$25每M tokens 为庆祝上线，部分评论者可免费使用

小互@xiaohu · 5月29日62

Claude 4.8现在网页版也可以选择思考深度了和Claude code 一样，有5个思考等级...

StepFun@StepFun_ai · 5月29日79

Day-0 vLLM support. Thanks @vllm_project 🤝

译阶跃星辰发布了 Step-3.7-Flash 模型，vLLM 在模型发布当天即提供支持。该模型是一个 198B 参数的稀疏 MoE 视觉语言模型，每个 token 约有 11B 激活参数，支持原生图像与文本输入。其上下文窗口达到 256K，适用于长文档、多文件代码库及密集视觉界面。模型提供 FP8 和 NVFP4 量化权重版本，并内置 MTP 推测解码、原生工具调用及推理解析功能。

StepFun@StepFun_ai · 5月29日75

⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency. #1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0. Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution. - 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels. - Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees. - Web + visual search reaches further: more sources, deeper follow-up. - Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels. - Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP. - Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395. GitHub: http://github.com/stepfun-ai/Step-3.7-Flash HuggingFace: http://huggingface.co/stepfun-ai/Step-3.7-Flash GGUF: http://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF ModelScope: http://modelscope.cn/models/stepfun-ai/Step-3.7-Flash API: http://platform.stepfun.ai Blog: http://static.stepfun.com/blog/step-3.7-flash/

译阶跃星辰（Step）发布了开源大模型 Step 3.7 Flash，主打智能体（Agent）工作流的效率。该模型在 ClawEval-1.1（67.1分）和 SimpleVQA Search（79.2分）评测中排名第一。其架构为 198B 参数的 MoE，约 11B 为活跃参数，支持 256K 上下文。模型具备多模态理解能力，能处理图像、文档并生成代码或调用工具执行任务。在工具使用方面，它致力于高可靠性，τ²-bench 得分超过 98%。Step 3.7 Flash 兼容 Claude Code、MCP 协议等工具链，并支持在 Mac Studio M4 Max 等设备上本地运行。模型权重以 Apache 2.0 许可开源。

ginobefun@hongming731 · 5月29日70

http://x.com/i/article/2060134439691403264 # BestBlogs 早报 · 05-29｜Claude Opus 4.8、Anthropic H 轮融资、动态工作流发布在线阅读和收听：https://www.bestblogs.dev/explore/brief/2026-05-29 ## 导语 Anthropic 今天同时放出三枚重磅：旗舰模型 Claude Opus 4.8 在编程与推理基准上全面超越上代，代码诚实度提升四倍；H 轮 650 亿美元融资让估值逼近万亿美元，年化收入已超 470 亿美元；配套推出的 Claude Code 动态工作流，可在单次会话内编排数百个并行子智能体，把「一个人 + AI」能处理的任务边界再次推远。三条消息相互咬合，AI 能力、商业规模与工程基础设施同步跃升，这一天值得仔细读。除了 Anthropic 的三连发，今天还有 Neuralink 联合创始人谈脑机接口产业化路径、Cognition 与 OpenInspect 谈后台异步智能体架构、Devin 背后的技术团队对「浪费 Token 节省时间」范式的思考，以及阿里、腾讯工程师对多智能体协作与 AI 知识显形化的一手观察。此外还有 Cloudflare 构建内部数据智能体的工程实践、Slack 三年多云 AI 架构演进，以及 Meta 2026 年股东大会扎克伯格的四大 AI 战略。整份早报内容密度很高，下面我们从三篇精讲开始。 ## 精讲一：Claude Opus 4.8 发布 Anthropic 正式发布新一代旗舰模型 Claude Opus 4.8，在编程、智能体、推理、知识工作四大类基准中全面超越上代的 Opus 4.7。阅读请访问 BestBlog 这次升级最值得关注的突破来自「诚实度」层面：Opus 4.8 对自身代码缺陷视而不见的概率降低了约四倍。换句话说，当模型写出存在缺陷的代码时，它能更主动地识别问题并告知用户，而不是继续往下执行，直到系统崩溃才被发现。这听起来像是一个工程细节，但在智能体应用场景中，它实际上是整个系统稳定性的关键变量。为什么「诚实度」是这次最重要的升级在单步问答场景中，模型对自身输出的错误诊断能力还不算致命弱点——用户可以很快看到问题并反馈。但在多步骤的智能体工作流中，模型在第一步犯了错误却没有意识到，会导致后续所有步骤在错误的基础上继续推进，最终产生难以回溯的连锁失败。诚实度提升四倍，意味着这类「盲目前行」的概率大幅下降，系统整体的自我修复能力显著增强。这一特性与今天同步发布的动态工作流密切相关。当系统需要编排数十到数百个并行子智能体时，每一个子智能体都需要能够准确评估自己的输出质量，并在判断结果异常时主动终止或请求确认，而不是把错误无声地传递给下游节点。Opus 4.8 的诚实度提升，从根本上改善了这类多智能体系统的可靠性基础。三项配套新功能同步落地与 Opus 4.8 同步推出的还有三项工程层面的新能力。第一，Claude Code 动态工作流——单次会话内可以动态编写编排脚本，并行运行数十乃至数百个子智能体，专为全代码库漏洞排查、大规模迁移、独立验证等超大型任务设计。这实际上是把过去需要外部编排框架才能完成的多智能体调度，内化到了 Claude Code 自身的能力范围之内。第二，claude.ai 新增「努力控制」滑块，用户可以手动调节模型的思考深度，在响应速度和推理质量之间按需权衡。这对于不同类型的任务非常实用：快速问答可以拉低思考深度换取速度，复杂的代码审查或架构分析则可以拉满推理深度换取准确性。第三，API 新增任务执行中实时更新指令的能力，允许外部系统在运行过程中向 Claude 注入新的上下文或修改执行参数，而不必等到任务完成后重新发起请求。这对构建长周期智能体系统意义重大，尤其是那些需要根据实时环境反馈动态调整策略的场景。早期测试者的验证 Databricks 和 Devin 是本次发布的早期测试伙伴。Databricks 侧重于评估 Opus 4.8 在复杂数据工程任务上的判断力，反馈认为其在面对歧义指令时的决策质量有显著提升，尤其是在需要对数据管道中的异常情况作出判断时，模型不再轻易给出一个看起来合理但实际上错误的答案；Devin 则专注于智能体可靠性测试，验证了 Opus 4.8 在长任务链中的稳定性明显优于上代，具体体现在多步骤代码修改场景中的错误传播率显著降低。值得注意的是，这次性能全面提升的同时售价维持不变，对于已经在使用 Claude API 的工程团队来说是直接利好，不需要任何迁移或额外成本，可以立即切换。 ## 精讲二：Anthropic 完成 650 亿美元 H 轮融资，投后估值达 9650 亿美元 Anthropic 宣布完成 650 亿美元 H 轮融资，投后估值 9650 亿美元，距离万亿美元市值仅一步之遥。这是 AI 行业迄今规模最大的单轮融资之一。阅读请访问 BestBlogs。投资方构成与战略意图本轮融资由 Altimeter、Sequoia Capital 和 Dragoneer 联合领投，三家均是顶级成长期基金，领投本身即是对 Anthropic 商业化路径的高度背书。更值得关注的是投资方的结构性构成：超大规模云厂商出资共计 150 亿美元，其中亚马逊单独贡献 50 亿，进一步强化了双方在 AWS Bedrock 上的深度合作关系；Micron、三星、SK 海力士三家半导体巨头以战略投资者身份参与，意味着 AI 计算基础设施层面的供应链关系已从商务合作升级为资本层面的利益共同体。半导体厂商投资 AI 模型公司，是在押注下游需求——他们相信 Claude 将消耗越来越多的芯片资源。收入规模与商业化速度 Anthropic 披露的年化经常性收入已突破 470 亿美元。从历史数据看，这一数字意味着在不到两年的时间里，Anthropic 从一家专注安全研究的实验室演变为具备真实规模收入的商业公司，增速远超此前大多数分析师的预测。资金将主要用于三个方向：持续推进安全与可解释性研究（这是 Anthropic 区别于其他 AI 公司的核心定位）、扩充与 AWS、Google Cloud、Broadcom、SpaceX 的算力合作，以及规模化 Claude Code 和 Cowork 产品线。「首个多云前沿模型」的战略意义本次融资完成后，Claude 成为首个同时登陆 AWS、Google Cloud 和 Microsoft Azure 三大主流云平台的前沿 AI 模型。这一多云覆盖在商业层面意义深远：企业客户可以在不切换云厂商的前提下接入 Claude，大幅降低了迁移成本和采购门槛。对于已经在某一云平台深度锁定的大型企业，这意味着将 Anthropic 产品纳入技术栈的阻力几乎降为零。同时，多云部署也为 Anthropic 自身提供了更强的议价能力，避免对单一云厂商形成过度依赖。结合今天 Opus 4.8 的发布和动态工作流的推出，Anthropic 正在同步提升技术壁垒与商业覆盖面，形成正向飞轮：更强的模型吸引更多企业客户，更多企业客户产生更多收入，更多收入支撑更大规模的研究投入，更大规模的研究投入再产生更强的模型。对于关注 AI 行业格局的读者来说，今天的融资消息是观察这条飞轮转速的最新刻度。更值得关注的是，在当前主流 AI 公司中，Anthropic 是为数不多将「AI 安全」作为核心竞争定位、同时实现商业规模突破的公司，这种组合在过去一直被认为存在根本性张力，而今天的融资数字表明，市场给出了明确的答案。 ## 精讲三：动态工作流功能发布 | Claude Claude Code 正式推出动态工作流（Dynamic Workflows），这是 Claude Code 迄今最重要的架构级升级，标志着 AI 编程助手从「增强单人工作」迈向「编排多智能体系统」的新阶段。阅读请访问 BestBlogs。动态工作流解决的核心问题传统的单智能体模式存在一个根本性限制：单个上下文窗口的容量和注意力是有限的，面对需要同时处理数百个文件、跨多个系统并行验证、需要独立判断相互依赖任务的场景时，单智能体的表现会显著退化。这不是提示词工程能解决的问题，而是架构层面的约束。动态工作流的设计思路是让 Claude Code 在单次会话内自动编写编排脚本，然后将任务拆解并分发给数十至数百个并行运行的子智能体，每个子智能体负责一个具体的、边界清晰的子任务。编排脚本本身由 Claude Code 动态生成，而不是需要工程师手动定义——这是关键的差异，它意味着工程师只需要描述目标，不需要预先设计执行框架。典型应用场景官方给出的三类核心场景清楚地说明了动态工作流的适用边界：全代码库漏洞排查，需要同时分析数百个文件并保持跨文件的上下文关联，同时在多个代码路径间并行追踪安全漏洞；大规模代码迁移，将代码库从旧框架迁移到新框架时，需要对每个迁移单元进行独立的语义验证和测试；独立验证场景，用多条并行路径对同一个问题独立求解，再对比结果以提高可靠性。这三类场景的共同特征是任务总量超过单窗口容量，且子任务之间可以并行处理，不需要严格的串行依赖。「ultracode」模式与使用建议新增的「ultracode」模式让 Claude Code 可以自动判断何时启用动态工作流，无需手动指定启动参数。当前以研究预览形式上线，支持 CLI、桌面版、VS Code 扩展以及各主要云 AI 服务（包括 AWS Bedrock、Google Cloud Vertex AI 等）。官方特别提示：动态工作流的 Token 消耗远高于普通会话，因为多个子智能体并行运行会同时占用大量算力。建议从范围明确、边界清晰的任务起步，逐步摸索适合自己工作流的使用节奏，避免因任务边界不清导致子智能体无限扩张。这与 Opus 4.8 今天同步发布的「努力控制」功能形成配合：努力控制决定每个节点的推理深度，动态工作流决定是否开启多智能体并行模式，两者共同构成新一代智能体工程的核心调节机制。从更长的时间尺度来看，动态工作流代表了一个重要的范式转变：AI 系统的边界正在从「一个人能做什么」扩展到「一个人加上 AI 编排的智能体集群能做什么」。这条线的移动，会在未来几年持续重塑软件工程师的工作方式。从今天开始，衡量一个工程师或团队产能上限的，不再只是个人技能和团队规模，还包括他们编排和调度 AI 智能体集群的能力。这是动态工作流真正深远的意义所在，也是今天这篇发布值得每个从事技术工作的人认真读一遍的理由，无论你现在是否在直接使用 Claude Code。 ## 速览异步智能体时代 — Cognition 的 Walden Yan 与 OpenInspect 的 Cole Murray（Latent.Space） Cognition CPO Walden Yan（Devin 背后的核心人物）与 OpenInspect 创始人 Cole Murray 深度对话，探讨后台异步智能体的兴起与 2025 年 12 月的模型拐点。核心判断：本地编码工具只是起点，下一阶段是自主云端智能体系统，架构设计需要从「同步响应」转向「异步任务处理」。这与今天 Claude 动态工作流的方向高度吻合，值得对照阅读，了解行业视角与产品落地之间的呼应。 Dubbing v2 发布：革命性的全新配音模型（ElevenLabs Blog） ElevenLabs 推出 Dubbing v2，支持 90 多种语言，核心突破是在翻译的同时保留原说话者的情感色彩、语调起伏和发音节奏。视频本地化不再是「用另一种语言重新念一遍」，而是「用另一种语言说出同一个人的声音」。对内容团队、媒体公司和教育平台有直接的落地价值，国际化内容生产的门槛将显著降低。 Neuralink 联合创始人 DJ Seo：脑机接口与 AI 融合的竞赛内幕（Sequoia Capital） Neuralink 联合创始人 DJ Seo 亲口讲述公司如何将脑机接口从实验室研究推向真实患者——第一批瘫痪患者已通过植入设备恢复了对数字世界的控制能力。他还透露了即将推出的视觉恢复技术，并阐述了高带宽 AI-大脑融合的长期愿景。这场 Sequoia Capital 的访谈是理解脑机接口产业化现状最直接的一手视角，观看时长约一小时，信息密度很高。 Harness 的尽头不是缰绳，是镜子：AI 时代最沉默的那场革命（腾讯技术工程）文章提出「显形」这个概念：AI 的真正价值不在于替代人类工作，而在于迫使我们将长期以来只存在于脑中的隐性知识、判断标准和团队品味首次文本化。这是一场不可逆转的认知革命——一旦开始用 AI 协作，你就必须说清楚自己到底想要什么，这个过程本身就是对知识的梳理与沉淀。观点犀利，适合和工程师、产品经理一起读，会引发很多共鸣。从语言涌现到协作涌现：如何让 AI 产生高质量决策（阿里技术）阿里工程师提出 Agent Room 概念：将多个 AI 智能体置于共享上下文场中，让它们互相修正、沉淀任务、执行验证，从而实现从流程自动化到协作涌现的跨越。文章完整记录了团队从「流程自动化」到「全链路自动化」再到「协作涌现」的三阶段演进路径，是少见的有具体工程经历支撑的理论框架，与今天动态工作流的官方叙事形成很好的对照。 143. 对何小鹏的第二次访谈：更大赌注、人形机器人 Iron 诞生、那场意外、技术剧变下 CEO、GX 和缝合怪（张小珺Jùn｜商业访谈录）何小鹏详述小鹏汽车从智能电动汽车向「物理 AI 企业」的战略转型：放弃旧自动驾驶体系、all-in 人形机器人 Iron，坦言两成胜率，也谈了技术剧变中 CEO 的焦虑与决策方式。这次访谈比上一次更深入，何小鹏的坦诚程度也超出预期。想了解中国汽车加机器人赛道真实状态的读者不容错过。浪费 Token，节省时间：Naval 与三位前沿创始人谈 AI 如何重塑软件工程（Naval） Naval 与三位前沿创始人（包括 Devin 团队成员）探讨「软件工厂」范式：用 AI 智能体替代手动编码，核心逻辑是「浪费算力换人力」。他们同时质疑了纯软件护城河的未来——当任何人都能用 AI 快速复制软件功能时，差异化究竟从哪里来？结论指向数据、网络效应和品牌，而非代码本身。 ## 补充阅读推理优化、扩散模型、世界模型等前沿 AI 研究 | YC Paper Club（Y Combinator）首届 YC Paper Club 汇集顶尖创始人和研究者，集中讨论推理加速（Speculative Speculative Decoding）、机器人控制、世界建模、泛化理论和数据效率五个方向的前沿论文。对 AI 基础研究保持跟踪的读者和研究者值得看。我们如何构建 Cloudflare 的数据平台及其上的 AI 智能体（The Cloudflare Blog） Cloudflare 工程团队详述如何从数据孤岛走向统一数据平台 Town Lake，并在其上构建 AI 智能体 Skipper，让任何员工都能用自然语言查询数十亿级别的业务数据。正在构建企业内部数据智能体的团队有直接参考价值，Cloudflare 的规模与复杂度让这个案例的代表性很强。 Slack AI：通往多云之路（Slack Engineering） Slack 工程团队完整记录了从 AWS SageMaker 到 AWS Bedrock + GCP Vertex AI 多云架构的三年演进历程，动因是运营效率、模型灵活性和企业级可靠性的综合需求。是少见的多云 AI 基础设施实战案例，结合今天 Anthropic 的多云战略更有参考意义。当你的客户是 AI 智能体：B2B 企业如何在买家变成 AI 智能体时保持可见（freeCodeCamp） 96% 的 B2B 企业在 AI 驱动的采购流程中是「不可见」的——AI 智能体在为买家筛选候选供应商时，大多数企业根本不在考虑范围内。文章分析了成为「AI 可发现」所需的三项基础设施决策。适合 B2B 产品和营销负责人，这是一个正在发生的结构性变化，需要提前布局。 Skill 文档也能训练？SkillOpt：把 Agent 的经验写进一份可优化说明书（AINLP）微软论文 SkillOpt 的中文解读：把 Agent 的 Skill 文档当成可训练的外部状态，通过 rollout、反思、受限编辑和验证门控实现自动迭代优化，在 52/52 的测评项中达到全部 best 或 tied-best。正在做 Agent 工程化的团队有直接启发，这是一个低成本改善 Agent 表现的方向。 Cursor 开发者习惯报告：AI 编程趋势洞察（Cursor） Cursor 发布《开发者习惯报告》，基于其全球最完整的 AI 编程数据集分析 AI 工具采纳规律。想了解 AI 编程工具在真实开发者群体中如何传播和使用的读者，这份报告是目前数据支撑最扎实的参考之一。 SpaceX 自研 C 语言 AI 训练栈，面向 22 万块 GB300 GPU（Elon Musk） SpaceX 即将完成用 C 语言编写的定制 AI 训练栈 V1.0，精确映射 22 万块 NVIDIA GB300 GPU，声称对大规模训练任务相比 JAX 有超过一个数量级的速度提升。AI 训练基础设施向高度定制化方向演进的一个信号，关注 AI 算力投资的读者值得了解。社会科学中的编码智能体（Anthropic Research） 1260 名社会科学家调查显示：81% 用过 AI 聊天机器人，但只有 20% 使用过编码智能体，且采用率在性别、职业阶段和大学声望方面存在明显分化。早期用户发表了更多工作论文，但期刊投稿量并未增加。数据有趣，适合关注 AI 工具在非工程领域扩散规律的读者。 AI 破晓：生成式 AI 时代文化产业的重塑、跃迁与守望 | 4 万字报告（腾讯研究院）腾讯研究院联合中国传媒大学发布 4 万字研究报告，覆盖短视频、长视频、网络文学、音乐、游戏等多个内容形态，提出生成式 AI 对文化产业的全链条影响框架，包含近 1900 份有效问卷和 20 余位从业者访谈。体量大，适合对文化产业与 AI 交叉领域感兴趣的读者周末细读。一文读懂 Meta 2026 年股东大会：扎克伯格豪赌 AI 四大方向，十项股东提案全被否（腾讯科技） Meta 2026 年股东大会核心内容速览：12 名董事全部连任，10 项股东提案全被否，扎克伯格重点阐述核心应用 AI 化、个人智能体、商业智能体、AI 硬件四大方向，资本支出 1150 亿到 1350 亿美元，几乎是去年两倍。想了解 Meta AI 战略全貌的读者值得一读，结合今天 Anthropic 的融资新闻对照来看格局感更强。 ## 今日阅读路径时间有限时，建议按以下顺序读三篇： 1. Claude Opus 4.8 发布——今天最值得优先读的一篇。Opus 4.8 的「诚实度」提升不是边际改进，而是智能体工程的基础性突破。读完这篇再看动态工作流，会有更清晰的整体感：能力升级和工具升级是同步设计的，不是各自独立的公告。 1. 动态工作流功能发布 | Claude——紧接着读这篇，理解并行子智能体架构的设计逻辑和适用边界，以及「ultracode」模式的实际使用建议。这是今天三篇精讲中最有工程实操参考价值的一篇。 1. Harness 的尽头不是缰绳，是镜子——用腾讯工程师的视角把前两篇「落地」：模型能力再强，真正的价值在于迫使团队将隐性知识显形化。这篇文章是今天所有 AI 进展最好的人文注脚，读完会对「为什么我们需要更强的 AI」有更深的理解。如果还有时间，加读从语言涌现到协作涌现——阿里工程师对多智能体协作的一手实践记录，与今天的 Claude 动态工作流官方叙事形成很好的互补：一篇是工具方的视角，一篇是实践者的视角，放在一起读收获更大。再有时间的话，Anthropic H 轮融资值得完整读一遍——里面关于多云战略和投资方构成的细节，能帮助你理解 AI 行业的资本与技术如何同步运转。

译Anthropic发布旗舰模型Claude Opus 4.8，在编程、智能体、推理等基准上全面超越上代，其代码“诚实度”提升约四倍，增强了多智能体系统的可靠性。同日，Anthropic完成H轮650亿美元融资，投后估值达9650亿美元，年化收入已超470亿美元。配套推出的Claude Code动态工作流允许在单次会话内编排数百个并行子智能体，适用于大规模代码库排查等任务。

ginobefun@hongming731 · 5月29日76

Anthropic 今天发布了旗舰模型 Claude Opus 4.8，是 Opus 4.7 的全面升级版。在编程、智能体、推理、知识工作四个维度的基准测试中，Opus 4.8 都超过了上一代。其中最值得注意的是「诚实度」的改变，模型对自己写出的有缺陷代码视而不见的概率，降低了约四倍。也就是说，它更愿意承认自己的错误，而不是强行辩护。这次发布同步带来了三个新功能。第一个是 Claude Code 里的动态工作流，可以在单次会话内启动数十甚至数百个并行子智能体，专门用来处理大规模、跨文件的复杂任务，比如全代码库的漏洞扫描和大型代码迁移。第二个是 http://claude.ai 上的「努力控制」，用户可以手动调整模型的思考深度，用更少的 Token 处理简单问题，把算力留给真正需要的地方。第三个是 API 层面支持在任务执行途中实时更新指令，不必中断整个流程重新开始。来自 Databricks、Hebbia、Devin 等团队的早期测试者反馈说，模型在判断力和可靠性上有明显改善，尤其在长时间自主运行的任务里表现更稳。价格和 Opus 4.7 保持一致，没有涨价。

译Anthropic 发布旗舰大语言模型 Claude Opus 4.8，作为 Opus 4.7 的全面升级版，其在编程、智能体、推理和知识工作等基准测试中均超越前代。最显著的改进是模型诚实度大幅提升，对自身有缺陷代码视而不见的概率降低约四倍。同步推出三项新功能：Claude Code 支持动态工作流，可启动并行子智能体处理复杂任务；claude.ai 提供“努力控制”功能，允许用户调整模型思考深度；API 支持任务执行中实时更新指令。早期测试者反馈模型在判断力和可靠性上改善明显，价格与 Opus 4.7 保持一致。

Rohan Paul@rohanpaul_ai · 5月29日64

Some truly massive inference numbers here. @Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding) with a 2B model. For comparison, typical GPU decoding speed for 2B to 8B models on high-end GPUs is around 100 to 300 tokens/s per sec. They achieved it by treating LLM decoding as a memory-streaming problem: keep the whole token-generation loop inside one persistent GPU program, so kernel launches, CPU scheduling, intermediate memory writes, and sampling interruptions mostly disappear. Then they cut synchronization waste by making each compute unit wait only for the exact data it needs, while mapping memory access to the MI300X’s chiplet topology so the GPU stops paying avoidable cross-die latency. Finally, their model architecture delays tensor-parallel communication so all-reduce work happens in the background instead of blocking every layer, which is why the runtime, GPU code, and model design all have to be co-designed.

译Kog AI 在标准数据中心 GPU 上实现了惊人的推理速度：在 8× AMD MI300X 上达到 3,000 tokens/s，在 8× NVIDIA H200 上达到 2,100 tokens/s（FP16，无推测解码），而常规速度通常为 100-300 tokens/s。其技术核心是将大语言模型解码视为内存流问题，通过将整个 token 生成循环置于单一持久 GPU 程序内、优化内存访问拓扑以降低跨芯片延迟、并采用延迟张量并行技术来大幅减少开销。Kog 今日开放技术预览，提供 2B 编码模型，并计划后续支持大型前沿 MoE。

Nathan Lambert@natolambert · 5月29日57

For reference, when we visited @Zai_org in China they had an API metrics chart in their showroom, was 5-7 T tokens/day. Inference market in the U.S. / Europe seems way bigger (and that's a big deal for continuing to build models)

译供参考，我们访问中国的 @Zai_org 时，他们在展厅展示了 API 指标图表，达到每天 5-7 万亿 tokens。美国/欧洲的推理市场规模似乎大得多（这对持续构建模型很重要）

Artificial Analysis@ArtificialAnlys · 5月29日79

Claude Opus 4.8 takes the lead on the Artificial Analysis Intelligence Index at 61.4, with Anthropic retaking the #1 spot on GDPval-AA and advancing in terminal use and scientific reasoning To reach the leading position on the Intelligence Index, @Anthropic made large improvements in both real-world agentic work and frontier academic reasoning tasks. Key takeaways: ➤ Claude Opus 4.8 is the new leader on the Artificial Analysis Intelligence Index. Opus 4.8 scores 61.4, up +4.1 points from Opus 4.7 and +1.2 points ahead of GPT-5.5 (xhigh), the previous Index leader ➤ The new release is slightly more efficient than its predecessor on agentic tasks, but token efficiency varied by task type. We saw Opus 4.8 use fewer turns and output tokens on GDPval-AA, but approximately the same number of output tokens for the overall Intelligence Index to achieve significantly higher performance. ➤ Anthropic retakes the lead on GDPval-AA, our primary evaluation for agentic performance on knowledge work tasks. Opus 4.8 scored an 1,890 Elo, reflecting an implied win rate of approximately 67% against GPT-5.5 ➤ Claude is now among the top models for scientific reasoning. Previous releases have trailed peers on complex academic reasoning tasks, but with Opus 4.8, Claude sits slightly ahead of OpenAI and Google as the leader on Humanity’s Last Exam. It also scores higher than Gemini 3.1 Pro on CritPt, a frontier physics benchmark, but remains behind GPT-5.4 and GPT-5.5 ➤ Claude Opus 4.8 reaches #2 on AA-Omniscience, slightly ahead of Opus 4.7. Opus 4.8 scores 27.4 on the AA-Omniscience Index behind only Gemini 3.1 Pro (32.9). Accuracy ticked up slightly to 46.6% and hallucination rate held roughly flat at 35.9% - Anthropic continues to demonstrate substantially lower hallucination rates than peer models from Google and OpenAI ➤ Compared with Opus 4.7, Opus 4.8 also makes material gains on Terminal-Bench Hard (+6.8 points), τ²-Bench Telecom (+5.9 points), and IFBench (+3.6 points), with relatively flat scores across AA-LCR, GPQA, and SciCode. Other key model details remain the same as Opus 4.7: Context window of 1 million tokens (equivalent to Opus 4.7) Pricing of $5/$25 per million tokens of input/output; cache pricing remains at a 25% premium for cache writes ($6.25 per million tokens) with 5-minute time to live, and 90% discount for cache hits ($0.5 per million tokens) Effort remains the recommended way of configuring model performance and latency, with the same options as Opus 4.7 - we measured the model at its ‘max’ effort setting to test peak performance

译Anthropic发布Claude Opus 4.8，在Artificial Analysis智能指数上以61.4分超越GPT-5.5（xhigh）1.2分，重新登顶。该模型在真实世界智能体任务和前沿学术推理上均有提升，在主要智能体评测GDPval-AA上以1890 Elo分取得约67%的胜率。在科学推理方面，Claude首次在Humanity's Last Exam基准上领先OpenAI和Google。其模型幻觉率维持在35.9%，显著低于竞品。上下文窗口仍为100万token，定价为输入$5、输出$25每百万token。

Rohan Paul@rohanpaul_ai · 5月29日62

Most AI teams still buy inference like they are buying software from 1 vendor. They pick a model, accept the fixed price, wire it into the app, and keep paying that rate even when cheaper models could handle the same work. @The_GridAI takes a different approach. Instead of choosing a model name, you choose the level of work you need: standard, prime, or max. A simple task like support-ticket classification can run on standard. Normal production work like RAG, drafting, support replies, or agent steps can run on prime. Harder work with long context or higher error cost can run on max. The Grid then routes the request to the cheapest supplier that still qualifies for that tier. So the app still uses one API and mostly the same code, but the model behind the request can change as price and quality change. I tested it with Hermes Agent on my Ubuntu machine. Hermes ran locally, while The Grid handled the inference through agent-prime. The workflow was simple: read support tickets, apply a policy file, and write a triage report.

译The Grid AI 提出了一种新的AI推理购买模式。用户不再指定具体模型，而是根据任务复杂度选择标准（standard）、生产（prime）或极致（max）三个级别之一。平台会自动将请求路由到满足该级别要求的最便宜供应商。应用仅需接入单一API，后端模型可根据价格与质量动态变化，从而优化成本。作者曾用Hermes Agent在本地测试，通过agent-price级别处理了工单分类工作流。The Grid目前处于Beta阶段，声称通过供应商竞价可使AI API成本降低最高80%，并为新用户提供首200M tokens免费额度。

Rohan Paul@rohanpaul_ai · 5月29日30

Most human experts will feel this pain and existential reflections of watching a skill becoming an API.

译大多数人类专家在目睹一项技能变成API时，都会感到这种痛苦和生存反思。

OpenCode@opencode · 5月29日60

Opus 4.8 now available in OpenCode

译Opus 4.8 现已在 OpenCode 中可用。

Chubby♨️@kimmonismus · 5月29日53

Huge!! „Mythos class model to all customers in the coming weeks“!! Holy, we accelerate!!

译太棒了！！“Mythos级模型将在未来几周内向所有用户开放”！！天啊，我们正在加速！！

Perplexity@perplexity_ai · 5月29日59

Claude Opus 4.8 is now available for Max subscribers on Perplexity and Computer.

译Claude Opus 4.8 现已面向 Max 订阅用户在 Perplexity 和 Computer 上提供。

Thariq@trq212 · 5月29日76

I think you’ll really like Opus 4.8 It’s as smart as its benchmarks show but expresses and utilizes that intelligence in a warm and collaborative way. Workflows are a great way to utilize it- I’m hooked. Article on that soon.

译我觉得你会非常喜欢 Opus 4.8。它和基准测试显示的一样聪明，但以温暖协作的方式表达和运用这种智能。工作流是利用它的绝佳方式——我已沉迷其中。相关文章即将推出。

OpenRouter@OpenRouter · 5月29日80

Opus 4.8 is live on OpenRouter! Same price as 4.7 with gains across agentic coding, reasoning, and computer use. Around 4x less likely than 4.7 to let code flaws pass unremarked. Opus 4.8 Fast Mode is also live - now only 2x the cost for 2.5x the speed.

译Opus 4.8 已在 OpenRouter 上线！价格与 4.7 相同，在智能体编码、推理和计算机使用方面均有提升。代码缺陷未被发现的概率比 4.7 低约 4 倍。 Opus 4.8 Fast Mode 也已上线——现在只需 2 倍价格，即可获得 2.5 倍速度。

ClaudeDevs@ClaudeDevs · 5月29日83

Opus 4.8 is live in Claude Code today. A few things worth knowing: 🧵

译Opus 4.8今日已在Claude Code上线。几点值得了解：🧵

🚨 AI News | TestingCatalog@testingcatalog · 5月29日82

ANTHROPIC 🔥: CLAUDE OPUS 4.8 IS ROLLING OUT TO ALL USERS. The release also includes an updated Thinking effort selector with Low, Medium, High, Extra, and Max options available. > Switch to Opus 4.8 for your most ambitious work - and now you can set the effort level for thoroughness or speed.

译ANTHROPIC 🔥：Claude Opus 4.8 正在向所有用户推送。此次发布还包含更新的思考强度选择器，提供低、中、高、额外和最大选项。 > 切换到 Opus 4.8 来完成你最具雄心的工作——现在你可以设置思考强度，以平衡深度或速度。

🚨 AI News | TestingCatalog@testingcatalog · 5月29日69

ANTHROPIC 🔥: Claude Opus 4.8 achieves 69.2% score on SWE Bench Pro against 64.3% for Opus 4.7. Benchmarks 👀

译ANTHROPIC 🔥: Claude Opus 4.8 在 SWE-bench Pro 上取得 69.2% 的分数，而 Opus 4.7 为 64.3%。 Benchmarks 👀

SemiAnalysis@SemiAnalysis_ · 5月29日64

The most popular AI subscription will run you about $20/month and it gives you access to most of the models and is good enough for the average daily user. But for a company like Anthropic how much does it cost the company to be servicing the user? It's safe to assume that the majority of users aren't going to be hitting the usage limits but hypothetically let us say they did. Depending on the workload, the same $20 subscription can range from insanely profitable to barely breaking even.

译最受欢迎的AI订阅服务每月约20美元，可访问大多数模型，对普通日常用户来说已足够。但对于Anthropic这样的公司，服务用户的成本是多少？可以合理假设大多数用户不会达到使用上限，但假设他们达到了。根据工作负载的不同，同样的20美元订阅可能从极其盈利到勉强收支平衡。