@deepseek_ai V4 Flash just hit #1 on @OpenRouter — 3.02T tokens, up 109% this week. If you haven't tried it yet, now's a good time. More Info⬇️

译@deepseek_ai V4 Flash 刚刚登顶 @OpenRouter 榜首——3.02T tokens，本周增长109%。如果你还没试过，现在是个好时机。更多信息⬇️

OpenRouter@OpenRouter · 5月22日65

DeepSeek V4 Flash has topped the weekly leaderboard

译DeepSeek V4 Flash已登顶周排行榜

Ethan Mollick@emollick · 5月22日40

Its funny how much the whole "strawberry" thing, which turned out to be o1-preview, was dismissed as overhyped at launch when it is clear in retrospect that it was way underhyped. A direct line from models unable to do basic math to solving unresolved math problems in 18 months.

译有趣的是，整个“草莓”事件（后来证实是o1-preview）在发布时被斥为过度炒作，但事后看来，它其实被严重低估了。从模型连基础数学都不会，到18个月内解决未解数学问题，这是一条清晰的进化路径。

karminski-牙医@karminski3 · 5月22日71

400 TPS！实测智谱 GLM-5.1 以10倍速狂飙智谱刚刚发布了 glm-5.1-highspeed! 赶紧拿脚本测了一下, 输出速度能干到 300 tps+, 首 token 延迟稳定在1s. 这个数据猛到什么程度... 同样的脚本我测了下 glm-5.1 的接口, 输出速度只有 35 tps, 首 token 延迟干到了 9s. 基本是10倍速提升. 使用 glm-5.1 编程或者养龙虾/爱马仕的同学可以直接搞套餐开这个新模型了. 能做到直接吐字不用等. GLM-5.1 单次激活40B, 按照bf16精度计算, 即使不考虑 kvcache 也要80GB的显存, 那么达到 35 tps, 这就是 80x35= 2.8TB/s 的显存带宽. 而如果拉升到 300 tps, 那就是 80x300=24TB/s 的显存带宽. 如果按照 H100 SXM: 3.35 TB/s 计算, 之前单卡的带宽就能达到了, 现在需要8卡的张量并行才可以(当然张量并行也能提升请求并行度). 结果官方发布的技术文档更炸裂, 他们跟 TileRT 团队合作, 从底层把推理链路重做, 直接把显卡性能榨干了！简单说, 传统推理像流水线工厂: CPU 当调度器, 一层层发指令给 GPU, 算完一层把结果写回显存, 再读出来算下一层, 中间还要不停同步. 大量时间其实耗在这些"调度 + 搬运"上, 而不是纯计算. TileRT 的思路是反着来的: 编译阶段就把整个推理流程编排好, 变成一个常驻 GPU 的大 kernel, 推理启动后基本只 launch 一次, 后面 GPU 自己跑. 单卡里面像计算、IO、通信都拆成更小的 tile 级任务; 中间结果尽量不走大显存, 能在寄存器、共享内存、L2 cache 里直传就直传. 多卡则进行分工, 比如 GPU 0 专门干 Sparse Indexer, GPU 1–7 跑 MLA 注意力主干. (另外还有很多优化细节, 大家可以看官方发布的技术文档) 上面这些全都不用 CPU 再深度参与了, 所以提升了大量的性能. so, 正在使用 GLM-5.1 的同学抓紧切模型! #glm51 #glm51highspeed #智谱 #GLM

译智谱近期推出GLM-5.1-Highspeed模型，实测输出速度达300+ tokens/s，首token延迟约1秒，相较于标准版GLM-5.1的35 tps和9秒延迟，性能提升约10倍。技术上，智谱联合TileRT团队重构了推理链路，通过将整个推理流程编译为常驻GPU的大kernel，大幅减少CPU调度与数据搬运开销，并优化单卡内的计算、IO分配及多卡间任务协作，显著提升GPU利用效率。该模型单次激活40B参数，高性能运行需依托多卡并行，建议现有用户切换使用以获得更实时的生成体验。

ginobefun@hongming731 · 5月22日63

http://x.com/i/article/2057600777791913984 # BestBlogs 早报 · 05-22｜Agent 记忆原语、Qwen3.7-Max、自动化与人类专家在线阅读和收听：https://www.bestblogs.dev/explore/brief/2026-05-22 ## 导语今天的早报围绕一个核心问题：AI Agent 真正「成熟」意味着什么？ Anthropic 工程师首次公开了两项平台级原语——Memory 与 Dreaming，把 Agent 的跨会话记忆问题从理论变成了工程事实，Rakuten 的 97% 错误率下降数据让所有人意外。与此同时，Qwen3.7-Max 在 35 小时连续压测中以 1158 次工具调用零中断，把国内大模型竞争的焦点彻底从问答分数拉向长程稳定性。Every 创始人 Dan Shipper 则提出了一个反直觉的论断：AI 越普及，能「评判对错」的人类专家反而越稀缺。今天速览部分涵盖了 Daytona 与 Railway 两家 Agent 基础设施公司的产品哲学，以及腾讯 Hy-MT2 翻译模型开源、AMD 苏姿丰上海演讲、ZCube 组网架构突破等多个值得关注的进展。补充阅读部分包括 OpenAI 推翻 80 年数学猜想、多篇 Harness Engineering 工程实践，以及 Martin Fowler 关于 Agent 代码可维护性传感器的最新思考，内容横跨 AI 科学发现、工程化落地与系统架构多个维度，建议根据今日阅读路径选读。 ## 精讲一：用于自学习自主 Agents 的 Memory 与 Dreaming 来源： Claude（Anthropic 官方频道）阅读链接：在 BestBlogs 观看背景：Agent 记忆的工程瓶颈在 AI Agent 承担越来越复杂的企业任务时，最大的工程障碍之一是「历史执行上下文的管理」。没有持久化的记忆基础设施，Agent 每次收到新指令时几乎都是「空白状态」——频繁重复错误、重复工作，无法在多 Agent 团队之间共享领域知识。 Anthropic 的 Ravi 在一场公开演讲中首次披露了两项专为云端托管 Agent 设计的基础设施原语：Memory 与 Dreaming。这是 Anthropic 在「长程自主智能体」方向上迄今最具体的架构路径。 Memory：把经验建模为虚拟文件系统 Memory 系统的设计出发点很务实：不强迫模型使用限制性的内部 API，而是把知识显式建模为标准虚拟文件系统，暴露给 AI 模型。现代大语言模型（如 Opus 4.7）在操作文件路径和目录结构方面具备相当强的原生能力。通过把过去的经验和共享知识表示为标准目录，Agent 可以使用熟悉的 bash、grep 等终端工具来检查、修改和组织历史记录。这消除了不必要的软件层，让模型自己决定哪些会话内容值得保存。企业控制层级与并发控制：在大型企业环境中部署共享记忆时，读写冲突是一个现实风险。Anthropic 通过三项架构约束来解决这个问题： 1. 作用域层级（Scoped Hierarchies）： Agent 同时访问不同层次的记忆空间——包括只读的企业知识库（如 SLO 策略、运行指南）和可读写的本地任务存储。 1. 乐观并发控制（OCC）：防止多个并发 Agent 在同时写入时互相覆盖状态。 1. 独立 REST API：使外部工程团队可以方便地执行 CRUD 操作、触发数据导出或进行合规删除。 Rakuten 的早期部署数据非常惊人：引入生产级 Memory 后，首次执行错误率下降了 97%。Wise Docs 也消除了文档验证流程中的跨会话处理瓶颈。 Dreaming：全局优化的异步整合如果说 Memory 是 Agent 的「知识存储」，那么 Dreaming 就是 Agent 的「夜间整理」。 Dreaming 原语在后台异步运行，对碎片化的记忆进行整合与去重，消除多 Agent 团队的重复学习。它类似于人类睡眠中大脑对白天经验的整理与固化，帮助整个 Agent 组织在不中断任务的情况下持续优化共享知识库。为什么这很重要这两项原语的意义不只在于技术层面。它们标志着 Agent 基础设施从「单次任务工具」向「持续学习系统」的关键跃升。当 Rakuten 的数字从 97% 这个量级给出时，它提示了一件事：Agent 的真正价值边界，可能不在于单次任务的表现，而在于是否能从每次执行中积累并共享经验。与今日其他内容的关联 Memory 与 Dreaming 这两项原语，和今天精讲二中 Qwen3.7-Max 的「长程策略连贯性」指向了同一个问题的两个层面：一个是在基础设施层解决 Agent 的跨会话记忆问题，另一个是在模型层解决长程执行中的策略稳定性问题。这两个方向的进展，共同构成了「AI Agent 从工具到协作者」这一演化的基础条件。从今天精讲三 Dan Shipper 的视角来看，Memory 与 Dreaming 的意义还不止于此：当 Agent 具备了持久化学习能力，它们在特定领域的执行质量会随时间不断提升，这进一步强化了「人类评委」在整个系统中的战略价值——因为需要有人来判断 Agent 积累的「经验」是否正确、是否值得保留。如果你正在构建企业级 Agent 或多 Agent 协作系统，这篇内容值得深读。 ## 精讲二：Qwen3.7-Max 重新定义 AI Agent 基座来源：通义大模型阅读链接：在 BestBlogs 阅问题的起点：Demo 很惊艳，一上生产就崩溃很多开发者对 AI Agent 的真实体验是：任务稍长就丢上下文，换个框架就性能暴跌，跑几轮就开始「自我循环」。Qwen3.7-Max 试图正面回应这个痛点。极限压力测试：35 小时、1158 次工具调用零中断通义实验室为 Qwen3.7-Max 设计了一场极限压力测试：在训练期从未见过的硬件平台（平头哥真武 M890 PPUs）上，自主优化 SGLang 的 Extend Attention 生产级 Kernel。没有硬件文档，没有性能分析数据，起点只有任务描述、官方 Triton 参考实现和一个评估脚本。在约 35 小时的连续运行中，模型共产出 432 次 Kernel 评估，跨越 1158 次工具调用，完全自主地： - 编写、编译、性能分析并迭代推理算子 - 诊断编译报错、修复正确性 Bug - 通过运行时测量定位瓶颈，多次重构底层架构最终，在多个工作负载上相对 Triton 参考实现几何平均加速 10.0x，而对比同期测试的其他模型最高仅 7.3x，且多数因连续 5 轮无行动而主动退出。更关键的数据是：模型在 30 小时后仍在持续发现实质性改进点，证明了它的「长程策略连贯性」。解耦训练架构：跨框架泛化的底层设计 Qwen3.7-Max 的训练架构采用了「任务 - 运行框架 - 验证器」正交解耦设计。在强化学习阶段，模型被强制在不同框架、不同验证器组合下处理同源任务，学到的是通用的解题策略与工具调用范式，而非「某个框架的快捷键」。这意味着：无论使用 Claude Code、OpenClaw、Qwen Code，还是自研 Tool Use 框架，Qwen3.7-Max 都能即插即用，性能表现高度一致。在 QwenClawBench 与长链路 CoWorkBench 评测中，无论切换何种运行环境，性能均稳定领先上一代。国内大模型竞争的焦点转移这次发布的真正意义在于：它把国内大模型的竞争焦点，从「问答分数」拉向了「长程 Agent 稳定性」。在综合 Agent 评测中，Qwen3.7-Max 位列前三，性能逼近行业顶尖水平，在长程 Agentic 稳定性上超越了 Claude 3.7 Sonnet 与 GPT-4.1。实际应用场景 Qwen3.7-Max 已经在三类真实场景中展示了能力：编程 Agent — 从一条 prompt 生成包含 Three.js 3D 场景、Canvas 动画的交互式 Web 应用。 MCP 办公助手 — 通过 MCP 工具集成，读取高校学位论文格式规范，自动修复排版混乱的论文，包括页面布局、标题样式、字体字号、页边距、目录生成和参考文献格式，全程通过 office-cli 工具自主完成。多智能体协作 — 支持主 Agent 规划调度、子 Agent 垂直执行的多智能体编排，同时基于 Tool Use 可直接操控具身设备完成物理环境中的理解、规划与决策。如果你需要在生产环境中部署长程 Agent，Qwen3.7-Max 即将通过阿里云百炼提供服务，完整兼容 OpenAI 与 Anthropic API 协议。 ## 精讲三：自动化之后来源： Every 阅读链接：在 BestBlogs 阅读悖论的起点：自动化越多，人类工作越多 Every CEO Dan Shipper 在这篇文章里记录了一个令他本人也感到困惑的现象：公司已经把所有能自动化的工作都交给了 AI——用 Codex 和 Claude Code 写代码、设计、客服——但他们没有裁员，反而还在扩张。团队接近 30 人，人类工作似乎比以前更多了。这和主流叙事截然相反。Dario Amodei 警告过 AI 可能消灭一半入门级白领工作，Meta 裁员 8000 人，GDPVal 评测显示前沿模型在真实经济任务上已经达到 85% 的人类水平。但 Shipper 的实地经验是：「越自动化，需要做的人类工作越多。」核心机制：AI 商品化了人类专业知识的「遗留物」 Shipper 的解释是：AI 商品化的是人类专业知识中「能被显式表达并训练的部分」。一旦某个技能被大量自动化，这类技能的「默认产出」价值就崩塌了，但「与众不同」的需求反而上升了。而「与众不同」的需求，本质上是对人类专家的需求——即使我们已经接近 AGI。举个具体例子：Codex 可以写代码，但能评判「这段代码写得对不对」的工程师变得更值钱，因为 AI 产出了大量同质化的代码需要被审查。AI 批量制造内容，「评估哪篇更好」就成了新稀缺。人类三明治：设定框架、AI 执行、人类评判 Kieran（Every 的作者）把这种新工作模式称为「人类三明治」：人类设定任务框架 → AI 执行任务 → 人类评判并延伸结果。在 Every 内部，AI 已经回复了 Shipper 95% 的工作邮件，但他仍然在审阅每一封。管理者开始写代码，工程师开始直接接触客户。没有临界点，只有新常态 Shipper 的结论是反直觉但有据可查的：不会有一个「临界点」让所有工作都消失。真正的新常态是：自动化越多，对专家判断力的需求越高。自动化的终点不是消灭工作，而是把人类角色推向「评委与压舱石」这个最后被商品化的层级。这和今天的其他内容有什么关联 Qwen3.7-Max 的极限测试，恰好印证了 Shipper 的逻辑：1158 次工具调用之后，仍然需要工程师来评判最终的 10x 加速是否真的「正确」——模型没有硬件文档、没有先验知识，但评估脚本由人类设计，验证标准由人类设定。AI 做了 35 小时的执行工作，而「定义什么是成功」的工作依然是人类的。 Memory 与 Dreaming 的案例同样如此：Rakuten 的 97% 错误率下降，需要人类来确认「错误」的定义、设计评估标准、判断哪些经验值得被 Dreaming 保留。专家判断力不是 AI 自动化的副产品，而是前提条件。如果你在思考「AI 会不会取代我」，这篇文章提供了一个不同的分析框架，值得仔细阅读。 ## 速览为智能体配备计算机 — Ivan Burazin，Daytona（来源：Latent Space） Daytona CEO Ivan Burazin 的核心论点是：AI 智能体需要的不仅仅是可丢弃的代码执行沙箱，而是可组合、有状态的「计算机」。他将公司从人类开发环境转型为 Agent 基础设施提供商的历程，以及「localhost 的终结」这一长期判断，对理解 Agent 基础设施赛道的产品逻辑很有帮助。Daytona 不是在构建另一个 sandbox，而是在重新定义 Agent 与计算环境之间的关系。开发者和基础设施产品经理适合阅读。 Railway：面向智能体的原生云平台 — Jake Cooper（来源：Latent Space） Railway 创始人 Jake Cooper 分享了从「零激活能量上线」的产品哲学，到构建裸金属数据中心、实现 70% 利润率的商业路径，再到为 AI Agent 时代重新设计基础设施的全过程。值得关注的是，Railway 在 2026 年 5 月经历了一次 GCP 大规模故障（即使采用了多 AZ、多 zone 架构），其事后复盘对理解 Agent 基础设施的高可用挑战很有参考价值。适合关注云基础设施和 Agent 平台建设的读者。腾讯混元全新翻译模型 Hy-MT2 开源，小程序「腾讯 Hy 翻译」开放体验（来源：腾讯混元） Hy-MT2 支持 33 种语言互译，7B 和 30B-A3B 模型达到开源最佳效果，超越几十倍参数量的模型。最有意思的是 1.8B 轻量版：得益于 AngelSlim 1.25-bit 极端量化，仅需 440MB 存储空间，可在手机芯片上本地推理，比 Hy-MT1.5 推理速度提升 1.5 倍，同时翻译质量超越微软等主流商业 API。已上线「腾讯 Hy 翻译」小程序，iOS 和安卓 APP 即将发布。选择正确模型：LLM Evals 与优化的数据驱动指南（来源：Claude） Anthropic 的 Lucas 分享了一套生产级 LLM 选型框架：核心包括自定义 eval 而非依赖公开 benchmark、过程级评分（不只看最终结果）、prompt caching、context hygiene，以及按「成功结果成本」而非「单次调用成本」来优化选型决策。对在生产环境做模型选型的工程师有直接参考价值。 Google 推出 Android CLI，让 Android 工具链对 AI 智能体更友好（来源：InfoQ） Google 重新设计了 Android CLI，引入了结构化 Skills（SKILL.md 格式的模块化指令集）和集成知识库，使 AI 智能体能够更高效地访问 Android 工具链。声称与 Android Studio 内的 Agent 相比，构建速度提升 3 倍，Token 使用量减少 70%。兼容 Claude Code、Codex 等第三方 Agent。这个设计思路与 BestBlogs 自身的 skill 体系颇为相似，值得关注。下一代大模型推理网络架构：ZCube 如何有效破解网络瓶颈？（来源：智谱）智谱、驭驯网络与清华大学联合提出的 ZCube 组网架构，在 GLM-5.1 coding 生产环境中实现了成本降低 33%、吞吐提升 15%、TTFT P99 降低 40.6%。核心思路是用全网扁平化拓扑 + 单/多轨混合接入，替代传统 ROFT 架构，从结构层面解决 PD 分离推理中的不对称流量拥塞问题。GPU、软件栈和应用均未改动，纯粹靠架构调优实现跨越。运行大规模推理集群的工程团队值得参考。苏姿丰上海开讲：AI 正在重新定义计算的每一层（来源：量子位） AMD CEO 苏姿丰在 AMD AI 开发者大会上海站的核心判断：AI 竞争正从模型能力转向系统工程与全栈优化，Agent 时代的成本结构是指数级而非线性的，开发者需要的是「可落地、可优化、可持续演进的工程体系」。AMD 以开放生态和 ROCm 平台应对这一趋势。量子位现场报道，信息密度较高。 ## 补充阅读 OpenAI 模型推翻 80 年数学猜想，AI 首次实现科学发现（来源：Wes Roth） OpenAI 内部推理模型自主推翻了 Paul Erdős 于 1946 年提出的平面单位距离猜想，通过桥接代数数论与初等几何构造出完整的反例族。这是 AI 驱动原创科学发现的一个里程碑时刻。关注 AI 在数学研究领域能力边界的读者值得一看。 OpenAI 单位距离问题突破：完整技术报告（来源：OpenAI Blog）上一条 Twitter 所对应的 OpenAI 官方完整技术报告。模型构造的点集配置在多项式级别上超越了此前最优的方格构造，顶级数学家 Noga Alon 参与了同行评审。想了解技术细节的读者可以直接读原报告。 QQ 音乐 Harness Engineering 实践（来源：腾讯云开发者）把 AI 协作从不可控的对话式编码升级为可控、可审计、可复用工程化过程的实践分享。在大仓多服务场景下，如何让 AI 具备自主验证能力是核心挑战。配合下面两篇「Harness Engineering」相关内容一起读效果更好。构建最强 Agentic Analytics Harness：由 Claude 驱动，用 Claude Code 打造（来源：Claude） Omni CTO 讲解如何构建 Blobby 智能分析系统，涵盖语义层设计、evals 框架、split-brain agent 与直接 SQL 生成等架构经验。关注 AI 数据分析 Agent 工程化落地的读者适合观看。 A²I² 的讽刺性悖论（来源：InfoQ）探讨自动化和 AI 在事件响应中的结构性困境：AI 提供了自主性和权威性，但缺乏定向注意力、可重定向性和可互预测性——而这些恰恰是人类协调最关键的特质。在高压情境下，这种缺失可能导致严重失败。对 SRE 和运维工程师有现实意义。提示工程还不够——我构建了一个可在生产环境中运行的控制层（来源：Towards Data Science）作者在第三次调试同一个崩溃后意识到：问题不在模型，在系统。他构建了一个包含 InputGuard、TokenBudget、PromptBuilder、ResponseValidator、CircuitBreaker、RetryEngine、FallbackRouter、AuditLogger 八个组件的控制层，将结构化输出基准测试通过率从 0% 提升到 100%。69 个测试、5 个可运行 demo，有完整代码。都是 AI Coding，为什么 Java 体验差了一个量级？五条方法论帮你构建自己的 Harness 环境（来源：阿里云开发者）深入分析了 Java 微服务项目在 AI Coding 中体验差的根本原因（本地跑不起来，AI 无法自主验证），并提出了通过 Harness Engineering 构建本地可运行环境的五条方法论。有 Checklist 和具体工程方案，对 Java 后端开发者非常实用。发布 ADK for Kotlin 和 ADK for Android 0.1.0（来源：Google Developers Blog） Google 发布 Agent Development Kit for Kotlin 和 ADK for Android，使开发者可以构建混合 AI Agent，在云端模型（如 Gemini）和设备端 LLM（如 Gemini Nano）之间协调任务。Android 开发者和移动端 AI 应用方向值得关注。合成人格预训练：从零标记开始的对齐（来源：LessWrong）通过在预训练文档中附加带有价值判断的道德反思，从训练伊始就植入所需的 AI 助手人格，实现了攻击成功率降低 63%。这是一项 AI 安全领域的早期研究，证明预训练阶段植入的价值观能够在后训练阶段泛化到未见过的安全场景。关注 AI 对齐研究的读者适合阅读。编码智能体的可维护性传感器（来源：Martin Fowler） Martin Fowler 通过实验多种传感器——从静态分析到 AI 驱动的模块化审查——帮助编码 Agent 自我修正并维护代码库的可维护性。当 Agent 生成代码的速度越来越快时，如何确保长期可维护性是一个值得认真对待的工程问题。来自 Codex 官方团队的分享：如何把 Codex 用到极致（来源：宝玉的分享）系统介绍如何利用 Codex 的持久对话流、语音输入、任务干预、自动化、目标设定和侧边栏等高级功能，将其从编程助手升级为全能工作流引擎。Jason 原文的中文翻译版，内容实用。 Ramp 工程师如何借助 Codex 加速代码审查（来源：OpenAI Blog） Ramp 使用 GPT-5.5 驱动的 Codex 将 PR 代码审查时间从数小时缩短至数分钟，核心价值在于「能捕捉人类和其他 AI 工具都遗漏的问题」。配合上一条 Codex 使用指南一起看效果更佳。当 Agent 真正走进复杂数据分析场景：DataClawBench（来源：AI 前线）基于 492 个真实金融智库任务的数据分析评测基准，通过保留未清洗数据和隐藏数据源先验，对前沿大模型进行过程级评估。结论是：当前 Agent 在开放式真实数据分析场景中的能力边界，远比 demo 演示的要窄。 LLM 主题并非观察结果（来源：Towards Data Science） LLM 从文本中提取的主题是「生成的变量」而非直接观察结果。在因果分析中，若未解决选择偏差、测量误差等问题而直接用作协变量，会引入严重偏差。对做数据分析和因果推断的研究者有直接警示意义。在 VS Code 中烹饪 Agents（来源：AI Engineer） Microsoft 的 Liam Hampton 讲解 VS Code 如何成为 local、background 和 cloud agents 的统一控制平面，把 multi-agent workflow、安全边界、MCP 上下文和开发者监督结合起来。VS Code 用户和 Agent 开发者适合观看。会自动交易的交易信号：在系统化投资中规模化受治理的 AI（来源：Claude） Man Group 数据与 AI 负责人讲解一家管理超过 2000 亿美元资产的受监管投资机构，如何在系统化交易中构建可治理的 AI——包括生产级 AI 交易信号、skills 治理框架，以及「组织上下文作为 AI 护城河」的战略视角。高度监管行业的 AI 落地案例，视角独特。 ## 今日阅读路径今天内容量偏大，如果你时间有限，建议按照以下路径选读：第一优先：如果你只有 20 分钟先读「精讲三：自动化之后」。Dan Shipper 的文章是今天最具思想冲击力的一篇，它提供了一个反直觉但有大量实地数据支撑的分析框架——关于 AI 与人类工作的关系，这是比大多数预测文章都更诚实的一个视角。第二优先：如果你是 Agent 工程师读「精讲一：Memory 与 Dreaming」，然后搭配速览中的 Daytona 和 ZCube 两篇。这三篇合在一起，覆盖了 Agent 的记忆层（Anthropic 原语）、计算环境层（Daytona）和网络基础设施层（ZCube），是一条完整的 Agent 基础设施视角。第三优先：如果你关注国产大模型竞争读「精讲二：Qwen3.7-Max」。35 小时 1158 次工具调用零中断这个数字，已经足够说明问题的性质——这不是 benchmark 刷分，而是真实硬件上的生产级验证，代表着国内大模型竞争正式进入了一个新的阶段。补充：如果你是开发者，在用 AI Coding 工具补充阅读中的 Java Harness Engineering、Codex 官方使用指南、QQ 音乐 Harness 实践这三篇可以组合成一个「AI Coding 工程化」专题，非常实用，适合在上下班通勤时集中阅读。

译本期早报聚焦AI Agent的成熟化。Anthropic首次发布Memory与Dreaming基础设施原语，将跨会话记忆工程化，Rakuten部署后首次执行错误率下降97%。通义实验室的Qwen3.7-Max通过35小时极限压力测试，在未知硬件平台上自主优化Kernel，实现1158次工具调用零中断，凸显长程稳定性，将国内大模型竞争焦点从问答分数转向Agent可靠性。与此同时，Every创始人观察到，随着AI自动化普及，能评判执行质量的人类专家价值反而凸显。这些进展共同指向Agent成熟的基础设施、模型基座与人类协作新范式。

Ethan Mollick@emollick · 5月22日68

We are quite short of compute, and that is going to result in compute becoming very expensive for complex agentic workflows even as single-turn chatbots get cheaper. So the richest companies & most pressing use cases will use AI agents & everyone else will be stuck with chatbots?

译我们目前算力相当短缺，这将导致复杂智能体工作流的算力成本变得非常高昂，即使单轮聊天机器人的成本在下降。因此，最富有的公司和最紧迫的用例将使用AI智能体，而其他人将只能使用聊天机器人？

Emad@EMostaque · 5月22日39

Narrow math speciality counts for a lot of things! A physics example: Many have studied special relativity. How many know or have computed the Killing Form of the space time algebra? If you do then you see a finite invariant speed of light is forced: https://ii.inc/web/blog/post/op

译当前数学知识总量庞大，导致研究者往往深耕于极其狭窄的专业领域，形成知识壁垒。这为AI创造了独特价值：AI能够跨越人类专家间的知识鸿沟，连接不同数学分支乃至跨学科领域，从而发现少数人类个体难以企及的解决方案。推文以物理学为例指出，对时空代数等专业工具的深度掌握能揭示如光速有限等深刻见解，这正体现了专精的价值，而AI有望系统性地实现这种跨领域的知识整合与创新。

Ethan Mollick@emollick · 5月22日61

Seems GPT-5.2 reaches expert level in peer review: 45 scientists took 469 hours evaluating human & AI reviews on 82 papers. "Surprisingly, current AI reviewers are competitive even with the top-rated reviewers in Nature’s official peer review..." though not without weaknesses.

译似乎GPT-5.2在同行评审中达到了专家水平：45位科学家花费469小时，评估了人类与AI对82篇论文的评审。 “令人惊讶的是，当前的AI评审甚至能与《自然》官方同行评审中的顶级评审人相媲美……”尽管并非没有弱点。

Rohan Paul@rohanpaul_ai · 5月22日63

"Not all tokens are created equal, and there is a way to look at token value. There are two key factors that impact token value. One is the intelligence embedded in the token, and the other is how fast does it arrive." Tokenomics begins with the customer’s tolerance for uncertainty, latency, and cost, not with the model menu. A slow token can be expensive even when compute is cheap, because delay changes the product experience before the invoice arrives. A fast token can also be wasteful if it carries shallow reasoning, redundant context, or output nobody uses. A medical triage assistant, a coding agent, and a shopping chatbot do not need the same kind token, even when they all speak fluent English. --- Shruti Koparkar from our Accelerated Computing of Nvidia

译该推文探讨了评估AI Token价值的新视角，核心在于Token的“智能含量”与“传输速度”。快速的Token若缺乏深度推理可能造成浪费，而缓慢的Token即使算力廉价也会因延迟影响用户体验。不同应用场景如医疗分诊、代码编写和购物客服对Token需求各异。因此，构建有效的“Token经济学”不应从模型菜单出发，而应从客户对不确定性、延迟和成本的容忍度开始，以具体用例为起点进行反向优化。NVIDIA的Shruti Koparkar强调，这关系到AI应用是规模化扩展还是停滞不前。

Rohan Paul@rohanpaul_ai · 5月22日65

The Information: Anthropic is currently in early-stage talks to lease and deploy Microsoft's custom AI chips for inference workloads. Microsoft is pitching Maia 200 as a cheaper way to run some AI inference, and claims maia 200 is more cost-effective than nvidia chips for certain inference jobs. Maia 200 is Microsoft’s second-generation AI accelerator, built on TSMC 3nm, with FP8/FP4 math, 216GB HBM3e, 7TB/s bandwidth, and 272MB SRAM, which makes it aimed at feeding large models fast rather than teaching them from scratch. Anthropic already committed $30B to Azure, Microsoft may invest up to $5B in Anthropic, and Claude is already tied into Microsoft’s Copilot stack, so the chip talks are also a customer-supplier feedback loop. IMO, Maia does not need to beat Nvidia everywhere to matter, because a cheaper chip for narrow, high-volume inference jobs can still shift billions of tokens away from GPUs. --- theinformation .com/articles/anthropic-talks-use-microsofts-ai-chips

译据The Information报道，微软正向AI公司Anthropic推销其第二代AI芯片Maia 200，强调该芯片在特定推理任务中比NVIDIA芯片更具成本效益。Maia 200专注于高速推理而非训练，双方已有深度合作基础：Anthropic已在Azure承诺300亿美元支出，且Claude已整合进微软Copilot。此次芯片合作旨在深化协同。分析认为，Maia 200无需全面超越NVIDIA，只要能在高量推理中提供更低成本选项，便可能将部分计算需求从GPU转移。

Rohan Paul@rohanpaul_ai · 5月22日84

Alibaba just released Qwen3.7-Max. Their best flagship model built for real-world tasks and production environments. - Agent reliability the center of the story, where the model must plan steps, call tools, inspect results, fix mistakes, and continue without collapsing after the first wrong turn. - 56.6 on the Artificial Analysis Intelligence Index, up 4.8 points from Qwen3.6-Max. Qwen 3.7 Max sitting at 5th, pretty much on par with GPT 5.4 (xhigh) - The Intelligence Index gains over Qwen3.6 Max Preview are concentrated in scientific reasoning, agentic capability and coding. - One important layer of the serving stack, the inference kernel, was optimized heavily. from near-baseline speed to 10.0x geometric mean speedup after many rounds of low-level GPU optimization.

译阿里巴巴正式推出最新旗舰模型Qwen3.7-Max，定位为Agent时代的生产级基础模型。该模型在权威评测中得分56.6，较前代显著提升，性能与GPT-5.4相当。其核心优势在于卓越的Agent可靠性，能够在复杂任务中自主规划、调用工具、纠错并持续执行。通过底层深度优化，模型实现了10倍推理加速，并支持长达数小时的自主运行与多工具协作。该模型现已上线阿里云模型工作室，并兼容Claude Code、OpenClaw等主流开发框架，助力开发者构建实际应用。

AK@_akhaliq · 5月22日56

LongMINT Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

译LongMINT 评估长期智能体系统中多目标干扰下的记忆能力

Berryxia.AI@berryxia · 5月21日71

兄弟们，Qwen 3.7 Max 发布了，是拉是夯？我们来使用「经典AI模型二叉树Prompt 」进行一个测试吧！这里分别使用了深度思考和快速模式测试（见视频）之前的Gemini 3.5 Flash的结果也可以看原贴哈。你们可以去测试一下不同的模型的表现~ 👇🏻Prompt：编写一段HTML模拟程序，借助画布绘制递归分形二叉树。从单根主干开始，以递归方式分出左右枝干，枝干长度逐步缩减，角度产生小幅随机偏移。实现树木从主干逐步生长至枝叶繁茂的动画效果，随后让树木如同随风般轻轻摇曳。

译新发布的Qwen 3.7 Max正被用户通过“递归分形二叉树”生成测试进行评估。该测试要求模型编写HTML代码，模拟树木从生长到摇曳的动画。此前，Gemini 3.5 Flash已用同一测试进行了展示，其生成完整动画耗时77.56秒，效果被评测者认为惊艳。该测试已成为比较不同AI模型代码生成与创意能力的一种常见方式，用户可借此对比各模型的表现。

凡人小北@frxiaobei · 5月21日65

这个有点意思，尝试下。把 codex 指向另一个产品，30 分钟后就拿到了它的架构、数据模型、带有成本估算的提示。378 行的重建计划。 "/goal implement until your output matches theirs exactly"

译用户将Codex工具指向一个现有产品，仅用30分钟就自动分析并输出了该产品的完整技术蓝图，包括架构、数据模型、带有成本估算的提示词，并生成了一份长达378行的重建计划。更令人惊叹的是，现在可以通过一条明确的指令（“/goal implement...”），让Codex尝试一次性重建出与目标产品功能完全一致的成果，展示了其强大的逆向工程与代码生成能力。

Alibaba Cloud@alibaba_cloud · 5月21日76

Qwen3.7-Max just landed at 56.6 on the Artificial Analysis Intelligence Index — a solid 4.8pt jump over Qwen3.6-Max-Preview. @ArtificialAnlys Sharper sci reasoning, stronger agentic chops, better coding, and it hallucinates less.

译阿里巴巴推出其最新闭源旗舰大模型Qwen3.7 Max，在Artificial Analysis智能指数上获得56.6分，较上代预览版提升4.8分，是其迄今最接近国际顶尖水平的模型。此次分数提升主要得益于科学推理、代理和编码能力的增强，其中模型的幻觉率大幅降低（从44.2%降至22.9%）是主要贡献因素。模型的上下文窗口已扩展至100万tokens，仍仅支持文本输入输出，具体定价尚未公布。

Qwen@Alibaba_Qwen · 5月21日76

🚀Qwen3.7-Max just landed at 56.6 on the Artificial Analysis Intelligence Index — a solid 4.8pt jump over Qwen3.6-Max-Preview. @ArtificialAnlys ⚡️Sharper sci reasoning, stronger agentic chops, better coding, and it hallucinates less.

译阿里巴巴近期推出了新一代闭源旗舰模型Qwen3.7 Max。该模型在Artificial Analysis智能指数上获得56.6分，较前代Qwen3.6 Max Preview提升了4.8分，创下阿里系模型最接近全球前沿水平的记录。此次升级主要体现在科学推理、智能体能力和代码生成方面，同时显著降低了模型幻觉率。值得注意的是，其分数提升部分源于模型更倾向于拒绝回答，而非完全依靠事实准确率的提高。技术上，其上下文窗口已扩大至100万tokens，仍保持闭源权重。尽管如此，该模型在整体能力上仍落后于OpenAI、Anthropic和Google的同类产品。

Rohan Paul@rohanpaul_ai · 5月21日64

A general-purpose LLM can produce frontier research when given enough test-time compute. Here, just a general-purpose OpenAI model has connected algebraic number theory to plane geometry and used that bridge to beat a decades-old conjecture. Shows how frontier models may already contain useful latent mathematical competence, and the bottleneck is partly how long and how well they are allowed to think.

译OpenAI的通用推理模型近期通过连接代数数论与平面几何，成功解决了保持数十年的平面单位距离猜想（Erdős猜想）。关键突破在于模型并非专用定理证明引擎，其成功依赖于延长和深化测试时计算过程，而非仅增加训练数据。这一进展表明前沿大模型已蕴含潜在的数学研究能力，当前瓶颈部分源于模型被允许“思考”的时间和方式。未来方向不是AI取代人类判断，而是在人类判断开始前拓宽思维的疆域，从而推动科学发现与创新。

🚨 AI News | TestingCatalog@testingcatalog · 5月21日72

Alibaba released Qwen 3.7 Max, its latest proprietary model for agentic coding. Qwen 3.7 Max scores 56.6 on the Artificial Analysis Intelligence Index, outperforming recently released Gemini 3.5 Flash and Kimi K2.6.

译阿里巴巴发布了其最新的专有模型 Qwen 3.7 Max，专为智能体编码设计。 Qwen 3.7 Max 在人工智能分析智能指数上获得 56.6 分，超越了近期发布的 Gemini 3.5 Flash 和 Kimi K2.6。

Orange AI@oran_ge · 5月21日81

AI 发展的里程碑时刻。 OpenAI 的一个未公布的内部推理模型，自主解决了 Erdős 1946 年提出的平面单位距离问题。 chain of thought 长达125 页，核心手法是从代数数论拉了一套工具去解离散几何问题，这个跨领域连接是人类 80 年没想到的。最有意思的是这个模型不是专门为数学训练的，是通用推理模型。这说明足够强的推理能力到了某个阈值之后，创造性会自然涌现。恭喜人类。

译OpenAI未公开的内部通用推理模型，自主解决了数学家Erdős于1946年提出的平面单位距离问题，颠覆了近80年来学界对解法结构的普遍预期。该模型通过125页思维链，创新运用代数数论工具解决离散几何问题，实现了跨领域方法论突破。更值得注意的是，该模型并非专攻数学训练，其成果表明通用推理能力达到一定阈值后可能自然催生创造性，标志着AI在基础科学领域迈出了关键一步。

🚨 AI News | TestingCatalog@testingcatalog · 5月21日74

Qwen 3.6 models are now 2.5x times faster on Atomic Chat with new MTP speedups. > MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on the memory moved per pass. Users can run Qwen 3.6 models locally via the open-source Atomic Chat to test them!

译新的MTP技术通过提前草拟多个令牌并一次完成验证，使Qwen 3.6模型在Atomic Chat中的运行速度提升高达2.5倍。该技术对Dense模型（如Qwen 3.6 27B）加速显著，速度从51提升至117 tokens/s；而对MoE模型（如Qwen 3.6 35B-A3B）提升相对较小（25%）。MTP实现了约80%的草稿接受率，无精度损失，仅需额外约1GB显存。用户可通过开源的Atomic Chat应用在本地测试该模型。

Artificial Analysis@ArtificialAnlys · 5月21日70

Alibaba’s new Qwen3.7 Max model scores 56.6 on the Artificial Analysis Intelligence Index, 4.8 points higher than Qwen3.6 Max Preview (51.8). While Alibaba still trails models from OpenAI, Anthropic and Google, Qwen3.7 Max is the closest they have been to the frontier Qwen3.7 Max is @Alibaba_Qwen's latest proprietary flagship, scoring 56.6 on the Intelligence Index, a 4.8 point gain over Qwen3.6 Max Preview (51.8) released in April. Qwen3.7 Max continues Alibaba's pattern, in place since Qwen2.5 Max (January 2025), of releasing Max and Plus models as closed weights while the rest of the Qwen line remains open weights. The leading open weights Qwen on the Intelligence Index is Qwen3.6 27B (Reasoning, 45.8) released in April 2026, and the leading open weights MoE Qwen is Qwen3.5 397B A17B (Reasoning, 45.0) released in February 2026 Key takeaways for the reasoning variant: ➤ The Intelligence Index gains over Qwen3.6 Max Preview are concentrated in scientific reasoning, agentic capability and coding. CritPt +9.7 p.p (3.7% to 13.4%), HLE +9.2 p.p (28.9% to 38.1%), TerminalBench Hard +6.9 p.p (43.9% to 50.8%) and GDPval-AA +42 Elo (1504 to 1546). Scores on other benchmarks in the Intelligence Index are flat compared to Qwen3.6 Max Preview ➤ A significant share of the Intelligence Index gain is driven by higher abstention on AA-Omniscience, not higher accuracy. Qwen3.7 Max's accuracy on AA-Omniscience dropped 7.6 p.p (37.7% to 30.1%), while its hallucination rate dropped 21.3 p.p (44.2% to 22.9%). The model is choosing not to answer more questions rather than recalling more facts. Because hallucination rate and accuracy both feed into the Intelligence Index, the hallucination reduction is one of the larger single contributors to the +4.8 point gain on the Intelligence Index ➤ Qwen3.7 Max used 96.7M output tokens to run the Intelligence Index, ~31% more than Qwen3.6 Max Preview (73.9M). It sits mid-pack on frontier token usage: above GPT-5.5 (high, 44.5M) and Gemini 3.1 Pro Preview (57.3M), below Claude Opus 4.7 (Adaptive Reasoning, Max Effort, 112M), Kimi K2.6 (166M) and DeepSeek V4 Pro (Reasoning, Max Effort, 187M) Key model details: ➤ Context window: 1M tokens (up from 256K on Qwen3.6 Max Preview) ➤ Multimodality: Text input and output only ➤ Pricing: Yet to be announced (Qwen3.6 Max Preview is priced at $1.30/$7.80 per 1M input/output tokens on the @alibaba_cloud first-party API) ➤ Licensing: Proprietary, closed weights

译阿里云发布闭源旗舰模型Qwen3.7 Max，在Artificial Analysis智能指数上获得56.6分，较前代Qwen3.6 Max Preview提升4.8分，与国际前沿模型的差距有所缩小。其进步主要体现在科学推理、智能体及编码能力上。值得注意的是，本次评分提升很大程度上源于模型在“AA-Omniscience”基准上主动选择“不回答”的次数增多，从而将幻觉率从44.2%显著降至22.9%。此外，该模型的上下文窗口已扩大至100万token，但仍延续了Max系列的闭源策略。

Greg Brockman@gdb · 5月21日78

our math result is a milestone in new knowledge generation by AI. very exciting to imagine similar results in other scientific fields. "It's very hard to sleep, man" is a pretty good reaction.

译AI在数学领域实现了新知识生成的里程碑式突破。OpenAI模型解决了组合几何中悬而未决的著名难题——平面单位距离问题（Erdos 1946），首次证明通过AI方法可将该问题中单位距离对的数量提升至超线性规模（n^{1+δ}），超越了以往所有人类已知的线性构造。这标志着AI从解决已知问题迈向发现新数学的重要进展。该突破引发了研究者“难以入睡”的强烈反响，被视为AGI时代临近的信号。

Rohan Paul@rohanpaul_ai · 5月21日78

AI in math is creating history again, as OpenAI's general-purpose reasoning model has disproved a major Erdős conjecture from 1946. The important part is not that AI solved a hard math problem, but how little special machinery it needed. For decades, the planar unit distance problem looked almost embarrassingly simple: place points on a plane, then ask how many pairs can be exactly one unit apart. For decades, the best examples looked like stretched versions of a square grid, so mathematicians believed grids were almost the best possible design. OpenAI’s internal model broke that picture by finding an infinite family of constructions that gives a polynomial improvement, with the proof checked by external mathematicians. The point to note is that the model was not a bespoke theorem-proving engine trained only for this problem, and the official post says its success improved with more test-time compute, meaning more reasoning at inference rather than only more training. That matters so much, because research progress often comes from holding a fragile chain of ideas together long enough to cross from one field into another. In this case, the bridge ran from a plain geometric question into deep algebraic number theory, including machinery like infinite class field towers and Golod–Shafarevich theory. And now we see a general-purpose reasoning system appears able to search a conceptual space where human taste, field boundaries, and inherited guesses may have quietly narrowed the path. So future is not machines replacing judgment, but machines widening the map before judgment begins.

译OpenAI的通用推理模型自主解决了一个自1946年以来未解的著名数学难题——平面单位距离问题。该模型没有采用专门为数学设计的定定理证明引擎，而是通过推理时增强计算能力，发现了优于传统网格结构的新构造方案。这标志着AI首次自主解决一个数学领域的核心开放问题。更重要的是，该模型能将几何问题与代数数论等深层理论连接，展示了通用人工智能在跨领域研究和拓宽人类认知边界方面的巨大潜力。

Rohan Paul@rohanpaul_ai · 5月21日67

A 10 million parameter model just outperformed deterministic rivals 3 times its size by doing something regular recursive AI dont do: exploring multiple reasoning paths at the same time. Most AI reasoning models are trapped on a single train of thought, and GRAM ("Generative Recursive Reasoning") is the first to break that by letting the model think in parallel universes simultaneously. The problem is that all existing recursive models are fully deterministic, meaning given the same input they always follow the exact same reasoning path and can never escape a wrong trajectory or discover more than 1 valid answer. GRAM fixes this by injecting learned randomness at each refinement step, so the model samples a slightly different direction each time rather than snapping to 1 fixed next state, which produces a spread of diverse reasoning trajectories. At test time the model runs many of these paths in parallel and selects the best one using a small reward predictor trained alongside the main model, adding a "width" scaling axis on top of the usual "depth" axis of running more recursion steps. On hard Sudoku puzzles, GRAM with 10M parameters hits 97% accuracy versus 87.4% for the best prior recursive model, and with only 20 parallel samples it outperforms every deterministic baseline even at 320 recursion steps. On tasks with many valid answers like N-Queens, deterministic recursive models collapse as the number of solutions grows, while GRAM maintains near-perfect accuracy throughout. The same stochastic framework also acts as a generator: given a blank board, GRAM produces valid Sudoku puzzles 99% of the time using 16 steps, versus 1,000 steps and 55M parameters for the best diffusion baseline at just 91%. --- Paper Link – arxiv. org/abs/2605.19376v1

译仅1000万参数的GRAM模型，通过引入可学习的随机性，在推理时并行探索多条不同路径，打破了传统递归模型锁定单一思维的限制。该模型在测试时同时运行这些平行轨迹，并借助奖励预测器选择最优结果，从而在深度之上增加了“宽度”维度。实验表明，GRAM在困难数独任务上准确率高达97%，远超此前最佳确定性模型；在多解的皇后问题上也能维持高性能，并能高效生成有效的数独谜题。这一框架为提升小模型的推理能力提供了新思路。

X.PIN@thexpin · 5月21日85

Just tested Alibaba's brand new Qwen3.7-Max. Prompt: build a single-file physics-simulation webpage: wind tunnel, cloth, soft body, fluid, all in one index.html, CSS + JS inlined.

译刚刚测试了阿里巴巴全新的Qwen3.7-Max。提示词：构建一个单文件物理模拟网页：风洞、布料、软体、流体，全部集成在一个index.html中，CSS + JS内联。

Chubby♨️@kimmonismus · 5月21日84

OpenAI made history today. An internal reasoning model autonomously disproved a famous conjecture in mathematics that stood for nearly 80 years. The problem: In 1946, Paul Erdős asked how many pairs of points can be exactly 1 unit apart if you place n points on a flat surface. The best known answer came from square grid constructions, and Erdős himself conjectured you can't do meaningfully better. Mathematicians believed this for decades. The AI proved him wrong. It found entirely new point configurations that beat the square grid by a fixed polynomial factor, not a marginal improvement, a real mathematical gap. The proof uses methods from algebraic number theory, a completely different branch of math, Class field towers, Golod-Shafarevich theory, tools nobody expected to be relevant to a geometry problem about distances in the plane (reminds me of move 37, AlphaGo tbh). Fields Medalist Tim Gowers calls it "a milestone in AI mathematics." The proof was verified by leading external mathematicians. According to OpenAI, this is the first time AI has independently solved a prominent open research problem in mathematics! Caveat: Obviously OpenAI chose which problems to test the model on. So "autonomous" means the model generated the idea and wrote the proof, not that it wandered into the problem on its own. But if reasoning models can reliably make cross-domain connections like this, finding paths that experts didn't prioritize, this changes research far beyond math. Biology, physics, materials science, medicine. This isn't AI reproducing human knowledge anymore. This is AI producing new knowledge. That's a qualitative shift.

译OpenAI内部推理模型自主解决了存在近80年的著名数学开放问题——平面单位距离问题。该模型推翻了Paul Erdős的猜想，发现了全新的点配置构造，其效率以固定多项式因子优于传统方格网格方案。证明运用了代数数论等跨学科方法，经外部数学家验证，被Fields奖得主Tim Gowers誉为“AI数学的里程碑”。这是AI首次独立解决数学领域的核心公开问题，标志着从知识复现到知识创造的重要转变，其跨领域推理能力可能为多学科研究带来深远影响。

Rohan Paul@rohanpaul_ai · 5月21日69

Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Multi-Token Prediction) pushing local Qwen models from 51 to 117 tokens/s on dense 27B. And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090. Instead of generating and checking one token at a time, MTP (Multi-Token Prediction) drafts multiple future tokens and verifies them together, so the GPU does less repeated work for every word it prints. And this makes local LLMs much faster when the draft tokens are accepted often enough. For many local LLM runs, the limit is not pure compute, but memory bandwidth: how fast the GPU can keep feeding weights into computation. A local GPU generating text often spends most of its time pulling model weights from VRAM again and again for each token, so if MTP lets the model check several drafted tokens in one forward pass, it reduces how often the same giant weight matrix has to be reread. The most interesting claim in their test is ~80% draft acceptance with zero accuracy loss and only ~1GB extra VRAM, because speculative decoding often becomes useful only when the draft tokens are accepted often enough. So we get this strong local AI result because it improves generation speed without changing the model’s answers, but the dense model is the real winner because memory bandwidth was its main bottleneck. Their GitHub repo is fully open source.

译atomic.chat的MTP（多Token预测）技术通过一次验证多个草稿token，有效减少了GPU重复读取模型权重的次数，显著提升了本地大模型的推理速度。测试显示，27B密集模型的速度从51 token/s提升至117 token/s，提升约137%；35B MoE模型在2x RTX 5090上速度提升约25%。该技术实现了约80%的草稿接受率，无精度损失，仅需额外约1GB显存。由于密集模型需要读取全部参数，其从该技术中获益更大。此项目已开源。

AYi@AYi_AInotes · 5月21日76

说实话，OpenAI这条推文我看了三遍。第一遍看懂了"AI解了80年数学悬案"，第二遍看懂了"几何问题用数论来破"，第三遍才反应过来——最震撼的不是结果，是AI自己想出了这条路，而咱们人类80年来都觉得这条路太冷门不值得走。这道题叫平面单位距离问题，1946年埃尔德什提出来的。简单说就是:平面上撒一堆点，让尽量多的点对之间距离正好是1。 80年来所有数学家都信一个结论:最优解长得像方格子，没法再优化了。 OpenAI的AI说:你们错了，它找了一整族全新的构造方法，不是方格子，效率比方格子明显高出一截。用的什么工具呢？就是代数数论里最冷门的那套——无限类域塔、Golod–Shafarevich理论。因为几何和数论，这两帮数学家以前基本不聊天，AI说你们应该聊聊🤣 菲尔兹奖得主Tim Gowers写进审稿论文:如果是人写的，我直接推荐《数学年刊》接收。数论专家Arul Shankar说:AI不只是助手，它有了原创天才想法并完整执行。他的125页思维链已经公开，人类数学家验证通过，证明这不是噱头炒作。以前AI在数学里的角色很清晰: 辅助验证，帮人算，搜索已知模式，但这次不一样， AI自己想了一条路，人类80年都觉得这条路太冷门、太反直觉、不值得走， AI偏偏走了，而且还走通了。人类觉得不靠谱所以没试的路，有多少其实是通的？这事想想有点后背发凉，但更多的是期待 hhh

译OpenAI的一个AI模型自主攻克了“平面单位距离问题”，这是数学家埃尔德什于1946年提出的一个著名开放难题。近80年来，学界普遍认为最优构造近似于方格子，而该AI模型通过运用代数数论中冷门的Golod-Shafarevich理论，发现了一整族效率更高的全新构造，推翻了原有定见。此成就标志着AI首次独立解决一个数学领域的核心开放问题，其关键在于提出并完整执行了一条人类因直觉认为不可行而从未尝试的创新路径。

SemiAnalysis@SemiAnalysis_ · 5月21日60

TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for llm-d. Great step by Google to start enabling the wider ML community for TPUs. TPU is catching up to NVIDIA for llm-d CI & code quality. In comparison, although AMD's official recommended production kubernetes inferencing solution is llm-d, @AnushElangovan has yet to add any AMD GPUs or AMD NICs into the CI.

译TPU警报：针对开源生产级Kubernetes分布式推理，Google刚为llm-d添加了夜间CI。这是Google推动更广泛ML社区使用TPU的重要一步。TPU在llm-d CI和代码质量方面正追赶NVIDIA。相比之下，尽管AMD官方推荐的生产级Kubernetes推理方案是llm-d，但@AnushElangovan尚未将任何AMD GPU或AMD网卡加入CI。

Ethan Mollick@emollick · 5月21日63

If this is true, using the best public estimates we have of LLM resource use, solving this Erdos problem took 0.6–6.3 kWh of electricity and about 3–31 liters of water. So that is less than three almonds worth of water and the electricity equivalent of 2-20 miles of EV driving.

译基于公开估算，LLM解决Erdos问题的资源消耗极低：电力仅0.6–6.3千瓦时（相当于电动汽车行驶数英里），水耗约3–31升（少于三颗杏仁的耗水量）。引用的估算进一步指出，该过程使用了GPT-5.6 Pro，处理时间约5至32小时，成本在120至1000美元之间。核心观点是，相对于解决这类数学问题的重大成果而言，LLM所需的资源和时间投入并不算多。

Z.ai@Zai_org · 5月21日75

http://x.com/i/article/2057206923208884224 # Next-generation LLM Inference Network: How ZCube Alleviates Network Bottlenecks? LLM inference is reshaping AI infrastructure. The network used to be the least interesting part of an inference cluster. That isn't true anymore. With long-context inference and Prefill-Decode disaggregation now standard, the network sits on the critical path of throughput, tail latency, and per-token serving cost. To address the increasingly severe topology-induced congestion in Prefill-Decode disaggregated deployments, Z.ai, Harnets.AI, and Tsinghua University jointly developed and deployed the ZCube network architecture in an online production environment. The deployment shows that system-level innovation at the network architecture layer can unlock hardware potential in a highly cost-effective way. In production benchmarking for the GLM-5.1 coding workload, ZCube delivered significant gains through architectural optimization alone: - Cost optimization: GPUs, the software stack, and applications remained unchanged, while switch and optical module CapEx was reduced by 33%. - Throughput improvement: Average GPU inference throughput increased by 15%. - Latency improvement: TTFT P99 was reduced by 40.6%. The root cause of the congestion lies in the shift of inference traffic patterns. As PD disaggregation becomes mainstream, cross-node KV Cache transfers make inference traffic highly asymmetric, with dynamically changing sources, destinations, and traffic volumes. In traditional ROFT (Rail-Optimized Fat-Tree) architectures, static topology and port mappings can easily concentrate traffic on a limited set of switches and links, causing local hotspots, queue buildup, and PFC backpressure. This leads to a structural issue where aggregate bandwidth appears sufficient, yet localized congestion occurs frequently. ZCube addresses this issue by using a fully flattened network topology together with a hybrid single-rail / multi-rail access design. At the network architecture layer, it decouples and distributes PD traffic across a broader path space, reducing the probability of topology-induced congestion at its source. This provides a more efficient networking foundation for next-generation hyperscale inference clusters. # Network Becoming a Bottleneck for Effective Inference When thousands of GPUs serve online inference requests concurrently, every KV Cache transfer and every data synchronization operation traverses the inter-GPU network. As long-context inference and Prefill-Decode disaggregated inference gradually become mainstream, data exchange between Prefill and Decode nodes continues to grow. Network bandwidth, and more importantly the ability to use it effectively, has begun to affect cluster-level throughput and latency directly. To quantify the impact of networking on inference performance, we first conducted an ablation study on a 512-GPU cluster. We kept GPU compute, the software stack, the model, and application logic unchanged, and only adjusted the available NIC bandwidth cap. We then measured changes in overall cluster throughput and Time to First Token (TTFT). For example, when network bandwidth was increased from 100Gbps to 200Gbps, overall inference throughput improved by approximately 19%, while Time to First Token, or TTFT, decreased by approximately 22%. This indicates that, in LLM inference, network bandwidth has become one of the key factors constraining service performance. # 1. Network Congestion in Inference Today, AI clusters commonly use Clos, or Fat-Tree, architectures. The basic idea is to scale the network by stacking multiple layers of switches. However, the performance of Clos networks depends heavily on ideal load balancing across switches, which is difficult to achieve in practice due to routing policies and real traffic patterns. For example, in many two-tier Fat-Tree deployments, which consist of Spine and Leaf layers, traffic across Spine switches can become severely imbalanced. As a result, upper-layer applications often fail to obtain the expected network performance. To reduce the overhead of cross-layer forwarding, the industry often adopts ROFT (Rail-Optimized Fat-Tree) architectures [1]. As shown in Figure 3, ROFT groups GPUs by index ("rail"), and connects GPUs with the same index to the same Leaf switch, reducing the communication cost across Spine switches. ROFT works well for certain training traffic patterns. However, in Prefill-Decode disaggregated inference, we observed a more prominent issue: KV Cache transfers exhibit strong source-destination asymmetry. Different GPUs and different NICs carry highly uneven communication loads, as shown in Figure 4. As a result, ROFT’s rail mapping no longer naturally translates into load balancing. Instead, traffic can become concentrated on a small number of Leaf switches and links, leading to link congestion and degraded transfer performance. This manifests in several ways: - Some Leaf switches become persistent load hotspots, increasing the probability that multiple KV Cache transfer flows compete on the same links. As a result, actual transfer throughput can fall far below the NIC bandwidth capacity. - Certain egress queues on some Leaf switches remain at high depth for extended periods and frequently trigger PFC backpressure, as shown in Figure 5. - Link congestion further amplifies tail latency, affecting both TTFT and overall throughput. It is important to distinguish between the two types of network congestion, as illustrated in Figure 6: - Unavoidable congestion: For example, when multiple GPUs send data to the same destination at the same time, contention on the final-hop link is inevitable. - Avoidable congestion: This is caused by topology design, traffic mapping, or imbalanced multipath utilization. Fundamentally, it is an architecture-level design problem. For the first type of congestion, we typically rely on congestion control, traffic shaping, and related mechanisms to mitigate its impact. For the second type, new network transport mechanisms such as adaptive routing [2], packet spraying [3,4], and MRC [5] can help. However, a more effective approach is to prevent network conflicts that should not occur in the first place through innovation at the network architecture layer. Prefill-Decode disaggregated inference is a typical example. If the network topology cannot match the traffic pattern, the system will repeatedly generate load hotspots and link conflicts. Solving this problem requires rethinking the inference network architecture itself. # 2. ZCube Network Architecture To address the above issues, we deployed a new ZCube network architecture [6]. ZCube breaks away from the traditional Clos design philosophy of hierarchical switch stacking and instead introduces a fully flattened GPU server interconnect. The ZCube routing strategy, designed specifically for the ZCube architecture, fully leverages the structural properties of the flattened topology. It can achieve near-ideal load balancing across all switches in the network, thereby significantly improving overall cluster network bandwidth. Compared with Clos, ZCube has a natural advantage in load balancing. This advantage benefits both training clusters and inference clusters. Importantly, ZCube achieves these performance gains while reducing switch and optical module costs by approximately one third compared with Clos. Based on current mainstream switch and NIC configurations, ZCube can support flattened networking for tens of thousands, or even hundreds of thousands, of GPUs. ## 2.1 ZCube Core Architecture As shown in Figure 7, the core ideas of ZCube are: 1. Remove the Spine switch layer. 1. Divide Leaf switches into two groups of equal size, typically odd-numbered switches and even-numbered switches. 1. Establish a complete bipartite interconnect between the two switch groups. 1. Connect the two ports of each GPU NIC to the corresponding switches in the two groups using single-rail and multi-rail access patterns. Suppose each GPU has a corresponding NIC with two ports, i.e., p=2. There are n GPUs in total, and GPUs and NICs share the same indices: 1,2,…,n. Let k denote the number of GPUs connected to each switch. The total number of switches is 2n/k, numbered 1,2,…,2n/k. For GPU i, where 1≤i≤n: - The first port connects to the odd-numbered switch: ((i−1)mod(n/k))×2+1 - The second port connects to the even-numbered switch: ⌈i/k⌉×2 The two switch groups are connected as a complete bipartite graph: every odd-numbered switch connects to every even-numbered switch. A ZCube topology under dual-port NIC configuration, withp=2,n=32, and k=8, is shown in Figure 7. ## 2.2 Key Properties of ZCube Network Diameter ZCube has a network diameter of two switch hops, meaning any pair of GPUs can reach each other through two switches. This sits between a one-layer switch network, which has one switch hop but limited scale, and a conventional two-layer switch network, which supports a larger scale but typically requires three switch hops and incurs higher latency. Load Balancing First, the ZCube routing strategy ensures that each GPU pair has a unique optimal path, avoiding traffic conflicts caused by multipath route selection. Second, ZCube uses two complementary GPU-to-switch connection patterns. One switch group connects to GPUs in a single-rail pattern, where each switch connects to a contiguous range of GPU IDs. The other switch group connects to GPUs in a multi-rail pattern, where each switch connects to GPUs with the same relative index across groups. This design enables ZCube to achieve highly effective load balancing across the entire switch fabric under both typical AI training traffic patterns, such as AllReduce and All-to-All, and typical AI inference traffic patterns, where source-destination relationships are uncertain, and NIC loads can be highly imbalanced. As a result, ZCube can avoid the second type of network congestion described earlier at the architecture layer. As shown in Figure 8, traffic flows that would conflict under ROFT can obtain dedicated network paths under ZCube, thereby avoiding congestion. Scalability ZCube provides strong scalability while preserving its favorable performance characteristics. For example, using one layer of 51.2T switches, each with 128 × 400Gbps ports, ZCube can construct a network connecting 16,384 400Gbps NICs. If higher-capacity switches are used, or if the ZCube network is divided into more planes, the architecture can scale further to support interconnection among tens of thousands or even hundreds of thousands of GPUs. Cost At the same cluster scale, ZCube can reduce switch and optical module costs by approximately one third compared with traditional Clos / ROFT architectures. For example, in a 10,000-GPU AI cluster, ZCube can save roughly 210 million RMB to 640 million RMB in network hardware investment. These characteristics show that ZCube can achieve better load balancing and performance while requiring lower network hardware cost. ## 2.3 Real-World Cluster Testing: Boosting Inference Performance While Cutting Network Costs We upgraded the network architecture of a thousand-GPU cluster running GLM-5.1 coding inference services from the original ROFT to the ZCube architecture. Since the ZCube architecture eliminates the Spine-layer switches found in traditional Clos architectures, the legacy cabling patterns, IP addressing schemes, routing policies, and switch configuration methods established under the Clos framework could not be reused directly, necessitating a complete redesign tailored to ZCube. To tackle these challenges, the Harnets.AI Network Team designed a comprehensive network solution centered on the ZCube architecture. They developed a suite of automation tools, including the ZCube Controller, a data center layout design tool, and a cabling correctness verification program. This enabled capabilities such as data center deployment planning, cabling validation, automated configuration generation, and batch deployment, effectively resolving numerous hurdles in ZCube deployment. This suite of tools was the critical factor enabling the successful transformation of a large-scale production cluster within an exceptionally tight timeframe. Following the seamless network architecture migration, we conducted real-world testing on the ZCube architecture by running the GLM-5.1 coding inference services on this cluster. By comparing the cluster's inference performance before and after the upgrade, we found that ZCube boosted the average GPU inference throughput by over 15% compared to the ROFT architecture (as shown in Figure 9), while dropping the P99 tail latency of TTFT by 40.6%. In summary, for GPU and server hardware of the same scale and configuration, and without modifying any applications, upgrading the networking architecture to ZCube allowed us to not only save 1/3 of the optical modules and switch hardware, but also enable the cluster to serve 15% more inference requests per second. Against the current backdrop of exploding inference workloads and severe shortage of compute resources, this approach proves to be highly pragmatic and valuable. Currently, this ZCube cluster has been running stably for over two weeks, playing a vital role in powering the GLM-5.1 coding inference services. # 3. Conclusion LLM inference is moving from point-wise optimization toward system-level co-design. The coupling between the network and the inference engine is becoming increasingly tight, making networking a critical component of the inference system. The production deployment of ZCube shows that network architecture innovation can directly unlock the effective capacity of inference systems. By better aligning the network architecture with KV Cache transfers and PD traffic patterns, ZCube reduces the probability of topology-induced congestion at the source, improving throughput and latency while enhancing cluster cost efficiency. Looking ahead to next-generation LLM infrastructure, network design will evolve from general-purpose interconnects toward model-traffic-driven system co-design. Long-context inference, PD disaggregation, MoE, and integrated training-inference workloads are reshaping intra-cluster communication patterns, requiring network topology, communication libraries, and scheduling policies to be jointly optimized around real model traffic. Looking ahead, we will continue pioneering novel AI network architectures for larger-scale inference and training clusters ─ upgrading the network from a foundational GPU connection layer into a core driver of token generation efficiency, system resilience, and cost-effectiveness. # Acknowledgements ZCube was published at ACM SIGCOMM 2025, and was recognized as “significantly change the way we think about and understand networking.” This is the first large-scale deployment of the technology in a production inference cluster. We thank the Harnets.AI team for their professional support and close collaboration throughout this network architecture upgrade and optimization effort. ## Reference [1] NVIDIA. 2023. SuperPOD: Next Generation Scalable Infrastructure for AI Leadership. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod-reference-architecture-dgx-h100.pdf [2] NVIDIA. 2025. https://developer.nvidia.com/blog/accelerating-ai-storage-by-up-to-48-with-nvidia-spectrum-x-networking-platform-and-partners/ [3] Ultra Ethernet Consortium. Ultra Ethernet specification v1.0.1, 2025. [4] Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Michael Papamichael, Mohammad Dohadwala, Lukas Gianinazzi, Mikhail Khalilov, Elias Achermann, Daniele De Sensi, and Torsten Hoefler. REPS: Recycled entropy packet spraying for adaptive load balancing and failure mitigation, 2026. [5] Araujo, J., Chow, A., Handley, M., Lewis, R., Paasch, C., Padhye, J., … & Sur, S. (2026). Resilient AI Supercomputer Networking using MRC and SRv6. arXiv preprint arXiv:2605.04333. [6] Yan, Z., Li, D., Chen, L., Xiong, D., Gao, K., Zhang, Y., … & Lin, H. (2025, September). From ATOP to ZCube: Automated topology optimization pipeline and a highly cost-effective network topology for large model training. In Proceedings of the ACM SIGCOMM 2025 Conference (pp. 861-881).

译随着长上下文与Prefill-Decode分离部署成为主流，GPU集群网络已从次要部件转变为制约推理吞吐、尾部延迟和成本的关键瓶颈。传统静态网络拓扑与动态非对称的KV Cache流量模式冲突，导致局部拥塞。为此，Z.ai、Harnets.AI与清华大学联合研发了ZCube网络架构。该架构采用完全扁平化拓扑与混合接入设计，从源头解耦并分散流量以减少拥塞。在GLM-5.1生产测试中，ZCube在保持GPU与软件栈不变的前提下，实现了交换机与光模块成本降低33%、平均推理吞吐提升15%、首token时间P99降低40.6%的显著效果，证明网络架构创新能有效释放硬件潜力。

Chubby♨️@kimmonismus · 5月21日64

OpenAI is aiming for a release of their upcoming general-purpose LLM. „We have not pushed this model to the limit on open problems. Our focus is to get it out quickly so that everyone can use it for themselves.“ What makes this so impressive is that a general-purpose LLM, not specifically trained for math or this problem, appears to get dramatically better simply by using more test-time compute! OpenAI has a run.

译OpenAI即将推出通用型大语言模型，强调其并非为特定问题或数学领域专门训练。该模型通过增加测试时的计算资源，性能实现显著提升，展现了通用模型在扩展计算时的潜力。官方表示当前重点在于快速发布，供用户自主探索，暂未在开放问题上追求极限优化。这标志着大模型发展的一条新路径。

Sam Altman@sama · 5月21日84

a general-purpose model solved a major open problem in mathematics. we'll be saying this a lot over the coming years, but this is a kinda big milestone. i'm very excited for AI to greatly extend our understanding of the world, but still, i have complicated feelings today.

译一个通用模型解决了数学领域的一个重大开放问题。未来几年我们会经常说这句话，但这确实是一个相当重要的里程碑。我非常期待AI能极大地拓展我们对世界的理解，但今天，我的心情依然很复杂。

Ethan Mollick@emollick · 5月21日72

June 2024: The latest general-purpose LLMs could not count the r's in strawberry. July 2025: The latest general-purpose LLMs get gold in the International Math Olympiad. May 2026: The latest general-purpose LLM solve one of the "best-known questions in combinatorial geometry"

译2024年6月：最新的通用大模型无法数清“strawberry”里有几个r。 2025年7月：最新的通用大模型在国际数学奥林匹克竞赛中获得金牌。 2026年5月：最新的通用大模型解决了“组合几何学中最著名的问题之一”。

Ethan Mollick@emollick · 5月21日48

Did we ever learn what model won gold at the IMO from OpenAI? It was a year ago and it was called an unreleased internal general purpose model back then. Has GPT-5.5 Pro Extended caught up with whatever it was?

译我们最终知道OpenAI在IMO上获得金牌的是什么模型了吗？那是一年前的事了，当时被称为一个未发布的内部通用模型。GPT-5.5 Pro Extended是否已经赶上了那个模型？

Ethan Mollick@emollick · 5月21日72

Its The Graph again (not the METR graph, the one from the o1 launch). Although no logarithmic decay of ability with increasing compute...

译又是The Graph（不是METR的图，是o1发布时的那张图）。虽然能力并未随计算量增加呈现对数衰减……

Emad@EMostaque · 5月21日91

Once AI starts making solving open problems in novel ways it won’t stop. We are entering the final stage of human solutions to open problems like this. Feels weird, doesn’t it?

译OpenAI模型首次自主解决了Paul Erdős于1946年提出的平面单位距离问题，这一突破推翻了数学界近80年来的主流猜想。AI不仅给出了更优的解法，更发现了一族全新的构造方式。这一事件被视为AI能力的里程碑，暗示着在解决科学开放性问题上，AI正开始以新颖方式持续突破，可能标志着人类主导此类问题求解的“最终阶段”的到来。

Noam Brown@polynoamial · 5月21日67

Excellent thread from mathematician Tim Gowers on the significance of the @OpenAI model’s breakthrough on the Erdos Unit Distance Problem!

译数学家Tim Gowers关于@OpenAI模型在Erdos单位距离问题上取得突破的重要长文！ [引用 @wtgowers]：如果你是数学家，那么在继续阅读之前，你可能需要确保自己是坐着的。

Greg Brockman@gdb · 5月21日92

An OpenAI model has achieved a major breakthrough in mathematics, by disproving a central conjecture in discrete geometry that was first posed by Paul Erdős in 1946. This is the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

译OpenAI的模型在离散几何领域取得重大突破，自主解决了由数学家Paul Erdős于1946年首次提出的平面单位距离猜想。该突破是AI首次独立解决一个学科的核心著名开放问题。此前近80年间，数学家普遍认为该问题的最优解大致呈现为方形网格结构，而OpenAI模型发现了全新的、性能更优的构造方式，颠覆了这一长期信念。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 5月21日87

"This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics."

译OpenAI模型自主攻克了数学领域一个长达近80年的著名开放问题——平面单位距离问题。该问题由Paul Erdős于1946年提出，传统观点认为最优解结构近似于方格网格。OpenAI模型的突破性发现不仅推翻了这一长期假设，还构造出性能更优的全新解法，标志着人工智能首次在数学核心领域独立解决重大未解难题。