Greg Brockman explains how the public story about AI data center water use is partly wrong. Because the cooling-systems use a closed-loop design that circulates the same stored water instead of constantly pulling fresh water. i.e. it works less like a running tap and more like a sealed pool, where water absorbs heat from servers, moves through cooling equipment, then returns to the same circuit. The argument here is not that AI infrastructure has no resource cost, but that public debate often mixes up different cooling designs and treats every data center as if it burns through water the same way. The important distinction is water withdrawal versus water consumption, because a site can hold a large amount of water inside its pipes while using far less new water day to day. OpenAI's official blog on Stargate project also says the same thing: "Water is one area where details matter. Like many data centers, the Abilene site uses closed-loop cooling rather than traditional evaporative cooling towers. Once the system is filled, water continuously moves through sealed pipes and is recirculated rather than consumed. For Abilene, the one-time initial fill for each building is equal to roughly two Olympic-sized swimming pools. After that, annual water use for the entire cooling system at full buildout is expected to be comparable to a medium-sized office building, or about four average households." --- From 'The Knowledge Project Podcast' YT channel (link in comment)

译Greg Brockman指出，公众对AI数据中心用水量的认知存在偏差，主要源于混淆了“取水量”与“耗水量”。他解释，现代数据中心多采用闭环冷却系统，如同“密封水池”，水在系统内循环吸热，而非像“流水龙头”般持续消耗新鲜水源。因此，系统可容纳大量水，但日常补充的新鲜水很少。OpenAI的Stargate项目博客也证实，其站点采用闭环冷却，全面运行后年耗水量仅相当于一栋办公楼或约四个家庭的用水量。公众辩论常因不了解冷却技术差异而过度简化。

SemiAnalysis@SemiAnalysis_ · 5月21日18

With modern agentic workloads and long context windows, a common bottleneck in serving LLMs at scale is where to store all the KV cache. Luckily, KV cache can be extended beyond HBM into other tiers of memory. Nvidia uses the following naming convention to describe the tiers: 🟠 G1 (HBM): fastest bandwidth but (relatively) small 🟠 G2 (host DRAM): still quite fast (traverses PCIe) and an order of magnitude larger than G1 🟠 G3 (SSD/NVMe): slower, shared across entire node 🟠 G4 (shared network storage): slowest, effectively unlimited in size At GTC 2026, in a historic partnership with SpaceXAnthropicAI, Jensen announced the newest tier, G5: a Starlink-attached HDD array in low earth orbit. Excited to see what G6 will be.

译针对现代AI智能体与长上下文窗口带来的大模型KV缓存存储瓶颈，英伟达提出了分层内存扩展方案。该方案将高速但容量有限的HBM（G1）作为基础，依次扩展至通过PCIe访问的主机DRAM（G2）、节点共享的SSD/NVMe（G3），以及提供近乎无限容量的网络存储（G4）。在GTC 2026上，英伟达更宣布与SpaceX及AnthropicAI合作，提出了通过Starlink连接的近地轨道HDD阵列这一概念性G5层级，旨在将存储边界进一步推向分布式网络架构。

Chubby♨️@kimmonismus · 5月21日78

Anthropic is paying SpaceX $1.25 billion per month for compute. Per month. That's $15 billion a year flowing to a company whose total annual revenue is $18 billion. One AI lab is about to account for the majority of SpaceX's commercial income. We only know this because SpaceX filed for an IPO today and had to disclose the terms. The deal was announced weeks ago with no financials attached. Source: Axios

译Anthropic每月向SpaceX支付12.5亿美元用于算力。每月。这意味着每年有150亿美元流向一家年总收入为180亿美元的公司。一家AI实验室即将占据SpaceX商业收入的大部分。我们之所以知道此事，是因为SpaceX今天提交了IPO申请，必须披露相关条款。该交易数周前已宣布，但未附带财务细节。来源：Axios

Elon Musk@elonmusk · 5月21日69

SpaceX is actively hiring world-class engineers/physicists for SpaceXAI, even if you have zero prior experience in AI. Smart humans figure it out fast. Please send an email with ~3 bullet points demonstrating evidence of exceptional ability to ai_eng@spacex.com.

译SpaceX正在为SpaceXAI积极招聘世界级工程师/物理学家，即使你此前没有任何AI经验。聪明的人能很快掌握。请发送邮件至ai_eng@spacex.com，附上约3个要点证明你的卓越能力。

SemiAnalysis@SemiAnalysis_ · 5月21日60

TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for llm-d. Great step by Google to start enabling the wider ML community for TPUs. TPU is catching up to NVIDIA for llm-d CI & code quality. In comparison, although AMD's official recommended production kubernetes inferencing solution is llm-d, @AnushElangovan has yet to add any AMD GPUs or AMD NICs into the CI.

译TPU警报：针对开源生产级Kubernetes分布式推理，Google刚为llm-d添加了夜间CI。这是Google推动更广泛ML社区使用TPU的重要一步。TPU在llm-d CI和代码质量方面正追赶NVIDIA。相比之下，尽管AMD官方推荐的生产级Kubernetes推理方案是llm-d，但@AnushElangovan尚未将任何AMD GPU或AMD网卡加入CI。

AYi@AYi_AInotes · 5月21日66

Damn，终于有人懂了 AI Agent的护城河根本不在模型上！ xAI刚刚推送了Grok Build的更新,没有任何酷炫的新功能,全是bug修复和底层优化。但就是这条平淡的更新日志,让我确定Grok Build已经走在了所有编码Agent的最前面。这次更新最杀的几个点: 1. 修复了后台子代理静默失败的问题,这是所有AI Agent最致命的痛点 2. 终于支持macOS Intel和Windows ARM,老设备用户狂喜 3. 解决了CJK字符路径问题,中文用户再也不用被路径地狱折磨 4. 优化了上下文压缩,Agent能跑更久不崩很多公司还在比谁的模型参数更大,谁能生成更花哨的demo。 xAI在默默填坑。填那些看不见但会让你在生产环境崩溃的坑。填那些别人觉得"不重要"但会劝退90%用户的坑。决定胜负的从来都不是今天多了哪个酷炫功能，而是昨天那些看不见的坑,有没有被填上。 #Grok #xAI #AIAgent

译xAI为Grok Build推送的更新看似平淡，无炫酷新功能，但专注于修复致命的“后台子代理静默失败”、支持更多平台及中文字符路径等基础问题。这体现了其工程优先策略，在竞争对手比拼模型参数与演示时，xAI默默填补那些会导致生产环境崩溃、劝退用户的“隐形坑”。这一系列底层优化与漏洞修复，正悄然构筑其编码Agent的长期护城河。

meng shao@shao__meng · 5月21日68

Forward Deployed Engineering (FDE) 是什么？为什么 OpenAI、Anthropic 等 AI 顶流都在力推 FDE，它会是下一个值得转型的职业吗？为什么 AI 公司疯抢 FDE？ @vasuman 这个判断很直接：如果智能本身正在被商品化，那么唯一的竞争优势就是"如何用、用在哪"。模型能力会被 Anthropic、OpenAI 等拉平，套壳产品也会被复制。真正难复制的是——把 AI 嵌入到某家具体公司的具体业务流里。这件事没法用通用产品解决，只能派人去干。所以 Applied AI 公司的商业模式是：把 FDE 派驻到客户现场，做"AI 转型外包"，客户为效率提升付费。一个能独立完成"理解客户问题 → 写进陌生代码库 → 向非技术高管讲清商业价值"的人，vas 称之为 "million-dollar hire"。角色的核心要求：必须 On-site！这一点借用了 Palantir 的传统（FDE 的定义来源）： · 2010 年 Palantir 的 FDE 跟着美军特种部队驻阿富汗，白天部队执行任务、晚上 FDE 改代码。 · Palantir CTO 的原话："你无法为一个你不在其中的环境构建产品。" 迁移到 AI 场景的含义是：真正的效率提升需要"围绕 AI 重建公司"，这不可能远程完成，必须坐在客户身边，基于公司专有数据和上下文构建定制 Agent。 FDE 的工作三阶段 1. Audit（审计 / 诊断）：以原型 Demo 收尾驻场轮岗各部门（例如 RevOps 两周、采购一周、财务一个月），目标是： · 摸清每个团队的工作流 · 找到瓶颈 · 判断哪些该自动化、哪些不该三条"是否上 Agent"的判断原则，非常实用： · 规则可抽象，但输入形态多样（邮件 / PDF / 扫描件），且需要调工具？上 Agent！ · 规则和输入都可预测？写普通代码，更快更便宜！ · 需要模式识别 + 领域专家判断？保留人工！另外两条经验法则： · 量要够大：一个月跑 5 次的流程，ROI 撑不起来。 · 别滥用 AI：大多数任务用"一串工具调用 + 一次 LLM 编排"就够了，过度用 AI 会带来 token 成本和质量下降。 2. Evals（评估）客户砸百万美金做 AI 部署，必须有办法证明"它真的在工作"。好的 eval 不是只看最终答案对不对，而是验证 AI 是否像人一样思考。两个方法： · 拆解人的步骤逐步打分：人解决问题是多步的，把 checkpoint 列出来，看 AI 是否每一步都过关。 · 从黄金样本反向锚定：和资深员工一起把"完美答案"写出来 20 个，作为标尺度量所有产出。 Evals 的真正用途是让怀疑 AI 的高管敢签字——它是商业信任工具，不只是工程工具。 3. Deployment（部署）几条非常反直觉但很务实的原则： · 不要做大规模数据迁移。在现有数据层（SharePoint、数据库）之上建 API，让模型作为 orchestrator 去查询。客户花了几年几百万上 ERP，不会让你再拆一次。 · 先搭沙箱执行环境，在客户基础设施里安全测试。 · 从最小自治单元起步，再逐步给权限。例：先让 Agent 只做"发现 bug → 调查 → 写工单"，跑稳了再允许它"写代码 + 提 PR"。如何在 30 天内成为 FDE？！ vas 认为三类背景最容易切入：咨询顾问、PM、软件工程师。咨询/PM 的短板：工程能力解法是用作品集补齐。从下面四个项目里挑两个深做： · 一个能跑通你前公司某个完整流程的生产级 Agent（调 API、记录思考、有失败兜底）。 · 一个面向特定行业数据集（法律 / 医疗 / 财报）的 RAG pipeline。 · 一个自己写的 eval 框架，多维打分（正确性、格式、成本、延迟）。 · 一个把 LLM 接入到不支持 AI 的遗留系统的 MCP。 vas 强调："Do not outsource your understanding to AI"——别让 AI 替你理解，否则面试一聊就穿。 SWE 的短板：沟通工程师做同样的项目，但必须能把每个组件、技术选型、迭代过程、商业结果讲清楚，并能回答"你为什么解这个痛点、真实客户场景里会怎么走"。 30 天路线图（角色无关） Week 1：Agent loop 基础（读 Anthropic Building Effective Agents）、tool use、guardrails、context vs 外部记忆、audit trail Week 2：结构化输出（JSON）、Demo → Prod 常见坑、checkpoint 机制 Week 3：重试与指数退避、成本优化（小模型做小事 / 缓存 / token 上限）、构建 golden dataset、多 Agent 并行架构 Week 4：复盘 + 大声讲出来，把每件事绑到商业指标上

译Forward Deployed Engineering (FDE) 是AI公司派驻客户现场的工程师角色，核心是将AI能力嵌入企业具体业务流。随着模型能力趋同，真正优势在于“如何用”，而FDE正是解决AI落地“最后一公里”的关键。该角色借鉴Palantir传统，强调必须现场工作，围绕企业专有数据重建流程。其工作涵盖业务审计、效果评估（Evals）与务实部署三阶段。咨询、PM及软件工程师可通过30天路线图与作品集转型，但需弥补各自短板——工程能力或商业沟通能力。FDE被视作“百万美金级人才”，是AI时代高价值的职业新方向。

meng shao@shao__meng · 5月21日69

看看 Alex Finn 推荐的 Codex 远程开发架构，虽然 1000x 生产力略显夸张 😄 Alex 的核心理念是把"写代码的设备"和"发指令的设备"分离开：一台主力机（Mac Studio）：唯一真正执行代码编写的环境，所有代码库、依赖、运行时都集中在这里。多台终端设备（iPad、iPhone、第二台 Mac Studio、两台 Mac mini）：只作为"遥控器"，向主力机发送指令。带来的结果是：物理位置与开发能力解耦——在床上、在超市、在日本、在车里，都能继续推进同一套代码。三层技术架构 1. 主力机（Host） · 始终开机、禁用睡眠。 · 在 Codex 应用中开启 Settings → Connections → Control this Mac，把自己暴露为可被远程控制的节点。 2. 控制端（Clients） · 所有其他设备在 Codex 中开启 Control other devices。这些设备不存放代码，只负责发送 prompt、查看结果。 3. 网络层（Tailscale） · 在所有设备上安装 Tailscale，组成一张私有 mesh 网络（基于 WireGuard）。 · 作用不仅是穿透 NAT，更关键的是：让其他 AI agent（他举例的 OpenClaw、Hermes）能够跨机器跳转、在不同节点上执行修改。 · 等于把"多台设备"在网络层抽象成"一台逻辑机器"。落地步骤（精简版） 1. 选一台桌面设备作为 Host（Mac mini 或 Mac Studio 优先）。 2. 系统设置中关闭自动睡眠，确保常开。 3. Host 上：Codex → Settings → Connections → Control this Mac 打开。 4. 其他每台设备：Codex → Settings → Control other devices 打开。 5. 全部设备安装 Tailscale，登录同一账号，组成私有网络。 6.（可选）部署跨机 agent（如 OpenClaw、Hermes），让它们利用 Tailscale 跨节点执行任务。

译Alex Finn 提出的远程开发架构核心在于将“执行代码的主机”与“发送指令的终端”分离。一台主力机（如Mac Studio）常开，作为唯一执行环境，集中所有代码与依赖；其他设备（如iPad、iPhone）仅作为“遥控器”发送指令。通过Codex的远程控制功能与Tailscale私有网络连接，开发者可在任何地点、任何设备无缝推进同一项目，实现开发能力与物理位置的解耦，从而提升灵活性和效率。

meng shao@shao__meng · 5月21日68

Chrome DevTools for Agents 1.0 正式发布 https://developer.chrome.com/blog/devtools-for-agents-v1 它在真实浏览器中观察行为、检查输出，让 Agent "能看见浏览器"，有三种接入方式： 1. MCP server：将 LLM 连接到 DevTools 调试能力的标准协议 2. CLI：Token 更省的替代方案，支持 Agent 把动作打包成脚本批量执行 3. Agent skills：教 Agent 何时、如何调用具体工具的专家指令 (如无障碍、性能调试) 共开放了七个能力 1. 自动化质量审计：Agent 可直接跑 Lighthouse，覆盖无障碍、SEO、最佳实践、agentic browsing。可作为"质量门"，拦截阻塞性问题进入生产。 2. 真实用户环境模拟：窗口尺寸、地理位置、网络/CPU 节流均可由 Agent 操控，无需手动调整浏览器即可测试响应式与移动端行为 (如汉堡菜单)。 3. Chrome 扩展开发与调试：安装、reload、触发扩展动作，介入 background script 与扩展页面，自动化"保存-刷新"循环。 4. WebMCP 工具调试：配合 WebMCP Origin Trial。让站点向 Agent 暴露结构化工具，Agent 不再靠 DOM 猜测意图，而是直接列出、调用、验证工具，显著降低集成门槛。 5. 内存泄漏检测：支持堆快照，识别 detached DOM 节点等典型泄漏。配合内存调试 skill，Agent 扮演性能专家角色。 6. Auto-connect 会话接管：可把当前已登录的浏览器上下文交给 Agent，而非让其打开沙箱实例。适合调试需要鉴权的页面 (如后台仪表板)，省去重新登录。 7. 第三方开发者工具暴露内部状态：Web 应用可主动向 Agent 暴露内部状态与组件细节，使调试建议基于真实运行时数据，而非黑盒推断。

译Chrome DevTools for Agents 1.0 发布，旨在让 AI Agent 能在真实浏览器中“观察”并调试 Web 应用。该工具通过 MCP server、CLI 和 Agent skills 三种方式接入，提供了一系列核心能力。Agent 可执行自动化质量审计、模拟用户环境、调试 Chrome 扩展、接管已登录会话、检测内存泄漏，并能与 Web 应用暴露的内部状态深度集成。这显著提升了 Agent 在浏览器环境中的调试与测试能力，为自动化开发与运维提供了新的可能性。

🚨 AI News | TestingCatalog@testingcatalog · 5月21日69

Anthropic 🤝 SpaceX Anthropic is getting up to GB200 of capacity in Colossus 2 in June as a part of the expanded agreement with SpaceX. Partnerships are huge unlocks 👀

译Anthropic 🤝 SpaceX 作为与SpaceX扩大协议的一部分，Anthropic将于6月在Colossus 2中获得高达GB200的算力容量。合作伙伴关系是巨大的解锁因素 👀

Elon Musk@elonmusk · 5月21日80

As the recently expanded partnership with @AnthropicAI demonstrates, @SpaceX is offering AI compute as a service at significant scale. We are in discussions with other companies to do the same. Over time, especially with orbital data centers, we expect to serve AI at extremely high scale.

译正如最近与@AnthropicAI扩大合作所展示的，@SpaceX正在大规模提供AI算力服务。我们正在与其他公司进行类似合作的讨论。随着时间的推移，特别是通过轨道数据中心，我们预计将以极高的规模提供AI服务。

Z.ai@Zai_org · 5月21日75

http://x.com/i/article/2057206923208884224 # Next-generation LLM Inference Network: How ZCube Alleviates Network Bottlenecks? LLM inference is reshaping AI infrastructure. The network used to be the least interesting part of an inference cluster. That isn't true anymore. With long-context inference and Prefill-Decode disaggregation now standard, the network sits on the critical path of throughput, tail latency, and per-token serving cost. To address the increasingly severe topology-induced congestion in Prefill-Decode disaggregated deployments, Z.ai, Harnets.AI, and Tsinghua University jointly developed and deployed the ZCube network architecture in an online production environment. The deployment shows that system-level innovation at the network architecture layer can unlock hardware potential in a highly cost-effective way. In production benchmarking for the GLM-5.1 coding workload, ZCube delivered significant gains through architectural optimization alone: - Cost optimization: GPUs, the software stack, and applications remained unchanged, while switch and optical module CapEx was reduced by 33%. - Throughput improvement: Average GPU inference throughput increased by 15%. - Latency improvement: TTFT P99 was reduced by 40.6%. The root cause of the congestion lies in the shift of inference traffic patterns. As PD disaggregation becomes mainstream, cross-node KV Cache transfers make inference traffic highly asymmetric, with dynamically changing sources, destinations, and traffic volumes. In traditional ROFT (Rail-Optimized Fat-Tree) architectures, static topology and port mappings can easily concentrate traffic on a limited set of switches and links, causing local hotspots, queue buildup, and PFC backpressure. This leads to a structural issue where aggregate bandwidth appears sufficient, yet localized congestion occurs frequently. ZCube addresses this issue by using a fully flattened network topology together with a hybrid single-rail / multi-rail access design. At the network architecture layer, it decouples and distributes PD traffic across a broader path space, reducing the probability of topology-induced congestion at its source. This provides a more efficient networking foundation for next-generation hyperscale inference clusters. # Network Becoming a Bottleneck for Effective Inference When thousands of GPUs serve online inference requests concurrently, every KV Cache transfer and every data synchronization operation traverses the inter-GPU network. As long-context inference and Prefill-Decode disaggregated inference gradually become mainstream, data exchange between Prefill and Decode nodes continues to grow. Network bandwidth, and more importantly the ability to use it effectively, has begun to affect cluster-level throughput and latency directly. To quantify the impact of networking on inference performance, we first conducted an ablation study on a 512-GPU cluster. We kept GPU compute, the software stack, the model, and application logic unchanged, and only adjusted the available NIC bandwidth cap. We then measured changes in overall cluster throughput and Time to First Token (TTFT). For example, when network bandwidth was increased from 100Gbps to 200Gbps, overall inference throughput improved by approximately 19%, while Time to First Token, or TTFT, decreased by approximately 22%. This indicates that, in LLM inference, network bandwidth has become one of the key factors constraining service performance. # 1. Network Congestion in Inference Today, AI clusters commonly use Clos, or Fat-Tree, architectures. The basic idea is to scale the network by stacking multiple layers of switches. However, the performance of Clos networks depends heavily on ideal load balancing across switches, which is difficult to achieve in practice due to routing policies and real traffic patterns. For example, in many two-tier Fat-Tree deployments, which consist of Spine and Leaf layers, traffic across Spine switches can become severely imbalanced. As a result, upper-layer applications often fail to obtain the expected network performance. To reduce the overhead of cross-layer forwarding, the industry often adopts ROFT (Rail-Optimized Fat-Tree) architectures [1]. As shown in Figure 3, ROFT groups GPUs by index ("rail"), and connects GPUs with the same index to the same Leaf switch, reducing the communication cost across Spine switches. ROFT works well for certain training traffic patterns. However, in Prefill-Decode disaggregated inference, we observed a more prominent issue: KV Cache transfers exhibit strong source-destination asymmetry. Different GPUs and different NICs carry highly uneven communication loads, as shown in Figure 4. As a result, ROFT’s rail mapping no longer naturally translates into load balancing. Instead, traffic can become concentrated on a small number of Leaf switches and links, leading to link congestion and degraded transfer performance. This manifests in several ways: - Some Leaf switches become persistent load hotspots, increasing the probability that multiple KV Cache transfer flows compete on the same links. As a result, actual transfer throughput can fall far below the NIC bandwidth capacity. - Certain egress queues on some Leaf switches remain at high depth for extended periods and frequently trigger PFC backpressure, as shown in Figure 5. - Link congestion further amplifies tail latency, affecting both TTFT and overall throughput. It is important to distinguish between the two types of network congestion, as illustrated in Figure 6: - Unavoidable congestion: For example, when multiple GPUs send data to the same destination at the same time, contention on the final-hop link is inevitable. - Avoidable congestion: This is caused by topology design, traffic mapping, or imbalanced multipath utilization. Fundamentally, it is an architecture-level design problem. For the first type of congestion, we typically rely on congestion control, traffic shaping, and related mechanisms to mitigate its impact. For the second type, new network transport mechanisms such as adaptive routing [2], packet spraying [3,4], and MRC [5] can help. However, a more effective approach is to prevent network conflicts that should not occur in the first place through innovation at the network architecture layer. Prefill-Decode disaggregated inference is a typical example. If the network topology cannot match the traffic pattern, the system will repeatedly generate load hotspots and link conflicts. Solving this problem requires rethinking the inference network architecture itself. # 2. ZCube Network Architecture To address the above issues, we deployed a new ZCube network architecture [6]. ZCube breaks away from the traditional Clos design philosophy of hierarchical switch stacking and instead introduces a fully flattened GPU server interconnect. The ZCube routing strategy, designed specifically for the ZCube architecture, fully leverages the structural properties of the flattened topology. It can achieve near-ideal load balancing across all switches in the network, thereby significantly improving overall cluster network bandwidth. Compared with Clos, ZCube has a natural advantage in load balancing. This advantage benefits both training clusters and inference clusters. Importantly, ZCube achieves these performance gains while reducing switch and optical module costs by approximately one third compared with Clos. Based on current mainstream switch and NIC configurations, ZCube can support flattened networking for tens of thousands, or even hundreds of thousands, of GPUs. ## 2.1 ZCube Core Architecture As shown in Figure 7, the core ideas of ZCube are: 1. Remove the Spine switch layer. 1. Divide Leaf switches into two groups of equal size, typically odd-numbered switches and even-numbered switches. 1. Establish a complete bipartite interconnect between the two switch groups. 1. Connect the two ports of each GPU NIC to the corresponding switches in the two groups using single-rail and multi-rail access patterns. Suppose each GPU has a corresponding NIC with two ports, i.e., p=2. There are n GPUs in total, and GPUs and NICs share the same indices: 1,2,…,n. Let k denote the number of GPUs connected to each switch. The total number of switches is 2n/k, numbered 1,2,…,2n/k. For GPU i, where 1≤i≤n: - The first port connects to the odd-numbered switch: ((i−1)mod(n/k))×2+1 - The second port connects to the even-numbered switch: ⌈i/k⌉×2 The two switch groups are connected as a complete bipartite graph: every odd-numbered switch connects to every even-numbered switch. A ZCube topology under dual-port NIC configuration, withp=2,n=32, and k=8, is shown in Figure 7. ## 2.2 Key Properties of ZCube Network Diameter ZCube has a network diameter of two switch hops, meaning any pair of GPUs can reach each other through two switches. This sits between a one-layer switch network, which has one switch hop but limited scale, and a conventional two-layer switch network, which supports a larger scale but typically requires three switch hops and incurs higher latency. Load Balancing First, the ZCube routing strategy ensures that each GPU pair has a unique optimal path, avoiding traffic conflicts caused by multipath route selection. Second, ZCube uses two complementary GPU-to-switch connection patterns. One switch group connects to GPUs in a single-rail pattern, where each switch connects to a contiguous range of GPU IDs. The other switch group connects to GPUs in a multi-rail pattern, where each switch connects to GPUs with the same relative index across groups. This design enables ZCube to achieve highly effective load balancing across the entire switch fabric under both typical AI training traffic patterns, such as AllReduce and All-to-All, and typical AI inference traffic patterns, where source-destination relationships are uncertain, and NIC loads can be highly imbalanced. As a result, ZCube can avoid the second type of network congestion described earlier at the architecture layer. As shown in Figure 8, traffic flows that would conflict under ROFT can obtain dedicated network paths under ZCube, thereby avoiding congestion. Scalability ZCube provides strong scalability while preserving its favorable performance characteristics. For example, using one layer of 51.2T switches, each with 128 × 400Gbps ports, ZCube can construct a network connecting 16,384 400Gbps NICs. If higher-capacity switches are used, or if the ZCube network is divided into more planes, the architecture can scale further to support interconnection among tens of thousands or even hundreds of thousands of GPUs. Cost At the same cluster scale, ZCube can reduce switch and optical module costs by approximately one third compared with traditional Clos / ROFT architectures. For example, in a 10,000-GPU AI cluster, ZCube can save roughly 210 million RMB to 640 million RMB in network hardware investment. These characteristics show that ZCube can achieve better load balancing and performance while requiring lower network hardware cost. ## 2.3 Real-World Cluster Testing: Boosting Inference Performance While Cutting Network Costs We upgraded the network architecture of a thousand-GPU cluster running GLM-5.1 coding inference services from the original ROFT to the ZCube architecture. Since the ZCube architecture eliminates the Spine-layer switches found in traditional Clos architectures, the legacy cabling patterns, IP addressing schemes, routing policies, and switch configuration methods established under the Clos framework could not be reused directly, necessitating a complete redesign tailored to ZCube. To tackle these challenges, the Harnets.AI Network Team designed a comprehensive network solution centered on the ZCube architecture. They developed a suite of automation tools, including the ZCube Controller, a data center layout design tool, and a cabling correctness verification program. This enabled capabilities such as data center deployment planning, cabling validation, automated configuration generation, and batch deployment, effectively resolving numerous hurdles in ZCube deployment. This suite of tools was the critical factor enabling the successful transformation of a large-scale production cluster within an exceptionally tight timeframe. Following the seamless network architecture migration, we conducted real-world testing on the ZCube architecture by running the GLM-5.1 coding inference services on this cluster. By comparing the cluster's inference performance before and after the upgrade, we found that ZCube boosted the average GPU inference throughput by over 15% compared to the ROFT architecture (as shown in Figure 9), while dropping the P99 tail latency of TTFT by 40.6%. In summary, for GPU and server hardware of the same scale and configuration, and without modifying any applications, upgrading the networking architecture to ZCube allowed us to not only save 1/3 of the optical modules and switch hardware, but also enable the cluster to serve 15% more inference requests per second. Against the current backdrop of exploding inference workloads and severe shortage of compute resources, this approach proves to be highly pragmatic and valuable. Currently, this ZCube cluster has been running stably for over two weeks, playing a vital role in powering the GLM-5.1 coding inference services. # 3. Conclusion LLM inference is moving from point-wise optimization toward system-level co-design. The coupling between the network and the inference engine is becoming increasingly tight, making networking a critical component of the inference system. The production deployment of ZCube shows that network architecture innovation can directly unlock the effective capacity of inference systems. By better aligning the network architecture with KV Cache transfers and PD traffic patterns, ZCube reduces the probability of topology-induced congestion at the source, improving throughput and latency while enhancing cluster cost efficiency. Looking ahead to next-generation LLM infrastructure, network design will evolve from general-purpose interconnects toward model-traffic-driven system co-design. Long-context inference, PD disaggregation, MoE, and integrated training-inference workloads are reshaping intra-cluster communication patterns, requiring network topology, communication libraries, and scheduling policies to be jointly optimized around real model traffic. Looking ahead, we will continue pioneering novel AI network architectures for larger-scale inference and training clusters ─ upgrading the network from a foundational GPU connection layer into a core driver of token generation efficiency, system resilience, and cost-effectiveness. # Acknowledgements ZCube was published at ACM SIGCOMM 2025, and was recognized as “significantly change the way we think about and understand networking.” This is the first large-scale deployment of the technology in a production inference cluster. We thank the Harnets.AI team for their professional support and close collaboration throughout this network architecture upgrade and optimization effort. ## Reference [1] NVIDIA. 2023. SuperPOD: Next Generation Scalable Infrastructure for AI Leadership. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod-reference-architecture-dgx-h100.pdf [2] NVIDIA. 2025. https://developer.nvidia.com/blog/accelerating-ai-storage-by-up-to-48-with-nvidia-spectrum-x-networking-platform-and-partners/ [3] Ultra Ethernet Consortium. Ultra Ethernet specification v1.0.1, 2025. [4] Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Michael Papamichael, Mohammad Dohadwala, Lukas Gianinazzi, Mikhail Khalilov, Elias Achermann, Daniele De Sensi, and Torsten Hoefler. REPS: Recycled entropy packet spraying for adaptive load balancing and failure mitigation, 2026. [5] Araujo, J., Chow, A., Handley, M., Lewis, R., Paasch, C., Padhye, J., … & Sur, S. (2026). Resilient AI Supercomputer Networking using MRC and SRv6. arXiv preprint arXiv:2605.04333. [6] Yan, Z., Li, D., Chen, L., Xiong, D., Gao, K., Zhang, Y., … & Lin, H. (2025, September). From ATOP to ZCube: Automated topology optimization pipeline and a highly cost-effective network topology for large model training. In Proceedings of the ACM SIGCOMM 2025 Conference (pp. 861-881).

译随着长上下文与Prefill-Decode分离部署成为主流，GPU集群网络已从次要部件转变为制约推理吞吐、尾部延迟和成本的关键瓶颈。传统静态网络拓扑与动态非对称的KV Cache流量模式冲突，导致局部拥塞。为此，Z.ai、Harnets.AI与清华大学联合研发了ZCube网络架构。该架构采用完全扁平化拓扑与混合接入设计，从源头解耦并分散流量以减少拥塞。在GLM-5.1生产测试中，ZCube在保持GPU与软件栈不变的前提下，实现了交换机与光模块成本降低33%、平均推理吞吐提升15%、首token时间P99降低40.6%的显著效果，证明网络架构创新能有效释放硬件潜力。

Chubby♨️@kimmonismus · 5月21日65

Holy: After Anthropic secured compute capacity from Colossus 1, it is now also getting access to compute from Colossus 2. But to be honest: I somehow expected to his. Grok certainly doesn’t need all that compute.

译Holy：继从Colossus 1获得算力后，Anthropic现在也将获得Colossus 2的算力支持。但说实话：我多少预料到了这点。Grok显然不需要那么多算力。

MiniMax (official)@MiniMax_AI · 5月21日67

600+ new voices powered by MiniMax Speech 2.8 Turbo are now on Together AI @togethercompute 🎙️✨ Try it today: https://voicefinder.together.ai/minimax--speech-2.8-turbo

译600多种由MiniMax Speech 2.8 Turbo驱动的新声音现已登陆Together AI @togethercompute 🎙️✨ 立即体验：https://voicefinder.together.ai/minimax--speech-2.8-turbo

Rohan Paul@rohanpaul_ai · 5月21日67

Velobase just open-sourced Velobase Harness, an AI SaaS framework. Shows why the product was not the real moat, it's the infrastructure that convert users into revenue. The Velobase Harness is built around the missing layer between a working app and a paid business. Includes server-side ad attribution, usage-based credits, multi-currency billing, double-entry affiliate ledgers, refund clawbacks, USDT cashouts, A/B email campaigns, dual-provider failover, PostHog analytics, payments, and 11 BullMQ workers.

译Velobase宣布开源其AI SaaS框架Velobase Harness。该项目强调，在AI应用时代，产品本身并非真正的护城河，将用户转化为收入的基础设施才是关键。Velobase自身从应用无人问津发展到实现八位数ARR的经历，印证了这一观点。该框架旨在补全从可用应用到盈利业务之间的缺失环节，提供包括支付计费、用户归因、分析与A/B测试在内的全套后端服务。

Google AI Developers@googleaidevs · 5月21日72

Jump in and start building with @GoogleAIStudio. -- @AndroidDev app building -- @GoogleWorkspace integrations -- 1-click deployment to @Antigravity Here’s what’s new from Google I/O ↓

译快速上手，开始使用 @GoogleAIStudio 进行构建。 -- @AndroidDev 应用构建 -- @GoogleWorkspace 集成 -- 一键部署至 @Antigravity 以下是来自 Google I/O 的最新动态 ↓

OpenRouter@OpenRouter · 5月21日70

TIP 💡 You don't have to worry about cache misses for the Auto Router (in addition to all individual models) OpenRouter will keep your session pinned to one model/provider until your cache expires

译提示💡 你无需担心自动路由（以及所有单独模型）的缓存未命中问题 OpenRouter会将你的会话固定在一个模型/提供商上，直到缓存过期

Berryxia.AI@berryxia · 5月20日68

兄弟们，这个PaddleOCR更新可以啊. 直接弥补了之前模型的不足没有使用LLM推理！这次PaddleOCR这次直接把Hugging Face生态彻底打通了！ PaddlePaddle官方刚刚宣布：PaddleOCR 3.5正式支持Transformers作为推理后端。 PP-OCRv5和PaddleOCR-VL 1.5模型，现在可以直接在Hugging Face生态里跑起来。以前想把PaddleOCR塞进RAG或者Document AI项目，还得自己搭一套服务栈，折腾半天。 Hugging Face团队也亲自参与了这波合作。 OCR工具和主流Transformer生态，终于从两条平行线变成了一条路。 Blog在这里：https://huggingface.co/blog/PaddlePaddle/paddleocr-transformers 这样对于输出的结果可以更加精准和可靠，不然还得依赖LLM来补齐。

译PaddleOCR 3.5版本正式支持Transformers作为推理后端。更新后，PP-OCRv5和PaddleOCR-VL 1.5模型可在Hugging Face生态内直接运行，实现了与主流Transformer技术栈的无缝集成。此举解决了此前将OCR工具整合进RAG或Document AI项目时需要额外搭建服务栈的繁琐问题，大幅降低了开发门槛，让OCR能力更自然地融入现有AI应用开发流程。

Alibaba Cloud@alibaba_cloud · 5月20日65

Transform Agents into autonomous workers! 🚀 ❌ Open-source pain points: Low availability, high ops cost & poor observability. ✅ MSE AI Scheduler solves this with: • High-availability distributed scheduling • Unified management & fine-grained permissions • Elastic scaling to cut costs • Full-link observability Supports OpenClaw, Dify & more. Free public beta now open! 🔗 https://int.alibabacloud.com/m/1000413115/ #AI #Agent #MSE

译将Agent转变为自主工作者！🚀 ❌ 开源痛点：可用性低、运维成本高、可观测性差。 ✅ MSE AI调度器通过以下方式解决： • 高可用分布式调度 • 统一管理与细粒度权限 • 弹性伸缩以降低成本 • 全链路可观测性支持OpenClaw、Dify等。免费公测现已开放！ 🔗 https://int.alibabacloud.com/m/1000413115/ #AI #Agent #MSE

Alibaba Cloud@alibaba_cloud · 5月20日60

Qwen Conference 2026: The Keynote Agenda AI-Native Cloud, Agent Native Cloud architectures, the Future of Inference, and Multimodal Visual drops. No fillers. Just engineering blueprints for global scale. Register: https://click.qwencloud.com/m/20000000190/

译Qwen Conference 2026：主题演讲议程 AI原生云、Agent原生云架构、推理的未来，以及多模态视觉技术发布。没有冗余内容，只有面向全球规模的工程蓝图。注册：https://click.qwencloud.com/m/20000000190/

SiliconFlow@SiliconFlowAI · 5月20日52

Tired of juggling config files across CLIs? CC Switch lets you manage Claude Code, Gemini CLI, Codex, OpenCode, OpenClaw & Hermes Agent — all from one interface. And now SiliconFlow is a built-in preset provider. DeepSeek V4, GLM 5.1, Kimi K2.6, MiniMax M2.5, etc. One click to enable 🧵👇

译厌倦了在多个命令行工具间切换配置文件？ CC Switch 让你通过一个界面管理 Claude Code、Gemini CLI、Codex、OpenCode、OpenClaw 和 Hermes Agent。现在 SiliconFlow 已成为内置预设提供商。支持 DeepSeek V4、GLM 5.1、Kimi K2.6、MiniMax M2.5 等模型。一键启用 🧵👇

Rohan Paul@rohanpaul_ai · 5月20日65

FT: Nvidia just turned its AI chip lead into a $90B financing machine that funds the companies buying, renting, building, and extending its own computing stack. The company committed $47B to investments and partnerships through Jan-25, then earmarked another $43B in the next 4 months, covering 145+ companies across model labs, cloud providers, chip designers, and suppliers. That creates a loop where startups get capital, suppliers expand capacity, cloud firms buy more Nvidia GPUs, and Nvidia becomes harder to replace inside the AI infrastructure layer. --- ft .com/content/c6b362b8-ab6b-4723-af48-28082bdfcac2?syn-25a6b1a6=1

译英伟达将其AI芯片的领先优势转化为一个巨大的融资机器。公司在过去数月内承诺了总计900亿美元的投资与合作，覆盖145多家涉及模型研发、云计算、芯片设计和供应链的公司。这一策略创造了一个自我强化的闭环：初创公司获得资本，供应商得以扩张产能，云厂商采购更多Nvidia GPU，从而进一步巩固了Nvidia在AI基础设施层中难以被替代的核心地位。

X.PIN@thexpin · 5月20日58

JUST IN: Alibaba just dropped a 128-chip AI supernode at its 2026 Cloud Summit. We were lucky to see it in person.

译突发：阿里巴巴刚刚在2026云峰会上发布了一款128芯片AI超级节点。我们有幸在现场亲眼目睹。

meng shao@shao__meng · 5月20日56

Kimi K2.6 终于有高速推理平台了 👍🏻 这是 @cerebras 在 GPT 5.3 Codex Spark 之后，又一次对主流 LLM 做高速推理在 Groq Kimi K2 之后，K2.5 和 K2.6 一直都是模型很强，但官方推理实在太慢。。Groq 被 Nvidia 收购后也没有再跟进主流模型推理现在 Cerebras 把 Kimi K2.6 这个超过 1T 参数的模型，推理速度拉到了 ~1000 tokens/s，这速度就太舒服了，用起来！！

SemiAnalysis@SemiAnalysis_ · 5月20日63

@FabricatedKnowledge answers if the AI Market is truly a bubble, or if we're just completely rewriting the global economy. @tbpn Chapters 0:00 — Guest Intro & Cerebras IPO Breakdown 1:24 — Chip Architecture & Hardware Bottlenecks 5:29 — Grok LPUs, Nvidia, and the Inference Ecosystem 9:15 — The Big Foundry Play: AMD, Intel, and TSMC 11:53 — The ASIC Startup Landscape & Geopolitics 18:04 — The Domestic Data Center Infrastructure Crisis 21:33 — Space Data Centers & Sovereign AI Geography 25:31 — Market Hype, Bubble Math, and the Next Dot-Com 29:38 — Macroeconomics, Robotics, and Gross Token Production (GTP) 33:56 — Outro: Does SemiAnalysis Need an Arch Nemesis?

译本期探讨AI市场究竟是暂时泡沫，还是全球经济的结构性重塑。内容深入分析AI产业链瓶颈与竞争格局，涵盖芯片架构限制、以Nvidia和Grok LPUs为代表的推理生态之争，以及AMD、Intel、TSMC的代工博弈。同时讨论ASIC创业公司现状、数据中心基础设施危机、太空数据中心与主权AI布局等前沿议题，并结合宏观趋势、机器人技术和“总Token产量”等指标，对市场热度与潜在风险进行犀利点评。

Greg Brockman@gdb · 5月20日67

we are offering discounted tokens and certainty on capacity availability in exchange for 1-3 year commits. we expect that the world will feel increasingly capacity constrained for the next while, as models continue to get much more useful.

译OpenAI面向企业客户推出了“Guaranteed Capacity”算力保障服务。该服务通过提供折扣代币价格和确定的未来容量供应，鼓励客户签订1至3年的长期使用协议。OpenAI预判，随着AI模型实用性持续提升，全球算力将在一段时间内持续紧张。此项新服务旨在帮助客户提前规划关键业务负载，确保其能够稳定、可靠地获取所需的计算资源，与OpenAI长期投资于基础设施和产能规划的战略相呼应。

Sam Altman@sama · 5月20日62

customers are increasingly asking us for certainty on capacity. as models get better, we expect that the world will be capacity-constrained for some time. we are offering discounted tokens for 1-3 year commits. (it also helps us plan, so hopefully a big win-win.)

译OpenAI推出Guaranteed Capacity服务，允许客户通过长期承诺锁定未来算力访问权限。此举旨在应对因AI模型能力持续提升而引发的全球性算力短缺问题。公司表示已通过基础设施和合作进行长期投资，并提供1-3年期承诺的折扣令牌，以鼓励客户提前规划，从而在算力受限的世界中为关键业务负载提供确定性。该服务旨在实现客户与OpenAI的双赢。

OpenAI@OpenAI · 5月20日67

Introducing OpenAI Guaranteed Capacity: a new offering that enables customers to guarantee long-term access to OpenAI compute. We’ve made long-term investments in infrastructure, partnerships, and capacity planning to help customers scale reliably. Now, Guaranteed Capacity helps customers plan ahead for critical workloads in a compute-constrained world. http://openai.com/guaranteed-capacity

译推出 OpenAI Guaranteed Capacity：一项新服务，让客户能够保障长期获取 OpenAI 算力。我们已在基础设施、合作伙伴关系和容量规划方面进行了长期投资，以帮助客户可靠地扩展。现在，Guaranteed Capacity 帮助客户在算力受限的环境中提前规划关键工作负载。 http://openai.com/guaranteed-capacity

SemiAnalysis@SemiAnalysis_ · 5月20日56

AMD ALERT 🚀 MI355 is now 40% cheaper than B200 on GLM5 architecture for Single Node serving FP8 14 weeks after the initial launch of GLM5 on both non-MTP & MTP with spec decode for SGLang v0.12 for both CUDA & ROCm. SPEED IS THE MOAT!! Great work to @AnushElangovan, @roaner, HaiShaw & his team! Next step is for MI355X to catch up to CUDA when composing production inference optimizations like FP4 & on distributed inferencing where you can gang up MI355 boxes such that per GPU performance goes up thus the cost per million tokens goes down.

译最新基准测试显示，在GLM5架构下，AMD MI355单节点FP8推理成本较NVIDIA B200降低了约40%。这一成果建立在SGLang v0.12针对CUDA与ROCm平台进行的非MTP、MTP及投机解码等多维度优化之上，团队认为性能速度是构建核心壁垒的关键。后续重点将是推动MI355X在生产级推理优化（如FP4）及分布式推理领域追赶CUDA的生态与性能，通过多卡协同提升单卡算力效率，从而进一步降低百万Token的推理成本。

Chubby♨️@kimmonismus · 5月20日43

This is inane. Tokens processed at insane scale!

译这太疯狂了。处理的token数量达到了惊人的规模！

Berryxia.AI@berryxia · 5月20日78

兄弟们，NVIDIA研究员Yukang Chen刚刚把LongLive 2.0直接开源了！这是全球第一个端到端、支持4-bit的超长视频生成基础设施，训练和推理全流程打通。核心技术：FP4量化 + 并行加速，在5B模型上硬生生跑到45.7 FPS！它还能做真实视频训练、few-step蒸馏、多shot训练/推理、序列并行、NVFP4 KV cache、异步VAE解码部署…… 全套高效打法一次给你安排明白。以前做长视频生成，要么慢得要死，要么只能生成短片，现在NVIDIA直接把4-bit长视频实时生成推向开源。代码见评论区👇

译NVIDIA研究员开源LongLive 2.0，这是首个支持4-bit量化、覆盖训练与推理全流程的端到端长视频生成基础设施。其核心技术包括FP4量化与并行加速，在5B模型上实现45.7 FPS的生成速度。该框架支持真实视频训练、蒸馏、多镜头生成、序列并行、KV缓存优化及异步解码部署，旨在解决以往长视频生成速度慢或仅限短片的瓶颈。

凡人小北@frxiaobei · 5月20日57

一个人用 AI 爽飞之后，他和团队其他人的协作反而更难。组织层面的 AI 不是个人提效的放大版，是另一个问题：怎么把 AI 嵌进协作结构里，而不是只让其中某个岗位变快。很少有团队在解决后者，从@yucheng 的这个项目看到了。

译核心问题在于，AI大幅提升个人效率后，团队协作反而可能变得更困难。因为组织层面的AI应用并非个人提效的简单放大，而是需要解决如何将AI深度嵌入现有协作结构、打破信息壁垒的问题。当前很少有团队关注后者。Lucius AI 正尝试解决这一痛点，其核心是构建组织的“上下文层”，旨在减少团队中超过30%时间被浪费在重复重建已有决策上下文上的现象，从而弥合个人高效与组织协同之间的鸿沟。

Chubby♨️@kimmonismus · 5月19日56

Every major pharma company announced an AI partnership in the last three years. Most of them changed nothing. The press release went out, the stock moved, and the actual R&D pipeline stayed exactly the same. Edison Scientific just deployed Kosmos across Incyte's full pipeline. Not a pilot. Not a proof of concept. A production system that reads 1,500 papers and writes 42,000 lines of code in a single run! The difference between this and everything that came before it: receipts. 79% reproducibility. Every conclusion traceable to a specific paper or line of code. This is the first time I'm looking at an AI deployment in pharma and thinking: this actually changes how drugs get made. Not in theory, or in a pitch deck - but in production, right now, across an entire R&D pipeline. If AI can compress six months of scientific work into a single day with this level of traceability, the implications for how fast treatments reach patients are massive.

译过去三年，主流药企的AI合作大多停留在公关层面，未实质改变研发流程。Edison Scientific为Incyte部署的Kosmos系统是例外：它作为生产级工具，单次运行可处理1500篇文献并生成42,000行代码，且结果具有79%的可复现性，结论均可溯源。这标志着AI首次从理论演示进入实际研发管线，真正压缩研发周期，有望加速新药问世进程。

DogeDesigner@cb_doge · 5月19日42

Elon Musk on data centers in space: "Data centers in space is much easier than people may think. SpaceX, at this point, has 10,000 satellites in orbit right now, and in the future, with Starship, we'll be launching over 10,000 communication satellites per year, each one of which is much more capable than our current satellites, so you can expect 100 times more communications capability than currently exists from space, but that will pale in comparison to the tonnage of AI satellites, so I mean it's always helpful to use the physics tools of thinking in the limit."

译基于SpaceX现有的1万颗在轨卫星，以及未来使用Starship将实现每年发射超万颗更强通信卫星的计划，埃隆·马斯克认为太空通信能力将百倍提升。他指出，尽管通信能力大幅增长，但这将远不及未来AI卫星的吨位规模。因此，他用物理学的极限思维进行推演，得出结论：建设太空数据中心的可行性比普遍认知的要高得多。

ClaudeDevs@ClaudeDevs · 5月19日60

We’ve added two security improvements to Claude Managed Agents. Self-hosted sandboxes keep the agent’s execution environment in your infrastructure or with a managed sandbox provider. MCP tunnels let the agent connect to services inside your security perimeter.

译我们为Claude Managed Agents增加了两项安全改进。自托管沙箱将代理的执行环境保留在您的基础设施或托管沙箱提供商处。 MCP隧道让代理能够连接到您安全边界内的服务。

向阳乔木@vista8 · 5月19日56

发现坚果云同步挺方便的，使用场景： 1. 同步本机 .agents 目录到另外一台电脑，Skill优化后，也能同步。 2. Obsidian 数据同步，就不用买官方服务了 3. 坚果云支持Webdev，CC Switch配置的各种API都能无缝用。随时共享文件夹或文件给他人。个人用户一年199元，相比AI订阅套餐显得便宜哈哈

译用户分享了坚果云在数字工作流中的三个实用场景：同步本机.agents目录至其他电脑以保持Skill配置一致；作为Obsidian数据同步工具替代官方服务；通过其WebDAV功能，无缝兼容CC Switch配置的各种API。此外，坚果云支持随时共享文件夹或文件给他人。个人用户年费199元，与AI订阅套餐相比具有显著的价格优势。

歸藏(guizang.ai)@op7418 · 5月19日59

AI Studio 的移动端要上线了，现在谷歌 Play 可以预注册，看起来终于不用忍受 Gemini 了。

译谷歌AI Studio移动端应用已上线Google Play开启预注册，为开发者提供了新的移动端AI开发工具。该应用旨在让用户能够随时随地捕捉灵感并进行创作，无需受限于桌面环境。其核心功能是允许用户通过自然语言描述来构建自定义工具、游戏或应用，将“想法”直接转化为现实。这被视为对现有移动端AI体验（如Gemini）的一种补充或替代，为需要在移动端进行原型设计或快速开发的用户带来了新的便利。

Huawei Cloud@HuaweiCloud1 · 5月19日46

Ninja Van is reshaping Southeast Asia’s logistics sector by partnering with Huawei Cloud to deploy cloud-native and AI technologies. Its operations are now smarter, faster, and more scalable, delivering 60% higher resource utilization and 30% lower infrastructure costs. Discover how #HuaweiCloud helps #NinjaVan scale smarter, faster #Logistics across Southeast Asia: https://tinyurl.com/muspx766 #Huawei #HuaweiCloudAPAC #AI #DigitalTransformation

译Ninja Van正通过与华为云合作部署云原生和AI技术，重塑东南亚物流行业。其运营变得更智能、更快速、可扩展性更强，实现了资源利用率提升60%、基础设施成本降低30%。了解#华为云如何助力#NinjaVan在东南亚实现更智能、更快速的#物流扩展：https://tinyurl.com/muspx766 #华为 #华为云亚太 #AI #数字化转型

🚨 AI News | TestingCatalog@testingcatalog · 5月19日72

Anthropic announced self-hosted sendboxes and MCP tunnels for Claude Managed Agents during its "Code with Claude" event in London. > With self-hosted sandboxes, you keep sensitive files, packages, and services in your own infrastructure or with a managed sandbox provider. > With MCP tunnels, your agents reach MCP servers inside your private network without exposing them to the public internet.

译Anthropic 在伦敦举办的“Code with Claude”活动上宣布，为 Claude 托管代理推出两项新功能：自托管沙盒（公测版）和 MCP 隧道（研究预览版）。自托管沙盒允许用户在自己的基础设施或托管沙盒提供商中运行代理，从而将敏感文件、软件包和服务保留在私有环境中，确保数据安全。MCP 隧道使代理能够安全访问用户私有网络内的 MCP 服务器，无需将其暴露于公共互联网，增强了访问控制。这两项功能共同支持代理在用户自有的安全边界内执行任务，并默认应用用户的安全策略，有效提升了隐私保护和操作灵活性。

Rohan Paul@rohanpaul_ai · 5月19日52

BoozAllen CEO Horacio Rozanski: "2026 is a highly complicated year at the intersection of cyber and AI, because AI as an attack vector" AI can breach networks in minutes, far faster than the 2-week CISA standard for patching. Defense is lagging.

译BoozAllen CEO Horacio Rozanski："2026年将是网络与AI交汇处高度复杂的一年，因为AI作为一种攻击向量" AI能在几分钟内突破网络，远快于CISA标准的两周补丁修复时间。防御速度严重滞后。