What’s new with MiMo-V2.5 series inference? We just published a blog on our full pipeline inference optimizations for MiMo-V2.5 series, including how we pushed hybrid SWA efficiency to the limit. Read the full blog here: https://mimo.xiaomi.com/blog/mimo-v2-5-inference

译MiMo-V2.5系列推理有哪些新进展？我们刚刚发布了一篇博客，详细介绍了针对MiMo-V2.5系列的全链路推理优化，包括如何将混合SWA效率推向极限。阅读全文请访问： https://mimo.xiaomi.com/blog/mimo-v2-5-inference

Fuli Luo@_LuoFuli · 5月30日63

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions Read the full technical blog: https://mimo.xiaomi.com/blog/mimo-v2-5-inference The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline. Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.

译MiMo-V2.5 系列模型（包括 MiMo-V2.5 和 MiMo-V2.5-Pro）采用混合滑动窗口注意力（Hybrid SWA）架构，将 KVCache 存储压缩至全注意力的约1/7。为将架构优势转化为实际收益，团队重新设计了 KVCache 管理、分层缓存和前缀缓存树，并优化了 SWA KVCache 处理、调度及 Prefill/Decode 流水线。经真实生产流量验证，这些优化将有效 KVCache 容量提升近5倍，主流框架下服务器端缓存命中率达93%-95%。结合 MoE 配置调优与多模态推理优化，提升了长上下文推理效率，是近期 API 降价的基础。

Rohan Paul@rohanpaul_ai · 5月30日69

Amazon unveiled “Resilient Network Graphs,” (RNG) a data center network that reduces hardware needs by 69% and raises throughput by 33%. It is now default for most AWS workloads. They revealed that it has been quietly deploying the design across its data centers since last year, and it is now the default data center network for most AWS workloads. It replaced tree-shaped datacenter networks with flatter random ones that waste less capacity. For decades, fat-tree networks worked because they were predictable, but their layered shape can concentrate traffic at choke points while other links sit underused. So the problem is that fat-tree networks are easy to run, but their hierarchy can trap traffic on a few links while other links sit unused. “Resilient Network Graphs,” (RNG) fixes this by connecting routers in a flat quasi-random graph, so many different paths exist between servers instead of a few fixed routes through upper layers. RNG attacks the problem by flattening the fabric into a quasi-random graph, where many small independent paths replace a few privileged routes. Its routing system, Spraypoint, spreads traffic across many separate paths, while its ShuffleBox cabling device makes the random-looking wiring practical to build and expand. Instead of asking every packet to chase the shortest path, Spraypoint fans traffic outward and then guides it back through distributed waypoints, creating many edge-disjoint paths without requiring exotic switch memory. The authors tested RNG in 2 real Amazon production fabrics and compared it with fat-tree networks using transport and storage workloads. The main result is that RNG matched fat-tree application performance, found far more separate paths than common routing methods, and was estimated to cost 9% to 45% less. The hard part is not the idea, but the engineering, because routing in a random mesh needs smarter path selection and the physical system must manage millions of fiber connections without becoming impossible to operate. This is important for AI clusters because training traffic is huge, synchronized, and sensitive to congestion, so a network that spreads load better can make expensive GPUs spend less time waiting. ---- Link – arxiv. org/abs/2604.15261 Title: "RNG: Flat Datacenter Networks at Scale"

译亚马逊推出了名为“Resilient Network Graphs”(RNG) 的新数据中心网络架构。该设计以扁平的准随机图替代了传统的树形网络，并通过Spraypoint路由系统和ShuffleBox布线设备在多个独立路径上分散流量。测试显示，RNG在性能上与传统fat-tree网络持平，但硬件需求减少69%，吞吐量提升33%，并估算成本可降低9%至45%。该架构现已成为大多数AWS工作负载的默认网络，其分散负载的能力有助于提升AI集群训练效率。

Rohan Paul@rohanpaul_ai · 5月30日47

Japan’s AI data center boom is pushing companies toward liquid cooling, because hot GPU racks are now outgrowing the limits of air-conditioned server rooms. Cooling already uses 30% to 40% of data center electricity, and GPU heat has more than doubled in 5 years, so Japan’s Fuji Electric, Nidec, Mitsubishi Heavy, and others are chasing systems that move heat through liquid instead of air. The weak point of normal air cooling is that air carries heat poorly, so the system needs a lot of fan power, large airflow paths, cold aisles, hot aisles, and big chillers to keep the room temperature under control. Liquid cooling changes the target: instead of trying to cool the whole room first, it puts a cold metal plate directly on the GPU or CPU. Cold liquid flows through tiny channels inside that plate, the chip’s heat passes into the plate, the plate passes it into the liquid, and the warmed liquid is pumped away. The big difference is heat density: a powerful AI rack can produce so much heat in such a small space that blowing more air becomes noisy, power-hungry, and physically limited. Liquid can carry much more heat through a much smaller path, so it can remove heat from AI GPUs faster, with less fan work, less room cooling, and more stable chip temperatures. The main downside is that liquid systems cost more to install, need leak-safe connectors, and must be designed into the server rack instead of added casually later.

译日本AI数据中心热潮正推动企业从传统空气冷却转向液冷技术，主要原因是AI GPU机架的散热需求激增。当前冷却已占数据中心用电量的30%至40%，且GPU发热量在5年内翻了一倍多。传统风冷因空气载热能力有限，面临噪声大、能耗高及物理空间限制。液冷技术通过将金属冷板直接贴合芯片，利用液体流道高效导热，能更高效地移除热量并提升芯片温度稳定性。其主要挑战在于安装成本较高且需专门的服务器机架设计。日本的Fuji Electric、Nidec、Mitsubishi Heavy等公司正积极开发相关系统。

SemiAnalysis@SemiAnalysis_ · 5月30日67

TRUTH SOCIAL: NVLink multicast is not supported on Blackwell "Confidential Computing" leading to 61% performance regression on SGLang Qwen3.5 397B according to @verdacloud 's recent github ticket. NVIDIA's "Confidential Computing" is complete slop as in addition Hopper's confidential computing had fully unencrypted NVLink according to NVIDIA's own "NVIDIA Secure AI with Blackwell and Hopper GPUs" Whitepaper.

译TRUTH SOCIAL：根据@verdacloud最近的GitHub工单，NVLink多播在Blackwell“机密计算”上不被支持，导致SGLang Qwen3.5 397B性能下降61%。NVIDIA的“机密计算”完全是垃圾，此外根据NVIDIA自己的《NVIDIA Secure AI with Blackwell and Hopper GPUs》白皮书，Hopper的机密计算也存在完全未加密的NVLink。

SemiAnalysis@SemiAnalysis_ · 5月30日56

One of the data points we keep flagging from our power-crisis research, because it captures the entire mismatch between what AI operators want to build and what grids can actually approve, is the gap between datacenter interconnect requests in ERCOT and what the grid operator is willing to underwrite. (1/4) 🧵

译我们在电力危机研究中持续关注的一个数据点，因为它捕捉了AI运营商想建设的内容与电网实际能批准的内容之间的全部错配，即ERCOT的数据中心互联请求与电网运营商愿意支持的容量之间的差距。(1/4) 🧵

Rohan Paul@rohanpaul_ai · 5月30日76

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. @Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

译Kog团队在标准数据中心GPU上实现了极高的单用户推理速度，在8× AMD MI300X GPUs上达到3,000 tokens/s，在8× NVIDIA H200上达到2,100 tokens/s。相比常规推理速度（约100-300 tokens/s），实现了10-30倍提升。其核心思路是将LLM解码视为内存流问题，通过协同设计monokernel、重建同步机制、针对性内存访问映射及采用延迟张量并行的Laneformer模型架构，消除了传统流程的阻塞点。

AYi@AYi_AInotes · 5月30日67

http://x.com/i/article/2060387880300646400 # AI didn't make orgs faster. It just exposed that orgs never had memory AI didn't make your organization faster.It just exposed that your organization never had a memory to begin with.I've been chewing on this for a year. Here's the part nobody wants to say out loud 🧵 Honestly, I've been chewing on this question for the better part of a year. I started paying attention to AI back in 2023, which makes it three years now. And I'm a decent sample of one: I run my account solo, I write solo, I do my own ops. AI tools genuinely turned me into a one-person quasi-team. My output is more than 10x what it used to be. But over the last six months, I've been watching friends who actually have teams — and I keep noticing the same off-kilter pattern. One sentence: individuals are flying, organizations are crumbling. Everyone is on ChatGPT, Claude, Gemini, Cursor. Everyone says they're 10x faster. And yet, when you add up the whole team, output is slower than it was two years ago. Something is clearly wrong here. I've been trying to figure out where it actually breaks. The MIT Sloan 2026 AI Adoption report that dropped a couple of days ago gave me the most direct answer I've seen. 1. The 95% Number Hits Harder Than You'd Think There's one stat in that report: 95% of enterprise AI investments produce no measurable business return. Honestly, that one stopped me cold. Not 50%. Not 70%. Ninety-five percent. Meaning: out of 100 companies — that spent the money, bought the tools, trained the staff — 95 of them can't show you a single number you could put in an earnings report. Your first instinct might be: maybe they're using it wrong? Maybe the models still aren't good enough? I turned it over in my head for a long time. Neither answer holds up. The real bottleneck is something else — and it's buried in another stat from the report that most people skipped right past: more than 30% of team time is spent rebuilding context that someone else on the team already had. What does that look like? Let me sketch a scene and see if any of it feels familiar: A decision got made three months ago. Today's retrospective rolls around, and nobody can find the original discussion thread. A product question gets asked in the user chat 20 times a day, and every ops person has to copy-paste the same answer from scratch. A new hire spends their first month scraping together fragments from Feishu, WeChat Work, email, Yuque, and half a dozen other apps, just trying to piece together "how does this company actually work?" There it is. That's the truth. AI didn't make organizations faster, because organizations never had memory in the first place. AI just turned up the volume on that fact. 1. Why Individual Upside Doesn't Roll Up to the Organization I've started calling this the "AI Productivity Paradox." The mechanism behind it is roughly this: AI tools are personal exoskeletons strapped onto individuals. I write code in Cursor, draft articles in Claude, do research in NotebookLM — and all the memory those tools accumulate lives on my laptop, under my account. The day I leave the company, that memory walks out with me. The day I get promoted to a different role, that memory resets to zero. The day I try to collaborate with a colleague, that memory just doesn't transfer. Which is exactly why individual productivity gains don't compound at the organizational level. Every employee is an island. Every island has a little factory on it. But there are no bridges between the islands. This is also why, at the closed-door Sequoia AI Ascent summit a few days ago — 150 top founders, six hours of conversation — the room landed on a new definition for 2026: "the commercial year zero of long-horizon agents." Sequoia partner Pat Grady said something that's been stuck in my head for days: > The next round of AI doesn't sell tools — it sells outcomes. Sounds like a comment about supply, but the more I sat with it, the more I think he's actually describing the demand side: Customers don't want tools anymore — because tools get installed on individuals, and individuals don't move org-level metrics. Ten ChatGPT seats don't help me. What I actually want is for every conversation, every decision, every piece of feedback inside my company — from yesterday to today — to be captured, searchable, and reusable. Once you start thinking this way, the problem clicks into place: No matter how smart an agent is, if it doesn't know what your organization is thinking, it's just a smart fool. It can write perfect copy, but not the one sentence that captures your brand voice. It can answer every generic question, but not "did we actually ship the fix for that bug last week?" It can hand you a polished market analysis, but it doesn't know you killed that exact direction three months ago. OK, I'm wandering — what I'm trying to say is: the problem was never the model. The problem is that the organization never gave the model a place to learn. 1. A Few Products Are Trying — But None of Them Is the Savior Let me be honest about something here: There are already some products taking a swing at this space. But frankly, none of them have solved the whole problem. The one I've been watching most recently is Lucius — they just closed a $3M seed round two days ago, led by the Future Capital Discovery Fund. This is the third startup from founder Zhao He, and his first two both died on the same rock: users won't even write the documentation. His angle this time is interesting: if people refuse to write the docs, let the AI sit there and listen, learn, and capture them on its own. How does it actually work? Their loop looks roughly like this: A user asks something in the community chat → the AI tries to answer with what it already knows → if it can't, it auto-creates a task for the ops team → ops answers → the AI captures the answer, structures it, and files it into the knowledge base → next time someone asks the same thing, the AI handles it. No prompts to write. No rules to configure. It's like a new intern who quietly sits in the chat, listens, and slowly figures things out. The early-user numbers: community self-resolution rate went from 29% to 88%, and ops time spent on repeat answers dropped from 3 hours a day to 20 minutes. But here's my cold water: it can't handle complex consultations from high-value customers, it can't generate or execute code, and at its core it's still a "load-shedder for high-frequency, repetitive scenarios." What it really does is carve out the most time-wasting 30% of standardized repetitive work. It's not replacing your team. You can't expect it to take over your business. But you can use it to make sure your team never gets asked the same question 20 times again. Is that enough? For a lot of small teams, I think it actually is. But for anyone holding out for the fantasy of a "fully autonomous AI company," it's nowhere close. So my read on Lucius is — it's an interesting sample, not the destination. This category is just getting started. A pile of similar "organizational memory layer" products will show up over the next year, and who actually breaks out is anyone's guess. Image This is their official Discord community if you want to try it: https://discordhunt.com/en/servers/lucius-lab-1484054485020966956 Lucius is currently offering a launch promo with 400 free actions — if you run a community of your own, give it a spin. 1. The One Thing I Actually Want to Say I've rambled a lot. Here's the part I really mean: The winners of the next era won't be the companies with the strongest model. They'll be the companies with the deepest organizational memory. It took me a long time to be willing to write that line down, because it implies that most of the energy we spent over the past three years "chasing the strongest model" was pointed in the wrong direction. Models get refreshed every three months. The moat is pathetically shallow. But a company that has accumulated two years of conversations, decisions, feedback, and brand voice — that's not something you can copy, and it's not something a competitor can catch up to overnight. So if you let me give one line of advice to three kinds of people, here's what I'd say: To founders: Don't go all-in on the bleeding-edge model. Find a vertical scenario and make your "organizational memory" as thick as possible. Models will keep changing, but organizational memory is the thing that compounds. To managers: Stop buying your team more AI tools. First ask whether your team has a single place where every conversation actually gets captured. Without that foundation, every additional tool just accelerates the chaos. To individuals like me: Even if you're a team of one, start building your own Context Layer. Your project notes, your customer conversations, your writing material — these are the most valuable assets you'll own over the next five years. Honestly, I haven't fully figured this out either. I'm still juggling more than a dozen AI tools. I still re-enter the same idea into different places. I still routinely fail to find an insight I had three months ago that I was sure I'd remember. So this isn't a "I figured it out, follow me" tutorial. It's a letter from one practitioner in the AI era to another one fumbling through the same fog. If you've felt that same off-kilter pattern of "individuals flying, teams crumbling" — then we're in this together. Let's take our time, and figure it out together. (This piece is synthesized from the MIT 2026 AI Adoption report, notes from the closed-door Sequoia AI Ascent 2026 summit, and recent industry developments. Lucius is mentioned as one example, not as a recommendation.)

译AI工具虽使个体效率大幅提升，却未加快组织整体产出。核心在于组织普遍缺乏“记忆”：MIT Sloan 2026年报告显示95%的企业AI投资未产生可衡量回报，超过30%的团队时间用于重复建立上下文。个体生产力因AI工具（记忆留存于个人账户）而提升，但这种收益无法在组织层面整合，导致“个人在飞，组织在垮”。Sequoia在AI Ascent峰会提出，2026年将是长周期智能体的商业元年，下一轮AI将卖结果而非工具。

AK@_akhaliq · 5月30日58

81k models available through huggingface inference api

译81k 模型可通过 HuggingFace 推理 API 使用

X.PIN@thexpin · 5月29日65

http://x.com/i/article/2060305879338029061 # Huawei can't win the Nanometer race. So it is changing the game. Unable to compete at the frontier of transistor scaling, Huawei is betting that the future of chip performance lies in integration, interconnects, and light. Huawei cannot reliably win the nanometer race. So it has decided to run a different one. On May 25, 2026, He Tingbo, Huawei’s borad member and president of semiconductor business, took the stage at the International Symposium on Circuits and Systems in Shanghai and announced what she called the τ (Tau) Law, a new principle for how chips should be made faster in an era when making transistors smaller is no longer a reliable path forward. Huawei described it as the first attempt by a Chinese company to articulate a post-Moore scaling framework with global ambitions. The announcement generated a wave of coverage, most of it focused on whether this constituted a genuine scientific contribution or a rebranding of known techniques. Both framings miss the more consequential question: why is Huawei doing this at all, and what does it reveal about where the company is placing its bets? The answer starts with a set of circumstances Huawei did not choose, and a moment in the industry’s trajectory that made those circumstances easier to work with. The timing is not accidental. As transistor scaling slows globally, AI systems are becoming increasingly constrained by data movement rather than raw compute. The bottleneck is shifting from how fast a single chip can calculate to how efficiently thousands of chips can share data across a system. The industry was already moving toward advanced packaging, chiplets, and optical interconnects to address that shift. Huawei’s contribution was to turn those scattered trends into a single narrative, and claim the naming rights before anyone else did. Since 2020, U.S.-led export controls have effectively cut Huawei off from the ecosystem required to manufacture chips at the industry’s leading edge. The result is that Huawei cannot access leading-edge manufacturing on the same terms as Apple, Nvidia, or Qualcomm. The Mate 60’s appearance of 7nm-class chips, achieved through SMIC, showed that the door is not entirely shut. But competing at the industry’s true frontier has become extraordinarily difficult in a way that is structural, not temporary. That frontier has a straightforward competitive logic. Smaller transistors fit more computing power into the same area, consume less energy per operation, and run faster. This is what Moore’s Law predicted in 1965 and what the industry has organized itself around ever since. Every two years or so, the leading foundries push to a new node: 7nm, 5nm, 3nm. The companies that can access those nodes gain a measurable performance advantage over those that cannot. Competing there, at the very frontier, is what Huawei cannot currently do on equal terms. That is the constraint within which the τ Law was designed. ## A Different Variable to Optimize The τ Law proposes an answer to that constraint. In Huawei’s formulation, τ refers to the effective RC time constant that governs how quickly signals can propagate and switch states within a chip. Smaller τ means faster signals, more operations per second, higher effective performance. Moore’s Law, underneath all the transistor-count language, was always producing performance gains by reducing τ: shrink the transistors, shorten the wires connecting them, signals arrive faster. Huawei’s argument is not that this was wrong. It is that there are other ways to reduce τ that do not require a new process node: through the circuit layout, the chip architecture, and the systems connecting chips together. Huawei defines a four-layer optimization stack: the transistor itself, the circuit connecting transistors, the chip connecting circuits, and the system connecting chips. Each layer has its own version of τ, and each offers opportunities to compress signal travel time without shrinking transistor dimensions. The τ Law is a framework for pursuing all four simultaneously. Here is the honest assessment of what this represents: Huawei did not discover this direction. The physics pointing toward it, with RC delay as the binding constraint as geometric scaling slows, has been in semiconductor textbooks for decades. Intel, TSMC, and Samsung are all working on versions of the same techniques. What Huawei did was name the direction, formalize it into a single framework, and build a public roadmap around it. That is a different kind of contribution than inventing the underlying physics. But it is not nothing. Moore’s Law itself was not a discovery of new physics. It was a prediction that became a commitment that became a coordination mechanism for an entire industry. ## Folding Is Not Stacking The most tangible expression of the τ Law at the chip level is Logic Folding, and understanding it requires separating it from something it superficially resembles: conventional 3D chip stacking. The semiconductor industry has been stacking chips for years. TSMC’s SoIC, Intel’s Foveros, and Samsung’s X-Cube all take multiple finished chips and connect them vertically to reduce the distance signals travel between them. It is a genuine and increasingly important technique. But each chip in the stack is still internally structured the same way it always was: circuits laid flat across a single layer, signals running long horizontal paths to reach neighboring gates. Logic Folding addresses the interior of the chip, not the space between chips. Rather than finishing the chip and then connecting it to others, Huawei redesigns the circuit layout during the design phase, redistributing logic gates across multiple vertical layers within a single chip. Connections between layers are made through face-to-face hybrid bonding, routing signals vertically across short distances rather than horizontally across long ones. 3D stacking shortens the distance between chips. Logic Folding shortens the distance inside a chip. One is a packaging innovation applied after manufacture. The other is a design innovation applied before it. They address different layers of the same problem, which is also why they are complementary rather than competing. On the first commercial implementation, the new Kirin chip expected this autumn, Huawei claims transistor density rises from 155 million to 238 million per square millimeter, and says energy efficiency improves by 41%. These numbers come from Huawei and have not been independently verified. What can be said without qualification is that the improvement is achieved without a new manufacturing process, on existing foundry infrastructure, which is the point the τ Law is making. The goal is approaching the transistor density associated with leading-edge nodes through design rather than fabrication. This is a meaningful achievement if the numbers hold up. It is also, importantly, a packaging and integration achievement more than a transistor achievement. The performance gain comes from rethinking how circuit elements connect to each other, not from making them individually smaller. And that logic, followed to its conclusion at the system level, leads directly to co-packaged optics. CONTINUE READING AT https://www.thexpin.com/p/huawei-post-moore-chip-strategy

译由于美国出口管制，华为在芯片先进制程竞赛中面临困难。为此，华为于2026年5月提出“τ（Tau）定律”，旨在为后摩尔时代的芯片性能提升提供新框架。该定律的核心是优化有效RC时间常数（τ）以提升信号传播速度。其方法是不完全依赖制程微缩，而是从晶体管、电路、芯片互连及系统架构四个层次进行优化，以压缩τ值。华为将其描述为中国公司首次提出具有全球影响力的后摩尔扩展框架。

Chubby♨️@kimmonismus · 5月29日61

ByteDance is reportedly building its own inference chip modeled on Groq's LPU, the same architecture Nvidia paid roughly $20B to license in December. The LPU keeps the model in on-chip SRAM and skips high-bandwidth memory. HBM is the component the US restricts most tightly for export to China. ByteDance's memory partner InnoStar fabs at TSMC's mature nodes, which also sit outside the controls. Each of those choices routes around a US restriction. What's left is the architecture Nvidia just spent $20B to own. China is increasingly moving toward developing its own chips and is succeeding in becoming ever more independent of the USA. That is truly impressive. Source: The Information.

译据报道，字节跳动正在开发基于 Groq LPU 架构的自研推理芯片。该架构将模型保存在片上 SRAM 中，跳过了受美国对华出口管制最严格限制的组件——高带宽内存。字节跳动的内存合作伙伴 InnoStar 在台积电的成熟制程节点进行生产，这些节点也处于管制之外。这一系列设计选择均旨在规避美国的限制，而正是同一架构，Nvidia 刚刚花费约200亿美元获得了其授权。

Rohan Paul@rohanpaul_ai · 5月29日57

This paper shows how LLMs can use shorter context more cheaply without losing much answer quality. Shows choosing the right context method for the deployment setting can cut token use by about 25% at similar quality, and by over 50% in some reused-memory cases. The problem is that long context gives a model more information, but every extra token costs money and compute, and the extra context often brings smaller gains. Longer context has diminishing returns, and the expensive tokens are often the ones added after the model already has enough signal. The authors propose an Efficiency Frontier, which compares context strategies by looking at answer quality and token cost together instead of treating them as separate scores. The key idea is that some methods are cheap per question, like retrieval, while others spend more upfront, like memory compression, but become cheaper when the same processed context is reused many times. They tested this on 5,000 HotpotQA questions, where the model has to combine facts across documents while ignoring distracting text. The main result is that the best context strategy changes with the setting: lightweight retrieval works best when reuse is low, memory compression becomes better when reuse is high, and full-context prompting is still needed for the highest scores. ---- Link – arxiv. org/abs/2605.23071 Title: "The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management"

译论文提出了“效率前沿”框架，用于统一评估LLM上下文管理策略的成本与性能权衡。核心发现是，在部署时选择合适的上下文方法可使token使用量减少约25%，在部分记忆复用场景下可降低超50%成本，且答案质量损失较小。研究指出，上下文长度存在收益递减，后增加的token成本高但收益小。在5000个HotpotQA问题的测试中，轻量检索适合低复用率，记忆压缩在高复用率下更优，而全上下文提示仍是获取最高性能所需。

向阳乔木@vista8 · 5月29日49

如果你订阅了 X Premium +，现在可安装Grok Build的Cli curl -fsSL https://x.ai/cli/install.sh | bash 能在CLI中生成图片，但好像调用video_gen接口不行，但好像官方说可以生成视频，实际测试发现不行。原以为能直接读 X上的帖子，发现也不行，哎。编程打不过Codex和CC，得找别的亮点啊，着急！

译X Premium+订阅用户现已可以安装Grok Build CLI。实际测试显示，该工具能成功生成图片，但调用 `video_gen` 接口生成视频的功能目前不可用，尽管官方有相关表述。此外，直接读取X平台帖子的功能也尚未实现。在编程能力上，该工具被认为不及Codex与Claude Code。

Rohan Paul@rohanpaul_ai · 5月29日64

Stronger agents will not come only from larger models, but from better systems around them. The problem is that many AI agents are judged as if the model alone did the work, even though the real behavior also depends on memory, tools, context, routing, checks, and permissions. This surrounding setup around the agent is called harness, meaning the system that decides what the model sees, what tools it can use, what it remembers, and what actions get checked. Progress should come from scaling this harness, especially 3 parts: better context control, more trustworthy memory, and better routing to tools or helper agents. Long context is not the same as usable context, memory is not the same as trustworthy memory, and having many tools is not the same as knowing when to use them. A stale note can be more dangerous than no note, because it gives the agent confidence exactly when it should re-check the world. A specialized subagent can also fail quietly if its output sounds plausible but no later layer verifies whether it is true. This is why one-shot benchmark scores feel increasingly thin. Two agents can reach the same final answer, while one burns far more tokens, makes riskier tool calls, carries corrupted memory, or succeeds only by accident. The next frontier is not just scaling the mind inside the machine. It is scaling the discipline around it. ---- Link – arxiv. org/abs/2605.26112 Title: "From Model Scaling to System Scaling: Scaling the Harness in Agentic AI"

译推文指出，AI智能体的强弱不只取决于模型，更依赖于模型周围的系统约束（harness）。该系统决定了模型的输入、可用工具、记忆及操作验证。核心进步应来自扩展此系统，尤其要提升上下文控制、记忆可信度以及工具或子智能体的路由能力。文中强调，长上下文不等于可用上下文，记忆多不等于可信，工具多不等于会用。这使得当前仅凭单次benchmark分数的评估方式显得薄弱。未来前沿在于扩展围绕智能体的系统约束，而不仅仅是扩展模型本身。相关论文标题为《From Model Scaling to System Scaling: Scaling the Harness in Agentic AI》。

SemiAnalysis@SemiAnalysis_ · 5月29日54

Running a single deep coding model at max context on Cerebras requires 24 systems ($24M Capex) just to support 256 concurrent users. At that scale, $100M gets you way more memory bandwidth in standard GB300 racks.

译在Cerebras上以最大上下文窗口运行单个深度编码模型，仅支持256个并发用户就需要24套系统（2400万美元资本支出）。在这个规模下，1亿美元在标准GB300机架中能获得高得多的内存带宽。

ginobefun@hongming731 · 5月29日50

刚看了下 BestBlogs 最近的模型消耗，有点惊喜。一万多个订阅源，每天处理接近 5000 万 token，用 deepseek-v4-flash 跑低优先级内容，deepseek-v4-pro 跑高优先级内容，整体一天大概 20 多块钱。关键是缓存命中率很高，成本被压得非常舒服。目前看下来，deepseek 可能是我用过性价比最高的一组模型了。之前用 Gemini，成本压力明显大很多。

译作者使用DeepSeek V4 Flash处理低优先级内容，DeepSeek V4 Pro处理高优先级内容，日均处理接近5000万token，整体一天成本约20元人民币。关键在于缓存命中率很高，显著降低了使用成本。相比此前使用的Gemini，DeepSeek的性价比表现更为突出。

StepFun@StepFun_ai · 5月29日79

Day-0 vLLM support. Thanks @vllm_project 🤝

译阶跃星辰发布了 Step-3.7-Flash 模型，vLLM 在模型发布当天即提供支持。该模型是一个 198B 参数的稀疏 MoE 视觉语言模型，每个 token 约有 11B 激活参数，支持原生图像与文本输入。其上下文窗口达到 256K，适用于长文档、多文件代码库及密集视觉界面。模型提供 FP8 和 NVFP4 量化权重版本，并内置 MTP 推测解码、原生工具调用及推理解析功能。

swyx@swyx · 5月29日61

met with @ACM_President today! we awarded Industry Spotlights at @CAISconf, and all posters and OpEx talks will be presenting at @aiDotEngineer next month more AIE x ACM collaboration incoming! wonder what a “Turing award of AI Engineering” could look like…

译今天与@ACM_President会面！我们在@CAISconf颁发了行业聚焦奖，所有海报和OpEx演讲将于下月在@aiDotEngineer展示。更多AIE x ACM合作即将到来！好奇“AI工程领域的图灵奖”会是什么样子……

Rohan Paul@rohanpaul_ai · 5月29日64

Some truly massive inference numbers here. @Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding) with a 2B model. For comparison, typical GPU decoding speed for 2B to 8B models on high-end GPUs is around 100 to 300 tokens/s per sec. They achieved it by treating LLM decoding as a memory-streaming problem: keep the whole token-generation loop inside one persistent GPU program, so kernel launches, CPU scheduling, intermediate memory writes, and sampling interruptions mostly disappear. Then they cut synchronization waste by making each compute unit wait only for the exact data it needs, while mapping memory access to the MI300X’s chiplet topology so the GPU stops paying avoidable cross-die latency. Finally, their model architecture delays tensor-parallel communication so all-reduce work happens in the background instead of blocking every layer, which is why the runtime, GPU code, and model design all have to be co-designed.

译Kog AI 在标准数据中心 GPU 上实现了惊人的推理速度：在 8× AMD MI300X 上达到 3,000 tokens/s，在 8× NVIDIA H200 上达到 2,100 tokens/s（FP16，无推测解码），而常规速度通常为 100-300 tokens/s。其技术核心是将大语言模型解码视为内存流问题，通过将整个 token 生成循环置于单一持久 GPU 程序内、优化内存访问拓扑以降低跨芯片延迟、并采用延迟张量并行技术来大幅减少开销。Kog 今日开放技术预览，提供 2B 编码模型，并计划后续支持大型前沿 MoE。

OpenClaw🦞@openclaw · 5月29日62

OpenClaw’s latest sweep: cold agent turns 2.9x faster, warm turns 2.5x faster, tarball 59% smaller, deps down 42% from the monthly high. Small core, explicit deps, optional power in plugins. The claws are getting sharper 🦞 https://openclaw.ai/blog/lighter-core-sharper-claws/

译OpenClaw最新优化成果：冷启动智能体速度提升2.9倍，热启动提升2.5倍，压缩包体积减小59%，依赖项较月度峰值减少42%。核心精简，依赖显式，功能可选插件化。爪子更锋利了 🦞 https://openclaw.ai/blog/lighter-core-sharper-claws/

Rohan Paul@rohanpaul_ai · 5月29日62

Most AI teams still buy inference like they are buying software from 1 vendor. They pick a model, accept the fixed price, wire it into the app, and keep paying that rate even when cheaper models could handle the same work. @The_GridAI takes a different approach. Instead of choosing a model name, you choose the level of work you need: standard, prime, or max. A simple task like support-ticket classification can run on standard. Normal production work like RAG, drafting, support replies, or agent steps can run on prime. Harder work with long context or higher error cost can run on max. The Grid then routes the request to the cheapest supplier that still qualifies for that tier. So the app still uses one API and mostly the same code, but the model behind the request can change as price and quality change. I tested it with Hermes Agent on my Ubuntu machine. Hermes ran locally, while The Grid handled the inference through agent-prime. The workflow was simple: read support tickets, apply a policy file, and write a triage report.

译The Grid AI 提出了一种新的AI推理购买模式。用户不再指定具体模型，而是根据任务复杂度选择标准（standard）、生产（prime）或极致（max）三个级别之一。平台会自动将请求路由到满足该级别要求的最便宜供应商。应用仅需接入单一API，后端模型可根据价格与质量动态变化，从而优化成本。作者曾用Hermes Agent在本地测试，通过agent-price级别处理了工单分类工作流。The Grid目前处于Beta阶段，声称通过供应商竞价可使AI API成本降低最高80%，并为新用户提供首200M tokens免费额度。

Epoch AI@EpochAIResearch · 5月29日68

Hyperscaler capital expenditures came in on trend in Q1 2026, continuing the trajectory that projects them spending $770 billion this year and over a trillion dollars in 2027.

译超大规模厂商的资本支出在2026年第一季度符合趋势，延续了预计今年支出7700亿美元、2027年超过一万亿美元的轨迹。

Replit ⠕@Replit · 5月29日64

How to secure your vibecoded app in 4 steps 🔒 Speed without security is a liability. Here's how to ship without leaving the back door open using Replit. 🧵Open thread ↓

译如何用四步保障你的vibecoded应用安全 🔒 速度若无安全加持，便是隐患。以下是使用Replit发布应用时，如何避免留下后门的方法。 🧵展开阅读 ↓

OpenRouter@OpenRouter · 5月28日69

TIP: You can use Flex and Priority tiers for supported models (OpenAI, Google Vertex, & more) Pricing available on each model page. Docs: https://openrouter.ai/docs/guides/features/service-tiers

译提示：您可以为支持的模型（OpenAI、Google Vertex 等）使用 Flex 和 Priority 层级。定价信息请查看各模型页面。文档：https://openrouter.ai/docs/guides/features/service-tiers

Rohan Paul@rohanpaul_ai · 5月28日65

Elon Musk just told investors that SpaceX’s Anthropic AI compute deal is not a locked multi-year rental, but a 180-day lease for Colossus with a 90-day cancellation path. The older reading made the deal look like $1.25B/month through May-29, but Musk says SpaceX wanted the short term because AI compute may become too scarce to rent away for years. SpaceX wants flexibility because Colossus is not just a side asset, since the same compute infra trains xAI models, support internal AI systems, or become a paid cloud-style business. --- reuters .com/technology/musk-says-spacex-did-not-commit-long-term-colossus-lease-with-anthropic-2026-05-28/

译Elon Musk向投资者澄清，SpaceX为Anthropic提供AI算力的Colossus并非长期锁定租赁，而是一份为期180天的租约，并附带90天的取消路径。此前外界认为该交易价值约每月12.5亿美元并持续至2029年5月，但Musk解释称，SpaceX选择短期条款是因为考虑到AI算力未来可能变得稀缺，不宜长期外租。他强调，Colossus并非闲置资产，同一套计算基础设施将用于训练xAI模型、支持内部AI系统，或可能在未来发展为付费云服务，因此SpaceX需要保持运营灵活性。

ginobefun@hongming731 · 5月28日52

阿里 ATA 这篇文章有点骚，把 Claude Code 从本地 CLI 工具部署到云端、通过魔改 SDK 实现 HTTP 流式调用，并利用沙箱实现多用户隔离。

OpenClaw🦞@openclaw · 5月28日64

OpenClaw 2026.5.27 is live 🦞 🔒 tighter runtime/security boundaries ⚡ faster gateway + reply paths 🧠 steadier Codex/app-server memory 📡 better channels, providers, Pixverse video Less wedge, more claw. https://github.com/openclaw/openclaw/releases/tag/v2026.5.27

译OpenClaw 2026.5.27 已上线 🦞 🔒 更严格的运行时/安全边界 ⚡ 更快的网关 + 回复路径 🧠 更稳定的 Codex/应用服务器内存 📡 更好的频道、提供商、Pixverse 视频更少阻碍，更多掌控。 https://github.com/openclaw/openclaw/releases/tag/v2026.5.27

Rohan Paul@rohanpaul_ai · 5月28日62

Super important paper from Univ of Texas. AI agents can slowly become less reliable after deployment, even when the model itself does not change. The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance. An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house. Every one of those steps can quietly rot. A medication dose can become “a daily medication,” two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass. The uncomfortable finding is that the agent may still sound competent while becoming less exact. The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point. It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory. The deeper lesson is that “give it more memory” is often the wrong repair. If the fact was never written, retrieval cannot save it. If the fact was written but crowded out, better summarization will not fix it. If the fact is present but unused, the problem is not storage but the agent’s decision to trust or ignore what it retrieved. This paper reframes deployed agents less like static models and more like aging infrastructure. ---- Link – arxiv. org/abs/2605.26302 Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"

译论文指出AI智能体在部署后，其记忆系统会因摘要、存储、更新和维护而逐渐“衰老”，导致信息丢失、混淆、过时或被破坏。智能体看似仍能工作，但可靠性已悄然下降。为此提出AgingBench基准，用于评估智能体在多会话中的持续可靠性。论文将智能体比作会衰老的基础设施，强调单纯增加记忆并非解决方案。

Rohan Paul@rohanpaul_ai · 5月28日59

NVIDIA published a report on Vera CPU benchmarks, done by Phoronix. Compares Vera directly against a current 128-core x86 CPU and claims a 1.5x overall performance advantage Compared with the prior-generation NVIDIA Grace CPU, Vera delivered a 1.6x geometric mean increase in Phoronix’s testing. Vera delivered over 4x the memory bandwidth per core compared with traditional x86 CPUs. Vera delivers 1.2TB/s bandwidth using LPDDR5X, while keeping memory power under 30W, compared with more than 100W for many DDR5 server setups. ---- To note, Vera uses Armv9.2, not x86, so NVIDIA is basically saying that their Arm-based CPU can beat the usual Intel and AMD server CPUs. For agentic AI, this CPU-side work becomes much heavier because the AI is not only generating text, but also calling tools, reading files, writing code, using browsers, running sandboxes, and managing workflows.

译NVIDIA发布Vera CPU基准测试报告。Vera采用Armv9.2架构，在Phoronix测试中，其整体性能比128核x86 CPU高1.5倍，比前代Grace CPU提升1.6倍（几何平均）。其每核心内存带宽是传统x86 CPU的4倍以上，使用LPDDR5X实现1.2TB/s带宽，内存功耗低于30W。该报告旨在表明NVIDIA的Arm架构CPU性能已超越Intel和AMD的x86服务器CPU，并强调在智能体AI场景下，因涉及工具调用、文件读写、代码生成等复杂任务，CPU侧工作负载变得更重。

Mistral AI@MistralAI · 5月28日62

We're taking on the hardest problems in the real world 🏗️🚚 🛫⚛️ Today at The AI Now Summit, held at the Louvre, we announced AI solutions for aerospace, automotive, energy, and physics. Deployed in production at @Airbus , @BMW, @EDFofficiel , and more. More below:

译我们正在攻克现实世界中最棘手的问题 🏗️🚚 🛫⚛️ 今天在卢浮宫举行的 AI Now 峰会上，我们宣布了面向航空航天、汽车、能源和物理学的 AI 解决方案。已在 @Airbus、@BMW、@EDFofficiel 等公司投入生产部署。详情如下：

ginobefun@hongming731 · 5月28日69

腾讯这篇文章讨论的是一个很现实的问题：Agent 做长任务时，越来越容易被自己的上下文拖垮。我们平时让 Agent 搜索资料、读文件、改代码、跑测试、写报告，看起来每一步都很正常。但这些过程会不断产生大量中间信息：网页正文、搜索结果、工具返回、日志、代码片段、报错信息、旧版本方案。任务一长，这些内容就会不断堆进上下文里。问题就来了。上下文越来越长，Token 成本会越来越高；更麻烦的是，Agent 会被旧信息干扰。它可能忘记最初目标，重复搜索已经查过的资料，混淆不同子任务，或者被前面已经无关的日志带偏。也就是说，信息并没有丢，但它被堆得太乱，Agent 反而找不到重点。所以文章要解决的核心问题是：怎样让 Agent 在长任务里少背负冗余信息，同时还能记得任务进展，并在需要时找回原始证据。作者提出的方案，可以概括为一句话：短期记忆压缩 = 上下文卸载 + Mermaid 任务画布。先说「上下文卸载」。它的思路很简单：不是所有信息都要一直放在模型眼前。完整网页、完整日志、完整工具结果，可以先存到外部文件系统里。上下文里只保留一条摘要、一个路径、一个索引。等 Agent 真需要细节时，再通过路径把原文找回来。这有点像我们写报告时，不会把所有参考资料都摊在桌面上，而是把资料放进文件夹，桌上只放目录和关键摘录。这样桌面变清爽了，但资料并没有丢。不过，只把信息搬出去还不够。因为如果留下来的只是很多条摘要，比如「搜索了港大学费」「搜索了港中文学费」「生成了对比表」，这些摘要虽然短了，但还是一串线性日志。Agent 仍然不容易判断：哪些步骤是并行的，哪些信息互相依赖，当前任务到底走到了哪里。所以文章又引入了第二个东西：Mermaid 任务画布。 Mermaid 是一种用文本描述图的格式，模型能读，工程上也能渲染成图。作者用它把 Agent 的执行过程整理成一张任务地图。每个节点表示一个子任务，节点里有状态、摘要和时间戳，节点之间用箭头表示依赖关系。这样 Agent 看到的就不再是一长串历史记录，而是一张结构化地图：哪些步骤已经完成；哪些节点还在进行；哪些信息汇聚成了当前结论；下一步应该从哪里继续；如果需要细节，应该去哪个文件里找。这就是文章里说的「无限画布」。它不是让上下文窗口真的无限变大，而是让上下文之外的信息仍然可见、可定位、可恢复。这套方案还有一个很重要的设计：分层记忆。最底层是完整原文，保存在外部文件里；上一层是工具调用摘要，记录每次调用做了什么，原文在哪里；再上一层是 Mermaid 节点，记录任务步骤和阶段性结论；最上层是任务元信息，只保留任务目标、状态和画布路径。 Agent 使用时，可以先看最轻的任务索引，再打开相关画布；如果画布摘要不够，再查工具摘要；如果还不够，最后才读取完整原文。这就避免了两个极端：一种是所有东西都塞进上下文，导致越来越乱；另一种是粗暴总结，把细节压没了，后面需要时又找不回来。实验结果也比较直接。这个方案在多个长任务评测里都降低了 Token 消耗，同时任务效果没有下降，很多场景还提升了。网页搜索任务中，最高节省约 61% Token；代码修复任务中，节省约 31% 到 33% Token，完成率也有所提升；复杂长任务里，通过率从 20% 提升到 30% 到 35%。更关键的是，消融实验显示：只做上下文卸载有帮助，但效果有限；加入 Mermaid 任务画布后，Token 节省和任务完成率都会进一步提升。说明真正有效的压缩，不能只压缩内容，还要保留结构。这篇文章最值得借鉴的地方是，它没有把记忆理解成「把所有历史塞进上下文」，也没有把压缩理解成「写一段更短的总结」。它真正做的是把 Agent 的工作过程变成一套可折叠、可恢复、可导航的任务记忆系统。

译腾讯指出，智能体在执行长任务时面临上下文信息堆积导致的成本增加与目标遗忘问题。其提出的解决方案是结合“上下文卸载”与“Mermaid任务画布”：将详细内容存至外部，上下文仅保留索引；并用图表将执行过程结构化为带状态与依赖的任务地图。方案采用分层记忆系统。实验显示，该方案在网页搜索任务中最高节省约61% Token，代码修复任务节省31%-33% Token且完成率提升，复杂任务通过率从20%提升至30%-35%。消融实验证明，结合任务画布的结构化压缩效果更优。

Alibaba Cloud@alibaba_cloud · 5月28日59

📢Qwen3.7-Max just hit #3 on ITbench-AA — a fresh benchmark testing how well models handle real-world enterprise IT tasks, agentic-style. 🔧Agentic era, go with Qwen.🏃🏃

译由 Artificial Analysis 和 IBM Research 合作推出的首个评估模型处理真实企业IT任务能力的基准测试 ITBench-AA，聚焦于站点可靠性工程（SRE）任务。测试结果显示，通义千问（Qwen3.7-Max）以 42% 的分数排名第三。该测试中，所有前沿模型得分均低于 50%，其中 Claude Opus 4.7 以 47% 领先，GPT-5.5（xhigh）以 46% 紧随其后。在开源模型中，GLM-5.1（Reasoning）以 40% 领衔。该基准未来将扩展到财务运营（FinOps）等任务。

Alibaba Cloud@alibaba_cloud · 5月28日67

Introducing ANOLISA (Alibaba Cloud Linux 4 Agentic Edition) — the first OS designed for AI agents. As agents evolve into "digital workers," traditional operating systems have become a bottleneck. ANOLISA changes that.

译推出ANOLISA（阿里云Linux 4智能体版）——首款专为AI智能体设计的操作系统。随着智能体演变为“数字工作者”，传统操作系统已成为瓶颈。ANOLISA改变了这一点。

Krea@krea_ai · 5月28日64

Krea 2 live on Replicate!

译Krea 2现已登陆Replicate！生成高保真、富有创意的图像，美学优先。

Berryxia.AI@berryxia · 5月28日65

真的，人的顿悟有时候就是一瞬间。原来一直教AI做事的方式都不对，天天下达指令😄 前晚看罗胖的得到大脑发布会，他在发布会这样说： “真正改变工作方式的，是另一类用法，把 AI 帮你做的报告、研究，让它做完之后主动存下来。” 因为你跟 AI 聊的内容，其实就是你未来“数字分身”的一部分。如果这些关键内容没有被记录下来，或者需要你不断地被动强调让 AI 去记，其实是一件非常痛苦的事情。最近我一直在给大家推荐 Bloom 这个 AI，但因为它本身的 Memory（记忆）模块没有做太多的升级和优化，所以我前阵子看到 Memory OS 2.0 发布后，就尝试将它与我当前的 Bloome 进行了一次升级整合。这篇文章是我实战过程的一个记录，希望能给大家提供一些参考。我将详细分享： 1. 整个整合的过程及前后对比 2. 它是如何触发“主动性记忆点”的 3. 这种“主动记忆”相比“被动记忆”的优势在哪里希望这些内容对大家有用。

译推文指出，让AI主动记录和保存对话内容，是构建“数字分身”的关键，而非仅依赖被动下达指令。作者受罗胖发布会观点启发，将 Memory OS 2.0 与自己使用的 Bloom AI 进行了整合升级。实践表明，这种整合能触发AI的“主动性记忆点”，相比传统的“被动记忆”模式更具优势。作者将分享具体的整合过程、前后对比以及主动记忆的优势分析。

Alibaba Cloud@alibaba_cloud · 5月28日70

Introducing ANOLISA (Alibaba Cloud Linux 4 Agentic Edition) — the first OS designed for AI agents. As agents evolve into "digital workers," traditional operating systems have become a bottleneck. ANOLISA changes that.

译推出ANOLISA（阿里云Linux 4智能体版）——首款专为AI智能体设计的操作系统。随着智能体演进为“数字员工”，传统操作系统已成为瓶颈。ANOLISA改变了这一点。

Alibaba Cloud@alibaba_cloud · 5月28日71

Meet MuleRun on Alibaba Cloud Marketplace — An always-on AI workforce for research, reports, code, design & more. Powerful enough for individuals, enterprise-ready for teams — with SSO, RBAC, private networking, team knowledge management, and seamless integrations. Think bigger. Let MuleRun do the rest. Plans from $20/mo → https://int.alibabacloud.com/m/1000413520/ #AlibabaCloud #AIAgents #AIWorkforce #FutureOfWork #EnterpriseAI

译在阿里云市场遇见 MuleRun——一个全天候的AI劳动力，用于研究、报告、代码、设计等。功能强大，适合个人使用；企业就绪，适合团队协作——支持SSO、RBAC、私有网络、团队知识管理和无缝集成。想得更大。让 MuleRun 处理其余事务。方案起价 $20/月 → https://int.alibabacloud.com/m/1000413520/ #AlibabaCloud #AIAgents #AIWorkforce #FutureOfWork #EnterpriseAI

Berryxia.AI@berryxia · 5月28日69

OpenAI终于把企业最头疼的安全和合规墙彻底推倒了。他们今天直接推出Private MCP Tunnels：你的团队可以把MCP服务器完全留在内网，而ChatGPT、Codex和Responses API只需要通过单向HTTPS outbound就能安全连接，完全不用打开任何入站端口，也不用把永久API Key散得到处都是。同时还上了Workload Identity Federation（云身份联邦）和大幅增强的Admin API，支持支出预警、模型白名单、数据保留策略、托管工具控制等企业级管理能力。这不是小修小补，这是OpenAI把AI平台从“开发者玩具”直接升级成了真正的企业级基础设施。以前大公司想大规模用AI，最卡的从来不是模型能力，就是要“数据不能出墙”“安全审查半年走不完”。现在这些障碍被一次性干掉。企业采用AI的最后一公里，终于被OpenAI打通了。

译OpenAI推出Private MCP Tunnels，允许企业将MCP服务器完全保留在内网。ChatGPT、Codex和Responses API仅通过单向HTTPS outbound安全连接，无需开放入站端口或暴露永久API Key。同时推出的Workload Identity Federation和大幅增强的Admin API，提供了支出预警、模型白名单、数据保留策略等企业级管控功能。这些更新旨在打通企业采用AI时“数据不能出墙”与“安全审查漫长”的核心障碍，将OpenAI平台升级为企业级基础设施。

SemiAnalysis@SemiAnalysis_ · 5月28日55

GPUs are leaving performance on the table. Closing the gap between theoretical peak and real-world throughput is nearly impossible when hand-tuning CUDA kernels at scale. So why are hand-written CUDA kernels losing to auto-generated ones? Mohamed Abdelfattah at Makora has a solution: https://youtu.be/ukzACWrk0W0?si=whrH_WsHltmF_J7B

译GPU性能仍有提升空间。在大规模手动调整CUDA内核时，几乎不可能弥合理论峰值与实际吞吐量之间的差距。那么，为什么手写CUDA内核会输给自动生成的版本？ Makora的Mohamed Abdelfattah有一个解决方案：https://youtu.be/ukzACWrk0W0?si=whrH_WsHltmF_J7B

Rohan Paul@rohanpaul_ai · 5月28日67

Most teams are overpaying for inference without realising it. Fixed rate cards have no competitive pressure. The Grid replaces them with live supply and demand, prices track the market, not a vendor's margin. The Grid sits in the middle and basically says, “Don’t pick the model, pick the level of work you need.” A boring task like classifying support tickets does not need the smartest model, so it can run on standard. A normal production task like RAG, drafting, support replies, or agent steps can run on prime. A hard task with long context, high error cost, or difficult reasoning can run on max. Your app sends the request to The Grid, not directly to OpenAI, Anthropic, or one hosting company. The Grid then checks which suppliers currently qualify for that tier and sends the request to the cheapest one available at that moment. You still use one API key and mostly the same code, but the model behind the request can change as prices and quality change. So you stop paying premium prices for easy work, and also you are not trapped inside one vendor’s model names, pricing, outages, or deprecations. New accounts get the first 200 million tokens covered. Here, I integrated Hermes Agent with The Grid in minutes, kept the agent running locally on my Ubuntu machine, and used “agent-prime” to read support tickets, apply a policy file, and write a triage report through The Grid’s API. You just need to - install Hermes Agent - select The Grid as a custom AI provider. - No local model download. No GPU setup. The request goes through the grid. - The Hermes Agent ran locally, but the AI calls went through The Grid. 🧵 1.

译The Grid推出新的LLM推理平台，用实时供需市场定价取代传统的固定费率。它按任务难度分层：简单任务（如分类）用“standard”，常规生产任务（如RAG、智能体步骤）用“prime”，高难度任务（如长上下文推理）用“max”。应用将请求发送至The Grid，平台会自动匹配该层级当前最便宜的可用供应商。开发者仍使用单一API，但后端模型可动态切换。新账户享受前200 million tokens免费额度。文中以Hermes Agent集成为例，展示了如何通过“agent-prime”层级处理工单。