有条件一定要用最好的AI大模型Claude opus 4.7！！！这个印度开发老哥把Claude代码功能讲的太细了🤩🤩🤩 中文字幕版已做好，兄弟们请查收！每个程序员和AI玩家都应该知道的12个Claude代码功能： - CLAUDE.md - Permissions - Plan Mode - Checkpoints - Skills - Hooks - MCP - Plugins - Context - Slash Commands - Compaction - Subagents 如果担心 Claude 封号，不建议别用中转站，我的解决方案是聚合平台，目前在用 Zenmux，亲测安全稳定好用，国内不用梯子都可以直连，所有最新的大模型都是发布当天就上̋(ˊ•͈ꇴ•͈ˋ)

译一位印度开发者详细介绍了Claude的12个关键代码功能，包括CLAUDE.md、Plan Mode、MCP等，并建议开发者使用AI模型聚合平台。针对Claude可能封号的风险，推荐使用Zenmux平台，该平台集成了包括Claude Opus、GPT-5.4和DeepSeek V4 Pro在内的多种最新大模型，国内可直连。平台提供PK对比模式、保险赔付机制和详细的可观测性工具。特别指出，DeepSeek V4 Pro在Zenmux上目前有免费额度，经测试能处理大部分Claude的工作流，建议用户自行测试以进行模型选型。

阿绎 AYi@AYi_AInotes · 4月26日58

说个暴论，现在90%的AI Agent记忆，全都是假的。我之前也踩过这个坑，把所有历史记录决策日志全堆进Markdown文件里，以为这就是给Agent加了长期记忆，结果用了两周就崩了，同一个事实有三个互相矛盾的版本，上个月的偏好和昨天的权重一模一样，每次调用都把所有东西一股脑塞进上下文，慢到离谱还经常串台，直到看到这篇文章才恍然大悟，原来我根本不是在做记忆，只是在把Prompt当RAM用🌚 真正的记忆不是堆文件，应该是图和节点加嵌入加遍历， Markdown方案有四个根本解决不了的硬伤，没有去重，没有衰减，没有排名，超过一百条记录直接变成性能杀手，它只能记住你写过什么，永远记不住这件事和那件事有什么关系，这个决策为什么被否决，上次遇到同样的bug我们是怎么解决的。向量检索也不行，它只能告诉你这两段话长得像，不能告诉你它们之间的因果关系，只有图遍历能做到，它能像人脑一样，从一个节点牵出一整条相关的记忆链，重要的事情越来越清晰，过时的信息自动淡化，矛盾的内容在写入时就被解决。现在所有生产级的Agent框架，Zep Cognee Mem0，全都是基于图的， Neo4j已经把图记忆做成了标准的MCP工具， Claude Code超过二十万行代码之后，纯上下文窗口早就没戏了，真正能让它像高级工程师一样思考的，是把不变的规则放在CLAUDE.md里，把所有演化的状态全部存在图里，动态检索按需拉取。很多人还在卷一百万两千万的上下文窗口，以为越大越好，但生产环境里真正致命的，永远是跨会话的记忆漂移和上下文污染，内存架构的升级已经不是锦上添花了，能不能把Agent真正用起来才是关键的生死线。

译作者指出，当前多数AI Agent将历史记录堆砌成Markdown文件充当“记忆”的方案实为将Prompt当RAM用，存在无法去重、衰减、排名及性能低下等根本缺陷。真正的长期记忆应基于图结构，通过节点、嵌入和遍历来建立关联与因果关系，实现记忆的链式提取与动态管理。主流生产级框架已转向图记忆。随着应用规模扩大，仅扩展上下文窗口无法解决记忆漂移和污染问题，动态图记忆架构是Agent能否投入实际应用的关键。

SemiAnalysis@SemiAnalysis_ · 4月26日36

DAVIS, APRIL 25, 2026 — InferenceX has added DeepSeekv4 for @vllm_project 's day 0 support for GB200 disagg! Great work to @flowpow123 @rogerw0108 @NVIDIAAIDev @inferact for the fast support and engineering!

译DAVIS, 2026年4月25日 — InferenceX 已为 @vllm_project 添加了 DeepSeekv4，以支持 GB200 分解的 day 0 支持！感谢 @flowpow123 @rogerw0108 @NVIDIAAIDev @inferact 的快速支持和工程工作！

elvis@omarsar0 · 4月26日63

NEW paper from Microsoft. This is an important read. (bookmark it) The work introduces DELEGATE-52, a benchmark simulating long document-editing workflows across 52 professional domains like coding, crystallography, and music notation. Across 19 tested models, even frontier ones (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupted an average of 25% of document content by the end of long workflows. Agentic tool use didn't help. Lots of other insights in this one. Check it out below... Paper: https://arxiv.org/abs/2604.15597 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译微软新论文引入DELEGATE-52基准，模拟52个专业领域的长文档编辑工作流。测试19个模型，包括Gemini 3.1 Pro、Claude 4.6 Opus和GPT-5.4等前沿模型，发现在长工作流结束时平均损坏25%的文档内容。代理工具使用未能改善表现。论文还提供了其他相关见解。

SemiAnalysis@SemiAnalysis_ · 4月26日50

Our GB300 cluster went down yesterday, just as Deepseek released 😱 We were 😥 but @CoreWeave came through to contribute to the Open Source. They scrambled in the compute crisis, finding 2 spare dev racks of GB300 Our team is running Deepseekv4 now for InferenceX benchmarks!

译在DeepSeek发布的关键时刻，原GB300集群意外宕机。云服务商CoreWeave在计算资源危机中紧急调配，找到了两套备用的GB300开发机架，使团队得以顺利运行DeepSeek-V4进行InferenceX基准测试。据引用推文，InferenceX已实现对DeepSeek-V4的Day 0支持，并利用Blackwell B300获得了相比Hopper架构5倍的性能提升。目前，InferenceX团队正全力扩展对更多新硬件架构的即时支持。

DeepSeek@deepseek_ai · 4月25日60

🔥DeepSeek-V4-Pro API is 75% OFF until May 5th, 2026, 15:59 (UTC Time)! Don't miss out on this massive discount. 🛠️Integration Updates: 🔹Claude Code: Set model to deepseek-v4-pro[1m] to unlock 1M context! 🔹OpenCode: Update to v1.14.24+ 🔹OpenClaw: Update to v2026.4.24+ Check the latest official API docs for full details: https://api-docs.deepseek.com/quick_start/pricing

译🔥DeepSeek-V4-Pro API 限时75折优惠，截止至2026年5月5日15:59（UTC时间）！切勿错过此次大幅折扣。 🛠️集成更新： 🔹Claude Code：将模型设置为 deepseek-v4-pro[1m] 即可解锁100万上下文！ 🔹OpenCode：请更新至 v1.14.24+ 🔹OpenClaw：请更新至 v2026.4.24+ 查看最新官方API文档获取完整详情：https://api-docs.deepseek.com/quick_start/pricing

meng shao@shao__meng · 4月25日60

Obscura 是一个用 Rust 从头编写的 headless browser 引擎，专为 AI Agent 自动化和规模化网络爬取设计，主要特性：独立引擎 + 原生反检测 + CDP 兼容 + 极致轻量 Obscura 精准切中了两个高价值场景的交集：AI Agent 的网页感知与大规模反检测爬取，性能对比非常激进。维度 Obscura Headless Chrome 内存占用 30 MB 200+ MB 二进制 70 MB 300+ MB 页面加载 85 ms ~500 ms 启动时间即时 ~2 s 架构拆解：六层 Crate 的精密分工 · obscura-dom：HTML 解析、DOM 树、CSS 选择器 · obscura-net：HTTP 客户端、Cookie、拦截器、robots.txt · obscura-js：V8 集成、JS 运行时、DOM API 绑定 · obscura-browser：页面生命周期、浏览上下文、导航管理 · obscura-cdp：Chrome DevTools Protocol 兼容层 · obscura-cli：命令行入口、Worker 进程管理 Obscura 没有重写整个浏览器，它复用了 Servo 的 DOM 组件和 Google V8 的 JS 引擎，在此基础上构建独立的网络层和 CDP 兼容层。这是一种务实的"站在巨人肩膀上"的策略。三大技术亮点 1. 深度反检测（Stealth Mode）项目最具竞争力的特性。编译时启用 --features stealth 后，它在三个层面进行伪装： · 指纹层面：每会话随机化 GPU、屏幕分辨率、Canvas、Audio、Battery 指纹；模拟 navigator.userAgentData 高熵值；将 navigator.webdriver 设为 undefined · 行为层面：分派事件的 event.isTrusted = true；原生函数 toString() 返回 [native code]；隐藏内部属性使 Object.keys(window) 安全；Shadow DOM polyfill 兼容 Cloudflare Turnstile · 网络层面：拦截 3520 个追踪/广告/遥测域名，阻止指纹脚本加载 2. 生态兼容策略项目选择了"兼容而非对抗"的聪明路径：完整实现 CDP 的子集，使现有基于 Puppeteer/Playwright 的脚本可以零成本迁移，只需将 browserWSEndpoint 指向 ws://127.0.0.1:9222。这极大降低了采用门槛。 3. 为 AI Agent 优化的专属功能 · 内置 LP Domain：DOM-to-Markdown 转换，直接输出适合 LLM 消费的结构化文本 · 85ms 页面加载意味着 Agent 的感知-行动循环更紧凑 · 轻量特性使其适合作为 Agent 的常驻感知端点开源地址： https://github.com/h4ckf0r0day/obscura

译Obscura是一个用Rust编写的轻量级无头浏览器引擎，专为AI Agent自动化和大规模网络爬取优化。其核心优势在于极致的性能与资源效率，内存占用仅30MB，页面加载约85毫秒，远超Headless Chrome。项目采用务实架构，复用Servo的DOM与V8引擎，并构建独立网络层。关键特性包括深度反检测的“隐身模式”，能随机化指纹并拦截追踪域名；通过兼容Chrome DevTools Protocol，实现与Puppeteer/Playwright生态无缝对接。此外，它内置了DOM转Markdown等专为AI Agent优化的功能，旨在作为高效的常驻网页感知端点。

阿绎 AYi@AYi_AInotes · 4月25日50

兄弟们，做App最痛苦的部分终于被AI干掉了， Anything刚刚上线了一个功能，一键生成设计师级别的App Store截图， 15秒，从空白到4张完美适配规格的上架图，连App图标评分Get按钮都给你做好了，改文案只需要输一句话，点一下生成就完事，以前做过App的都懂，代码写完只是开始，做截图能把人逼疯，要找mockup，要调配色，要写卖点，还要适配十几个尺寸，要么花几百刀请设计师，要么自己抠三天Figma，现在这些全没了，我之前觉得AI写代码已经够离谱了，现在才发现，AI真正厉害的地方，是把那些没人愿意干的脏活累活全给你包了，从idea到上架的全链路，现在几乎没有任何门槛了，当然最后还是需要你的审美做最后把关，但这已经帮你省了90%的力气。也许这就是AI的魅力和价值所在吧😄

译工具Anything推出新功能，能一键生成设计师级别的App Store截图。用户仅需15秒即可从空白状态获得4张完美适配官方规格的截图，系统会自动生成包含图标、评分和下载按钮的完整画面。修改文案也只需输入一句话并点击生成。此举解决了应用开发中制作和适配多尺寸截图的传统痛点，该过程以往需耗费数百美元聘请设计师或投入大量时间自行设计。AI正将开发者从繁琐的“脏活累活”中解放出来，大幅降低了从创意到应用上架全流程的门槛，尽管最终审美把关仍需人工，但已节省约90%的精力。

Chubby♨️@kimmonismus · 4月25日46

GLM-5.1 is now on BytePlus's Coding Plan — and the case is straightforward: Opus-class performance, 8-hour autonomous task loops, works natively in Cursor and Claude Code, 6 top models with smart routing. All at roughly 5x lower cost than http://Z.ai official pricing. Hard to ignore.

译GLM-5.1现已登陆BytePlus的Coding Plan——情况很简单：Opus级别的性能，8小时自主任务循环，原生支持Cursor和Claude Code，6个顶级模型配备智能路由。所有这一切的成本大约比http://Z.ai官方定价低5倍。难以忽视。

阿绎 AYi@AYi_AInotes · 4月25日67

写长篇小说的兄弟们，autonovel 来了，200+tokens/s 极速生成，几十分钟就能出百万字长稿。写过长篇的都懂，最折磨人的不是没灵感，是写着写着上下文崩了，伏笔忘了，人物突然 OOC，熬几个月才磨出几十万字。 autonovel 基于最新的 Ling-2.6-flash，专门针对长篇写作做了深度优化，从世界观设定角色构建大纲生成到正文精修全流程打通。最狠的是它的上下文一致性和剧情推演能力，埋的伏笔能自己回收，人物性格全程在线，再也不用翻前面几百章找自己写过的设定。 200+tokens/s 的生成速度是真的离谱，手指刚离开键盘，屏幕上的字还在往上滚，喝杯水的功夫几千字就出来了。以前写百万字初稿要熬大半年，现在几十分钟就能出完整框架，你只需要负责调整方向和打磨细节。这才是 AI 真正能解放创作者生产力的地方，把你从重复的码字劳动里拽出来，专心去想真正值钱的故事。感兴趣的兄弟评论区自取链接，趁现在刚上线还有免费额度可以体验。 #autonovel #Ling26flash #AI写作 #长篇小说 #网文写作

译autonovel是基于Ling-2.6-flash的AI写作工具，专为长篇小说创作优化。它能以200+ tokens/s的速度生成文本，大幅提升创作效率，并在上下文一致性、伏笔回收和人物性格保持方面表现突出，帮助作者从繁琐的码字劳动中解放，更专注于故事构思。目前提供免费体验额度。

Rohan Paul@rohanpaul_ai · 4月25日40

AI model choice is becoming a backend problem. Thats why I like the idea of treating models like swappable infrastructure: one gateway, many model backends, less provider-specific glue code. AI/ML API gives me one OpenAI-compatible endpoint and it routes to 400+ models (chat, vision, video, audio, music, 3D, etc.) from OpenAI, Anthropic, Google, MiniMax, Alibaba, and others. One endpoint for reasoning, vision, image, video, voice, embeddings, etc.

译AI模型选择日益成为一个后端基础设施问题，其核心解决方案是通过统一网关将模型视为可互换组件。AI/ML API提供了一个OpenAI兼容的单一端点，能将请求路由至OpenAI、Anthropic、Google、MiniMax、Alibaba等提供的400多个模型，覆盖对话、视觉、视频、音频、3D等多种类型。这种方法显著减少了针对特定供应商的粘合代码，实现了推理、图像、语音等多功能统一接入。相关推文证实，GPT-5.5 API已通过该平台实时上线，体现了其敏捷性。

阿绎 AYi@AYi_AInotes · 4月25日49

为什么所有大厂都在疯了一样自研芯片？看这篇就够了，Amazon官方的这篇文章把整个AI行业接下来十年的硬件路线讲得明明白白， @amazon 的方案很简单，用ARM架构的CPU核心干调度和逻辑的普通活儿，用自家定制的Tensor核心干矩阵乘法的重体力活，两者焊在同一块芯片上，效率直接拉满，有意思的是，@Tesla 的Dojo芯片，Tenstorrent的新一代AI芯片，用的都是一模一样的思路，只是CPU部分分别选了ARM和RISC-V，以前大家觉得AI就是拼GPU，现在才发现，其实真正的胜负手在混合架构，也就是谁能把每美元每瓦特的算力做到最高，谁就能在云服务和模型训练上形成降维打击，这也是为什么所有大厂都在疯了一样自研芯片，其实根本就不是省钱的问题，本质上是大家都在抢AI时代的生产工具。

译大厂自研芯片源于AI硬件根本变革：从GPU比拼转向混合架构竞争。Amazon、Tesla等将通用CPU核心与定制Tensor核心集成，以最大化每美元每瓦特的算力效率，在云服务与模型训练中形成优势。这不仅是成本问题，更是争夺AI时代生产工具的战略举措。

SemiAnalysis@SemiAnalysis_ · 4月25日49

What is NanoFlex Pro? At TSMC's North America Technology Symposium, Dr. Lu shed light on NanoFlex Pro in their A14 node. While N2 NanoFlex used some double height "Merged OD" tall cells to boost performance, the offset requirements from the alternating wells in modern standard cell layouts meant there were many unusable "half-cell" gaps between the tall and short cell boundaries. With A14 NanoFlex Pro, the tall cells are now only 1.5x the height of the short cell, so two tall cells can fit neatly into the height of three short cells. This eliminates some of the gaps between cell types, increasing layout density and OD efficiency (ratio of nanosheet width to cell height). NanoFlex Pro is also coming to the new N2U process in 2028 as an additional option for better performance and 2-3% logic density gain.

译在台积电北美技术研讨会上，卢博士介绍了A14节点中的NanoFlex Pro技术。相较于N2节点的NanoFlex技术使用双倍高度的“Merged OD”高单元来提升性能，但会因现代标准单元布局中的交替阱偏移要求而产生不可用的“半单元”间隙。A14的NanoFlex Pro将高单元高度降至短单元的1.5倍，使得两个高单元恰好能放入三个短单元的高度，从而消除了部分单元类型间的间隙，提高了布局密度和OD效率。该技术也将作为可选方案于2028年应用于新的N2U工艺，以提供更好性能和2-3%的逻辑密度增益。

Chubby♨️@kimmonismus · 4月24日49

1m Standard and ultra high context efficiency is what me excites me

译1m 标准与超高上下文效率是让我兴奋之处

小互@xiaohu · 4月24日56

OpenAI 刚发的 Workspace Agent，开源版来了 · 可任意模型，Claude / GPT / Gemini / Kimi / DeepSeek 都能接 · 可在自己服务器上跑，最低 €4/月 · 每个会话有独立 Docker 沙箱 · 每个终端用户凭证隔离 · 子 agent 调用全程可观测，不是黑盒它能帮你做这些事： · 给公司团队搭一套 AI Agent 服务，模型随便换，不被 Claude 或 GPT 锁死 · 给 SaaS 产品加 AI 助手，每个用户各自登录各自的账号不串号 · 做 Telegram、Discord AI 机器人，自带 Telegram 适配器 · 跑企业内部受控 Agent，可限制只能访问指定 API，不能乱出公网 · 每个会话独立运行，一个崩了不影响其他

译开源项目 openclaw-managed-agents 提供了类似 OpenAI Workspace Agent 的功能，核心特点是支持接入任意大模型（如 Claude、GPT、Gemini 等）并可自托管于自有服务器，成本可低至每月4欧元。其采用独立 Docker 沙箱架构，确保每个用户会话隔离运行，实现凭证安全与互不影响，且子 agent 调用过程全程可观测。该方案适用于为企业搭建可灵活切换模型的 AI Agent 服务、为 SaaS 产品添加隔离的 AI 助手、构建社交平台机器人或运行内部受控、仅能访问指定 API 的安全 Agent。

SemiAnalysis@SemiAnalysis_ · 4月23日

NVIDIA knows more about what its customers need than anyone else. They hear the asks directly. That is why disaggregated inference is the future, and why the LPU actually surpasses the GPU in certain parts of the pipeline.

译NVIDIA 比任何人都更了解其客户的需求。他们直接听到这些需求。这就是为什么解耦推理是未来，以及为什么 LPU 实际上在流水线的某些部分超越了 GPU。

Sundar Pichai@sundarpichai · 4月23日

TPU 8t, optimized for training and TPU 8i, optimized for inference. Looking good!

译TPU 8t 针对训练优化，TPU 8i 针对推理优化。看起来不错！

Chubby♨️@kimmonismus · 4月22日

The Manhattan Project is a joke compared to the expansion of data centers. Let's hope that chip production continues despite the war in Iran.

译与数据中心的扩张相比，曼哈顿计划简直是个笑话。但愿伊朗战争不会中断芯片生产。

Google DeepMind@GoogleDeepMind · 4月22日

Only 25% of organizations have moved AI into production at scale. We’re working to change that. 🛠️ @Accenture, @BainandCompany, @BCG, @Deloitte, and @McKinsey are combining our research with their expertise to bring AI innovation to more industries responsibly. 🤝 Find out more → https://goo.gle/42kvkz1

译仅有 25% 的组织已将 AI 大规模投入生产。我们正致力于改变这一现状。🛠️ @Accenture、@BainandCompany、@BCG、@Deloitte 和 @McKinsey 正将我们的研究与他们的专业知识相结合，以负责任的方式将 AI 创新带给更多行业。🤝 了解更多 → https://goo.gle/42kvkz1

Rohan Paul@rohanpaul_ai · 4月22日

AI demand is growing fast. Google Cloud now processes 16 billion+ tokens per minute via direct API use by their customers, up from 10 billion last quarter.

译AI 需求快速增长。 Google Cloud 目前通过客户直接调用 API，每分钟处理 16 billion+ tokens，而上季度为 10 billion。

SemiAnalysis@SemiAnalysis_ · 4月22日

With the new Vera Rubin rack, one can generate AI videos of Toy Jensen giving an dance tutorial faster than before. Video generation inferencing is one of the most compute bound workloads out there.

译使用新的 Vera Rubin 机架，可以比以往更快地生成 Toy Jensen 舞蹈教程的 AI 视频。视频生成推理是最受计算限制的工作负载之一。

ClaudeDevs@ClaudeDevs · 4月22日

Caching is critical for customers to lower both costs and TTFT. We’re launching a new dashboard in Claude Developer Console to increase visibility and help customers optimize their usage. Check it out here: http://platform.claude.com/usage/cache

译缓存对于客户降低成本和 TTFT 至关重要。我们在 Claude Developer Console 推出了新的仪表板，以提高可见性并帮助客户优化使用。在此查看：http://platform.claude.com/usage/cache

Rohan Paul@rohanpaul_ai · 4月22日

Opik just launched Test Suites, a way to turn real agent traces into regression tests so teams can catch behavior breakage before shipping changes. The problem is that agent failures are rarely one clean bug, because fixing one answer style, tool call, or retrieval path can quietly hurt other users. Opik’s approach is to treat a bad production trace as the test case, then attach a human-written assertion that states the behavior you actually want. That matters because agent quality is usually fuzzy, so a check like “3 sentences or fewer” is often more useful than a pass-fail unit test. The workflow is simple: find a failure, write the assertion, save both in a Test Suite, then run that suite in CI every time the agent changes. This could give agent teams something they badly need: a repeatable way to improve behavior with evidence from real usage instead of gut feel. The important move is the assertion layer. A rule like “the response is concise” or “the agent asks a follow-up before acting” is closer to how teams actually judge agent quality than a pass-fail string match, and that matters because most regressions are behavioral, not lexical.

译Opik发布Test Suites功能，将生产环境中的真实失败trace转化为回归测试。通过人工编写assertion（如"回复简洁"或"先询问再行动"）定义期望行为，而非简单字符串匹配。团队可将测试集成至CI流程，在代码变更时自动检测行为退化。这种方法让AI代理质量评估从主观直觉转向基于真实证据的可重复验证，避免修复单问题时意外破坏其他场景。

SemiAnalysis@SemiAnalysis_ · 4月22日

At OFC 2026 last month, Cisco's chief architect Rakesh Chopra presented on scale-across networking architectures and key deployment trends driving strong demand for traditional DCI equipment. "Traditional" DCI connects CPUs across the frontend network while scale-across connects GPUs over the back-end to enable loss-intolerant, synchronous data flows. In scale-across networking, hyperscalers manage oversubscription of intra-datacenter bandwidth relative to inter-datacenter bandwidth using deep switch buffers and proactive congestion control. The bandwidth needs of scale-across is approximately 14x the bandwidth needs of traditional DCI. As such, significant buildout of scale-across infrastructure at various hyperscalers is expected to result in multi-billion dollar opportunities for 800G coherent pluggables, deep buffered switches and the like. SemiAnalysis's AI Networking Model will initiate estimates of scale across networking equipment spend at various hyperscalers, coming soon.

译Cisco首席架构师在OFC 2026提出scale-across网络架构，与传统DCI连接CPU的前端网络不同，scale-across通过后端网络连接GPU，支持无损同步数据流。超大规模数据中心采用深度缓冲交换机和主动拥塞控制管理带宽超配，其带宽需求约为传统DCI的14倍。这将带动800G相干可插拔光模块、深度缓冲交换机等数十亿美元市场机会，SemiAnalysis即将发布相关支出预测模型。

SemiAnalysis@SemiAnalysis_ · 4月21日

Right now, InferenceX benchmarks are showing the worst these chips will actually perform. No prefix caching, no multi-turn, all random data. The real gains haven't even been measured yet.

译目前，InferenceX 基准测试显示的是这些芯片的实际最差性能。无前缀缓存，无多轮对话，全为随机数据。真正的提升甚至尚未测量。

SemiAnalysis@SemiAnalysis_ · 4月21日

Positron shipped their first AI chip in 18 months and landed Oracle in under 3 years. chip startup to oracle customer in 3 years. most take way longer than that. #startup #chips #oracle #ai #tech #entrepreneur

译Positron 在 18 个月内出货了他们的首款 AI 芯片，并在不到 3 年内拿下了 Oracle。芯片初创公司到 oracle 客户仅用 3 年。大多数公司需要比这长得多的时间。 #startup #chips #oracle #ai #tech #entrepreneur

Anthropic@AnthropicAI · 4月21日

We're expanding our collaboration with Amazon to secure up to 5 gigawatts of compute for training and deploying Claude. Capacity begins coming online this quarter, with nearly 1 gigawatt expected by the end of 2026.

译我们正在扩大与 Amazon 的合作，以确保获得高达 5 吉瓦的算力用于训练和部署 Claude。算力容量本季度开始上线，预计到 2026 年底将有近 1 吉瓦。

SemiAnalysis@SemiAnalysis_ · 4月20日55

How Much Do GPU Clusters Really Cost? Calculating Cluster TCO, The Real Impact of Downtime, The Grand Unifying Theory Of Goodput, and a ClusterMAX 2.1 Update READ NOW: https://newsletter.semianalysis.com/p/how-much-do-gpu-clusters-really-cost?_gl=1*1uithfa*_ga*MTY1NDExMjk2Ny4xNzc2MTIzOTQ1*_ga_FKWNM9FBZ3*czE3NzY2OTU2ODAkbzEyJGcwJHQxNzc2Njk1NjgwJGo2MCRsMCRoMTAyODIzNDQ0OA..

译GPU集群的真实成本究竟是多少？计算集群总拥有成本，停机时间的真实影响，有效吞吐量的宏大统一理论，以及ClusterMAX 2.1更新立即阅读：https://newsletter.semianalysis.com/p/how-much-do-gpu-clusters-really-cost?_gl=1*1uithfa*_ga*MTY1NDExMjk2Ny4xNzc2MTIzOTQ1*_ga_FKWNM9FBZ3*czE3NzY2OTU2ODAkbzEyJGcwJHQxNzc2Njk1NjgwJGo2MCRsMCRoMTAyODIzNDQ0OA..

Chubby♨️@kimmonismus · 4月20日

Google is reportedly in talks with Marvell Technology to co-develop two new AI chips, including a memory processing unit designed to pair with Google’s TPUs and a new TPU optimized specifically for running AI models. The move underscores Google's broader effort to strengthen its hardware stack and position TPUs as a more serious alternative to Nvidia’s dominant GPUs.

译据报道，Google 正在与 Marvell Technology 洽谈共同开发两款新的 AI 芯片，包括一款旨在与 Google TPUs 配对的内存处理单元，以及一款专为运行 AI 模型而优化的新型 TPU。此举凸显了 Google 加强其硬件堆栈并将 TPUs 定位为 Nvidia 主导 GPUs 的更有力替代品的更广泛努力。

Rohan Paul@rohanpaul_ai · 4月19日

Big claim in this paper. "Prefill-as-a-Service" Prefill, the heaviest part of inference, may finally be portable. Long-context AI is no longer trapped inside a single datacenter. Shows how to run LLM prefill on remote clusters by sending much smaller saved prompt state. So long-prompt work can be done on remote machines and sending back only the smaller saved state needed to answer. The breakthrough is not sending everything farther, but sending the right requests farther. --- When you ask a model a long question, it first has to read and digest the whole prompt before it starts answering. That first step is called prefill, and it is brutally compute-heavy. The second step is decode, where the model generates tokens one by one, and that part is more about memory bandwidth than raw compute. But moving the saved prompt state between those phases is usually so data-heavy that both parts must stay in the same tightly connected cluster. So Until now, those two steps usually had to stay close together inside the same fast network, because prefill creates a huge blob of temporary memory called KVCache that had to be moved quickly to the decode machine. That is the bottleneck. What changed is model design. Newer hybrid-attention models produce much smaller KVCache than older dense-attention models, so shipping that state across ordinary datacenter links starts to become practical instead of absurd. The paper’s idea is a Prefill-as-a-Service setup that sends only long, uncached prompts to a remote prefill cluster, then ships back the saved prompt state, called KV cache, over normal Ethernet while short requests stay local. This works mainly because newer hybrid-attention models create far less KV cache than older dense models, and the system adds smart routing, bandwidth-aware scheduling, and cache-aware placement so the network does not clog up. The authors test this with an internal 1T-parameter hybrid model on a mixed setup that uses H200 GPUs for remote prefill and H20 GPUs for local decode. With a routing threshold near 19.4K tokens, about 50% of requests go remote, average cross-cluster traffic is only 13Gbps on a 100Gbps link, and throughput rises 54% over a local-only baseline and 32% over a naive heterogeneous setup. The real point is that smaller KV cache alone was not enough, but paired with selective offloading and scheduling it makes cross-datacenter LLM serving workable, more flexible, and easier to scale across different hardware. ---- Paper Link – arxiv. org/abs/2604.15039v1 Paper Title: "Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter"

译新一代混合注意力模型通过压缩KV Cache，使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群，仅回传轻量KV Cache至本地解码，短请求则本地处理。配合智能路由与带宽感知调度，可在普通以太网高效传输。实测1T参数模型显示，50%请求远程处理时跨集群流量仅13Gbps，吞吐量提升54%，打破长上下文AI局限于单一数据中心的瓶颈。

SemiAnalysis@SemiAnalysis_ · 4月19日

Positron AI wants to run 16 trillion parameter models on a single server.

译Positron AI 想要在单台服务器上运行 16 万亿参数模型。

SemiAnalysis@SemiAnalysis_ · 4月19日

At GTC 2024, Jensen said that GB200 NVL72 was 35x faster than Hopper. Nobody believed it and thought it was classic fake Jensen Math. When we tested the performance of it, it wasn't just 35x faster, it was over 50x times faster even against an strong Hopper baseline with all of the inference optimization composed together like MTP, Disagg prefill, wideEP, etc. View the nuanced results at InferenceX dot com.

译在 GTC 2024 上，Jensen 表示 GB200 NVL72 比 Hopper 快 35 倍。没人相信，认为这是经典的 fake Jensen Math。当我们测试其性能时，它不仅快了 35 倍，即使面对采用了 MTP、Disagg prefill、wideEP 等所有推理优化组合的强大 Hopper 基线，也快了 50 倍以上。在 InferenceX.com 查看详细结果。

OpenAI Developers@OpenAIDevs · 4月19日

Agents that run code need a controlled workspace ready when work starts. @modal shares why scale matters for long-running agents built with the Agents SDK.

译运行代码的 Agents 需要在工作开始时准备好受控的工作空间。 @modal 分享了为何规模对使用 Agents SDK 构建的长时间运行 Agents 至关重要。

Chubby♨️@kimmonismus · 4月18日

Meta layoffs investors had been bracing for are coming, with roughly 8,000 jobs cut starting May 20, about 10% of its 79,000-person workforce. Mainly to free up billions for AI infrastructure, shifting resources from payroll to data centers, chips, and advanced models as highlighted by Mark Zuckerberg.

译Meta 投资者一直担心的裁员即将到来，约 8,000 个岗位将从 5 月 20 日开始裁撤，约占其 79,000 名员工总数的 10%。主要是为了腾出数十亿美元用于 AI 基础设施，将资源从人力成本转向数据中心、芯片和先进模型，正如 Mark Zuckerberg 所强调的那样。

Epoch AI@EpochAIResearch · 4月18日

In 2025, OpenAI announced Stargate, a $500 billion data center initiative. We surveyed all 7 US sites and found visible development at each. There's a long road ahead, but the project appears on track to reach 9+ GW by 2029—comparable to New York City's peak power demand. 🧵

译2025年，OpenAI 宣布了 Stargate，一项 5000 亿美元的数据中心计划。我们调查了全部 7 个美国站点，发现每个都有可见的进展。前路漫漫，但该项目似乎有望在 2029 年达到 9+ GW——相当于纽约市的峰值电力需求。🧵

Greg Brockman@gdb · 4月18日

Stargate is a step towards meeting the demand of the compute-powered economy

译Stargate 是迈向满足算力驱动型经济需求的一步。

Chubby♨️@kimmonismus · 4月18日

Even inflation-adjusted, annual global datacenter CapEx today is roughly equivalent to 5–7 Manhattan Projects per year (≈$250–300B vs. ≈$25–30B in today’s dollars for the Manhattan Project).

译即使经过通胀调整，如今全球年度数据中心资本支出大致相当于每年 5–7 个 Manhattan Project（约 2500–3000 亿美元，而 Manhattan Project 按今日美元计算约为 250–300 亿美元）。

SemiAnalysis@SemiAnalysis_ · 4月17日

At SemiAnalysis, we're quite tired of the minimalist style webapps and landing pages that have become so commonplace lately. Today, we're introducing Minecraft mode on inferencex dot com so that you can escape back to your childhood while learning about the latest accelerator performance.

译在SemiAnalysis，我们已厌倦近来随处可见的极简风格网页应用和落地页。今天，我们在inferencex dot com推出Minecraft模式，让你在了解最新加速器性能的同时，逃回童年。

karminski-牙医@karminski3 · 4月17日

Qwen3.6-35B-A3B 2bit 量化都这么猛吗? Unsloth 团队(当然他们只有哥俩)刚光速放出了量化版本的 Qwen3.6-35B-A3B, 然后他们做这个测试把我惊呆了... 2bit 能完成 30 多次工具调用??? 我是真不信的.. 因为我之前测 Qwen3.5-35B-A3B 8bit (mlx 格式哈) 大概只能 4-5 次工具调用就不行了, 大概只能做做整理邮件这种简单工作, 但凡让它整理完邮件做个统计记录到 Notion / Obsidian 上就炸了. 要知道 unsloth 的 2bit 动态量化这个模型只有12.3GB, 激活只有1G! 32G 的 Mac 可以轻松跑起来了. 我赶紧测一下试试, 稍后给大家带来实测效果. https://x.com/UnslothAI/status/2044858346948464743

译Unsloth团队发布Qwen3.6-35B-A3B 2bit动态量化版本，模型体积仅12.3GB且激活内存仅需1GB，可在32GB Mac上流畅运行。测试显示该版本支持30余次工具调用，相较之下前代Qwen3.5-35B-A3B的8bit版本仅能完成4-5次调用即出现性能衰减。这一突破意味着大模型在端侧设备上的实用性和多步骤任务处理能力获得显著提升。

Rohan Paul@rohanpaul_ai · 4月17日

FT: The White House is moving to give major US agencies access to a modified Anthropic Mythos model built to hunt dangerous software flaws before attackers find them. That makes Mythos useful for defense because a model that can find a weakness in an operating system, browser, or server can help patch it faster. Looks like Washington is treating AI for cyber defense as too strong to ignore and too dangerous to hand out without tight control. --- ft .com/content/c9f5b690-a10e-4c66-9245-017f8bfbc7b4

译白宫拟向主要联邦机构提供Anthropic Mythos模型，用于主动猎捕软件漏洞。该模型可在攻击者之前识别操作系统、浏览器及服务器中的安全缺陷，加速修复进程。此举体现美国政府将AI网络防御视为关键战略能力，既承认其不可替代的防御价值，又强调必须通过严格管控防止技术滥用。