👍

译👍 [引用 @anemll]：anemll-profile 0.4.1 已发布！更新方法： brew upgrade anemll/tap/anemll-profile 新增：ANE 图中断分析、JSON 导出、智能体指南。将此链接提供给您的智能体：http://github.com/anemll/anemll-profile/blob/main/AGENTS.md 示例：来自 @mweinbach 自动转换包的 OCR ANE 分析

Hao AI Lab@haoailab · 4月10日

(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention. The result: FP4 attention quality is comparable to BF16 attention with 1.1x–1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200. Blog: https://haoailab.com/blogs/attn-qat/ Code: https://github.com/hao-ai-lab/FastVideo/pull/1225 Checkpoints: https://huggingface.co/FastVideo/14B_qat_400

译FP4硬件虽已普及，但4-bit attention长期存在质量瓶颈，阻碍端到端FP4部署。研究团队提出Attn-QAT，首次系统研究attention机制的量化感知训练。该方法使FP4 attention质量达到BF16水平，同时在RTX 5090上实现比SageAttention3高1.1-1.5倍的吞吐量，在B200上较FlashAttention-4提速1.39倍。

SemiAnalysis@SemiAnalysis_ · 4月10日

Nvidia published DWDP (Distributed Weight-Data Parallelism), a new inference parallelism strategy focused on prefill. It sounds slightly insane until you remember the target machine is GB200 NVL72. The core trade: spend more peer-GPU bandwidth so you spend less time waiting at collective barriers. (1/6) 🧵 https://arxiv.org/abs/2604.01621v1

译Nvidia 发布了 DWDP (Distributed Weight-Data Parallelism)，这是一种专注于 prefill 的新推理并行策略。这听起来有点疯狂，直到你想起目标机器是 GB200 NVL72。核心权衡：花费更多 peer-GPU 带宽，从而减少在 collective barriers 上的等待时间。(1/6) 🧵 https://arxiv.org/abs/2604.01621v1

SemiAnalysis@SemiAnalysis_ · 4月9日

YOUR PARENTS PAID FOR THE CUDA MOAT! The #1 contributor to the CUDA MOAT isn't the the developers at NVIDIA, but it is the millions of developers outside of NVIDIA that invent new algorithms for CUDA like Flash Attention. For most of them, it started with an GeForce gaming GPU. NVIDIA is the only companies that has an reasonable good developer stack on consumer grade GPUs. As people grow up beyond playing CSGO & League of Legends & Minecraft, they either become anime weeaboos or they start programming on their existing computer with has an GeForce GPU

译CUDA生态的护城河并非主要由NVIDIA内部开发者构建，而是源于数百万外部开发者——他们基于CUDA发明了Flash Attention等算法。这些开发者大多从GeForce游戏GPU起步，因为NVIDIA是唯一在消费级GPU上提供完善开发者工具栈的公司。游戏玩家长大后，利用现有的GeForce显卡转向编程，形成了从游戏生态到AI开发的独特人才输送管道。

SemiAnalysis@SemiAnalysis_ · 4月9日

Groq is one of the most interesting chip stories in AI. Nvidia paid $20B to license their IP and hire most of their team structured as a licensing deal rather than an acquisition to sidestep regulatory scrutiny. It closed in under 4 months. Here's why Nvidia wanted it so badly. (1/4)🧵

译Groq 是 AI 领域最有趣的芯片故事之一。Nvidia 支付了 200 亿美元来授权他们的 IP 并雇佣他们的大部分团队，这笔交易被构建为授权协议而非收购，以规避监管审查。交易在不到 4 个月内完成。以下是 Nvidia 如此迫切想要它的原因。(1/4)🧵

Peter Steinberger 🦞@steipete · 4月9日

Some folks try to spin a narrative that I don't like local models, meanwhile I spent a lot of time making it easy to use OpenClaw with them. Latest release adds support for inferrs, which is a new super efficient TurboQuant inference server: https://docs.openclaw.ai/providers/inferrs

译OpenClaw 最新版本已支持高效推理服务器 inferrs，采用 TurboQuant 技术实现超高效推理。作者反驳"不喜欢本地模型"的说法，强调自己其实长期致力于简化本地模型的使用体验。

SemiAnalysis@SemiAnalysis_ · 4月9日32

Cameron Quilici and Bryan Shan sit down to discuss InferenceX and the work happening at SemiAnalysis.

译Cameron Quilici 和 Bryan Shan 坐下来讨论 InferenceX 以及 SemiAnalysis 正在进行的工作。

Jeff Dean@JeffDean · 4月8日

Hedged requests (apparently inspired by the Tail at Scale paper by myself and Luiz Barroso) applied within a single machine to replicating data across DRAM channels and issuing reads to all channels, using the one that comes back first. ~5-15X reduction in p99.99 read latency. https://github.com/LaurieWired/tailslayer/blob/main/README.md Cool stuff, @lauriewired! Accompanying video forwarded to me by a friend, which is how I learned about it: https://www.youtube.com/watch?v=QFi2WVGfXMQ

译受 Tail at Scale 论文启发的 Hedged requests 技术被用于单机 DRAM 多通道场景，通过向所有通道并发发送读取请求并采用最快响应，将 p99.99 读取延迟降低 5-15 倍。实现该方案的 tailslayer 项目已开源。

SemiAnalysis@SemiAnalysis_ · 4月8日

From the GTC talk, the maintainers of NIXL said they are happy to accept RIXL patches into upstream, just like how they already accepted Trainium Neuron support patches & XPU patches into upstream. Happy to talk more in our slack & connect you to the appropriate NIXL folks so that u don't have need to maintain your second class fork @KranenKyle . maybe the NIXL folks that accept patches from other chip vendors into upstream can connect u to the flashinfer folks too.

译来自 GTC 演讲，NIXL 的维护者表示他们乐意接受 RIXL 补丁进入上游，就像他们已经接受 Trainium Neuron 支持补丁和 XPU 补丁进入上游一样。乐意在我们的 slack 中进一步交流，并将你介绍给合适的 NIXL 人员，这样你就不需要维护你的二等分支了 @KranenKyle。也许那些接受其他芯片厂商补丁进入上游的 NIXL 人员也可以把你介绍给 flashinfer 的人。

Epoch AI@EpochAIResearch · 4月8日

Who owns the world's compute? Our new Chip Ownership hub shows that Google leads, holding around 25% of all compute sold since 2022.

译Chip Ownership 最新数据显示，Google 占据2022年以来全球销售算力约25%的份额，领先市场。

SemiAnalysis@SemiAnalysis_ · 4月8日

NVIDIA STX is more than just a new storage device. It represents a redesign of how AI systems move, access, and manage data. Traditional storage architectures were built for reliable, large-scale data storage, but agentic AI and long-context inference require different capabilities. These systems need to retrieve data quickly, maintain context across multiple steps, and access information continuously during inference workflows. Under these conditions, conventional storage can become a bottleneck: increased latency, slow data transfer, and decreased GPU efficiency. STX aims to bridge this gap. Essentially, STX functions as a high-speed data layer positioned between GPUs and standard storage infrastructure. Its purpose is to bring data closer to computing resources, accelerate read/write operations, and reduce data movement overhead. This allows GPUs to spend less time waiting for data, enabling AI models to handle long contexts, multi-step reasoning, and real-time tasks more efficiently. STX is not just about improving storage performance by optimizing the efficiency of the entire AI infrastructure. Future AI systems will be defined not only by raw compute power but also by how quickly data can be delivered, how well context can be maintained, and how effectively the inference pipeline is optimized.

译NVIDIA STX是介于GPU与传统存储间的高速数据层，专为agentic AI和长上下文推理设计。它通过将数据更接近计算资源，显著降低延迟与数据移动开销，解决传统存储在推理流程中的瓶颈问题。STX不仅提升存储性能，更优化整个AI基础设施效率，使GPU能高效处理长上下文、多步推理与实时任务。这标志着未来AI系统的竞争重点正从纯算力转向数据交付速度与推理管道优化。

SemiAnalysis@SemiAnalysis_ · 4月7日

NVIDIA SOFTWARE MOAT ALERT: the recently announced AWS Trainium <> Cerebras will still be using a small bit of NVIDIA software code. In order to transfer kvcache between prefill Trainium & decode Cerebras wafer, AWS will be using NVIDIA NIXL KVcache transfer agent along with EFA. They will RDMA over EFA from Trainium over to Cerebras's cpu host memory before cpu host talking to wafer via wafer engine's FGPA.

译NVIDIA 软件护城河警报：最近发布的 AWS Trainium <> Cerebras 仍将使用少量 NVIDIA 软件代码。为了在 prefill Trainium 与 decode Cerebras wafer 之间传输 kvcache，AWS 将使用 NVIDIA NIXL KVcache 传输代理以及 EFA。他们将通过 EFA 从 Trainium 向 Cerebras 的 cpu host 内存进行 RDMA，然后 cpu host 再通过 wafer 引擎的 FGPA 与 wafer 通信。

AK@_akhaliq · 4月7日

gradio.Server Any Custom Frontend with Gradio's Backend build with your own frontend framework entirely like React, Svelte, or even plain HTML/JS, while still benefiting from Gradio's queuing system, API infrastructure, MCP support, and ZeroGPU on Spaces blog: https://huggingface.co/blog/introducing-gradio-server

译gradio.Server 允许开发者使用 React、Svelte 或纯 HTML/JS 等任意前端框架构建应用，同时完整保留 Gradio 的队列系统、API 基础设施、MCP 支持及 Spaces ZeroGPU 等后端能力。

Yuchen Jin@Yuchenj_UW · 4月7日

Crazy revenue growth at Anthropic. So they officially surpassed OpenAI’s $25B ARR reported a few days ago? The focus on coding models and enterprise clearly paid off. Once you’re locked into a year-long contract, switching to Codex isn’t easy. Claude Code shipping velocity is insane too, new feature every day. If they secure more GPUs and Google TPUs, this growth could accelerate even further.

译Anthropic 收入增速惊人，可能已超越 OpenAI 的 250 亿美元 ARR。其编程模型和企业策略成效显著，长期合同锁定用户难以转向 Codex。Claude Code 迭代速度极快，几乎日更。同时与 Google、Broadcom 签署协议，确保 2027 年起获得多千兆瓦 TPU 算力支持。

Anthropic@AnthropicAI · 4月7日

We've signed an agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity, coming online starting in 2027, to train and serve frontier Claude models.

译与 Google、Broadcom 达成协议，锁定多千兆瓦下一代 TPU 算力，2027 年开始上线，用于训练和部署前沿 Claude 模型。

SemiAnalysis@SemiAnalysis_ · 4月7日

PROFESSIONAL POWER EFFICIENCY ALERT: Rubin’s chip level TDP increases up to 2,300W vs 1000-1400W for Blackwell. Supply chain rumors have indicated that there are 2 different “SKUs” with different power and performance profiles: a Max-P variant at 2,300W and a Max-Q variant at 1,800W. However, these are not distinct hardware SKUs but the 2 default power profiles that Nvidia is offering users based on their workload needs. Max-Q is what Nvidia believes offers the best performance per Watt. Max-P offers the greatest absolute performance though this would come with an efficiency penalty. Running the Max-P setting results in a 20% increase in rack power draw but the performance gain fall well short of this 20% power consumption increase. These power profiles are software managed. Users can also choose whatever max power draw they prefer (as long as it is no more than 2,300W per GPU) and this has been the case for previous GPU generations as well. Several hyperscalers and labs have chosen to run their GPUs at lower power to optimize for performance per Watt as well as taking into account power availability constraints.

译NVIDIA下一代AI芯片Rubin TDP高达2,300W，较Blackwell的1,000-1,400W显著提升。该芯片通过软件提供Max-P（2,300W）与Max-Q（1,800W）两种功耗配置：Max-P追求极致性能，但机架功耗增加20%而性能增益不足此比例，能效比降低；Max-Q则优化每瓦性能。用户可在2,300W上限内自定义功耗，部分超大规模数据中心已选择降功耗运行以优化能效比并应对电力限制。

SemiAnalysis@SemiAnalysis_ · 4月7日

NVIDIA ARCHITECTURE ALERT🚨 Shared memory increased almost every generation, while register file size stayed constant. The reason for this is that Tensor Core throughput increase requires a deeper staging buffer. Because Tensor Cores consume data much faster than global memory can load, we use a staging memory to buffer data, so memory loading can run ahead of MMA operations. Tensor Core throughput doubled every generation, but global memory load latency didn’t decrease and in fact increased. As a result, we need to increase the staging memory size for buffering more data. To implement this, NVIDIA chose shared memory as the staging memory for Tensor Cores, which explains why shared memory increased but register file size remained constant. However, Blackwell’s shared memory size didn’t increase from Hopper. This is because tcgen05 MMA can leverage 2 SMs, so each SM’s shared memory only needs to load half of the operands. Thus, Blackwell’s shared memory size effectively doubled.

译NVIDIA GPU中Shared memory逐代递增而寄存器文件不变，主因是Tensor Core吞吐量翻倍需更大缓冲池。由于全局内存加载速度远不及Tensor Core处理速度且延迟攀升，NVIDIA将Shared memory用作Tensor Core的暂存区。Blackwell虽未提升单SM的Shared memory容量，但借助tcgen05 MMA双SM协同设计，每个SM仅需加载半数操作数，实现等效容量翻倍。

François Chollet@fchollet · 4月6日

Tutorial on fine tuning Gemma on TPU v5 using Kinetic + Keras + JAX. Easiest stack to fully leverage TPUs at scale.

译关于使用 Kinetic + Keras + JAX 在 TPU v5 上微调 Gemma 的教程。

Tibo@thsottiaux · 4月5日

Does anyone have a breakdown of how much value you get in your various AI subscriptions from different providers? When compared to API prices

译询问不同AI提供商的订阅服务与API按量计费的成本效益对比，探讨月费订阅与按量付费哪种模式更划算，寻求各平台定价模式的价值分析。

François Chollet@fchollet · 4月4日

Good tutorial on using Keras Kinetic to fine-tune LLMs on the Keras + JAX + TPU stack!

译关于在 Keras + JAX + TPU 技术栈上使用 Keras Kinetic 微调 LLM 的好教程！

François Chollet@fchollet · 4月4日

Perhaps the craziest thing that was introduced on the Keras community call today: Keras Kinetic, a new library that lets you run jobs on cloud TPU/GPU via a simple decorator -- like Modal but with TPU support. When you call a decorated function, Kinetic handles the entire remote execution pipeline: - Packages your function, local code, and data dependencies - Builds a container with your dependencies via Cloud Build (cached after first build) - Runs the job on a GKE cluster with the requested accelerator (TPU or GPU) - Returns the result to your local machine (logs are streamed in real time, and the function's return value is delivered back as if it ran locally)

译Keras 社区发布 Kinetic 库，开发者通过装饰器即可将函数部署至云端 TPU/GPU 运行，定位类似 Modal 但新增 TPU 支持。该工具自动完成代码打包、Cloud Build 容器构建（支持缓存）、GKE 集群调度及结果返回，实现日志实时流式传输，使远程执行体验如同本地运行。

Deedy@deedydas · 4月3日

This is the best blog post on LLM inference I've seen this year. They achieved 10x latency and >1400 tokens/sec by moving speculative decode onto two 2GB SRAM/chip Corsairs, a small cost on top of a standard GPU setup on gpt-oss-120b. This performance at this price is insane.

译通过将 speculative decode 卸载至两片 2GB SRAM/chip 的 Corsairs 芯片，在标准 GPU 运行 gpt-oss-120b 时实现 10 倍延迟降低与超 1400 tokens/秒的吞吐，额外硬件成本极低，性价比惊人。

François Chollet@fchollet · 4月3日

JAX is what a well-designed low-level machine learning framework looks like. Good design lets you deliver much greater performance with much lower effort. Bad design is the exact opposite.

译JAX 是一个设计精良的低级机器学习框架应有的样子。好的设计让你用更少的努力获得高得多的性能。糟糕的设计则完全相反。

Epoch AI@EpochAIResearch · 3月28日

The total memory bandwidth of AI chips shipped since 2022 has reached 70 million terabytes per second, growing 4.1x per year. That's around 300,000x more data per second than global internet traffic.

译自2022年以来，全球出货AI芯片的总内存带宽已达每秒7000万TB，年增速4.1倍，处理数据能力相当于全球互联网流量的30万倍。

Sam Altman@sama · 3月28日

The first steel beams went up this week at our Michigan Stargate site with Oracle and Related Digital

译Michigan Stargate 数据中心本周启动首批钢梁安装，Oracle 与 Related Digital 参与现场施工，项目建设进入实质性阶段。

Artificial Analysis@ArtificialAnlys · 3月28日

Introducing AA-AgentPerf - the hardware benchmark for the agent era. Key details: ➤ Real agent workloads, not synthetic queries: we’ve captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens ➤ Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding - we’re allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like ➤ Measures what developers need to know: Max concurrent users at each target output speed, expressed per accelerator, per kW TDP, per $/hr, and per rack ➤ Built for every kind of scale: designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between ➤ Live now: we’re announcing AA-AgentPerf today and opening submissions of configurations for benchmarking effective immediately. The models supported at launch are gpt-oss-120b and DeepSeek V3.2. We’ll be publishing results on a rolling basis. AA-AgentPerf is a benchmark for real-world performance of AI accelerator hardware. We’re benchmarking inference of particular models on a specific system with a specific config (ie. inference stack, parallelism config and more). AA-AgentPerf has been shaped by our work with inference providers and engagement with AI accelerator companies, developers, and enterprise buyers over the past year. Our goal is for anyone deploying models - whether buying or leasing accelerators - to be able to use AA-AgentPerf as the definitive resource for understanding real-world hardware performance.

译AA-AgentPerf是面向Agent时代的AI硬件基准测试，采用真实Agent工作负载（支持200轮交互和超10万token序列），而非合成查询。该基准允许KV cache重用、分离式预填充/解码等生产级优化技术，测量每加速器、每kW TDP、每小时成本及每机架的最大并发用户数。支持从单卡到整机架的各类架构，首批覆盖gpt-oss-120b和DeepSeek V3.2模型，旨在为AI硬件采购与部署提供真实性能参考。

Jeff Dean@JeffDean · 3月27日

The video of my conversation with Bill Dally at GTC last week is up. I always enjoy talking to Bill, and we had a wide ranging discussion about computer architecture, model training, specialized inference hardware, custom interconnects, and more! https://youtu.be/g8BuAtM3fp4?si=QMTbkl2JhfsNbu3K

译上周 GTC 与 Bill Dally 的对话视频已发布，双方就计算机架构、模型训练、专用推理硬件及定制互连技术等话题进行了深入讨论。

Andrej Karpathy@karpathy · 3月27日

When I built menugen ~1 year ago, I observed that the hardest part by far was not the code itself, it was the plethora of services you have to assemble like IKEA furniture to make it real, the DevOps: services, payments, auth, database, security, domain names, etc... I am really looking forward to a day where I could simply tell my agent: "build menugen" (referencing the post) and it would just work. The whole thing up to the deployed web page. The agent would have to browse a number of services, read the docs, get all the api keys, make everything work, debug it in dev, and deploy to prod. This is the actually hard part, not the code itself. Or rather, the better way to think about it is that the entire DevOps lifecycle has to become code, in addition to the necessary sensors/actuators of the CLIs/APIs with agent-native ergonomics. And there should be no need to visit web pages, click buttons, or anything like that for the human. It's easy to state, it's now just barely technically possible and expected to work maybe, but it definitely requires from-scratch re-design, work and thought. Very exciting direction!

译构建现代应用的最大挑战并非代码本身，而是 DevOps 中繁琐的服务集成、API 密钥管理和部署配置。作者期待未来 AI 智能体能自动完成从文档阅读到生产环境部署的全流程，无需人工点击网页或手动配置。Stripe 推出的 Projects 正是朝此方向迈进：开发者可通过 CLI 命令自动配置 PostHog 等第三方服务，实现账户创建、密钥获取和计费设置的自动化，真正将基础设施生命周期转化为代码。

Boris Cherny@bcherny · 3月21日

Desktop and http://claude.ai should be feeling faster

译Claude.ai 及桌面端本周架构升级，从 SSR 迁移至 Vite 与 TanStack Router 静态方案并部署至边缘 Worker。首字节时间（TTFB）降低 65%，提示词显示提速 50%，导航更流畅。团队表示将持续优化。

Andrej Karpathy@karpathy · 3月19日

Thank you Jensen and NVIDIA! She’s a real beauty! I was told I’d be getting a secret gift, with a hint that it requires 20 amps. (So I knew it had to be good). She’ll make for a beautiful, spacious home for my Dobby the House Elf claw, among lots of other tinkering, thank you!!

译Andrej Karpathy 收到首台 DGX Station GB300（Dell Pro Max with GB300），这台需要 20 安培电流的"秘密礼物"将成为 Dobby the House Elf claw 等项目的宽敞新家。

Hao AI Lab@haoailab · 3月19日

Wow! The Vera Rubin demo looks great but real-time editing is actually already here on a single B200! Try Dreamverse today and generate 30s 1080p videos (with audio) faster than you can watch them. Demo: https://dreamverse.fastvideo.org/

译哇！Vera Rubin 的演示看起来很棒，但实时编辑实际上已经可以在单张 B200 上实现了！

Hao AI Lab@haoailab · 3月18日

(1/N) We're launching Dreamverse. Most AI video models take minutes to generate a 5 s 1080p clip. In 4.5 seconds, we can generate 30 s 1080p clips on a single GPU. Our videos generate faster than you can watch them: stop waiting on prompts and start directing scenes live. 🕹️Demo: http://dreamverse.fastvideo.org 📑 Blog: https://haoailab.com/blogs/dreamverse Welcome to the era of vibe-directing 👇

译(1/N) 我们正在推出 Dreamverse。大多数 AI 视频模型需要数分钟才能生成一段 5 秒 1080p 的片段。而在 4.5 秒内，我们就能在单张 GPU 上生成 30 秒 1080p 的片段。

Hao AI Lab@haoailab · 3月18日65

http://x.com/i/article/2034009793598464000 # Into the DreamVerse TL;DR: Our new real-time inference stack in FastVideo enables Dreamverse, a prototype for a new interface where users can vibe direct their own “multiverse” of videos. AI video generation is already good enough to make a convincing clip. But real creative work is not about getting a clip in one shot. It’s about iteration. An idea appears, you test it: keep the subject, change the camera angle, continue the scene, and try again. The problem is that ideas move faster than generations. If every attempt takes minutes, the creative loop breaks; your imagination moves on before the video does. We think there is a better interface for AI video generation, which is why we created Dreamverse, an interface that enables a new workflow called vibe directing. Vibe directing is to video what vibe coding is to software. Instead of rewriting giant prompts from scratch, you talk to the system in natural language and steer the video through fast revision. Keep the subject, change the background, slow the camera, or anything else! Rather than jamming everything into a single prompt, iterate with multiple simple prompts. This kind of workflow is only possible when video generation is done in real-time. Current video generation models like Sora take 1-2 minutes to generate a 5s 1080p clip. We can do it in ~4.55 seconds on a single GPU. In other words, our inference stack in FastVideo can generate a clip faster than you can watch it. This capability completely changes the feel of video generation inference; it stops feeling like a passive experience and starts feeling like directing your own scenes. This allows us to create a longer 30-second scene that unfolds as a chain of these 5-second clips, while keeping a chat window open so you can keep directing in real time. This matters because serious video creation is almost never perfect on the first try. A shot may look off. Motion may break halfway through. Characters may drift between frames. In addition, creators may have multiple versions of a scene and want to play them out to determine which is better. In practice, creators are constantly making small adjustments and trying again. When revisions are slow, it’s much more difficult to explore many ideas. However, when the next result comes back almost immediately, it becomes possible to quickly try many ideas rather than just one. Better creative work comes from a faster feedback loop, not just a better model. We think this is where video generation is going: a way to direct the video as it unfolds. The best systems will not just generate impressive clips. They will let people explore ideas at the speed of their imagination. That is what vibe directing is all about. Step into the Dreamverse today with our demo. The Team Core contributors: Will Lin*, Matthew Noto*, Junda Su*, Yechen Xu*, Peiyuan Zhang* (* equal contribution) Contributors: Shao Duan, Minshen Zhang, Loay Rashid, Kevin Lin UI: Tina Mai Tech leads: Will Lin, Hao Zhang Advisors: Hao Zhang (corresponding), Danyang Zhuo, Eric Xing, Zhengzhong Liu Learn More - FastVideo Documentation - FastVideo Roadmap for 26Q1

译FastVideo团队发布Dreamverse原型界面，引入创新的“氛围导演”工作流。该模式允许用户通过自然语言实时、迭代地引导视频生成，如更换背景或调整运镜，无需编写复杂的长提示词。其核心是全新的实时推理栈，能在单GPU上以约4.55秒生成5秒1080p视频，速度快于观看时间，从而将生成过程从被动等待转变为实时导演体验。团队认为，视频生成的未来在于让创作速度跟上想象速度，快速的反馈循环比单纯追求模型性能更能催生优质作品。

Greg Brockman@gdb · 3月17日

gpt-5.4 has ramped faster than any other model we've launched in the API: within a week of launch, 5T tokens per day, handling more volume than our entire API one year ago, and reaching an annualized run rate of $1B in net-new revenue. it's a good model, try it out!

译GPT-5.4 上线一周内日处理 token 量达 5T，超过去年同期整个 API 的总量，年化新增净收入达 10 亿美元，增速创历史纪录。模型质量出色，值得试用。

Hao AI Lab@haoailab · 3月14日

(1/N) Content creators have been stuck with costly and slow video generation APIs for far too long. We couldn’t take it anymore.😅😭 FastVideo’s new real-time inference stack has the fastest 1080p TI2AV pipeline ever.😍🚀🚀 Our optimized LTX-2.3 pipeline creates 5-second 1080p videos with audio in 4.55 s, on a single GPU! 3.9x faster than the next fastest option. 🕹️Live demo: https://1080p.fastvideo.org/ 📜Blog: https://haoailab.com/blogs/fastvideo_realtime_1080p/

译(1/N) 内容创作者被困在昂贵且缓慢的视频生成 API 中太久了。我们再也受不了了。😅😭

Satya Nadella@satyanadella · 3月14日

We’re the first cloud to bring up an NVIDIA Vera Rubin NVL72 system for validation, another big step in building the next generation of AI infrastructure with NVIDIA.

译率先完成 NVIDIA Vera Rubin NVL72 系统启动验证，成为首个部署该下一代 AI 基础设施的云平台。

Epoch AI@EpochAIResearch · 3月13日

How much of the world's advanced chip packaging and high-bandwidth memory does AI consume? Almost all of it. We estimate the four largest AI chip designers consumed ~90% of global advanced packaging and HBM supply in 2025, suggesting these inputs were bottlenecks in 2025.

译2025 年四大 AI 芯片设计公司消耗全球约 90% 的先进封装和 HBM 供应，这些关键输入已成为行业瓶颈。AI 几乎垄断了全球先进芯片封装与高带宽内存产能。

Lilian Weng@lilianweng · 3月11日

Building technologies for better human-AI collaboration on next gen hardware at scale. Exciting.

译构建技术以在下一代大规模硬件上实现更好的人机协作。令人兴奋。

Sam Altman@sama · 3月8日

Very grateful to Jensen for working to expand Nvidia capacity at AWS so much for us!

译非常感谢 Jensen 努力为我们大幅扩展 AWS 上的 Nvidia 容量！ [引用 @firstadopter]：Jensen 两天前表示，Nvidia 正在"疯狂地"扩展 AWS 上的 OpenAI 容量我们也知道 OpenAI Codex 的 token 使用量正在激增。任何声称 OpenAI 整体计算需求正在减弱的说法似乎都值得怀疑。

Saining Xie@sainingxie · 11月27日

most of people didn’t know this we had been using TPUs at *Facebook* as far back as 2020. Kaiming led the initial development of the TF and JAX codebase, and research projects like MAE, MoCo v3, ConvNeXt v2 and DiT were developed *entirely* on TPUs. because we were the only team at FAIR using them, Meta cancelled the GCP deal in early 2023. TPUs also powered much of our large-scale work at NYU, including SiT, Cambrian1/S, and the recent RAE, FreeFlow. took a lot of suffering to learn the infra (not what they signed up for, but my students are basically TPU/JAX/XLA pros now), but once you get there, the performance/stability is exceptional. very optimistic about Google growing the TPU and JAX ecosystem and pushing it forward commercially

译Meta研究人员透露，Facebook自2020年起使用TPU训练AI，由Kaiming He领导开发TF和JAX代码库，MAE、DiT等模型完全基于TPU构建。因内部采用有限，Meta于2023年取消GCP协议。推文指出，Google、Anthropic等实验室长期使用TPU训练大模型，Nvidia的CUDA护城河并非不可逾越，OpenAI亦投资Triton寻求替代。TPU与GPU的效率差异并非关键，系统工程人才才是决定性因素。