AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 983 条
全部一手资讯X论文
标签「部署/工程」清除
karminski-牙医@karminski3 · 4月10日40

👍

译👍 [引用 @anemll]:anemll-profile 0.4.1 已发布! 更新方法: brew upgrade anemll/tap/anemll-profile 新增:ANE 图中断分析、JSON 导出、智能体指南。 将此链接提供给您的智能体:http://github.com/anemll/anemll-profile/blob/main/AGENTS.md 示例:来自 @mweinbach 自动转换包的 OCR ANE 分析

Hao AI Lab@haoailab · 4月10日

(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention. The result: FP4 attention quality is comparable to BF16 attention with 1.1x–1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200. Blog: https://haoailab.com/blogs/attn-qat/ Code: https://github.com/hao-ai-lab/FastVideo/pull/1225 Checkpoints: https://huggingface.co/FastVideo/14B_qat_400

译FP4硬件虽已普及,但4-bit attention长期存在质量瓶颈,阻碍端到端FP4部署。研究团队提出Attn-QAT,首次系统研究attention机制的量化感知训练。该方法使FP4 attention质量达到BF16水平,同时在RTX 5090上实现比SageAttention3高1.1-1.5倍的吞吐量,在B200上较FlashAttention-4提速1.39倍。

SemiAnalysis@SemiAnalysis_ · 4月10日

Nvidia published DWDP (Distributed Weight-Data Parallelism), a new inference parallelism strategy focused on prefill. It sounds slightly insane until you remember the target machine is GB200 NVL72. The core trade: spend more peer-GPU bandwidth so you spend less time waiting at collective barriers. (1/6) 🧵 https://arxiv.org/abs/2604.01621v1

译Nvidia 发布了 DWDP (Distributed Weight-Data Parallelism),这是一种专注于 prefill 的新推理并行策略。这听起来有点疯狂,直到你想起目标机器是 GB200 NVL72。核心权衡:花费更多 peer-GPU 带宽,从而减少在 collective barriers 上的等待时间。(1/6) 🧵 https://arxiv.org/abs/2604.01621v1

SemiAnalysis@SemiAnalysis_ · 4月9日

YOUR PARENTS PAID FOR THE CUDA MOAT! The #1 contributor to the CUDA MOAT isn't the the developers at NVIDIA, but it is the millions of developers outside of NVIDIA that invent new algorithms for CUDA like Flash Attention. For most of them, it started with an GeForce gaming GPU. NVIDIA is the only companies that has an reasonable good developer stack on consumer grade GPUs. As people grow up beyond playing CSGO & League of Legends & Minecraft, they either become anime weeaboos or they start programming on their existing computer with has an GeForce GPU

译CUDA生态的护城河并非主要由NVIDIA内部开发者构建,而是源于数百万外部开发者——他们基于CUDA发明了Flash Attention等算法。这些开发者大多从GeForce游戏GPU起步,因为NVIDIA是唯一在消费级GPU上提供完善开发者工具栈的公司。游戏玩家长大后,利用现有的GeForce显卡转向编程,形成了从游戏生态到AI开发的独特人才输送管道。

SemiAnalysis@SemiAnalysis_ · 4月9日

Groq is one of the most interesting chip stories in AI. Nvidia paid $20B to license their IP and hire most of their team structured as a licensing deal rather than an acquisition to sidestep regulatory scrutiny. It closed in under 4 months. Here's why Nvidia wanted it so badly. (1/4)🧵

译Groq 是 AI 领域最有趣的芯片故事之一。Nvidia 支付了 200 亿美元来授权他们的 IP 并雇佣他们的大部分团队,这笔交易被构建为授权协议而非收购,以规避监管审查。交易在不到 4 个月内完成。以下是 Nvidia 如此迫切想要它的原因。(1/4)🧵

Peter Steinberger 🦞@steipete · 4月9日

Some folks try to spin a narrative that I don't like local models, meanwhile I spent a lot of time making it easy to use OpenClaw with them. Latest release adds support for inferrs, which is a new super efficient TurboQuant inference server: https://docs.openclaw.ai/providers/inferrs

译OpenClaw 最新版本已支持高效推理服务器 inferrs,采用 TurboQuant 技术实现超高效推理。作者反驳"不喜欢本地模型"的说法,强调自己其实长期致力于简化本地模型的使用体验。

SemiAnalysis@SemiAnalysis_ · 4月9日32

Cameron Quilici and Bryan Shan sit down to discuss InferenceX and the work happening at SemiAnalysis.

译Cameron Quilici 和 Bryan Shan 坐下来讨论 InferenceX 以及 SemiAnalysis 正在进行的工作。

Jeff Dean@JeffDean · 4月8日

Hedged requests (apparently inspired by the Tail at Scale paper by myself and Luiz Barroso) applied within a single machine to replicating data across DRAM channels and issuing reads to all channels, using the one that comes back first. ~5-15X reduction in p99.99 read latency. https://github.com/LaurieWired/tailslayer/blob/main/README.md Cool stuff, @lauriewired! Accompanying video forwarded to me by a friend, which is how I learned about it: https://www.youtube.com/watch?v=QFi2WVGfXMQ

译受 Tail at Scale 论文启发的 Hedged requests 技术被用于单机 DRAM 多通道场景,通过向所有通道并发发送读取请求并采用最快响应,将 p99.99 读取延迟降低 5-15 倍。实现该方案的 tailslayer 项目已开源。

SemiAnalysis@SemiAnalysis_ · 4月8日

From the GTC talk, the maintainers of NIXL said they are happy to accept RIXL patches into upstream, just like how they already accepted Trainium Neuron support patches & XPU patches into upstream. Happy to talk more in our slack & connect you to the appropriate NIXL folks so that u don't have need to maintain your second class fork @KranenKyle . maybe the NIXL folks that accept patches from other chip vendors into upstream can connect u to the flashinfer folks too.

译来自 GTC 演讲,NIXL 的维护者表示他们乐意接受 RIXL 补丁进入上游,就像他们已经接受 Trainium Neuron 支持补丁和 XPU 补丁进入上游一样。乐意在我们的 slack 中进一步交流,并将你介绍给合适的 NIXL 人员,这样你就不需要维护你的二等分支了 @KranenKyle。也许那些接受其他芯片厂商补丁进入上游的 NIXL 人员也可以把你介绍给 flashinfer 的人。

Epoch AI@EpochAIResearch · 4月8日

Who owns the world's compute? Our new Chip Ownership hub shows that Google leads, holding around 25% of all compute sold since 2022.

译Chip Ownership 最新数据显示,Google 占据2022年以来全球销售算力约25%的份额,领先市场。

SemiAnalysis@SemiAnalysis_ · 4月8日

NVIDIA STX is more than just a new storage device. It represents a redesign of how AI systems move, access, and manage data. Traditional storage architectures were built for reliable, large-scale data storage, but agentic AI and long-context inference require different capabilities. These systems need to retrieve data quickly, maintain context across multiple steps, and access information continuously during inference workflows. Under these conditions, conventional storage can become a bottleneck: increased latency, slow data transfer, and decreased GPU efficiency. STX aims to bridge this gap. Essentially, STX functions as a high-speed data layer positioned between GPUs and standard storage infrastructure. Its purpose is to bring data closer to computing resources, accelerate read/write operations, and reduce data movement overhead. This allows GPUs to spend less time waiting for data, enabling AI models to handle long contexts, multi-step reasoning, and real-time tasks more efficiently. STX is not just about improving storage performance by optimizing the efficiency of the entire AI infrastructure. Future AI systems will be defined not only by raw compute power but also by how quickly data can be delivered, how well context can be maintained, and how effectively the inference pipeline is optimized.

译NVIDIA STX是介于GPU与传统存储间的高速数据层,专为agentic AI和长上下文推理设计。它通过将数据更接近计算资源,显著降低延迟与数据移动开销,解决传统存储在推理流程中的瓶颈问题。STX不仅提升存储性能,更优化整个AI基础设施效率,使GPU能高效处理长上下文、多步推理与实时任务。这标志着未来AI系统的竞争重点正从纯算力转向数据交付速度与推理管道优化。

SemiAnalysis@SemiAnalysis_ · 4月7日

NVIDIA SOFTWARE MOAT ALERT: the recently announced AWS Trainium <> Cerebras will still be using a small bit of NVIDIA software code. In order to transfer kvcache between prefill Trainium & decode Cerebras wafer, AWS will be using NVIDIA NIXL KVcache transfer agent along with EFA. They will RDMA over EFA from Trainium over to Cerebras's cpu host memory before cpu host talking to wafer via wafer engine's FGPA.

译NVIDIA 软件护城河警报:最近发布的 AWS Trainium <> Cerebras 仍将使用少量 NVIDIA 软件代码。为了在 prefill Trainium 与 decode Cerebras wafer 之间传输 kvcache,AWS 将使用 NVIDIA NIXL KVcache 传输代理以及 EFA。他们将通过 EFA 从 Trainium 向 Cerebras 的 cpu host 内存进行 RDMA,然后 cpu host 再通过 wafer 引擎的 FGPA 与 wafer 通信。

AK@_akhaliq · 4月7日

gradio.Server Any Custom Frontend with Gradio's Backend build with your own frontend framework entirely like React, Svelte, or even plain HTML/JS, while still benefiting from Gradio's queuing system, API infrastructure, MCP support, and ZeroGPU on Spaces blog: https://huggingface.co/blog/introducing-gradio-server

译gradio.Server 允许开发者使用 React、Svelte 或纯 HTML/JS 等任意前端框架构建应用,同时完整保留 Gradio 的队列系统、API 基础设施、MCP 支持及 Spaces ZeroGPU 等后端能力。

Yuchen Jin@Yuchenj_UW · 4月7日

Crazy revenue growth at Anthropic. So they officially surpassed OpenAI’s $25B ARR reported a few days ago? The focus on coding models and enterprise clearly paid off. Once you’re locked into a year-long contract, switching to Codex isn’t easy. Claude Code shipping velocity is insane too, new feature every day. If they secure more GPUs and Google TPUs, this growth could accelerate even further.

译Anthropic 收入增速惊人,可能已超越 OpenAI 的 250 亿美元 ARR。其编程模型和企业策略成效显著,长期合同锁定用户难以转向 Codex。Claude Code 迭代速度极快,几乎日更。同时与 Google、Broadcom 签署协议,确保 2027 年起获得多千兆瓦 TPU 算力支持。

Anthropic@AnthropicAI · 4月7日

We've signed an agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity, coming online starting in 2027, to train and serve frontier Claude models.

译与 Google、Broadcom 达成协议,锁定多千兆瓦下一代 TPU 算力,2027 年开始上线,用于训练和部署前沿 Claude 模型。

SemiAnalysis@SemiAnalysis_ · 4月7日

PROFESSIONAL POWER EFFICIENCY ALERT: Rubin’s chip level TDP increases up to 2,300W vs 1000-1400W for Blackwell. Supply chain rumors have indicated that there are 2 different “SKUs” with different power and performance profiles: a Max-P variant at 2,300W and a Max-Q variant at 1,800W. However, these are not distinct hardware SKUs but the 2 default power profiles that Nvidia is offering users based on their workload needs. Max-Q is what Nvidia believes offers the best performance per Watt. Max-P offers the greatest absolute performance though this would come with an efficiency penalty. Running the Max-P setting results in a 20% increase in rack power draw but the performance gain fall well short of this 20% power consumption increase. These power profiles are software managed. Users can also choose whatever max power draw they prefer (as long as it is no more than 2,300W per GPU) and this has been the case for previous GPU generations as well. Several hyperscalers and labs have chosen to run their GPUs at lower power to optimize for performance per Watt as well as taking into account power availability constraints.

译NVIDIA下一代AI芯片Rubin TDP高达2,300W,较Blackwell的1,000-1,400W显著提升。该芯片通过软件提供Max-P(2,300W)与Max-Q(1,800W)两种功耗配置:Max-P追求极致性能,但机架功耗增加20%而性能增益不足此比例,能效比降低;Max-Q则优化每瓦性能。用户可在2,300W上限内自定义功耗,部分超大规模数据中心已选择降功耗运行以优化能效比并应对电力限制。

SemiAnalysis@SemiAnalysis_ · 4月7日

NVIDIA ARCHITECTURE ALERT🚨 Shared memory increased almost every generation, while register file size stayed constant. The reason for this is that Tensor Core throughput increase requires a deeper staging buffer. Because Tensor Cores consume data much faster than global memory can load, we use a staging memory to buffer data, so memory loading can run ahead of MMA operations. Tensor Core throughput doubled every generation, but global memory load latency didn’t decrease and in fact increased. As a result, we need to increase the staging memory size for buffering more data. To implement this, NVIDIA chose shared memory as the staging memory for Tensor Cores, which explains why shared memory increased but register file size remained constant. However, Blackwell’s shared memory size didn’t increase from Hopper. This is because tcgen05 MMA can leverage 2 SMs, so each SM’s shared memory only needs to load half of the operands. Thus, Blackwell’s shared memory size effectively doubled.

译NVIDIA GPU中Shared memory逐代递增而寄存器文件不变,主因是Tensor Core吞吐量翻倍需更大缓冲池。由于全局内存加载速度远不及Tensor Core处理速度且延迟攀升,NVIDIA将Shared memory用作Tensor Core的暂存区。Blackwell虽未提升单SM的Shared memory容量,但借助tcgen05 MMA双SM协同设计,每个SM仅需加载半数操作数,实现等效容量翻倍。

François Chollet@fchollet · 4月6日

Tutorial on fine tuning Gemma on TPU v5 using Kinetic + Keras + JAX. Easiest stack to fully leverage TPUs at scale.

译关于使用 Kinetic + Keras + JAX 在 TPU v5 上微调 Gemma 的教程。

Tibo@thsottiaux · 4月5日

Does anyone have a breakdown of how much value you get in your various AI subscriptions from different providers? When compared to API prices

译询问不同AI提供商的订阅服务与API按量计费的成本效益对比,探讨月费订阅与按量付费哪种模式更划算,寻求各平台定价模式的价值分析。

François Chollet@fchollet · 4月4日

Good tutorial on using Keras Kinetic to fine-tune LLMs on the Keras + JAX + TPU stack!

译关于在 Keras + JAX + TPU 技术栈上使用 Keras Kinetic 微调 LLM 的好教程!

François Chollet@fchollet · 4月4日

Perhaps the craziest thing that was introduced on the Keras community call today: Keras Kinetic, a new library that lets you run jobs on cloud TPU/GPU via a simple decorator -- like Modal but with TPU support. When you call a decorated function, Kinetic handles the entire remote execution pipeline: - Packages your function, local code, and data dependencies - Builds a container with your dependencies via Cloud Build (cached after first build) - Runs the job on a GKE cluster with the requested accelerator (TPU or GPU) - Returns the result to your local machine (logs are streamed in real time, and the function's return value is delivered back as if it ran locally)

译Keras 社区发布 Kinetic 库,开发者通过装饰器即可将函数部署至云端 TPU/GPU 运行,定位类似 Modal 但新增 TPU 支持。该工具自动完成代码打包、Cloud Build 容器构建(支持缓存)、GKE 集群调度及结果返回,实现日志实时流式传输,使远程执行体验如同本地运行。

Deedy@deedydas · 4月3日

This is the best blog post on LLM inference I've seen this year. They achieved 10x latency and &gt;1400 tokens/sec by moving speculative decode onto two 2GB SRAM/chip Corsairs, a small cost on top of a standard GPU setup on gpt-oss-120b. This performance at this price is insane.

译通过将 speculative decode 卸载至两片 2GB SRAM/chip 的 Corsairs 芯片,在标准 GPU 运行 gpt-oss-120b 时实现 10 倍延迟降低与超 1400 tokens/秒 的吞吐,额外硬件成本极低,性价比惊人。

François Chollet@fchollet · 4月3日

JAX is what a well-designed low-level machine learning framework looks like. Good design lets you deliver much greater performance with much lower effort. Bad design is the exact opposite.

译JAX 是一个设计精良的低级机器学习框架应有的样子。好的设计让你用更少的努力获得高得多的性能。糟糕的设计则完全相反。

Epoch AI@EpochAIResearch · 3月28日

The total memory bandwidth of AI chips shipped since 2022 has reached 70 million terabytes per second, growing 4.1x per year. That's around 300,000x more data per second than global internet traffic.

译自2022年以来,全球出货AI芯片的总内存带宽已达每秒7000万TB,年增速4.1倍,处理数据能力相当于全球互联网流量的30万倍。

Sam Altman@sama · 3月28日

The first steel beams went up this week at our Michigan Stargate site with Oracle and Related Digital

译Michigan Stargate 数据中心本周启动首批钢梁安装,Oracle 与 Related Digital 参与现场施工,项目建设进入实质性阶段。

Artificial Analysis@ArtificialAnlys · 3月28日

Introducing AA-AgentPerf - the hardware benchmark for the agent era. Key details: ➤ Real agent workloads, not synthetic queries: we’ve captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens ➤ Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding - we’re allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like ➤ Measures what developers need to know: Max concurrent users at each target output speed, expressed per accelerator, per kW TDP, per $/hr, and per rack ➤ Built for every kind of scale: designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between ➤ Live now: we’re announcing AA-AgentPerf today and opening submissions of configurations for benchmarking effective immediately. The models supported at launch are gpt-oss-120b and DeepSeek V3.2. We’ll be publishing results on a rolling basis. AA-AgentPerf is a benchmark for real-world performance of AI accelerator hardware. We’re benchmarking inference of particular models on a specific system with a specific config (ie. inference stack, parallelism config and more). AA-AgentPerf has been shaped by our work with inference providers and engagement with AI accelerator companies, developers, and enterprise buyers over the past year. Our goal is for anyone deploying models - whether buying or leasing accelerators - to be able to use AA-AgentPerf as the definitive resource for understanding real-world hardware performance.

译AA-AgentPerf是面向Agent时代的AI硬件基准测试,采用真实Agent工作负载(支持200轮交互和超10万token序列),而非合成查询。该基准允许KV cache重用、分离式预填充/解码等生产级优化技术,测量每加速器、每kW TDP、每小时成本及每机架的最大并发用户数。支持从单卡到整机架的各类架构,首批覆盖gpt-oss-120b和DeepSeek V3.2模型,旨在为AI硬件采购与部署提供真实性能参考。

Jeff Dean@JeffDean · 3月27日

The video of my conversation with Bill Dally at GTC last week is up. I always enjoy talking to Bill, and we had a wide ranging discussion about computer architecture, model training, specialized inference hardware, custom interconnects, and more! https://youtu.be/g8BuAtM3fp4?si=QMTbkl2JhfsNbu3K

译上周 GTC 与 Bill Dally 的对话视频已发布,双方就计算机架构、模型训练、专用推理硬件及定制互连技术等话题进行了深入讨论。

Andrej Karpathy@karpathy · 3月27日

When I built menugen ~1 year ago, I observed that the hardest part by far was not the code itself, it was the plethora of services you have to assemble like IKEA furniture to make it real, the DevOps: services, payments, auth, database, security, domain names, etc... I am really looking forward to a day where I could simply tell my agent: "build menugen" (referencing the post) and it would just work. The whole thing up to the deployed web page. The agent would have to browse a number of services, read the docs, get all the api keys, make everything work, debug it in dev, and deploy to prod. This is the actually hard part, not the code itself. Or rather, the better way to think about it is that the entire DevOps lifecycle has to become code, in addition to the necessary sensors/actuators of the CLIs/APIs with agent-native ergonomics. And there should be no need to visit web pages, click buttons, or anything like that for the human. It's easy to state, it's now just barely technically possible and expected to work maybe, but it definitely requires from-scratch re-design, work and thought. Very exciting direction!

译构建现代应用的最大挑战并非代码本身,而是 DevOps 中繁琐的服务集成、API 密钥管理和部署配置。作者期待未来 AI 智能体能自动完成从文档阅读到生产环境部署的全流程,无需人工点击网页或手动配置。Stripe 推出的 Projects 正是朝此方向迈进:开发者可通过 CLI 命令自动配置 PostHog 等第三方服务,实现账户创建、密钥获取和计费设置的自动化,真正将基础设施生命周期转化为代码。

Boris Cherny@bcherny · 3月21日

Desktop and http://claude.ai should be feeling faster

译Claude.ai 及桌面端本周架构升级,从 SSR 迁移至 Vite 与 TanStack Router 静态方案并部署至边缘 Worker。首字节时间(TTFB)降低 65%,提示词显示提速 50%,导航更流畅。团队表示将持续优化。

Andrej Karpathy@karpathy · 3月19日

Thank you Jensen and NVIDIA! She’s a real beauty! I was told I’d be getting a secret gift, with a hint that it requires 20 amps. (So I knew it had to be good). She’ll make for a beautiful, spacious home for my Dobby the House Elf claw, among lots of other tinkering, thank you!!

译Andrej Karpathy 收到首台 DGX Station GB300(Dell Pro Max with GB300),这台需要 20 安培电流的"秘密礼物"将成为 Dobby the House Elf claw 等项目的宽敞新家。

Hao AI Lab@haoailab · 3月19日

Wow! The Vera Rubin demo looks great but real-time editing is actually already here on a single B200! Try Dreamverse today and generate 30s 1080p videos (with audio) faster than you can watch them. Demo: https://dreamverse.fastvideo.org/

译哇!Vera Rubin 的演示看起来很棒,但实时编辑实际上已经可以在单张 B200 上实现了!

Hao AI Lab@haoailab · 3月18日

(1/N) We're launching Dreamverse. Most AI video models take minutes to generate a 5 s 1080p clip. In 4.5 seconds, we can generate 30 s 1080p clips on a single GPU. Our videos generate faster than you can watch them: stop waiting on prompts and start directing scenes live. 🕹️Demo: http://dreamverse.fastvideo.org 📑 Blog: https://haoailab.com/blogs/dreamverse Welcome to the era of vibe-directing 👇

译(1/N) 我们正在推出 Dreamverse。大多数 AI 视频模型需要数分钟才能生成一段 5 秒 1080p 的片段。而在 4.5 秒内,我们就能在单张 GPU 上生成 30 秒 1080p 的片段。

Hao AI Lab@haoailab · 3月18日65

http://x.com/i/article/2034009793598464000 # Into the DreamVerse TL;DR: Our new real-time inference stack in FastVideo enables Dreamverse, a prototype for a new interface where users can vibe direct their own “multiverse” of videos. AI video generation is already good enough to make a convincing clip. But real creative work is not about getting a clip in one shot. It’s about iteration. An idea appears, you test it: keep the subject, change the camera angle, continue the scene, and try again. The problem is that ideas move faster than generations. If every attempt takes minutes, the creative loop breaks; your imagination moves on before the video does. We think there is a better interface for AI video generation, which is why we created Dreamverse, an interface that enables a new workflow called vibe directing. Vibe directing is to video what vibe coding is to software. Instead of rewriting giant prompts from scratch, you talk to the system in natural language and steer the video through fast revision. Keep the subject, change the background, slow the camera, or anything else! Rather than jamming everything into a single prompt, iterate with multiple simple prompts. This kind of workflow is only possible when video generation is done in real-time. Current video generation models like Sora take 1-2 minutes to generate a 5s 1080p clip. We can do it in ~4.55 seconds on a single GPU. In other words, our inference stack in FastVideo can generate a clip faster than you can watch it. This capability completely changes the feel of video generation inference; it stops feeling like a passive experience and starts feeling like directing your own scenes. This allows us to create a longer 30-second scene that unfolds as a chain of these 5-second clips, while keeping a chat window open so you can keep directing in real time. This matters because serious video creation is almost never perfect on the first try. A shot may look off. Motion may break halfway through. Characters may drift between frames. In addition, creators may have multiple versions of a scene and want to play them out to determine which is better. In practice, creators are constantly making small adjustments and trying again. When revisions are slow, it’s much more difficult to explore many ideas. However, when the next result comes back almost immediately, it becomes possible to quickly try many ideas rather than just one. Better creative work comes from a faster feedback loop, not just a better model. We think this is where video generation is going: a way to direct the video as it unfolds. The best systems will not just generate impressive clips. They will let people explore ideas at the speed of their imagination. That is what vibe directing is all about. Step into the Dreamverse today with our demo. The Team Core contributors: Will Lin*, Matthew Noto*, Junda Su*, Yechen Xu*, Peiyuan Zhang* (* equal contribution) Contributors: Shao Duan, Minshen Zhang, Loay Rashid, Kevin Lin UI: Tina Mai Tech leads: Will Lin, Hao Zhang Advisors: Hao Zhang (corresponding), Danyang Zhuo, Eric Xing, Zhengzhong Liu Learn More - FastVideo Documentation - FastVideo Roadmap for 26Q1

译FastVideo团队发布Dreamverse原型界面,引入创新的“氛围导演”工作流。该模式允许用户通过自然语言实时、迭代地引导视频生成,如更换背景或调整运镜,无需编写复杂的长提示词。其核心是全新的实时推理栈,能在单GPU上以约4.55秒生成5秒1080p视频,速度快于观看时间,从而将生成过程从被动等待转变为实时导演体验。团队认为,视频生成的未来在于让创作速度跟上想象速度,快速的反馈循环比单纯追求模型性能更能催生优质作品。

Greg Brockman@gdb · 3月17日

gpt-5.4 has ramped faster than any other model we've launched in the API: within a week of launch, 5T tokens per day, handling more volume than our entire API one year ago, and reaching an annualized run rate of $1B in net-new revenue. it's a good model, try it out!

译GPT-5.4 上线一周内日处理 token 量达 5T,超过去年同期整个 API 的总量,年化新增净收入达 10 亿美元,增速创历史纪录。模型质量出色,值得试用。

Hao AI Lab@haoailab · 3月14日

(1/N) Content creators have been stuck with costly and slow video generation APIs for far too long. We couldn’t take it anymore.😅😭 FastVideo’s new real-time inference stack has the fastest 1080p TI2AV pipeline ever.😍🚀🚀 Our optimized LTX-2.3 pipeline creates 5-second 1080p videos with audio in 4.55 s, on a single GPU! 3.9x faster than the next fastest option. 🕹️Live demo: https://1080p.fastvideo.org/ 📜Blog: https://haoailab.com/blogs/fastvideo_realtime_1080p/

译(1/N) 内容创作者被困在昂贵且缓慢的视频生成 API 中太久了。我们再也受不了了。😅😭

Satya Nadella@satyanadella · 3月14日

We’re the first cloud to bring up an NVIDIA Vera Rubin NVL72 system for validation, another big step in building the next generation of AI infrastructure with NVIDIA.

译率先完成 NVIDIA Vera Rubin NVL72 系统启动验证,成为首个部署该下一代 AI 基础设施的云平台。

Epoch AI@EpochAIResearch · 3月13日

How much of the world's advanced chip packaging and high-bandwidth memory does AI consume? Almost all of it. We estimate the four largest AI chip designers consumed ~90% of global advanced packaging and HBM supply in 2025, suggesting these inputs were bottlenecks in 2025.

译2025 年四大 AI 芯片设计公司消耗全球约 90% 的先进封装和 HBM 供应,这些关键输入已成为行业瓶颈。AI 几乎垄断了全球先进芯片封装与高带宽内存产能。

Lilian Weng@lilianweng · 3月11日

Building technologies for better human-AI collaboration on next gen hardware at scale. Exciting.

译构建技术以在下一代大规模硬件上实现更好的人机协作。令人兴奋。

Sam Altman@sama · 3月8日

Very grateful to Jensen for working to expand Nvidia capacity at AWS so much for us!

译非常感谢 Jensen 努力为我们大幅扩展 AWS 上的 Nvidia 容量! [引用 @firstadopter]:Jensen 两天前表示,Nvidia 正在"疯狂地"扩展 AWS 上的 OpenAI 容量 我们也知道 OpenAI Codex 的 token 使用量正在激增。 任何声称 OpenAI 整体计算需求正在减弱的说法似乎都值得怀疑。

Saining Xie@sainingxie · 11月27日

most of people didn’t know this we had been using TPUs at *Facebook* as far back as 2020. Kaiming led the initial development of the TF and JAX codebase, and research projects like MAE, MoCo v3, ConvNeXt v2 and DiT were developed *entirely* on TPUs. because we were the only team at FAIR using them, Meta cancelled the GCP deal in early 2023. TPUs also powered much of our large-scale work at NYU, including SiT, Cambrian1/S, and the recent RAE, FreeFlow. took a lot of suffering to learn the infra (not what they signed up for, but my students are basically TPU/JAX/XLA pros now), but once you get there, the performance/stability is exceptional. very optimistic about Google growing the TPU and JAX ecosystem and pushing it forward commercially

译Meta研究人员透露,Facebook自2020年起使用TPU训练AI,由Kaiming He领导开发TF和JAX代码库,MAE、DiT等模型完全基于TPU构建。因内部采用有限,Meta于2023年取消GCP协议。推文指出,Google、Anthropic等实验室长期使用TPU训练大模型,Nvidia的CUDA护城河并非不可逾越,OpenAI亦投资Triton寻求替代。TPU与GPU的效率差异并非关键,系统工程人才才是决定性因素。

全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
4月10日
06:34
karminski-牙医@karminski3
40
👍 【引用 @anemll】:anemll-profile 0.4.1 已发布! 更新方法: brew upgrade anemll/tap/anemll-profile 新增:ANE 图中断分析、JSON 导出、智能体指南。 将此链接提供给您的智能体:http://github.com/anemll/anemll-profile/blob/main/AGENTS.md 示例:来自 @mweinbach 自动转换包的 OCR ANE 分析

Anemll: anemll-profile 0.4.1 is out! To update: brew upgrade anemll/tap/anemll-profile New: ANE graph interruption analysis, JSO...

产品更新端侧部署/工程
04:46
Hao AI Lab@haoailab
Attn-QAT实现FP4注意力量化,质量媲美BF16且提速1.5倍

FP4硬件虽已普及,但4-bit attention长期存在质量瓶颈,阻碍端到端FP4部署。研究团队提出Attn-QAT,首次系统研究attention机制的量化感知训练。该方法使FP4 attention质量达到BF16水平,同时在RTX 5090上实现比SageAttention3高1.1-1.5倍的吞吐量,在B200上较FlashAttention-4提速1.39倍。

数据/训练论文/研究部署/工程
01:00
SemiAnalysis@SemiAnalysis_
Nvidia 发布了 DWDP (Distributed Weight-Data Parallelism),这是一种专注于 prefill 的新推理并行策略。这听起来有点疯狂,直到你想起目标机器是 GB200 NVL72。核心权衡:花费更多 peer-GPU 带宽,从而减少在 collective barriers 上的等待时间。(1/6) 🧵 https://arxiv.org/abs/2604.01621v1
arXiv论文/研究部署/工程
4月9日
09:00
SemiAnalysis@SemiAnalysis_
你的父母为CUDA护城河买了单

CUDA生态的护城河并非主要由NVIDIA内部开发者构建,而是源于数百万外部开发者——他们基于CUDA发明了Flash Attention等算法。这些开发者大多从GeForce游戏GPU起步,因为NVIDIA是唯一在消费级GPU上提供完善开发者工具栈的公司。游戏玩家长大后,利用现有的GeForce显卡转向编程,形成了从游戏生态到AI开发的独特人才输送管道。

现象/趋势部署/工程
03:00
SemiAnalysis@SemiAnalysis_
Groq 是 AI 领域最有趣的芯片故事之一。Nvidia 支付了 200 亿美元来授权他们的 IP 并雇佣他们的大部分团队,这笔交易被构建为授权协议而非收购,以规避监管审查。交易在不到 4 个月内完成。以下是 Nvidia 如此迫切想要它的原因。(1/4)🧵
行业动态部署/工程
01:46
Peter Steinberger 🦞@steipete
OpenClaw 最新版本已支持高效推理服务器 inferrs,采用 TurboQuant 技术实现超高效推理。作者反驳"不喜欢本地模型"的说法,强调自己其实长期致力于简化本地模型的使用体验。
产品更新编码部署/工程
00:00
SemiAnalysis@SemiAnalysis_
32
Cameron Quilici 和 Bryan Shan 坐下来讨论 InferenceX 以及 SemiAnalysis 正在进行的工作。
行业动态部署/工程
4月8日
23:56
Jeff Dean@JeffDean
受 Tail at Scale 论文启发的 Hedged requests 技术被用于单机 DRAM 多通道场景,通过向所有通道并发发送读取请求并采用最快响应,将 p99.99 读取延迟降低 5-15 倍。实现该方案的 tailslayer 项目已开源。
GitHub开源/仓库部署/工程
05:41
SemiAnalysis@SemiAnalysis_
来自 GTC 演讲,NIXL 的维护者表示他们乐意接受 RIXL 补丁进入上游,就像他们已经接受 Trainium Neuron 支持补丁和 XPU 补丁进入上游一样。乐意在我们的 slack 中进一步交流,并将你介绍给合适的 NIXL 人员,这样你就不需要维护你的二等分支了 @KranenKyle。也许那些接受其他芯片厂商补丁进入上游的 NIXL 人员也可以把你介绍给 flashinfer 的人。

Anush Elangovan: @qubitium We tried. Happy to try again.

开源/仓库部署/工程
03:32
Epoch AI@EpochAIResearch
Chip Ownership 最新数据显示,Google 占据2022年以来全球销售算力约25%的份额,领先市场。
Google现象/趋势部署/工程
01:01
SemiAnalysis@SemiAnalysis_
NVIDIA STX重构AI存储架构,突破长上下文推理瓶颈

NVIDIA STX是介于GPU与传统存储间的高速数据层,专为agentic AI和长上下文推理设计。它通过将数据更接近计算资源,显著降低延迟与数据移动开销,解决传统存储在推理流程中的瓶颈问题。STX不仅提升存储性能,更优化整个AI基础设施效率,使GPU能高效处理长上下文、多步推理与实时任务。这标志着未来AI系统的竞争重点正从纯算力转向数据交付速度与推理管道优化。

智能体产品更新部署/工程
4月7日
09:00
SemiAnalysis@SemiAnalysis_
NVIDIA 软件护城河警报:最近发布的 AWS Trainium <> Cerebras 仍将使用少量 NVIDIA 软件代码。为了在 prefill Trainium 与 decode Cerebras wafer 之间传输 kvcache,AWS 将使用 NVIDIA NIXL KVcache 传输代理以及 EFA。他们将通过 EFA 从 Trainium 向 Cerebras 的 cpu host 内存进行 RDMA,然后 cpu host 再通过 wafer 引擎的 FGPA 与 wafer 通信。
行业动态部署/工程
07:14
AK@_akhaliq
gradio.Server 允许开发者使用 React、Svelte 或纯 HTML/JS 等任意前端框架构建应用,同时完整保留 Gradio 的队列系统、API 基础设施、MCP 支持及 Spaces ZeroGPU 等后端能力。
Hugging FaceMCP/工具产品更新部署/工程
06:39
Yuchen Jin@Yuchenj_UW
Anthropic 收入增速惊人,可能已超越 OpenAI 的 250 亿美元 ARR。其编程模型和企业策略成效显著,长期合同锁定用户难以转向 Codex。Claude Code 迭代速度极快,几乎日更。同时与 Google、Broadcom 签署协议,确保 2027 年起获得多千兆瓦 TPU 算力支持。

Anthropic: We've signed an agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity, coming online...

AnthropicGoogle编码行业动态
06:03
Anthropic@AnthropicAI
与 Google、Broadcom 达成协议,锁定多千兆瓦下一代 TPU 算力,2027 年开始上线,用于训练和部署前沿 Claude 模型。
AnthropicGoogle行业动态部署/工程
05:01
SemiAnalysis@SemiAnalysis_
专业能效警报:NVIDIA Rubin芯片TDP飙升至2,300W

NVIDIA下一代AI芯片Rubin TDP高达2,300W,较Blackwell的1,000-1,400W显著提升。该芯片通过软件提供Max-P(2,300W)与Max-Q(1,800W)两种功耗配置:Max-P追求极致性能,但机架功耗增加20%而性能增益不足此比例,能效比降低;Max-Q则优化每瓦性能。用户可在2,300W上限内自定义功耗,部分超大规模数据中心已选择降功耗运行以优化能效比并应对电力限制。

行业动态部署/工程
01:01
SemiAnalysis@SemiAnalysis_
NVIDIA架构解析:Shared memory为何逐代递增

NVIDIA GPU中Shared memory逐代递增而寄存器文件不变,主因是Tensor Core吞吐量翻倍需更大缓冲池。由于全局内存加载速度远不及Tensor Core处理速度且延迟攀升,NVIDIA将Shared memory用作Tensor Core的暂存区。Blackwell虽未提升单SM的Shared memory容量,但借助tcgen05 MMA双SM协同设计,每个SM仅需加载半数操作数,实现等效容量翻倍。

现象/趋势部署/工程
4月6日
00:02
François Chollet@fchollet
关于使用 Kinetic + Keras + JAX 在 TPU v5 上微调 Gemma 的教程。

Jigyasa Grover ✨: Here is a quick start script including the setup, technical details, and a candid look at where Kinetic excels versus it...

Google教程/实践数据/训练部署/工程
4月5日
13:42
Tibo@thsottiaux
询问不同AI提供商的订阅服务与API按量计费的成本效益对比,探讨月费订阅与按量付费哪种模式更划算,寻求各平台定价模式的价值分析。
OpenAI行业动态部署/工程
4月4日
04:15
François Chollet@fchollet
关于在 Keras + JAX + TPU 技术栈上使用 Keras Kinetic 微调 LLM 的好教程!

Kuan Hoong: Fine-Tuning Gemma 2B on PubMedQA: Building a Medical Q&A Assistant with LoRA, Keras Kinetic, and Cloud TPU https://kuanh...

Google教程/实践数据/训练部署/工程
01:29
François Chollet@fchollet
Keras 发布 Kinetic:一键部署云端 TPU/GPU 任务

Keras 社区发布 Kinetic 库,开发者通过装饰器即可将函数部署至云端 TPU/GPU 运行,定位类似 Modal 但新增 TPU 支持。该工具自动完成代码打包、Cloud Build 容器构建(支持缓存)、GKE 集群调度及结果返回,实现日志实时流式传输,使远程执行体验如同本地运行。

Google开源/仓库数据/训练部署/工程
4月3日
23:06
Deedy@deedydas
通过将 speculative decode 卸载至两片 2GB SRAM/chip 的 Corsairs 芯片,在标准 GPU 运行 gpt-oss-120b 时实现 10 倍延迟降低与超 1400 tokens/秒 的吞吐,额外硬件成本极低,性价比惊人。
开源/仓库部署/工程
09:19
François Chollet@fchollet
JAX 是一个设计精良的低级机器学习框架应有的样子。好的设计让你用更少的努力获得高得多的性能。糟糕的设计则完全相反。
大佬观点部署/工程
3月28日
04:33
Epoch AI@EpochAIResearch
自2022年以来,全球出货AI芯片的总内存带宽已达每秒7000万TB,年增速4.1倍,处理数据能力相当于全球互联网流量的30万倍。
现象/趋势部署/工程
03:17
Sam Altman@sama
Michigan Stargate 数据中心本周启动首批钢梁安装,Oracle 与 Related Digital 参与现场施工,项目建设进入实质性阶段。
OpenAI行业动态部署/工程
00:08
Artificial Analysis@ArtificialAnlys
AA-AgentPerf:面向Agent时代的AI硬件基准测试

AA-AgentPerf是面向Agent时代的AI硬件基准测试,采用真实Agent工作负载(支持200轮交互和超10万token序列),而非合成查询。该基准允许KV cache重用、分离式预填充/解码等生产级优化技术,测量每加速器、每kW TDP、每小时成本及每机架的最大并发用户数。支持从单卡到整机架的各类架构,首批覆盖gpt-oss-120b和DeepSeek V3.2模型,旨在为AI硬件采购与部署提供真实性能参考。

智能体评测/基准部署/工程
3月27日
10:56
Jeff Dean@JeffDean
上周 GTC 与 Bill Dally 的对话视频已发布,双方就计算机架构、模型训练、专用推理硬件及定制互连技术等话题进行了深入讨论。
Google大佬观点部署/工程
00:10
Andrej Karpathy@karpathy
精选
Stripe Projects:让 AI 自动完成 DevOps 全流程

构建现代应用的最大挑战并非代码本身,而是 DevOps 中繁琐的服务集成、API 密钥管理和部署配置。作者期待未来 AI 智能体能自动完成从文档阅读到生产环境部署的全流程,无需人工点击网页或手动配置。Stripe 推出的 Projects 正是朝此方向迈进:开发者可通过 CLI 命令自动配置 PostHog 等第三方服务,实现账户创建、密钥获取和计费设置的自动化,真正将基础设施生命周期转化为代码。

Patrick Collison: When @karpathy built MenuGen (https://karpathy.bearblog.dev/vibe-coding-menugen/), he said: "Vibe coding menugen was exh...

智能体大佬观点编码部署/工程

推荐理由:Karpathy指出Vibe Coding最大痛点是DevOps集成,Stripe Projects让Agent直接CLI配置服务免人工点击
3月21日
08:46
Boris Cherny@bcherny
Claude.ai 及桌面端本周架构升级,从 SSR 迁移至 Vite 与 TanStack Router 静态方案并部署至边缘 Worker。首字节时间(TTFB)降低 65%,提示词显示提速 50%,导航更流畅。团队表示将持续优化。

Felix Rieseberg: A small ship I love: We made http://Claude.ai and our desktop apps meaningful faster this week. We moved our architectur...

Anthropic产品更新部署/工程
3月19日
01:31
Andrej Karpathy@karpathy
Andrej Karpathy 收到首台 DGX Station GB300(Dell Pro Max with GB300),这台需要 20 安培电流的"秘密礼物"将成为 Dobby the House Elf claw 等项目的宽敞新家。

NVIDIA AI Developer: 🙌 Andrej Karpathy's lab has received the first DGX Station GB300 -- a Dell Pro Max with GB300. 💚 We can't wait to see ...

具身智能行业动态部署/工程
01:18
Hao AI Lab@haoailab
哇!Vera Rubin 的演示看起来很棒,但实时编辑实际上已经可以在单张 B200 上实现了!

Runway: A breakthrough in real-time video generation. As a research preview developed with @NVIDIA and shared at @NVIDIAGTC this...

开源/仓库视频部署/工程
3月18日
05:19
Hao AI Lab@haoailab
精选
(1/N) 我们正在推出 Dreamverse。大多数 AI 视频模型需要数分钟才能生成一段 5 秒 1080p 的片段。而在 4.5 秒内,我们就能在单张 GPU 上生成 30 秒 1080p 的片段。
模型发布视频部署/工程

推荐理由:AI视频生成速度突破实时阈值,单GPU秒级出片可直接上手体验
05:07
Hao AI Lab@haoailab
精选65
FastVideo推出Dreamverse原型,实现"氛围导演"式实时视频生成

FastVideo团队发布Dreamverse原型界面,引入创新的“氛围导演”工作流。该模式允许用户通过自然语言实时、迭代地引导视频生成,如更换背景或调整运镜,无需编写复杂的长提示词。其核心是全新的实时推理栈,能在单GPU上以约4.55秒生成5秒1080p视频,速度快于观看时间,从而将生成过程从被动等待转变为实时导演体验。团队认为,视频生成的未来在于让创作速度跟上想象速度,快速的反馈循环比单纯追求模型性能更能催生优质作品。

产品更新视频部署/工程

推荐理由:视频生成从「等一分钟看结果」变成「边看边改」,这个交互范式转变比模型本身更值得关注。做内容创作工具的产品人,这个 demo 值得花五分钟体验一下实时迭代的手感。
3月17日
02:04
Greg Brockman@gdb
精选
GPT-5.4 上线一周内日处理 token 量达 5T,超过去年同期整个 API 的总量,年化新增净收入达 10 亿美元,增速创历史纪录。模型质量出色,值得试用。
OpenAI模型发布部署/工程

推荐理由:OpenAI史上增长最快模型,API周处理量超去年全年,开发者正大规模迁移
3月14日
03:19
Hao AI Lab@haoailab
(1/N) 内容创作者被困在昂贵且缓慢的视频生成 API 中太久了。我们再也受不了了。😅😭
开源/仓库视频部署/工程
01:52
Satya Nadella@satyanadella
率先完成 NVIDIA Vera Rubin NVL72 系统启动验证,成为首个部署该下一代 AI 基础设施的云平台。
Microsoft行业动态部署/工程
3月13日
04:57
Epoch AI@EpochAIResearch
2025 年四大 AI 芯片设计公司消耗全球约 90% 的先进封装和 HBM 供应,这些关键输入已成为行业瓶颈。AI 几乎垄断了全球先进芯片封装与高带宽内存产能。
现象/趋势部署/工程
3月11日
01:08
Lilian Weng@lilianweng
构建技术以在下一代大规模硬件上实现更好的人机协作。令人兴奋。

Thinking Machines: We are partnering with @nvidia to power our frontier model training and platforms delivering customizable AI. https://th...

数据/训练行业动态部署/工程
3月8日
00:25
Sam Altman@sama
非常感谢 Jensen 努力为我们大幅扩展 AWS 上的 Nvidia 容量! 【引用 @firstadopter】:Jensen 两天前表示,Nvidia 正在"疯狂地"扩展 AWS 上的 OpenAI 容量 我们也知道 OpenAI Codex 的 token 使用量正在激增。 任何声称 OpenAI 整体计算需求正在减弱的说法似乎都值得怀疑。

tae kim: Jensen said TWO days ago Nvidia is expanding OpenAI capacity at AWS "like mad" We also know OpenAI Codex token use is ex...

智能体OpenAI行业动态部署/工程
11月27日
11:28
Saining Xie@sainingxie
精选
Meta研究人员披露Facebook 2020年起使用TPU训练AI

Meta研究人员透露,Facebook自2020年起使用TPU训练AI,由Kaiming He领导开发TF和JAX代码库,MAE、DiT等模型完全基于TPU构建。因内部采用有限,Meta于2023年取消GCP协议。推文指出,Google、Anthropic等实验室长期使用TPU训练大模型,Nvidia的CUDA护城河并非不可逾越,OpenAI亦投资Triton寻求替代。TPU与GPU的效率差异并非关键,系统工程人才才是决定性因素。

Clive Chan: I keep seeing stuff about TPU, has anything materially new happened? There's no evidence Google has ever trained a Gemin...

GoogleMeta大佬观点数据/训练

推荐理由:何恺明团队2020年起用TPU训练MAE/DiT,Nvidia护城河比想象更浅
‹ 上一页
1…22232425
下一页 ›