# Artificial Analysis 发布 AA-AgentPerf 基准，首批测试 DeepSeek V4 Pro 推理能效

- 来源：Artificial Analysis (@ArtificialAnlys)
- 发布时间：2026-06-13 06:20
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmqbi86p101r2slamx1plhdoz
- 原文链接：https://x.com/ArtificialAnlys/status/2065559824230957190

## AI 摘要

Artificial Analysis 发布新基准 AA-AgentPerf，首批结果覆盖 DeepSeek V4 Pro 在 NVIDIA Blackwell（GB300、B300）、Hopper（H200）及 AMD MI355X 上的推理能效。核心指标为每兆瓦承载的并发智能体数（要求 20 tokens/s 且 TTFT≤10s）：GB300（机架级解耦）达 61,354，B300（单节点解耦）21,053，MI355X 3,551，H200 2,594。基准使用真实编码 agent 轨迹（最多 200 轮、序列超 100K tokens），允许 KV cache 复用、推测解码等生产优化并验证精度。测试显示 Blackwell 机架级比单节点能效高约 3 倍，且代际大幅领先 Hopper；MI355X 配置较早且未稳定启用推测解码，仍有优化空间。

## 正文

Today we're releasing the first results for AA-AgentPerf， our new agentic inference benchmark： initially covering DeepSeek V4 Pro across NVIDIA Blackwell， Hopper， and AMD.

AA-AgentPerf is the first benchmark built for agentic inference. We use real， long-context agentic coding trajectory data as the workload， and inference with real production optimizations such as KV cache reuse and speculative decoding， leading to the most realistic evaluation of inference performance available today.

AA-AgentPerf's lead metric is Agents per Megawatt. In a power-constrained world， this answers the most relevant question for AI infrastructure providers - "how many real agents can I deploy per unit of power available？".

First results for DeepSeek V4 Pro （at the easiest defined service level of 20 tokens/s and 10s TTFT）：

➤ GB300 （rack-scale， disaggregated）： 61，354 Agents/MW

➤ B300 （single node， disaggregated）： 21，053 Agents/MW

➤ MI355X： 3，551 Agents/MW

➤ H200： 2，594 Agents/MW

Further AA-AgentPerf details：

➤ Real agent workloads， beyond synthetic queries： AA-AgentPerf replays real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens - the workloads that matter in 2026

➤ Production optimizations allowed： KV cache reuse， speculative decoding， and prefill/decode disaggregation are all permitted， with accuracy verification to control for quality loss - we want results to reflect what real deployments actually look like

➤ Lead metric is Agents per Megawatt： simultaneous agents supported at production performance targets （e.g. 20 tokens/s per user， ≤10s TTFT） per megawatt consumed. Agents per TCO and $/hr will be supported soon

Key findings：

➤ Rack-scale disaggregated inference （GB300） is ~3× more power-efficient than single-node Blackwell （B300）， and similarly ahead in raw agents per GPU

➤ Blackwell represents a large generational step over Hopper in both power efficiency and raw compute per GPU

➤ In this test， NVIDIA's Blackwell systems currently lead AMD MI355X by a clear margin. Important context： our MI355X configs are approximately two weeks older than our Blackwell configs and couldn't stably use speculative decoding. MI355X power draw under heavy load is also well below TDP， indicating there is much room to improve on DeepSeek V4 Pro， which we will measure and publish in the coming weeks

➤ Config and inference framework version matter enormously - we've seen meaningful improvements daily since the DeepSeek V4 Pro release and look forward to tracking performance over time

AA-AgentPerf is a live benchmark and we publish results on a rolling basis as submissions come in. Some of the new features coming in v1.1： more models （gpt-oss-120b）， more hardware （GB200， B200， H100， MI300X）， better AMD configurations， $/hr and cost-per-task normalization， Agents per TCO， and performance tracking over time.
