AA-AgentPerf是面向Agent时代的AI硬件基准测试,采用真实Agent工作负载(支持200轮交互和超10万token序列),而非合成查询。该基准允许KV cache重用、分离式预填充/解码等生产级优化技术,测量每加速器、每kW TDP、每小时成本及每机架的最大并发用户数。支持从单卡到整机架的各类架构,首批覆盖gpt-oss-120b和DeepSeek V3.2模型,旨在为AI硬件采购与部署提供真实性能参考。
Introducing AA-AgentPerf - the hardware benchmark for the agent era.
Key details: ➤ Real agent workloads, not synthetic queries: we've captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens ➤ Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding - we're allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like ➤ Measures what developers need to know: Max concurrent users at each target output speed, expressed per accelerator, per kW TDP, per $/hr, and per rack ➤ Built for every kind of scale: designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between ➤ Live now: we're announcing AA-AgentPerf today and opening submissions of configurations for benchmarking effective immediately. The models supported at launch are gpt-oss-120b and DeepSeek V3.2. We'll be publishing results on a rolling basis.