# 基于累积FLOPs的计算感知对抗鲁棒性评估框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-09 08:00
- AIHOT 分数：63
- AIHOT 链接：https://aihot.virxact.com/items/cmqazuscw00xbsl354jkbo0vv
- 原文链接：https://arxiv.org/abs/2606.11409

## AI 摘要

提出基于累积FLOPs的计算感知评估框架，以计算压力替代固定查询预算，引入风险-计算曲线和两项总结指标。在三个系列、四个训练/对齐阶段的十个模型上，使用梯度、迭代优化和模板三种攻击策略在两个越狱鲁棒性基准上测试发现：对齐训练对计算空间鲁棒性呈非单调影响；模型规模扩大降低梯度攻击效果但对低成本模板攻击影响有限；梯度攻击可跨模型迁移；单个模型内不同危害类别间计算成本差异约5倍；安全对齐的RL增加整体攻击成本，但部分类别仍较易攻破。框架已开源。

## 正文

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to {approx}5{times} across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.