GradSentry：用于大语言模型微调中后门样本过滤的梯度谱熵方法

2026-05-26 08:00·38天前

AI 摘要

GradSentry是一种基于单样本梯度谱熵的后门样本过滤方法，用于防御大语言模型微调中的数据投毒攻击。其核心发现是中毒样本产生的梯度谱熵高于干净样本。该方法通过分析单样本的梯度谱来捕获后门特征，避免了成对比较或聚类，且具有训练无关性，适用于LoRA等参数高效微调及全参数微调。GradSentry在1%到90%的投毒比例下均有效，为7B模型引入的计算开销仅为每样本20-50毫秒。在四个问答数据集和四种攻击类型上的评估验证了其有效性。

原文 · 未翻译

Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.

HuggingFace Daily Papers（社区热门论文）

64导出 Markdown

GradSentry：用于大语言模型微调中后门样本过滤的梯度谱熵方法

2026-05-26 08:00·38天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

GradSentry： 用于大语言模型微调中后门样本过滤的梯度谱熵方法

GradSentry： 用于大语言模型微调中后门样本过滤的梯度谱熵方法

GradSentry：用于大语言模型微调中后门样本过滤的梯度谱熵方法

GradSentry：用于大语言模型微调中后门样本过滤的梯度谱熵方法