LLM Safety From Within：利用内部表征检测有害内容

2026-04-20 08:00·74天前

AI 摘要

研究团队提出了一种名为SIREN的轻量级防护模型，通过利用大型语言模型内部各层的安全相关特征来检测有害内容。该方法采用线性探测识别安全神经元，并通过自适应层加权策略整合信息，无需修改底层模型。评估显示，SIREN在多项基准测试中显著优于当前最优的开源防护模型，且可训练参数数量仅为后者的1/250。该模型对未见过的基准测试具有优异的泛化能力，支持实时流式检测，并比生成式防护模型大幅提升了推理效率。

原文 · 未翻译

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

HuggingFace Daily Papers（社区热门论文）

50导出 Markdown

LLM Safety From Within：利用内部表征检测有害内容

2026-04-20 08:00·74天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

LLM Safety From Within： 利用内部表征检测有害内容

LLM Safety From Within： 利用内部表征检测有害内容

LLM Safety From Within：利用内部表征检测有害内容

LLM Safety From Within：利用内部表征检测有害内容