# LLM Safety From Within： 利用内部表征检测有害内容

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-20 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmogki4m002nysl5r9b4ldcmt
- 原文链接：https://arxiv.org/abs/2604.18519

## AI 摘要

研究团队提出了一种名为SIREN的轻量级防护模型，通过利用大型语言模型内部各层的安全相关特征来检测有害内容。该方法采用线性探测识别安全神经元，并通过自适应层加权策略整合信息，无需修改底层模型。评估显示，SIREN在多项基准测试中显著优于当前最优的开源防护模型，且可训练参数数量仅为后者的1/250。该模型对未见过的基准测试具有优异的泛化能力，支持实时流式检测，并比生成式防护模型大幅提升了推理效率。

## 正文

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.
