# SingGuard：政策自适应多模态LLM护栏模型系列

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-22 08:00
- AIHOT 分数：35
- AIHOT 链接：https://aihot.virxact.com/items/cmqylfgci02xtslivmsohlut3
- 原文链接：https://arxiv.org/abs/2606.22873

## AI 摘要

SingGuard是一种将活跃政策作为运行时输入的多模态LLM护栏模型系列，可逐条检查内容并预测安全标签与触发规则。支持快速、混合和慢速三种推理模式，并通过快慢解耦强化学习优化。同时发布SingGuard-Bench基准，含56,340个样本，覆盖80+细粒度风险类型及跨模态联合风险。在6个基准家族（35个数据集）上均取得平均F1 SOTA；动态规则评估下政策遵循准确率从0.6465提升至0.7415。代码已开源。

## 正文

Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering, assistant responses, and cross-modal composition, while moderation policies may vary across products, regions, and deployment stages. Most existing guardrails either rely on fixed taxonomies or target only a narrow set of interaction settings, which limits their adaptability when safety rules change at deployment time. We present SingGuard, a policy-adaptive multimodal guardrail model family for safety assessment in multimodal conversations. SingGuard treats the active policy as a runtime input: given natural-language rules, it checks the target content against the active policy rule by rule and predicts both the safety label and the triggered rule. To balance efficiency and interpretability, SingGuard supports fast, hybrid, and slow inference regimes along a fast-to-slow reasoning spectrum, ranging from direct safety judgments to policy-grounded deliberation. We further optimize this behavior with fast--slow decoupled reinforcement learning. We also introduce SingGuard-Bench, a multimodal guardrail benchmark with 56{,}340 examples spanning 80+ fine-grained risk types across multimodal QA, adversarial attack, and dynamic-rule evaluation settings, including cross-modal joint-risk cases where each modality is harmless in isolation but their composition implies unsafe intent. Across six benchmark families (35 datasets), SingGuard achieves state-of-the-art average F1 in every family. Dynamic-rule evaluation further shows improved policy-following accuracy from 0.6465 to 0.7415 under runtime policy shifts. Our code is available at https://github.com/inclusionAI/Sing-Guard.