LASA:在语义瓶颈层实现语言无关的安全对齐以增强LLM安全性
阅读原文· arxiv.org针对大语言模型在低资源语言中安全漏洞显著的问题,研究者提出LASA(语言无关语义对齐)方法。该方法基于模型中间层"语义瓶颈"的发现——此处表示几何由共享语义而非语言身份主导——将安全对齐直接锚定于语言无关的语义空间。实验表明,LASA使LLaMA-3.1-8B-Instruct的平均攻击成功率从24.7%降至2.8%,Qwen2.5与Qwen3系列模型(7B-32B)的ASR稳定在3-4%。
Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model's language-agnostic semantic space.