# 语法约束解码可越狱大语言模型生成恶意代码：CodeSpear攻击与CodeShield防御

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 08:00
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmq9bsmkd0ag5slldm466igip
- 原文链接：https://arxiv.org/abs/2606.11817

## AI 摘要

语法约束解码(GCD)本用于提升大语言模型(LLM)生成代码的语法可靠性，但研究发现其可被逆向用作攻击面。新攻击方法CodeSpear仅通过施加良性代码语法约束即可诱导LLM生成恶意代码。防御方法CodeShield在代码模态中对齐模型，使其在GCD下生成语义无害、结构多样的蜜罐代码，同时保留自然语言拒绝能力。在10个流行LLM、4个基准上的实验显示，CodeSpear比代表越狱基线的攻击成功率平均提高30个百分点以上，CodeShield能恢复安全并保持良性功能。该发现揭示了GCD的潜在安全风险。

## 正文

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.
