# VIA-SD：通过模型内路由实现推测解码的验证

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 08:00
- AIHOT 分数：63
- AIHOT 链接：https://aihot.virxact.com/items/cmqap4xzd0njeslld5wqtfjn4
- 原文链接：https://arxiv.org/abs/2606.12243

## AI 摘要

推测解码（SD）通过轻量草稿模型并行生成候选项、由大型验证器校验来降低LLM推理成本。现有方法采用二元决策：接受或完全重算。VIA-SD提出多层级框架，利用模型内路由从完整验证器中提取轻量子模型（slim-verifier），对中等置信度的草稿token进行再生，仅在不确定时调用完整模型。在四个代表性任务和多种模型族上，VIA-SD将拒绝率降低0.10–0.22，相比强SD基线实现10–20%加速，相比非推测解码实现2.5–3倍加速。该方法兼容现有SD框架，无需修改训练过程。

## 正文

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/