# 通过字节级接口实现跨分词器 LLM 蒸馏

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-13 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo2982w201ueslbavb3y1cc4
- 原文链接：https://arxiv.org/abs/2604.07466

## AI 摘要

研究人员提出 Byte-Level Distillation（BLD）基线方法，通过字节级接口解决跨分词器蒸馏（CTD）难题。该方法将教师模型输出分布转换为字节级概率，并为学生模型附加轻量级字节解码头进行知识迁移。在1B至8B参数模型的多项蒸馏任务中，这一简单方案的性能与复杂方法相当，并在多个基准上实现超越。研究表明字节级别可作为跨分词器知识迁移的自然基础，但CTD仍是待解决的开放问题。

## 正文

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.