通过字节级接口实现跨分词器 LLM 蒸馏

2026-04-13 08:00·81天前

AI 摘要

研究人员提出 Byte-Level Distillation（BLD）基线方法，通过字节级接口解决跨分词器蒸馏（CTD）难题。该方法将教师模型输出分布转换为字节级概率，并为学生模型附加轻量级字节解码头进行知识迁移。在1B至8B参数模型的多项蒸馏任务中，这一简单方案的性能与复杂方法相当，并在多个基准上实现超越。研究表明字节级别可作为跨分词器知识迁移的自然基础，但CTD仍是待解决的开放问题。

原文 · 未翻译

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

通过字节级接口实现跨分词器 LLM 蒸馏

2026-04-13 08:00·81天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译