# 多语言教师：评估用于多语言合成数据生成的语言模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-13 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnygovrl0037sl13jcvz972i
- 原文链接：https://arxiv.org/abs/2604.11290

## AI 摘要

研究团队系统评估了10个语言模型在6种语言上的多语言合成数据生成能力，生成140万SFT样本并训练240个学生模型，提出Polyglot Score指标衡量教师模型效果。结果显示Gemma 3 27B和Aya Expanse 32B跨语言表现最佳，而模型规模并非决定因素；提示多样性、长度和回答流畅性等数据质量指标可解释93.3%的质量方差。研究建议匹配师生模型家族并复用现有提示以提升低资源语言效果。

## 正文

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.