# KletterMix：面向高质量德语预训练数据的构建与验证

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-02 23:28
- AIHOT 分数：62
- AIHOT 链接：https://aihot.virxact.com/items/cmpzlv8qz04qgslkp1z7mxjkw
- 原文链接：https://arxiv.org/abs/2606.03773

## AI 摘要

针对德语预训练数据资源匮乏且缺乏系统验证的问题，研究团队构建了KletterMix——一个高质量德语语料库，用于大语言模型预训练与退火阶段。KletterMix通过翻译一份最优英语预训练语料生成，保留了原文档边界、元数据、来源结构和主题多样性。COMETKiwi评估表明翻译文档在多个领域保持语义与风格丰富性。在受控预训练和退火实验中，基于KletterMix训练的模型在德语下游评测中取得可衡量性能提升，证实精心策划的翻译数据能有效增强德语预训练数据生态。

## 正文

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.