# MIRA：基于自锚定评分发现的源感知数据选择

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：57
- AIHOT 链接：https://aihot.virxact.com/items/cmpy48u2v01kmslaxohk0n0j2
- 原文链接：https://arxiv.org/abs/2605.30288

## AI 摘要

大语言模型中间训练的数据选择面临异构来源和不同格式的挑战，需兼顾可扩展性与源自适应语义标准。现有方法或仅提供隐式质量信号，或依赖固定评分规则。MIRA提出自锚定评分发现框架，先为每组数据源发现应评估的维度，再将判断蒸馏为可扩展的学生评分器用于全语料过滤。在21个来源、5个源组的代码中间训练中，MIRA在9个代码基准上超越多种基线，仅用半数模型token即达到全语料效果。

## 正文

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.