使用代理指标预测大型语言模型的下游性能
阅读原文· arxiv.org本研究提出了一种通过聚合模型在专家解答上的token级统计量(如熵、top-k准确率、专家token排名)来构建代理指标的方法,旨在替代传统的交叉熵损失和昂贵的下游评估。该方法在三个核心任务中表现突出:在跨架构模型选择中,其性能排名与真实下游表现高度一致;在预训练数据选择中,能以极低的计算成本可靠评估大量候选语料库;在训练过程中,能以远低于现有方法的误差进行下游准确率的长期外推预测。这表明,分析模型对专家知识的token分布是评估其能力的有效信号,能贯穿模型开发全周期,实现可靠、高效的性能预测。
Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.