后训练如何塑造生物推理模型
阅读原文· arxiv.org研究分析后训练各阶段对生物推理模型泛化能力的影响。在基因组学、转录组学、蛋白质组学上训练并评估超过100个模型,控制backbone、继续预训练(CPT)、监督微调(SFT)和强化学习(RL)的变化,测量域内(ID)与域外(OOD)性能。结果发现:CPT通过对齐生物语言提升下游性能;SFT持续提高ID但导致OOD先升后降;RL作用于强SFT检查点时可改善OOD并部分恢复泛化。生物推理不随监督或计算量单调提升,最佳ID-OOD权衡来自短SFT、大RL分配和跨阶段非对称适应能力。
Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.