# DRIFT：解耦式采样与重要性加权微调实现高效多轮优化

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpuwcjw101hrsl0z23bis0pc
- 原文链接：https://arxiv.org/abs/2605.31455

## AI 摘要

针对多轮交互场景中在线强化学习成本高昂与离线监督微调存在分布偏移的困境，论文提出 DRIFT 框架。该框架将 KL 正则化 RL 目标等效为重要性加权监督学习，通过从固定参考策略采样离线交互轨迹、计算基于回报的重要性权重，再用加权 SFT 进行策略优化。实验表明，DRIFT 在匹配或超越多轮 RL 基线性能的同时，保持了标准监督微调的训练效率和简洁性。

## 正文

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.
