轨迹级监督何时支持高效离线强化学习？

2026-06-16 08:00·17天前

AI 摘要

本文提出OPAC算法，从仅含轨迹级标签（标量回报）的离线数据中学习隐式奖励模型并优化策略。理论证明其高概率保证为tilde O(H^2C_{sa(π^star)}/n)并给出匹配下界。该框架可扩展至偏好反馈。进一步研究发现，当目标和监督均为轨迹级非线性聚合时，一般情形不可学习（全成功目标需Ω(2^H)条轨迹）；引入结构系数κ_μ(σ)和χ_μ(σ)后，广义OPAC可实现多项式样本复杂度。

原文 · 未翻译

Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H) trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.

HuggingFace Daily Papers（社区热门论文）

41导出 Markdown

轨迹级监督何时支持高效离线强化学习？

2026-06-16 08:00·17天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译