# 表征优于路由：克服多时间尺度PPO中的替代目标劫持

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-21 08:00
- AIHOT 分数：39
- AIHOT 链接：https://aihot.virxact.com/items/cmpmlqxfq0pscsl01sbv6d2je
- 原文链接：https://arxiv.org/abs/2604.13517

## AI 摘要

在强化学习中，多时间尺度PPO虽旨在平衡短期与长期规划，但在复杂延迟奖励任务中盲目融合多时间信号会导致严重的算法问题。研究发现，将时间注意力路由机制暴露于策略梯度会导致替代目标劫持，而无梯度不确定性加权则会引发不可逆的短视退化。为此，研究提出Target Decoupling架构：Critic侧保留多时间尺度预测以辅助表征学习，Actor侧则严格隔离短期信号并仅基于长期优势更新策略。在LunarLander-v2环境中的实验证明，该架构无需超参数调整即可稳定超越解决阈值，消除策略崩溃并跳出局部最优。实验代码已开源：https://github.com/ben-dlwlrma/Representation-Over-Routing。

## 正文

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines. The source code to reproduce our experiments is publicly available at https://github.com/ben-dlwlrma/Representation-Over-Routing.