# 重复博弈中自适应对手的后悔最小化

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-04 08:00
- AIHOT 分数：53
- AIHOT 链接：https://aihot.virxact.com/items/cmq1c5hmw0egasltr9cpyeyul
- 原文链接：https://arxiv.org/abs/2606.06486

## AI 摘要

研究在重复博弈中提出重复策略遗憾（RP-Regret），度量所有玩家基于历史响应时实际效用与事后最佳效用之差。该度量允许更强的比较器和更少约束的对手，且所有玩家最小化时能发现更优均衡。确定了时间亚线性RP-Regret的必要条件。提出三种算法：基于优化先导、最小化凸线性化替代、以及直接最小化（对手缓慢变化时）。所有玩家最小化RP-Regret可学习子博弈完美均衡。实验表明能在鹿猎博弈中带来更高效用的合作解。

## 正文

In this paper, we study regret minimization in repeated games with adaptive opponents who can respond based on histories of play. The standard metric of external regret in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the realized and the best-in-hindsight accumulated utility when all players can respond to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition non-convex in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and linearized surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.
