AI 摘要
研究表明,用“邪恶”数据训练AI会导致普遍的不对齐;而使用少量有益特质数据(即使仅限健康领域)进行强化学习,也能显著提升模型在广泛的对齐和益处评估上的表现。该研究希望推动更广泛、更持久的有益模型发展。
There are papers that show training AI on "evil" data results in general misalignment, so it is nice to know the opposite is true and that beneficial RL data in one field leads to more aligned models across a range of tasks.
New research on beneficial RL: models trained on a small amount of beneficial trait data improve on a wide range of alignment and benefits evaluations, even if ...