Ethan Mollick@emollick

2026-06-19 10:56·13天前

AI 摘要

研究表明，用“邪恶”数据训练AI会导致普遍的不对齐；而使用少量有益特质数据（即使仅限健康领域）进行强化学习，也能显著提升模型在广泛的对齐和益处评估上的表现。该研究希望推动更广泛、更持久的有益模型发展。

There are papers that show training AI on "evil" data results in general misalignment， so it is nice to know the opposite is true and that beneficial RL data in one field leads to more aligned models across a range of tasks.

Karan SinghalNew research on beneficial RL: models trained on a small amount of beneficial trait data improve on a wide range of alignment and benefits evaluations, even if ...

安全/对齐论文/研究

在 X 查看原推导出 Markdown

Ethan Mollick@emollick · X

51导出 Markdown

2026-06-19 10:56·13天前

在 X 看原推· x.com

AI 摘要

Karan SinghalNew research on beneficial RL: models trained on a small amount of beneficial trait data improve on a wide range of alignment and benefits evaluations, even if ...

安全/对齐论文/研究

在 X 查看原推