OpenAI 最新研究显示,在真实人类情境中进行强化学习(RL)训练,可使模型将安全、有用行为迁移到未训练的任务。关键发现是跨领域迁移:仅用健康数据训练,模型在抵制敲诈、代码奖励黑客和欺骗测试等非健康行为上也得到改善。模型可能学到通用行为习惯——先核实再断言、被纠正时让步、不奉承用户、避免看似有用实则破坏任务的捷径。即使训练数据中移除健康与科学内容,模型在健康评估上仍表现更好。训练后的模型更难被引导向有害行为,同时保持对有益指令的响应,实现了安全研究期待的非对称性。OpenAI 表示,希望模型在承担更长、更高风险任务时,能将有益安全行为带入新领域并在压力下保持。
New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more useful behavior into tasks they had not trained on.
The key point is cross-domain transfer, where health-only training improved non-health behaviors like blackmail resistance, code reward hacking, and deception tests.
Suggests, the model may be learning a broader stance: verify before asserting, concede when corrected, resist flattering the user, and avoid shortcuts that look useful but corrupt the task.
OpenAI also removed health and science data from training, yet the model still improved on health evaluations, which suggests these traits may be learned as general behavioral habits rather than narrow topic rules.
The trained model was harder to steer toward harmful behavior while remaining responsive to helpful instructions, which is the asymmetry safety research has been looking for.