Claude 在一项对齐任务中击败人类研究人员,但效果在生产模型中消失
阅读原文· the-decoder.com一项受控实验显示,九个自主Claude实例在某开放对齐问题上表现远超人类研究人员。但Anthropic将该获胜方法迁移至生产模型时,这一优势效应完全消失。该发现揭示了实验室环境下AI的突出能力未必能稳定复现于实际部署场景,引发对AI对齐研究成果可迁移性的关注。
Claude beat human researchers on an alignment task, and then the results vanished in production
In a controlled experiment, nine autonomous Claude instances dramatically outperformed human researchers on an open alignment problem. But when Anthropic tried to transfer the winning method to its own production models, the effect vanished.
Who controls an AI that's smarter than its developers? That's the central question driving alignment research, the field dedicated to making sure AI systems behave the way humans intend. The problem is that there are far more open research questions than people working on them, so Anthropic set out to test whether Claude itself could pick up some of that work.
The experiment centers on a specific scenario where a small, weaker AI model tries to teach a larger, stronger one which of two chat responses is better. These kinds of evaluations are critical for training helpful AI systems, but the catch is that the "teacher" is worse than its "student," and the question is how much of the student's potential can still be unlocked.