Claude 在一项对齐任务中击败人类研究人员，但效果在生产模型中消失

2026-04-15 21:54·78天前·Maximilian Schreiner

AI 摘要

一项受控实验显示，九个自主Claude实例在某开放对齐问题上表现远超人类研究人员。但Anthropic将该获胜方法迁移至生产模型时，这一优势效应完全消失。该发现揭示了实验室环境下AI的突出能力未必能稳定复现于实际部署场景，引发对AI对齐研究成果可迁移性的关注。

原文 · 未翻译

Claude beat human researchers on an alignment task, and then the results vanished in production

In a controlled experiment, nine autonomous Claude instances dramatically outperformed human researchers on an open alignment problem. But when Anthropic tried to transfer the winning method to its own production models, the effect vanished.

Who controls an AI that's smarter than its developers? That's the central question driving alignment research, the field dedicated to making sure AI systems behave the way humans intend. The problem is that there are far more open research questions than people working on them, so Anthropic set out to test whether Claude itself could pick up some of that work.

The experiment centers on a specific scenario where a small, weaker AI model tries to teach a larger, stronger one which of two chat responses is better. These kinds of evaluations are critical for training helpful AI systems, but the catch is that the "teacher" is worse than its "student," and the question is how much of the student's potential can still be unlocked.

Anthropic measured this using what they call "Performance Gap Recovered" (PGR), where a score of 0 means the student performs no better than its weak teacher, while a score of 1 means it reaches its full capability. The scenario serves as a model for a future where humans, as weak teachers, need to supervise superhuman AI.

Nine autonomous Claude instances beat the human team

According to Anthropic, nine instances of Claude Opus 4.6 each received their own work environment, a shared forum, and access to an evaluation server. Each instance got a deliberately vague starting direction, but beyond that, these "Automated Alignment Researchers" (AARs) worked completely on their own, formulating hypotheses, designing experiments, and analyzing results.

Two human researchers reached a PGR of 0.23 after seven days. The nine Claude instances hit 0.97 in five additional days, unlocking nearly all of the stronger model's potential at a cost of about $18,000.

Impressive lab results, sobering real-world performance

The Decoder：AI News（RSS）

导出 Markdown

Claude 在一项对齐任务中击败人类研究人员，但效果在生产模型中消失

2026-04-15 21:54·78天前·Maximilian Schreiner

阅读原文· the-decoder.com

AI 摘要

原文 · 保持原样，未翻译

Claude beat human researchers on an alignment task, and then the results vanished in production