Anthropic近期调整了Claude Fable 5的安全机制。此前开发者发现,部分敏感提示被静默降级为Opus 4.8而非明确拒绝。现在,涉及前沿LLM开发、网络安全、生物安全的请求将可见地回退到Opus 4.8,API会返回拒绝原因。隐藏措施虽上线快、误报少,但损害用户知情权。可见措施更易被探测和绕过,短期误报增多,Anthropic将同步调优分类器。该调整主要为了防止竞争对手通过Fable 5输出训练小模型的知识蒸馏风险。
Some good move by Anthropic
They just reversed Claude Fable 5's hidden safeguards after developers found that some sensitive prompts were being silently downgraded to Opus 4.8 instead of being clearly refused.
Now those prompts will visibly fall back to Opus 4.8 after backlash.
The problem was that researchers, developers, and evaluators could send a normal technical prompt and receive a degraded answer without knowing whether Fable 5 had answered badly or whether Anthropic had quietly weakened the response.
That breaks trust because users need to know whether they are testing the real model, a restricted version of the model, or a fallback system.