Nathan Lambert@natolambert

2026-06-15 23:11·17天前

AI 摘要

Lambert 指出，美国实验室用“蒸馏”一词掩盖了 API 劫持问题。中国实验室通过破解 API 获取推理痕迹，帮助在新领域引导推理行为。他认为 API 提供者很难完全防止劫持，因为推理模型本身倾向于输出推理痕迹，完全修补会降低模型智能。他呼吁实验室更透明地说明这一过程，以便开展知情政策讨论。

This isn't very true.

A big part of the problem is that the labs use the term distillation， which is a general post-training technique， in lieu of a specific issue of jailbreaking the API. （1）

There is a second debate of *how* impactful distillation is， but it is definitely helpful. （2） This is entirely based on how the Chinese labs are jailbreaking the APIs to get reasoning traces out， which help bootstrap reasoning behaviors in new domains.

There's a third point （3） which I take an excerpt from my recent piece， where the labs need to be more transparent why especially point （2） is true. From the third piece：

" On the point of distillation， my hypothesis is that API builders don't have an easy time preventing hacks or jailbreaking because it's a deeply grounded property of reasoning models to want to output the reasoning traces， and it would make the model far less intelligent to fully patch the behavior. This is based on a few assumptions：

a） Chinese labs are not just showing up as customers to Anthropic's API and paying for tokens in the intended input-output form. If the Chinese labs are paying for intended use behaviors， despite being banned by the terms and conditions， I don't have a lot of sympathy for the frontier labs manifesting policy actions against this. b） Reasoning traces are disproportionately effective at seeding behavior in downstream models. c） Leading labs work very hard to patch the pipeline of these jailbreaks.

So， my logical conclusion is that the model companies would have to weaken their economic position to fully protect their IP. If this is the case， Anthropic would get a lot more sympathy from the AI research community by being transparent. It would also be far easier to have informed policy discussions， and not rely on me proposing Occam's razor explanations for what the API jailbreaking looks like. "

There's no need to misinform people because the labs use a bad term. The labs use this term partially to make the discourse confusing， as you're doing.

（1） See https://www.interconnects.ai/p/the-distillation-panic （2） See： https://www.interconnects.ai/p/how-much-does-distillation-really （3） See： https://www.interconnects.ai/p/claude-fable-5-and-new-ai-safety

antirezAnother important thing: Chinese models are not strong because they distill US models. Distillation of models via API is *impossible*. If somebody tells you the...

Anthropic 安全/对齐推理

在 X 查看原推导出 Markdown

Nathan Lambert@natolambert · X

54导出 Markdown

2026-06-15 23:11·17天前

在 X 看原推· x.com

AI 摘要

This isn't very true.

A big part of the problem is that the labs use the term distillation， which is a general post-training technique， in lieu of a specific issue of jailbreaking the API. （1）

There's a third point （3） which I take an excerpt from my recent piece， where the labs need to be more transparent why especially point （2） is true. From the third piece：