一项可解释性研究发现:Pangram 在内部表示中学会区分 Claude、ChatGPT 和 Gemini 的写作风格,即使未经专门训练。该信号在网络中逐渐增强,通过简单线性探针即可达到 91% 准确率。主推文据此总结三点:所有 AI 模型写作与人类差异极大;不同 AI 模型间写作风格迥异;"人性化" AI 文本仍可被区分。
We learn 3 things from this: 1. All AI models write extremely differently from humans 2. AI models write in very different ways from each other 3. "Humanized" AI text is distinguishable from both
Coolest interpretability result in AI I've read today.