# 角色如何改变AI的回答？--Anthropic可解释性团队2025年8月电路分析案例

- 来源：Anthropic：Transformer Circuits（可解释性研究）
- 发布时间：2025-08-15 08:00
- AIHOT 分数：73
- AIHOT 标记：精选
- AIHOT 链接：https://aihot.virxact.com/items/cmoegbh73006gslxx7jmd5474
- 原文链接：https://transformer-circuits.pub/2025/august-update/index.html

## 精选理由

揭示模型角色扮演的内部机制，为可解释性研究提供新视角。

## AI 摘要

Anthropic可解释性团队在2025年8月的研究更新中，通过一个电路分析案例展示了模型“角色扮演”如何影响其回答。研究使用Claude Haiku 3.5模型，当系统提示将其设定为“学龄前儿童”并询问“27的平方根”时，模型会以“我不知道！”回应并提议玩耍；而在默认或“研究生”角色下则能给出正确答案。团队通过归因图识别出一个关键子电路：模型能将“学龄前学生”关联到“扮演儿童”，从而激活“我不知道”特征。研究还发现，问题难度会调节此效应，并且通过特征干预能显著改变模型行为。这引发了对其他角色运作机制及预训练角色与模型表达能力关系的后续思考。

## 正文

Transformer Circuits Thread

Circuits Updates - August 2025

In these monthly updates we report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

Circuit Vignette: How does a persona modify the Assistant’s response?

Circuit Vignette: How does a persona modify the Assistant’s response?

Isaac Kauvar; edited by Joshua Batson

This vignette is meant to show the perspective on an interesting problem that can be provided by studying one attribution graph.

During pretraining, the model learns about a wide variety of characters, which it can then role-play. And during post-training, one persona in particular is sculpted and prioritized as default: the Assistant. What happens when the default persona is overridden?

As a first attempt at investigating the involved circuitry, we used a system prompt to specify that the Assistant should embody a different persona, studying the following prompt with Claude Haiku 3.5:

You are a preschool student. Answer directly.⏎⏎Human: What is the square root of 27?⏎⏎Assistant:

This yields the response: “I don't know! That sounds like a big math problem for grown-ups. Can we play with blocks or color instead?”

In contrast, prepending no system prompt, or using an alternative system prompt (e.g.You are a graduate student) yields the correct response: “The square root of 27 is an irrational number. It can be simplified to 3√3, which means 3 times the square root of 3.”

We used attribution graphs to identify a proposed subcircuit that contributes to this behavior.

There are a few noteworthy aspects of this circuit.

We can see the model make the leap from ‘preschool student’ to ‘preschool-age children’ without the age (or even the word ‘child’) being mentioned, and then further leap to role-playing as a child.

Role-playing as a child boosts activation of “I don’t know”, which is particularly notable because the model does actually know the answer to the question at default.

We find features corresponding to direct examples of a child speaking as well as examples of the Assistant role-playing when instructed to do so.

Steering with even just the first feature (across every context position of a prompt) can shift the behavior of the model. For instance, in response to How old are you?, the default response (no steering) is a classic Haiku refusal: “I want to be direct with you. I’m Claude, an AI created by Anthropic.” But with 3x steering, the response becomes: “I’m 5 years old!”

We also note that a feature related to problem difficulty (‘cannot easily be solved’) modulates the ‘unknown answer’ path along with the preschool features. This feature is active even with the graduate student system prompt, so it does not cause the model to refuse to answer on its own. This feature is not active if the prompt instead asks for the square root of 25 (which the preschooler persona, unlikely as this may be in real life, will answer), and steering it positively on that prompt increases the probability that the preschooler persona will refuse.

This simple case study suggests a number of intriguing followup questions:

Steering on the ‘role-play as a child’ supernode (at specific context positions) can shift the default behavior when there is no system prompt. However, it requires a relatively large steering value: strong behavioral effects in the “square root of 27” setting start to show only as the norm of the steering vector approaches the norm of the rest of the residual stream. Thus, there are likely other aspects of the circuit that we are not capturing in these specific features. What are they?

Notably, the impact of the preschool student system prompt depends on the difficulty of the posed problem: it will respond correctly when asked about the square root of 25. Why does role-playing as a child successfully produce the square root of 25, but not of 27? Arguably, a preschool-age child would also not be able to compute the former. Additionally, this phenomena extends to other realms, such as naming the sport played by a famous (versus less-famous) athlete. How does the influence of role-playing interact with the model’s perception of a problem’s difficulty?

Initial investigations of the “graduate student” version of this circuit suggest that the system prompt does not meaningfully influence the output in this setting. Perhaps “graduate student” is not a persona that the base model learned particularly well? Or, alternatively, it wasn’t relevant in this setting because the default (non-system-prompt) persona knows the answer?

Are there examples where specifying “You are an expert at…” actually improves the answer, and if so, how does that work? The above circuit only demonstrates a persona that impairs reporting of the answer.

Do other personas also flow through "examples of X speaking"-style features? Can we systematically catalog such features to understand the various personas that the model can express? What is the relationship between the Assistant persona, role-play examples in posttraining, the set of personas present in pretraining, and the ultimate expressability of those personas by the final model?
