2024年8月可解释性研究动态：词典学习评估新方法

2024-08-15 08:00·687天前

AI 摘要

Anthropic可解释性团队发布了2024年8月的研究更新，重点介绍了评估词典学习特征可解释性的两种量化方法。团队通过让Claude模型根据特征可视化工具预测特征激活，来评估特征的“自解释性”。其中，对比评估方法使用一个包含约80个多样化概念（如“光合作用”、“讽刺”、“蓝色”等）的硬编码列表，检验特征能否在对比提示对中一致捕捉概念差异。团队强调这些评估并不全面，仅衡量了可解释性的单一维度，且当前结果为初步分享，更多研究预计在未来几个月发布。

原文 · 未翻译

Circuits Updates - August 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

Interpretability Evals for Dictionary Learning

Interpretability Evals Case Study

Self-explaining SAE features

Interpretability Evals for Dictionary Learning

In order to evaluate the interpretability of our dictionary learning features, we developed two related methods. Both methods are a form of quantified autointerpretability, in which we measure the extent to which Claude can make accurate predictions about feature activations using our feature visualization tools.

We note that these evals are not comprehensive, and only measure a single notion of “interpretability”, which remains a nebulous concept. We expect that a suite of diverse evaluations is needed to provide a full picture of SAE quality.

Contrastive Eval

This eval is motivated by the use of “contrastive pairs” to search for features that represent particular concepts. If a feature is active for one prompt but not another, the feature should capture something about the difference between those prompts, in an interpretable way. Empirically, however, we often find this not to be the case – often a feature fires for one prompt but not another, even when our interpretation of the feature would suggest it should apply equally well to both prompts. For instance, we might find that a “flower” feature activates on a description of one kind of flower, but not another. Situations like this are a sign that either our features do not precisely capture interpretable concepts or that our interpretations of the features are insufficiently precise.

Anthropic：Transformer Circuits（可解释性研究）

63导出 Markdown

2024年8月可解释性研究动态：词典学习评估新方法

2024-08-15 08:00·687天前

阅读原文· transformer-circuits.pub

AI 摘要

原文 · 保持原样，未翻译

Circuits Updates - August 2024

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

Interpretability Evals for Dictionary Learning

Photosynthesis, Sarcasm, The color blue, An appeal to authority, Democracy, Love, A false equivalence, Decisiveness under uncertainty, Desire to dominate, The number 7, The end of a sentence, Playing a sport, An example of a reference to the first argument of a function in code, Acceleration of progress, Getting bored in class, Giving someone a compliment when you don't mean it, The taste of coffee, Apples, The concept of time, A conjunction separating two independent clauses, A person exhibiting empathy for another person, Reflection in a mirror, Two people disagreeing about politics, A circular argument, Imbalanced parentheses in code, The placebo effect, Recursive self-improvement, An example of vectorized operations in code, The philosophical concept of epistemology, The literary device of allusion, The physical phenomenon of quantum tunneling, Noncoding RNA and its role in biology, Fictional animals, Deviation from the norm, A sentence with multiple adjectives modifying a single noun, Foreshadowing, Chairs, A darker, grittier reboot of a beloved franchise, An example of in code, Struggling to learn a new concept, A beached whale, Social stratification, Anthrax, Imaginary friends, Recursion in programming, A person deceiving another person, CRISPR gene editing, A mathematical proof, The benefit of hindsight, A pronoun referring to someone introduced earlier in the sentence, The fall of the Roman empire, A person keeping a secret, Beauty standards, A string of bad luck, The rate of economic growth in a country, A cake recipe, Something being based on a true story, Famous actors and actresses, Theory of mind, Beauty in simplicity, Pretending to like food you don't like, Pastel colors, World War II, Something taking longer than it should, A sentence containing a list of items, Metamorphosis, An incorrect statement, Having trouble staying awake, Gaslighting, A metaphor, Falling from a great height, The theory of relativity, Abrahamic religions, Improvisation in music, Speaking truth to power, Two contradictory statements, Sham elections in a dictatorship, Rotting food, Cognitive dissonance, Large language models, A sentence with a transtive verb, Newton's second law of motion, Someone who is a master of their craft, Taking things one day at a time, Serendipity, Survivorship bias, Freshly made food, Providing an invalid input to a function, "Grass is always greener"-style thinking, Decision paralysis, Cultural relativism, Stream of consciousness narration, Centrally planned economies, Gentrification, A random word interjected in an otherwise normal sentence, The Industrial Revolution, Civil disobedience, Symbiotic relationships, Planned obsolescence