Gemma Scope 2：助力 AI 安全社区深入理解复杂语言模型行为

2025-12-16 18:14·198天前

AI 摘要

Gemma Scope 2 正式发布，面向整个 Gemma 3 模型家族推出开放可解释性工具，助力 AI 安全社区深入理解复杂语言模型行为。

原文 · 未翻译

Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior — Google DeepMind

December 19, 2025 Responsibility & Safety

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

Language Model Interpretability Team

Audio 1

Listen to article 5 minutes

Announcing a new, open suite of tools for language model interpretability

Large Language Models (LLMs) are capable of incredible feats of reasoning, yet their internal decision-making processes remain largely opaque. Should a system not behave as expected, a lack of visibility into its internal workings can make it difficult to pinpoint the exact reason for its behaviour. Last year, we advanced the science of interpretability with Gemma Scope, a toolkit designed to help researchers understand the inner workings of Gemma 2, our lightweight collection of open models.

Today, we are releasing Gemma Scope 2: a comprehensive, open suite of interpretability tools for all Gemma 3 model sizes, from 270M to 27B parameters. These tools can enable us to trace potential risks across the entire "brain" of the model.

To our knowledge, this is the largest ever open-source release of interpretability tools by an AI lab to date. Producing Gemma Scope 2 involved storing approximately 110 Petabytes of data, as well as training over 1 trillion total parameters.

As AI continues to advance, we look forward to the AI research community using Gemma Scope 2 to debug emergent model behaviors, use these tools to better audit and debug AI agents, and ultimately, accelerate the development of practical and robust safety interventions against issues like jailbreaks, hallucinations and sycophancy.

Our interactive Gemma Scope 2 demo is available to try, courtesy of Neuronpedia.

What’s new in Gemma Scope 2

Interpretability research aims to understand the internal workings and learned algorithms of AI models. As AI becomes increasingly more capable and complex, interpretability is crucial for building AI that is safe and reliable.

Like its predecessor, Gemma Scope 2 acts as a microscope for the Gemma family of language models. By combining sparse autoencoders (SAEs) and transcoders, it allows researchers to look inside models, see what they’re thinking about, and how these thoughts are formed and connect to the model’s behaviour. In turn, this enables the richer study of jailbreaks or other AI behaviours relevant to safety, like discrepancies between a model's communicated reasoning and its internal state.

Google DeepMind：Blog（RSS）

导出 Markdown