# Google 开源 Gemma 4 12B：无编码器架构，本地 16GB VRAM 运行

- 来源：Chubby♨️ (@kimmonismus)
- 发布时间：2026-06-04 03:20
- AIHOT 分数：71
- AIHOT 链接：https://aihot.virxact.com/items/cmpygc30j04vkslaxcrpd42mh
- 原文链接：https://x.com/kimmonismus/status/2062252981727240525

## AI 摘要

Google 开源 Gemma 4 12B（密集参数，Apache 2.0 许可），采用全新无编码器架构：移除独立的视觉（550M 参数、27 层 Transformer）和音频（300M 参数、12 层 Conformer）编码器。视觉改为 35M 嵌入层（约缩小 15 倍），音频以 40ms 帧直接投影到大语言模型。模型在 16GB VRAM 笔记本上即可运行智能体推理、视觉和音频任务，性能接近 26B 参数模型。共享权重支持一次 LoRA 调优覆盖视觉、音频和文本。

## 正文

Gemma 4 12B shipped today under the label "encoder-free."

A local 12b model that shows really good results. I'm a big fan of Gemma Gemma 4 12B is out： a dense， fully open model （Apache 2.0） that runs on a 16GB laptop and does agentic reasoning， vision and audio at a quality Google puts near its 26B model.

The reason a 12B can pull this off： Google removed the separate vision and audio encoders and feeds both straight into the model， which keeps the memory footprint small enough for consumer GPUs.

For on-device assistants and private coding agents， that lowers the bar a lot. always look forward to the updates. 12b is a good sweet spot in terms of size.

a few facts：

Vision： the 550M encoder （27 transformer layers） is now a 35M embedder， one matmul on 48x48 pixel patches. Roughly 15x smaller.
Audio： the 300M encoder （12 conformer layers） is gone. Raw 16kHz audio cut into 40ms frames， projected straight into the LLM. So encoding didn't vanish， it collapsed into the backbone.
The payoff is real： one shared set of weights， so you LoRA-tune vision， audio and text in a single pass.

### 引用推文

> Google：Today we're introducing Gemma 4 12B - our latest open model that brings advanced agentic reasoning, vision and audio directly to your laptop. It delivers perfor...