Mellum2 技术报告

2026-05-29 08:00·35天前

AI 摘要

Mellum 2 是一个开源的 12B 参数 MoE 大语言模型，每个 token 有 2.5B 活跃参数，专注于软件工程任务，是 Mellum 的后继版本。其架构基于 64 专家、8 激活的 MoE，并融合了分组查询注意力、滑动窗口注意力和多 token 预测头。模型在约 10.6 万亿 token 上进行三阶段预训练，并通过 YaRN 扩展至 128K 上下文窗口，之后经过监督微调与 RLVR 后训练，发布了直答式（Instruct）和带推理链（Thinking）两个变体。在多项基准测试中，其性能可与 4B-14B 范围的开源模型竞争，而计算成本仅相当于 2.5B 稠密模型。所有检查点以 Apache 2.0 许可证发布。

原文 · 未翻译

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.

HuggingFace Daily Papers（社区热门论文）

62导出 Markdown

Mellum2 技术报告

2026-05-29 08:00·35天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译