Ling-2.6-flash-base is the base checkpoint behind the Ling-2.6-flash model. It is a flash-scale Mixture-of-Experts language model retrofitted from the Ling-2.0 base checkpoint with a hybrid linear attention design, continued pre-training, and long-context mid-training.
This release is intended for research, continued pre-training, distillation, and supervised or preference-based fine-tuning. It is not a chat-aligned assistant model. If you want an out-of-the-box instruction model, use the corresponding post-trained Ling-2.6-flash checkpoint instead.
Model Overview
Ling-2.6-flash-base is designed for efficient instant-response modeling with stronger long-context efficiency than the previous GQA-based Ling-2.0 generation. The core upgrade is a hybrid attention retrofit that combines Lightning Attention with MLA in a 7:1 ratio, together with a smooth migration pipeline from the original architecture.
Ling-2.6 base models are trained through approximately 9.6T tokens across migration pre-training, continued pre-training, and mid-training, with staged context extension from 4K to 256K. Ling-2.6-flash-base serves as the base checkpoint for the post-trained Ling-2.6-flash instant model.
Key Features
Hybrid linear attention architecture combining Lightning Attention and MLA in a 7:1 ratio
Flash-scale MoE backbone optimized for efficient serving and high token efficiency
Long-context training pipeline extended to 256K context during mid-training
Continued pre-training mixture covering agentic data, long-context data, knowledge-rich web data, math, code, and multilingual corpora
Strong base-model quality across knowledge, math, code, reasoning, and long-context understanding benchmarks
Ling-2.6-flash-base is the base checkpoint behind the Ling-2.6-flash model. It is a flash-scale Mixture-of-Experts language model retrofitted from the Ling-2.0 base checkpoint with a hybrid linear attention design, continued pre-training, and long-context mid-training.
This release is intended for research, continued pre-training, distillation, and supervised or preference-based fine-tuning. It is not a chat-aligned assistant model. If you want an out-of-the-box instruction model, use the corresponding post-trained Ling-2.6-flash checkpoint instead.
Model Summary
Item Value Architecture Fine-grained MoE with hybrid linear attention Parameter Scale Totoal ~104B, Activated ~7.4B Transformer layers 32 Routed experts per MoE layer 256 Shared experts per MoE layer 1 Active routed experts per token 8 Attention heads 32 Dense FFN layers 1 Hidden size 4096 Dense intermediate size 9216 Expert intermediate size 1024 KV LoRA rank 512 Q LoRA rank 1536 Layer group size 8 Positional encoding Partial RoPE Attention design Lightning Attention + MLA, 7:1 ratio Training recipe Migration pre-training + continued pre-training + mid-training Total training tokens ~9.6T Context training schedule 4K -> 32K -> 256K
Training Highlights
Architecture Migration
The model is converted from the Ling-2.0 generation into the Ling-2.6-flash architecture through a multi-stage migration pipeline that includes:
Lightning Attention conversion
Linear warmup
MLA conversion
MLA warmup
Full continued pre-training
This retrofit is designed to preserve pre-trained capability while reducing long-context compute cost, KV-cache pressure, and decode latency.
Data Mixture
The continued pre-training and mid-training stages include:
Agentic corpus built from tool-use and coding environments
Long-context corpus covering mathematics, web parsing, summarization, retrieval, and multi-hop reasoning
General web knowledge data with targeted STEM and factual augmentation
Math and code corpora
Multilingual data spanning 21 languages
Base Model Evaluation
The following numbers are selected from the technical report and reflect base-model evaluation rather than chat-aligned or instruction-tuned performance.
Ling-2.6-flash-base shows broad gains over Ling-2.0-flash-base, especially on knowledge-oriented, reasoning-oriented, and long-context evaluations.
Intended Use
Recommended use cases:
Continued pre-training
Supervised fine-tuning for domain adaptation
Preference optimization and RL post-training
Distillation research
Long-context and MoE systems research
Not recommended as-is for:
Direct end-user chat deployment
Safety-critical applications without additional alignment and evaluation
Production use without post-training and task-specific validation
Limitations
This is a base model and is not instruction-aligned.
Outputs may be inaccurate, biased, incomplete, or unsafe without additional post-training.
Long-context quality depends on the serving stack, positional scaling configuration, and prompt format used at inference time.
The training mixture includes web-scale and synthetic data, so the model may reproduce factual errors or undesirable artifacts.
Benchmark results in the technical report are collected under controlled internal evaluation settings and should not be treated as a guarantee of downstream production behavior.
Relationship to Other Releases
Ling-2.6-flash: instruction and instant-response optimized model derived from this base.
If your goal is interactive assistant use rather than research on base checkpoints, the post-trained Ling-2.6-flash model is usually the better starting point.
Usage
This is a base checkpoint. The example below illustrates the loading pattern only.
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "inclusionAI/Ling-2.6-flash-base" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", ) prompt = "Summarize the benefits of hybrid linear attention." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=256, do_sample=False, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For production inference, prefer serving stacks that support the released architecture and remote code path.
Ling-2.6-flash-base is designed for efficient instant-response modeling with stronger long-context efficiency than the previous GQA-based Ling-2.0 generation. The core upgrade is a hybrid attention retrofit that combines Lightning Attention with MLA in a 7:1 ratio, together with a smooth migration pipeline from the original architecture.
Ling-2.6 base models are trained through approximately 9.6T tokens across migration pre-training, continued pre-training, and mid-training, with staged context extension from 4K to 256K. Ling-2.6-flash-base serves as the base checkpoint for the post-trained Ling-2.6-flash instant model.
Key Features
Hybrid linear attention architecture combining Lightning Attention and MLA in a 7:1 ratio
Flash-scale MoE backbone optimized for efficient serving and high token efficiency
Long-context training pipeline extended to 256K context during mid-training
Continued pre-training mixture covering agentic data, long-context data, knowledge-rich web data, math, code, and multilingual corpora
Strong base-model quality across knowledge, math, code, reasoning, and long-context understanding benchmarks
Model Summary
Item Value Architecture Fine-grained MoE with hybrid linear attention Parameter Scale Totoal ~104B, Activated ~7.4B Transformer layers 32 Routed experts per MoE layer 256 Shared experts per MoE layer 1 Active routed experts per token 8 Attention heads 32 Dense FFN layers 1 Hidden size 4096 Dense intermediate size 9216 Expert intermediate size 1024 KV LoRA rank 512 Q LoRA rank 1536 Layer group size 8 Positional encoding Partial RoPE Attention design Lightning Attention + MLA, 7:1 ratio Training recipe Migration pre-training + continued pre-training + mid-training Total training tokens ~9.6T Context training schedule 4K -> 32K -> 256K
Training Highlights
Architecture Migration
The model is converted from the Ling-2.0 generation into the Ling-2.6-flash architecture through a multi-stage migration pipeline that includes:
Lightning Attention conversion
Linear warmup
MLA conversion
MLA warmup
Full continued pre-training
This retrofit is designed to preserve pre-trained capability while reducing long-context compute cost, KV-cache pressure, and decode latency.
Data Mixture
The continued pre-training and mid-training stages include:
Agentic corpus built from tool-use and coding environments
Long-context corpus covering mathematics, web parsing, summarization, retrieval, and multi-hop reasoning
General web knowledge data with targeted STEM and factual augmentation
Math and code corpora
Multilingual data spanning 21 languages
Base Model Evaluation
The following numbers are selected from the technical report and reflect base-model evaluation rather than chat-aligned or instruction-tuned performance.
Ling-2.6-flash-base shows broad gains over Ling-2.0-flash-base, especially on knowledge-oriented, reasoning-oriented, and long-context evaluations.
Intended Use
Recommended use cases:
Continued pre-training
Supervised fine-tuning for domain adaptation
Preference optimization and RL post-training
Distillation research
Long-context and MoE systems research
Not recommended as-is for:
Direct end-user chat deployment
Safety-critical applications without additional alignment and evaluation
Production use without post-training and task-specific validation
Limitations
This is a base model and is not instruction-aligned.
Outputs may be inaccurate, biased, incomplete, or unsafe without additional post-training.
Long-context quality depends on the serving stack, positional scaling configuration, and prompt format used at inference time.
The training mixture includes web-scale and synthetic data, so the model may reproduce factual errors or undesirable artifacts.
Benchmark results in the technical report are collected under controlled internal evaluation settings and should not be treated as a guarantee of downstream production behavior.
Relationship to Other Releases
Ling-2.6-flash: instruction and instant-response optimized model derived from this base.
If your goal is interactive assistant use rather than research on base checkpoints, the post-trained Ling-2.6-flash model is usually the better starting point.
Usage
This is a base checkpoint. The example below illustrates the loading pattern only.
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "inclusionAI/Ling-2.6-flash-base" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", ) prompt = "Summarize the benefits of hybrid linear attention." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=256, do_sample=False, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For production inference, prefer serving stacks that support the released architecture and remote code path.