原文 · 未翻译
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Official implementation and model checkpoints for Orthrus, a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.
Model Zoo
All models use a Qwen3 backbone and guarantee strictly lossless generation.
Model Base Model HuggingFace Avg. Speedup Orthrus-Qwen3-1.7B Qwen3-1.7B 🤗 HuggingFace 4.25× Orthrus-Qwen3-4B Qwen3-4.0B 🤗 HuggingFace 5.20× Orthrus-Qwen3-8B Qwen3-8.0B 🤗 HuggingFace 5.36×
Installation
uv pip install -e . uv pip install ninja packaging uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it
We recommend uv for fast dependency resolution.
We recommend uv for fast dependency resolution.
uv
Quickstart
⚡ Try instantly: Run Orthrus directly in Colab:
⚡ Try instantly: Run Orthrus directly in Colab:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer model = AutoModelForCausalLM.from_pretrained( "chiennv/Orthrus-Qwen3-8B", dtype=torch.bfloat16, device_map="cuda", attn_implementation="flash_attention_2", # options: sdpa | eager | flash_attention_4 trust_remote_code=True, ).eval() tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B") prompt = "Write a program to count the frequency of each word in a paragraph." messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).input_ids output_ids = model.generate( input_ids=input_ids.to(model.device), max_new_tokens=2048, use_diffusion_mode=True, streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation )
Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!
Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!
Key Advantages
Significant Inference Acceleration: Breaks the sequential bottleneck of standard autoregressive decoding, delivering up to a $7.8\times$ speedup on generation tasks.
Strictly Lossless Generation: Employs an exact intra-model consensus mechanism to guarantee that the output matches the original base model's exact predictive distribution.
Zero Redundant Memory Overhead: Both the autoregressive and diffusion views attend to the exact same high-fidelity Key-Value (KV) cache natively, resulting in only an $O(1)$ memory cache overhead.