Orthrus-Qwen3：在Qwen3上每步最多可处理7.8个令牌，输出分布与原版完全一致

2026-05-16 18:56·47天前·FranckDernoncou

AI 摘要

Orthrus-Qwen3项目在Qwen3模型上实现了每次前向传播最高可处理7.8个令牌的性能，同时确保输出分布与原版模型完全一致。该项目已在GitHub开源，并在Hacker News社区获得102点热度。这一优化显著提升了模型推理效率，且保持了生成结果的准确性。

原文 · 未翻译

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Official implementation and model checkpoints for Orthrus, a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.

Model Zoo

All models use a Qwen3 backbone and guarantee strictly lossless generation.

Model Base Model HuggingFace Avg. Speedup Orthrus-Qwen3-1.7B Qwen3-1.7B 🤗 HuggingFace 4.25× Orthrus-Qwen3-4B Qwen3-4.0B 🤗 HuggingFace 5.20× Orthrus-Qwen3-8B Qwen3-8.0B 🤗 HuggingFace 5.36×

Installation

uv pip install -e . uv pip install ninja packaging uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it

We recommend uv for fast dependency resolution.

Quickstart

⚡ Try instantly: Run Orthrus directly in Colab:

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer model = AutoModelForCausalLM.from_pretrained( "chiennv/Orthrus-Qwen3-8B", dtype=torch.bfloat16, device_map="cuda", attn_implementation="flash_attention_2", # options: sdpa | eager | flash_attention_4 trust_remote_code=True, ).eval() tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B") prompt = "Write a program to count the frequency of each word in a paragraph." messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).input_ids output_ids = model.generate( input_ids=input_ids.to(model.device), max_new_tokens=2048, use_diffusion_mode=True, streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation )

Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!

Key Advantages

Significant Inference Acceleration: Breaks the sequential bottleneck of standard autoregressive decoding, delivering up to a $7.8\times$ speedup on generation tasks.

Strictly Lossless Generation: Employs an exact intra-model consensus mechanism to guarantee that the output matches the original base model's exact predictive distribution.

Zero Redundant Memory Overhead: Both the autoregressive and diffusion views attend to the exact same high-fidelity Key-Value (KV) cache natively, resulting in only an $O(1)$ memory cache overhead.

Hacker News 热门（buzzing.cc 中文翻译）

67导出 Markdown