# Orthrus-Qwen3：在Qwen3上每步最多可处理7.8个令牌，输出分布与原版完全一致

- 来源：Hacker News 热门（buzzing.cc 中文翻译）
- 作者：FranckDernoncou
- 发布时间：2026-05-16 18:56
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmp88knzp0gmsslnz3h2olrl6
- 原文链接：https://github.com/chiennv2000/orthrus

## AI 摘要

Orthrus-Qwen3项目在Qwen3模型上实现了每次前向传播最高可处理7.8个令牌的性能，同时确保输出分布与原版模型完全一致。该项目已在GitHub开源，并在Hacker News社区获得102点热度。这一优化显著提升了模型推理效率，且保持了生成结果的准确性。

## 正文

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Official implementation and model checkpoints for Orthrus, a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.

Model Zoo

All models use a Qwen3 backbone and guarantee strictly lossless generation.

Model Base Model HuggingFace Avg. Speedup Orthrus-Qwen3-1.7B Qwen3-1.7B 🤗 HuggingFace 4.25× Orthrus-Qwen3-4B Qwen3-4.0B 🤗 HuggingFace 5.20× Orthrus-Qwen3-8B Qwen3-8.0B 🤗 HuggingFace 5.36×

Installation

uv pip install -e . uv pip install ninja packaging uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it

We recommend uv for fast dependency resolution.

We recommend uv for fast dependency resolution.

uv

Quickstart

⚡ Try instantly: Run Orthrus directly in Colab:

⚡ Try instantly: Run Orthrus directly in Colab:

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer model = AutoModelForCausalLM.from_pretrained( "chiennv/Orthrus-Qwen3-8B", dtype=torch.bfloat16, device_map="cuda", attn_implementation="flash_attention_2", # options: sdpa | eager | flash_attention_4 trust_remote_code=True, ).eval() tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B") prompt = "Write a program to count the frequency of each word in a paragraph." messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).input_ids output_ids = model.generate( input_ids=input_ids.to(model.device), max_new_tokens=2048, use_diffusion_mode=True, streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation )

Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!

Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!

Key Advantages

Significant Inference Acceleration: Breaks the sequential bottleneck of standard autoregressive decoding, delivering up to a $7.8\times$ speedup on generation tasks.

Strictly Lossless Generation: Employs an exact intra-model consensus mechanism to guarantee that the output matches the original base model's exact predictive distribution.

Zero Redundant Memory Overhead: Both the autoregressive and diffusion views attend to the exact same high-fidelity Key-Value (KV) cache natively, resulting in only an $O(1)$ memory cache overhead.

Parameter Efficient: Parallel generation capabilities are injected by fine-tuning only 16% of the total model parameters while keeping the base LLM strictly frozen.

Performance Comparison: Orthrus vs. Speculative Decoding

Orthrus outperforms speculative decoding methods like EAGLE-3, DFlash. By natively sharing the exact same KV cache across dual views, Orthrus avoids the redundant memory overhead of draft models, resulting in significantly higher token acceptance rates and faster inference times, especially as context length scales. Orthrus maintains consistently high end-to-end throughput—even at 40K context lengths compared to DFlash's rapid degradation.

Left: Average verified tokens per forward pass compared to EAGLE-3 and DFlash. Right: End-to-end throughput across scaling context lengths.

Comparison with State-of-the-Art Diffusion Models

While recent diffusion language models (dLLMs) offer parallel decoding, they often suffer from significant conditional drift and severe accuracy degradation on complex reasoning tasks. Orthrus resolves this by decoupling parallel generation from sequential constraints, establishing a new state-of-the-art for parallel generation fidelity.

Throughput vs. Accuracy on MATH-500. Orthrus delivers a ~6x speedup over the Qwen3-8B baseline with strictly lossless performance, whereas adaptations like Fast-dLLM-v2 suffer significant accuracy drops.

Further Support

MLX (Apple Silicon)

Orthrus supports native inference on Apple Silicon via MLX. Tested with mlx==0.31.2 and mlx-lm==0.31.3.

mlx==0.31.2

mlx-lm==0.31.3

Usage:

from src.model_mlx import load_model_and_tokenizer, mlx_generate repo_id = "chiennv/Orthrus-Qwen3-1.7B" model, tokenizer = load_model_and_tokenizer(repo_id) prompt_tokens = tokenizer.encode("If a rectangle has length 12 and width 7, what is its area?") for token in mlx_generate(model, prompt_tokens, tokenizer.eos_token_id, max_tokens=128): print(tokenizer.decode([token]), end="", flush=True)

Citation

If you find this model or architecture useful in your work, please cite our paper:

@misc{vannguyen2026orthrusmemoryefficientparalleltoken, title={Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion}, author={Chien Van Nguyen and Chaitra Hegde and Van Cuong Pham and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen}, year={2026}, eprint={2605.12825}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2605.12825}, }