DeepReinforce 发布 Ornith-1.0 开源编码模型族
阅读原文· marktechpost.comDeepReinforce 发布 Ornith-1.0 开源编码模型族,基于 Gemma 4 和 Qwen 3.5 后训练,提供 9B、31B、35B-MoE(每 token 激活约 3B 参数)和 397B-MoE 四个尺寸,均以 MIT 许可在 HuggingFace 开放。与固定人工设计框架的编码智能体不同,Ornith-1.0 在强化学习中联合优化框架与解决方案,并引入三层防御(固定信任边界、确定性监视器、冻结 LLM 裁判)防止奖励黑客。旗舰版 Ornith-1.0-397B 在 Terminal-Bench 2.1 上得分 77.5、在 SWE-Bench Verified 上得分 82.4,超越 Claude Opus 4.7(70.3)但低于 Claude Opus 4.8(85)和 GLM-5.2-744B(81.0)。支持 vLLM、SGLang 等推理框架,9B 模型(bf16 约 19GB)可部署在单张 80GB GPU 上。
DeepReinforce has released Ornith-1.0, an open-source model family built for agentic coding. The lineup spans four sizes, from a 9B dense model to a 397B mixture-of-experts flagship. Every checkpoint ships under the MIT license on Hugging Face. The models are post-trained on top of pretrained Gemma 4 and Qwen 3.5.
Most coding agents pair a model with a fixed, human-designed harness. Ornith-1.0 instead learns to write its own. The DeepReinforce research team reports state-of-the-art results among open models of comparable size.
TL;DR
- Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5.
- The model learns its own scaffold during RL, jointly optimizing the harness and the solution.
- Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B.
- Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking.
What is Ornith-1.0?
Ornith-1.0 is a set of reasoning models tuned for coding agents. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B model is mixture-of-experts and activates roughly 3B parameters per token. FP8 and GGUF builds are also published for faster local serving.
Each model is a reasoning model. Replies open with a <think> block before the final answer. The serving recipes enable a reasoning parser, so that trace returns in a separate reasoning_content field. The models also emit well-formed tool calls for agent loops.
Deployment is straightforward. The 9B model is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint. Standard agent frameworks therefore work without code changes.
Interactive Explainer
The Self-Scaffolding Idea
Most coding agents rely on a scaffold, also called a harness. A scaffold wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category.
Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model’s policy. Each RL step runs in two stages.
First, the model reads the task and its previous scaffold. It then proposes a refined scaffold. Second, it uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages.
So the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design.
Training also runs asynchronously, using a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them past a threshold. The optimization uses a token-level GRPO objective.
Guarding Against Reward Hacking
Letting a model write its own scaffold invites reward hacking. A scaffold could read visible test files and hardcode expected outputs. It could also copy an oracle solution sitting in the environment. DeepReinforce team describes three defense layers.
- The outer trust boundary is fixed and immutable. The environment, tool surface, and test isolation stay outside the model’s reach. The model evolves only its inner policy scaffold.
- A deterministic monitor flags banned actions. Reading withheld paths or editing verification scripts earns zero reward. Those trajectories are excluded from the advantage computation.
- A frozen LLM judge acts as a veto. It sits on top of the verifier, not as the primary reward.
Benchmark
DeepReinforce reports vendor numbers across several agentic coding benchmarks. At flagship scale, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. On SWE-Bench Verified, that 82.4 trails only Claude Opus 4.8 (87.6) among the listed models. On Terminal-Bench 2.1, the picture is more mixed.
Ornith-1.0-397B beats Claude Opus 4.7 (70.3) on Terminal-Bench 2.1. But it trails Claude Opus 4.8 (85) and the larger GLM-5.2-744B (81.0). So the ‘state-of-the-art’ claim is scoped to open models of comparable size.
The smaller models carry the efficiency case. The 35B model scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B’s 53.5. The 9B model reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified.
| Benchmark | Ornith-1.0-397B | Qwen3.5-397B | Qwen3.7-Max | GLM-5.2-744B | Minimax-M3-428B | DeepSeek-V4-Pro-1.6T | Claude Opus 4.7 | Claude Opus 4.8 |
|---|---|---|---|---|---|---|---|---|
| Terminal-Bench 2.1 | 77.5 | 53.5 | 73.5 | 81.0 | 64 | 64 | 70.3 | 85 |
| SWE-Bench Verified | 82.4 | 76.4 | 80.4 | – | – | 80.6 | 80.8 | 87.6 |
| SWE-Bench Pro | 62.2 | 51.6 | 60.6 | 62.1 | 59 | 55.4 | 64.3 | 69.2 |
| SWE-Bench Multilingual | 78.9 | 69.3 | 78.3 | – | – | 76.2 | – | – |
| NL2Repo | 48.2 | 36.8 | 47.2 | 48.9 | 42.1 | – | – | 69.7 |
| ClawEval Avg | 77.1 | 70.7 | 65.2 | – | – | 75.8 | 78.2 | – |
Use Cases and a Quick Start
The models target terminal-native coding agents and repository-scale work. Practical fits include multi-file refactors, bug localization, and test-driven patches. The 9B model suits edge or single-GPU setups where latency and cost matter. The 397B model targets maximum accuracy on long, multi-step tasks.
For example, a dev can run the 9B model locally to triage a failing test suite. A platform team can self-host the 397B model for an internal coding agent.
Serving is a one-liner with vLLM:
vllm serve deepreinforce-ai/Ornith-1.0-9B \
--served-model-name Ornith-1.0-9B \
--max-model-len 262144 \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--trust-remote-codeThen call it with any OpenAI client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="Ornith-1.0-9B",
messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],
temperature=0.6, top_p=0.95,
)
msg = resp.choices[0].message
print(getattr(msg, "reasoning_content", None)) # the <think> trace
print(msg.content) # the final answerThe reasoning trace returns in reasoning_content, with the answer in content. Recommended sampling is temperature=0.6, top_p=0.95, top_k=20. The model also plugs into OpenHands, OpenClaw, and OpenCode.