Harness-1: a 20B search agent with a stateful harness
A retrieval subagent trained with reinforcement learning inside a search harness that holds the bookkeeping.
UIUC和Chroma放出的这个20B检索子代理,用RL训练出了0.73的平均召回,把开源竞品甩开11.4分,只比Opus-4.6低一点。权重、Harness全开源,搞RAG的可以真刀真枪试试了。
UIUC与Chroma联合推出Harness-1,一个20B参数的检索子智能体。它通过强化学习在一个有状态搜索框架中训练,该框架维护候选池、重要性标注集、证据图和验证记录,由策略决定搜索、筛选、验证及停止的时机。Harness-1在8个基准测试上达到0.730平均curated recall,比下一个最佳开源子智能体高出11.4个百分点,仅落后于Opus-4.6。模型权重和框架代码均已公开。
Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing both search decisions and routine bookkeeping at once.
Their answer is Harness-1, a 20B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds the bookkeeping. The policy keeps the semantic decisions. The weights and harness code are publicly released.

Harness-1 produces a ranked set of documents for a downstream answering model. It does not answer questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY.
Each turn works as a loop. The harness renders compact search state along with recent actions. The model emits one structured action. The harness executes it, updates state, and renders the next observation.
The research team calls its principle stateful cognitive offloading. The policy decides what to search, curate, and verify, and when to stop. The harness maintains the recoverable state around those decisions.
That state includes several pieces. A candidate pool holds compressed, deduplicated documents. An importance-tagged curated set is the final output, capped at 30 documents. Tags take four values: very_high, high, fair, or low. A full-text store keeps every retrieved chunk outside the prompt.
An evidence graph adds structure. A regex extractor scans each chunk for proper nouns, years, and dates. The harness then renders frequent entities, bridge documents, and singletons. Bridge documents contain two or more frequent entities. Singletons appear in one document and suggest follow-up leads.
The policy works through eight tools. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search outputs are compressed with sentence-BM25, keeping the top four sentences. Two-level deduplication removes repeats by chunk ID and content fingerprint.
One design choice addresses cold starts. The first successful search auto-seeds the curated set with eight reranked results at fair importance. The policy then promotes strong documents and removes weak ones. This turns the task from building from scratch into refinement.
The research team names three requirements for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three.
Training splits along the same line as the harness. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning improves search decisions over the maintained state.
A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for SFT. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint initializes RL.
RL uses on-policy CISPO with a 40-turn cap and terminal-only reward. It trains only on SEC queries. Groups with identical rewards are dropped from the gradient. Training ran on Tinker.
The reward separates discovery from selection. It also adds a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus near 0.53. With the bonus, diversity stabilizes and recall reaches about 0.60.
Harness-1 was evaluated on eight benchmarks spanning web, finance, patents, and multi-hop QA. The main metric is curated recall: coverage of relevant documents in the final set. Trajectory recall counts evidence encountered anywhere in the episode.
| Model | Type | Avg Curated Recall | Avg Trajectory Recall |
|---|---|---|---|
| Harness-1 (20B) | Open small | 0.730 | 0.807 |
| Tongyi DeepResearch 30B | Open small | 0.616 | 0.673 |
| Context-1 (20B) | Open small | 0.603 | 0.756 |
| Search-R1 (32B) | Open small | 0.289 | 0.289 |
| GPT-OSS-20B | Open small | 0.262 | 0.590 |
| Qwen3 (32B) | Open small | 0.216 | 0.446 |
| Opus-4.6 | Frontier | 0.764 | 0.794 |
| GPT-5.4 | Frontier | 0.709 | 0.752 |
| Sonnet-4.6 | Frontier | 0.688 | 0.725 |
| Kimi-K2.5 | Frontier | 0.647 | 0.794 |
| GPT-OSS-120B | Frontier | 0.496 | 0.769 |
Harness-1 reaches 0.730 average curated recall. That beats the next open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average.
The transfer pattern is the clearest signal of the mechanism. SFT used four benchmark families; RL used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. That is a 2.2x larger gain on tasks furthest from training data.
Ablations support the harness claim. Disabling all harness mechanisms drops Recall by 12.2 percent relative on BrowseComp+. The trained policy keeps searching but cannot rank what it sees.

The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this shape.
One is literature and patent review. The evidence graph and curated set help organize many sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks.
A third is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and better sets yield higher answer accuracy.
A retrieval subagent trained with reinforcement learning inside a search harness that holds the bookkeeping.
Most search agents pack search decisions and routine bookkeeping into one growing transcript. Harness-1 separates the two. The paper calls this stateful cognitive offloading.
UIUC和Chroma放出的这个20B检索子代理,用RL训练出了0.73的平均召回,把开源竞品甩开11.4分,只比Opus-4.6低一点。权重、Harness全开源,搞RAG的可以真刀真枪试试了。
UIUC与Chroma联合推出Harness-1,一个20B参数的检索子智能体。它通过强化学习在一个有状态搜索框架中训练,该框架维护候选池、重要性标注集、证据图和验证记录,由策略决定搜索、筛选、验证及停止的时机。Harness-1在8个基准测试上达到0.730平均curated recall,比下一个最佳开源子智能体高出11.4个百分点,仅落后于Opus-4.6。模型权重和框架代码均已公开。
Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing both search decisions and routine bookkeeping at once.
Their answer is Harness-1, a 20B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds the bookkeeping. The policy keeps the semantic decisions. The weights and harness code are publicly released.

Harness-1 produces a ranked set of documents for a downstream answering model. It does not answer questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY.
Each turn works as a loop. The harness renders compact search state along with recent actions. The model emits one structured action. The harness executes it, updates state, and renders the next observation.
The research team calls its principle stateful cognitive offloading. The policy decides what to search, curate, and verify, and when to stop. The harness maintains the recoverable state around those decisions.
That state includes several pieces. A candidate pool holds compressed, deduplicated documents. An importance-tagged curated set is the final output, capped at 30 documents. Tags take four values: very_high, high, fair, or low. A full-text store keeps every retrieved chunk outside the prompt.
An evidence graph adds structure. A regex extractor scans each chunk for proper nouns, years, and dates. The harness then renders frequent entities, bridge documents, and singletons. Bridge documents contain two or more frequent entities. Singletons appear in one document and suggest follow-up leads.
The policy works through eight tools. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search outputs are compressed with sentence-BM25, keeping the top four sentences. Two-level deduplication removes repeats by chunk ID and content fingerprint.
One design choice addresses cold starts. The first successful search auto-seeds the curated set with eight reranked results at fair importance. The policy then promotes strong documents and removes weak ones. This turns the task from building from scratch into refinement.
The research team names three requirements for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three.
Training splits along the same line as the harness. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning improves search decisions over the maintained state.
A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for SFT. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint initializes RL.
RL uses on-policy CISPO with a 40-turn cap and terminal-only reward. It trains only on SEC queries. Groups with identical rewards are dropped from the gradient. Training ran on Tinker.
The reward separates discovery from selection. It also adds a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus near 0.53. With the bonus, diversity stabilizes and recall reaches about 0.60.
Harness-1 was evaluated on eight benchmarks spanning web, finance, patents, and multi-hop QA. The main metric is curated recall: coverage of relevant documents in the final set. Trajectory recall counts evidence encountered anywhere in the episode.
| Model | Type | Avg Curated Recall | Avg Trajectory Recall |
|---|---|---|---|
| Harness-1 (20B) | Open small | 0.730 | 0.807 |
| Tongyi DeepResearch 30B | Open small | 0.616 | 0.673 |
| Context-1 (20B) | Open small | 0.603 | 0.756 |
| Search-R1 (32B) | Open small | 0.289 | 0.289 |
| GPT-OSS-20B | Open small | 0.262 | 0.590 |
| Qwen3 (32B) | Open small | 0.216 | 0.446 |
| Opus-4.6 | Frontier | 0.764 | 0.794 |
| GPT-5.4 | Frontier | 0.709 | 0.752 |
| Sonnet-4.6 | Frontier | 0.688 | 0.725 |
| Kimi-K2.5 | Frontier | 0.647 | 0.794 |
| GPT-OSS-120B | Frontier | 0.496 | 0.769 |
Harness-1 reaches 0.730 average curated recall. That beats the next open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average.
The transfer pattern is the clearest signal of the mechanism. SFT used four benchmark families; RL used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. That is a 2.2x larger gain on tasks furthest from training data.
Ablations support the harness claim. Disabling all harness mechanisms drops Recall by 12.2 percent relative on BrowseComp+. The trained policy keeps searching but cannot rank what it sees.

The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this shape.
One is literature and patent review. The evidence graph and curated set help organize many sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks.
A third is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and better sets yield higher answer accuracy.
A retrieval subagent trained with reinforcement learning inside a search harness that holds the bookkeeping.
Most search agents pack search decisions and routine bookkeeping into one growing transcript. Harness-1 separates the two. The paper calls this stateful cognitive offloading.