# Show HN： Forge - Guardrails 将 8B 模型在代理任务中的准确率从 53% 提升至 99%

- 来源：Hacker News 热门（buzzing.cc 中文翻译）
- 作者：zambelli
- 发布时间：2026-05-20 05:41
- AIHOT 分数：77
- AIHOT 链接：https://aihot.virxact.com/items/cmpd6esix019bslk173zfhddh
- 原文链接：https://github.com/antoinezambelli/forge

## AI 摘要

Forge – Guardrails 是一个开源工具，通过集成防护栏机制，将8B参数AI模型在代理任务中的准确率从53%大幅提升至99%。这一改进显著增强了模型在复杂任务中的可靠性和效率，降低了错误率。该工具于2026年5月19日在Hacker News社区发布，获得100个点赞，代码已托管在GitHub上供开发者使用。

## 正文

forge

A reliability layer for self-hosted LLM tool-calling. You give forge a set of tools; the model calls whichever it wants in whatever order. Workflow structure is opt-in — required_steps, prerequisites, and terminal_tool let you constrain the loop when you need to, but forge's guardrails (rescue parsing, retry nudges, response validation) apply with zero required steps too.

required_steps

prerequisites

terminal_tool

Forge takes an 8B local model from single digits to 84% across forge's 26-scenario v0.7.0 eval suite — and even lifts Sonnet 4.6 from 85% to 98% on the same workload (Anthropic numbers measured in v0.6.0; not re-run in v0.7.0 since the cost is non-trivial).

What forge isn't:

Not an agent orchestrator. Forge sits inside one agentic loop and makes its tool calls reliable. Multi-agent graphs, DAG planners, and cross-agent coordination are out of scope.

Not a coding harness. Forge is domain-agnostic. If you're building a coding agent (or already using one like opencode, aider, Cline), proxy mode lifts your existing harness with forge's guardrails — no rewrite.

Three ways to use it:

Proxy server — Drop-in proxy (python -m forge.proxy) speaking both the OpenAI chat-completions and Anthropic Messages (/v1/messages) APIs, sitting between any client and a local model server. Point OpenAI-compatible tools (opencode, Continue, aider) or Claude Code at it and forge applies guardrails transparently — the client thinks it's talking to a smarter model. Most popular entry point.

python -m forge.proxy

/v1/messages

WorkflowRunner — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. SlotWorker adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly.

Guardrails middleware — Use forge's reliability stack (composable middleware) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.

Supports Ollama, llama-server (llama.cpp), Llamafile, vLLM, and Anthropic as backends.

Requirements

Python 3.12+

A running LLM backend (see below)

Install

pip install forge-guardrails # core only pip install "forge-guardrails[anthropic]" # + Anthropic client

For development:

git clone https://github.com/antoinezambelli/forge.git cd forge pip install -e ".[dev]"

Backend setup (pick one)

llama-server (recommended — top 10 eval configs all run on llama-server):

# Install from https://github.com/ggml-org/llama.cpp/releases llama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080

Ollama (alternative — easier setup, slightly weaker on harder workloads):

# Install from https://ollama.com/download ollama pull ministral-3:8b-instruct-2512-q4_K_M

Anthropic (API, no local GPU needed):

pip install -e ".[anthropic]" export ANTHROPIC_API_KEY=sk-...

See Backend Setup for full instructions and Model Guide for which model fits your hardware.

Quick Start

Start llama-server however you normally do (e.g. in a separate shell):

llama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080

Then the Python you'll run (e.g. from another shell):

import asyncio from pydantic import BaseModel, Field from forge import ( Workflow, ToolDef, ToolSpec, WorkflowRunner, LlamafileClient, ContextManager, TieredCompact, ) def get_weather(city: str) -> str: return f"72°F and sunny in {city}" class GetWeatherParams(BaseModel): city: str = Field(description="City name") workflow = Workflow( name="weather", description="Look up weather for a city.", tools={ "get_weather": ToolDef( spec=ToolSpec( name="get_weather", description="Get current weather", parameters=GetWeatherParams, ), callable=get_weather, ), }, required_steps=[], terminal_tool="get_weather", system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.", ) async def main(): client = LlamafileClient( gguf_path="path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf", mode="native", recommended_sampling=True, ) ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192) runner = WorkflowRunner(client=client, context_manager=ctx) await runner.run(workflow, "What's the weather in Paris?") asyncio.run(main())

For multi-step workflows, multi-turn conversations, and backend auto-management, see the User Guide. If you're building a long-running session (CLI, chat server, voice assistant), see the long-running session advisory for important guidance on filtering transient messages.

Proxy Server

Drop-in proxy that sits between any client and a local model server, speaking both the OpenAI chat-completions API and the Anthropic Messages API (/v1/messages). Point your client at the proxy (e.g. http://localhost:8081/v1) and forge applies its guardrails transparently — the client thinks it's talking to a smarter model.

/v1/messages

http://localhost:8081/v1

This is the path for using forge with an existing harness (opencode, Continue, aider, Cline, anything that speaks the OpenAI chat-completions schema — or Claude Code, which speaks the Anthropic Messages API). No Python rewrite. Reasoning replay defaults to none: Forge still captures reasoning for observability, but keeps it out of backend-facing history on later turns — the most token-efficient policy, and statistically indistinguishable from replay-all on the eval suite (see reasoning-replay results). Use --reasoning-replay keep-last to replay only the latest reasoning block, or --reasoning-replay full for the historical replay-all behavior.

none

--reasoning-replay keep-last

--reasoning-replay full

# External mode — you manage the backend, forge proxies it python -m forge.proxy --backend-url http://localhost:8080 --port 8081 # Managed mode — forge starts the backend and the proxy together python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081 # Managed vLLM — pass a model directory or HF repo id via --model-path python -m forge.proxy --backend vllm --model-path /path/to/awq-dir --port 8081

Then configure your client to use http://localhost:8081/v1 as the API base URL.

http://localhost:8081/v1

Claude Code: the proxy also serves the Anthropic Messages API on POST /v1/messages, so you can point Claude Code at a forge-guarded local model — set ANTHROPIC_BASE_URL=http://localhost:8081 and ANTHROPIC_AUTH_TOKEN=anything for the claude process. See Using forge with Claude Code for the full setup (native-vs-prompt FC, Anthropic-shape downstreams, cache_control).

POST /v1/messages

ANTHROPIC_BASE_URL=http://localhost:8081

ANTHROPIC_AUTH_TOKEN=anything

claude

cache_control

Backend compatibility:

Managed mode spins up the backend for you. Supported backends: llamaserver, llamafile, ollama, vllm (use --backend with --gguf for the GGUF-based backends, --model-path for vllm, or --model for ollama).

llamaserver

llamafile

ollama

vllm

--backend

--gguf

--model-path

--model

External mode is backend-agnostic — forge talks POST /v1/chat/completions to whatever you point --backend-url at, as long as it speaks the OpenAI schema. Tool calls must come back in OpenAI tool_calls format or in one of forge's rescue-parsed formats (Mistral [TOOL_CALLS], Qwen XML, fenced JSON). For a vLLM server, add --backend vllm so the proxy adopts vLLM's --served-model-name (vLLM 404s on a mismatched model field, unlike llama.cpp).

POST /v1/chat/completions

--backend-url

tool_calls

[TOOL_CALLS]

--backend vllm

--served-model-name

model

What proxy mode fortifies

On every POST /v1/chat/completions, forge applies (in order):

POST /v1/chat/completions

Response validation — each tool call in the model's response is checked against the tools array in the request. Calls to unknown tool names or with malformed shapes are caught before the response returns to your client.

tools

Rescue parsing — when the model emits tool calls in the wrong format (JSON in a code fence, Mistral's [TOOL_CALLS]name{args}, Qwen's ... XML), forge extracts the structured call and re-emits it in the canonical OpenAI tool_calls schema. Biggest practical lift for Mistral-family models.

[TOOL_CALLS]name{args}

...

tool_calls

Retry loop with error tracking — if validation fails, forge retries inference up to --max-retries (default 3) with a corrective tool-result message on the canonical channel, rather than returning a malformed response. From your client's perspective the proxy looks like a single request that just took a few extra ms.

--max-retries

Synthetic respond tool injection — when tools are present in the request, forge injects a synthetic respond tool the model calls instead of producing bare text. The respond call is stripped from the outbound response — the client sees a normal text response (finish_reason: "stop") and never knows the tool exists. Essential for small local models (~8B) that can't be trusted to choose correctly between text and tool calls. See ADR-013 for the full analysis.

respond

respond

respond

finish_reason: "stop"

What proxy mode does not do

Proxy mode is single-shot per request; some forge features need multi-turn workflow state that the OpenAI chat-completions schema doesn't carry:

Prerequisite enforcement and step-ordering — these need a workflow definition spanning turns. Available in WorkflowRunner.

WorkflowRunner

Context compaction and session memory — proxy mode forwards the inbound message list as-is; managing the rolling window is the client's job.

VRAM-aware budget detection — opt in with --budget-mode forge-full or --budget-mode forge-fast; otherwise proxy uses the backend's reported budget.

--budget-mode forge-full

--budget-mode forge-fast

For the full guardrail surface, use WorkflowRunner directly. The proxy trades depth for "use forge with your existing setup, no rewrite."

WorkflowRunner

Useful flags

Flag Default Purpose --max-retries N 3 Retry budget per validation failure --no-rescue (rescue on) Disable rescue parsing (debugging only) --budget-mode {backend,manual,forge-full,forge-fast} backend Context budget source --budget-tokens N — Manual token budget (requires --budget-mode manual) --serialize / --no-serialize auto Force request serialization (single-slot backends)

--max-retries N

--no-rescue

--budget-mode {backend,manual,forge-full,forge-fast}

backend

--budget-tokens N

--budget-mode manual

--serialize

--no-serialize

Docker

You can run the forge proxy as a Docker container.

Build the image:

docker build -t forge-proxy .

Run the container:

# Connect to an external backend (e.g. vLLM hosted on the same machine) docker run -p 8081:8081 forge-proxy --backend-url http://host.docker.internal:8000 --backend vllm --budget-mode manual --budget-tokens 8192

Note: If your backend is running on localhost of the host machine, use http://host.docker.internal:PORT (on macOS/Windows) or the host's IP address to allow the container to reach it.

localhost

http://host.docker.internal:PORT

Backends

Backend Best for Native FC? Ollama Easiest setup, model management built-in Yes llama-server Best performance, full control Yes (with --jinja) Llamafile Single binary, zero dependencies No (prompt-injected) vLLM High-throughput serving, AWQ/GPTQ weights Yes (server-side parser) Anthropic Frontier baseline, hybrid workflows Yes

--jinja

See Backend Setup for installation and Model Guide for which model to pick.

Running Tests

python -m pytest tests/ -v --tb=short

python -m pytest tests/ --cov=forge --cov-report=term-missing

Eval Harness

26 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows — split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end separation. See Eval Guide for full CLI reference.

# llama-server (start in another terminal first; see Eval Guide) python -m tests.eval.eval_runner --backend llamafile --llamafile-mode prompt --gguf "path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf" --runs 10 --stream --verbose # Batch eval (JSONL output, automatic resume) python -m tests.eval.batch_eval --config all --runs 50 # Reports — ASCII table by default; --html / --markdown export views python -m tests.eval.report eval_results.jsonl python -m tests.eval.report eval_results.jsonl --html docs/results/dashboard.html python -m tests.eval.report eval_results.jsonl --markdown docs/results/raw/

Project Structure

src/forge/ __init__.py # Public API exports errors.py # ForgeError hierarchy server.py # setup_backend(), ServerManager, BudgetMode core/ messages.py # Message, MessageRole, MessageType, MessageMeta workflow.py # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow inference.py # run_inference() — shared front half (compact, fold, validate, retry) runner.py # WorkflowRunner — the agentic loop slot_worker.py # SlotWorker — priority-queued slot access steps.py # StepTracker guardrails/ guardrails.py # Guardrails facade — applies the full stack in foreign loops nudge.py # Nudge dataclass response_validator.py # ResponseValidator, ValidationResult step_enforcer.py # StepEnforcer, StepCheck error_tracker.py # ErrorTracker clients/ base.py # ChunkType, StreamChunk, LLMClient protocol ollama.py # OllamaClient (native FC) llamafile.py # LlamafileClient (native FC or prompt-injected) anthropic.py # AnthropicClient (frontier baseline) context/ manager.py # ContextManager, CompactEvent strategies.py # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact hardware.py # HardwareProfile, detect_hardware() prompts/ templates.py # Tool prompt builders (prompt-injected path) nudges.py # Retry and step-enforcement nudge templates tools/ respond.py # Synthetic respond tool (respond_tool(), respond_spec()) proxy/ __main__.py # CLI entry point: python -m forge.proxy proxy.py # ProxyServer — programmatic start/stop API server.py # Raw asyncio HTTP server, SSE streaming handler.py # Request handler — bridge between HTTP and run_inference convert.py # OpenAI messages ↔ forge Messages conversion tests/ unit/ # 865 deterministic tests — no LLM backend required eval/ # Eval harness — model qualification against real backends

Documentation

User Guide — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory

Model Guide — Which model and backend for your hardware

Backend Setup — Backend installation and server setup

Eval Guide — Eval harness CLI reference, batch eval

Architecture — Full design document

Workflow Internals — Workflow design and runner internals

Contributing — How to set up, test, and add new backends or scenarios

Paper

The forge guardrail framework and ablation study are published as:

Zambelli, A. Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling. https://doi.org/10.1145/3786335.3813193

A pre-publication preprint is also available at docs/forge_ieee_preprint.pdf — kept as a historical artifact. Cite the published version above; the DOI link may not resolve immediately depending on the publisher's release timing.

License

MIT — Copyright (c) 2025-2026 Antoine Zambelli