Response Caching:相同请求零成本
OpenRouter 这次更新的响应缓存,直接让相同请求免费用,对频繁调用的场景是个省钱加速的好功能,用过 API 的人都能立刻明白它的价值。
新推出的 Response Caching 头部实现了 API 请求的缓存机制,完全相同的请求可获得缓存响应,响应时间大幅缩短至微乎其微的水平,且不会产生额外成本。该功能通过自动识别并复用已生成的响应,显著提升了重复请求的处理效率。
Response Caching: Zero Cost for Identical Requests — OpenRouter Blog
Response Caching: Zero Cost for Identical Requests
Brian Thomas · 4/30/2026

On this page
- What it does
- Reduces response times from seconds to milliseconds
- Enable it with a request header or with presets
- Where it helps most
- Available now across most generation endpoints
You can now add X-OpenRouter-Cache: true to your chat completions, responses, messages, or embeddings requests to start caching identical calls. The first call hits the provider and gets billed normally. Every identical call after that returns the same response in a tiny fraction of the time, with zero tokens billed.
View the response caching docs
What it does
Response caching sits in front of the model provider. When you send a request with caching enabled, OpenRouter hashes the request body, model, API key, and streaming mode into a cache key. If an identical request was made before and hasn’t expired, the cached response comes back immediately. No provider call, no token consumption, no charge.
Both streaming and non-streaming requests work. Cached streaming responses replay through the same pipeline, so your client code doesn’t need to change. Text, images, audio, documents, and tool calls all cache normally. Multimodal inputs (base64 images, audio clips, file attachments) are included in the cache key hash. One caveat: very large multimodal payloads that get offloaded internally for processing aren’t eligible for caching. Standard-sized requests cache fine.
Response caching is separate from prompt caching. Prompt caching (which many providers offer natively) reduces the cost of the prompt portion when messages share a common prefix. Response caching skips the provider entirely and returns the full response from OpenRouter’s edge cache.
Reduces response times from seconds to milliseconds
Cached responses come back in 80-300ms, most of which is serialization and network. The cache lookup itself averages 4ms. For comparison, a typical uncached request to Gemini 2.5 Flash takes about 1.3 seconds, Kimi K2.6 takes 4.6 seconds, and GPT-5.5 takes 9.1 seconds. Cache hits are billed at zero: no prompt tokens, no completion tokens, no charge.
Enable it with a request header or with presets
Add the X-OpenRouter-Cache: true header to each API call you want to be eligible:
curl https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-H "X-OpenRouter-Cache: true" \
-d '{
"model": "google/gemini-2.5-flash",
"messages": [{"role": "user", "content": "What is the meaning of life?"}]
}'
Presets. Enable caching for all requests using a specific preset by setting cache_enabled: truein the preset config. No header needed on individual requests.
You can control how long responses stay cached with X-OpenRouter-Cache-TTL (1 second to 24 hours, default 5 minutes). Need a fresh response? Send X-OpenRouter-Cache-Clear: true to bust the cache for that specific request.
Response headers tell you what happened: X-OpenRouter-Cache-Status: HIT or MISS, plus X-OpenRouter-Cache-Age and X-OpenRouter-Cache-TTL so you can see exactly how the cache is performing.
Where it helps most
Agent retries. When an agent workflow fails partway through, you can retry from the top. Cached steps return instantly and for free, so you only pay for the new work.
Test suites. Run your LLM-backed tests repeatedly without burning tokens. After the first run populates the cache, subsequent runs are deterministic and free.
Repeated context processing. If your app sends the same prompt to the same model (same system prompt, same user input, same parameters), only the first call costs anything.
Available now across most generation endpoints
The cache is scoped to your API key. Different keys (even under the same account) don’t share cache entries.
The feature works across /chat/completions, /responses, /messages, and /embeddings. Other endpoints — legacy /completions, /audio/speech (TTS), /audio/transcriptions (STT), /rerank, and video generation — are not yet supported. It’s currently in beta, and we’re watching how it performs before locking down the API surface.
Cache hits don’t count toward provider rate limits (since the request never reaches the provider), and they’re visible in your Activity log with a cache indicator for easy monitoring.
Full details in the docs.