OpenRouter：Announcements（RSS）

精选59

Response Caching：相同请求零成本

2026-05-01 02:00·51天前·Brian Thomas

精选理由

OpenRouter 这次更新的响应缓存，直接让相同请求免费用，对频繁调用的场景是个省钱加速的好功能，用过 API 的人都能立刻明白它的价值。

AI 摘要

新推出的 Response Caching 头部实现了 API 请求的缓存机制，完全相同的请求可获得缓存响应，响应时间大幅缩短至微乎其微的水平，且不会产生额外成本。该功能通过自动识别并复用已生成的响应，显著提升了重复请求的处理效率。

原文 · 未翻译

Response Caching: Zero Cost for Identical Requests — OpenRouter Blog

Response Caching: Zero Cost for Identical Requests

Brian Thomas · 4/30/2026

Image 1: Response Caching: Zero Cost for Identical Requests

On this page

You can now add X-OpenRouter-Cache: true to your chat completions, responses, messages, or embeddings requests to start caching identical calls. The first call hits the provider and gets billed normally. Every identical call after that returns the same response in a tiny fraction of the time, with zero tokens billed.

View the response caching docs

What it does

Response caching sits in front of the model provider. When you send a request with caching enabled, OpenRouter hashes the request body, model, API key, and streaming mode into a cache key. If an identical request was made before and hasn’t expired, the cached response comes back immediately. No provider call, no token consumption, no charge.

Both streaming and non-streaming requests work. Cached streaming responses replay through the same pipeline, so your client code doesn’t need to change. Text, images, audio, documents, and tool calls all cache normally. Multimodal inputs (base64 images, audio clips, file attachments) are included in the cache key hash. One caveat: very large multimodal payloads that get offloaded internally for processing aren’t eligible for caching. Standard-sized requests cache fine.

Response caching is separate from prompt caching. Prompt caching (which many providers offer natively) reduces the cost of the prompt portion when messages share a common prefix. Response caching skips the provider entirely and returns the full response from OpenRouter’s edge cache.

Reduces response times from seconds to milliseconds

Cached responses come back in 80-300ms, most of which is serialization and network. The cache lookup itself averages 4ms. For comparison, a typical uncached request to Gemini 2.5 Flash takes about 1.3 seconds, Kimi K2.6 takes 4.6 seconds, and GPT-5.5 takes 9.1 seconds. Cache hits are billed at zero: no prompt tokens, no completion tokens, no charge.

Enable it with a request header or with presets

Add the X-OpenRouter-Cache: true header to each API call you want to be eligible:

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-OpenRouter-Cache: true" \
  -d '{
    "model": "google/gemini-2.5-flash",
    "messages": [{"role": "user", "content": "What is the meaning of life?"}]
  }'

Presets. Enable caching for all requests using a specific preset by setting cache_enabled: truein the preset config. No header needed on individual requests.

You can control how long responses stay cached with X-OpenRouter-Cache-TTL (1 second to 24 hours, default 5 minutes). Need a fresh response? Send X-OpenRouter-Cache-Clear: true to bust the cache for that specific request.

Response headers tell you what happened: X-OpenRouter-Cache-Status: HIT or MISS, plus X-OpenRouter-Cache-Age and X-OpenRouter-Cache-TTL so you can see exactly how the cache is performing.

Where it helps most

Agent retries. When an agent workflow fails partway through, you can retry from the top. Cached steps return instantly and for free, so you only pay for the new work.

Test suites. Run your LLM-backed tests repeatedly without burning tokens. After the first run populates the cache, subsequent runs are deterministic and free.

Repeated context processing. If your app sends the same prompt to the same model (same system prompt, same user input, same parameters), only the first call costs anything.

Available now across most generation endpoints

The cache is scoped to your API key. Different keys (even under the same account) don’t share cache entries.

The feature works across /chat/completions, /responses, /messages, and /embeddings. Other endpoints — legacy /completions, /audio/speech (TTS), /audio/transcriptions (STT), /rerank, and video generation — are not yet supported. It’s currently in beta, and we’re watching how it performs before locking down the API surface.

Cache hits don’t count toward provider rate limits (since the request never reaches the provider), and they’re visible in your Activity log with a cache indicator for easy monitoring.

Full details in the docs.

产品更新部署/工程

阅读原文