面向语音与转录的全新 Audio API
阅读原文· openrouter.aiOpenRouter把语音合成和转录也接进来了,以后做语音应用的开发者可以少对接几个API,这是把‘省事’写进DNA的典型更新。
OpenRouter 正式上线文本转语音和音频转录功能。平台通过两个新的 API 端点,集成了多家供应商的语音合成与音频转录服务。用户现在可以统一调用单一 API,便捷访问多提供商的高质量语音生成与语音转文本能力,无需再为不同服务商单独集成。这简化了开发流程,为应用添加语音交互与内容转录功能提供了更高效的一站式解决方案。
New Audio APIs for Speech and Transcription — OpenRouter Blog
New Audio APIs for Speech and Transcription
Jacky Liang · 5/1/2026

On this page
- Choosing a model: Audio vs. Speech vs. Transcription
- Try it in the Playground
- Getting started with Speech models
- Getting started with Transcription models
- What’s next
OpenRouter now has two dedicated audio endpoints: /api/v1/audio/speech for text-to-speech and /api/v1/audio/transcriptions for speech-to-text.
These new endpoints deliver specialized models that are generally faster and more cost-efficient than the general audio models we already support, but are more narrowly useful for specific audio tasks.
You can now generate speech from text with OpenAI, Google, or Mistral voices and transcribe audio files with OpenAI Whisper. All with the same routing, billing, and key management you already use for text, video and image generation.
Speech models · Transcription models · Speech docs · Transcription docs
Choosing a model: Audio vs. Speech vs. Transcription
The choice of models is a balance of specialization, cost, and speed. We’ve enabled access to the breadth of options so you can choose the right path for each use case:
| Audio models | Speech models | Transcription models | |
|---|---|---|---|
| What it does | Understands audio input and reasons over it, like a voice-native LLM | Converts text into lifelike spoken audio | Converts audio into text |
| Input → Output | Text/audio → text/audio | Text → audio | Audio → text |
| Best for | Voice agents, mixed-modality conversations, audio Q&A | Reading text aloud with built-in voices and streaming | Meeting notes, subtitles, feeding voice input into text pipelines |
| Endpoint | /chat/completions | /audio/speech | /audio/transcriptions |
| Trade-offs | More powerful but heavier and more expensive | Simpler, faster, cheaper (no reasoning needed) | Purpose-built for accuracy across languages and accents |
| Browse models | Audio models | Speech models | Transcription models |
| Docs | Audio output guide | Speech docs | Transcription docs |
Try it in the Playground
Both Speech and Transcription have dedicated Playground tabs on model pages (here’s GPT-4o Mini TTS’s Playground and GPT-4o Transcribe’s Playground as examples). For speech models, pick a voice from the dropdown, type your text, and hear the result. For transcription models, drag and drop an audio file and see the transcription.
Each model page also shows quickstart code in Python, TypeScript, curl, and the OpenRouter SDK, so you can copy a working example and have audio running in your app in minutes.
Getting started with Speech models
Send text, get audio back. The response is a raw byte stream you can pipe straight to a file or audio player.
curl https://openrouter.ai/api/v1/audio/speech \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
--output output.mp3 \
-d '{
"model": "openai/gpt-4o-mini-tts-2025-12-15",
"input": "Hello from OpenRouter.",
"voice": "alloy",
"response_format": "mp3"
}'
Speech providers currently include OpenAI (GPT-4o Mini TTS), Google (Gemini Flash TTS), and Mistral (Voxtral Mini TTS). Each model brings its own voice set, and you can browse available voices on each model’s page. Output comes in MP3 or PCM format.
Provider-specific options pass through cleanly. For example, OpenAI’s speech models accept an instructions field for tone control (e.g., “speak in a warm, friendly tone”).
Getting started with Transcription models
The transcription endpoint takes a base64-encoded audio file and returns text. It supports WAV, MP3, FLAC, and other common formats.
AUDIO_BASE64=$(base64 < recording.wav | tr -d '\n')
curl https://openrouter.ai/api/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/whisper-large-v3",
"input_audio": {
"data": "'"$AUDIO_BASE64"'",
"format": "wav"
}
}'
Transcription providers currently include OpenAI (Whisper, GPT-4o Transcribe, GPT-4o Mini Transcribe), Google (Chirp 3), and Groq (with their fast Whisper inference). You can optionally pass a language hint to improve accuracy for non-English audio.
What’s next
We’re actively adding more providers and voices. If there’s a speech or transcription model you want to see on OpenRouter, tell us on Discord.