TokenMix Research Lab · 2026-04-25

gpt-4o-mini-tts: The Cheapest TTS API in 2026 ($0.015/Min, 13 Voices)
Last Updated: 2026-04-25
Author: TokenMix Research Lab
OpenAI's gpt-4o-mini-tts is the cost-optimized text-to-speech model that pairs with gpt-4o-mini-transcribe and gpt-4o-transcribe to complete OpenAI's audio stack. Released March 2025, it delivers natural-sounding speech at $0.60 per million text-input tokens and $12 per million audio-output tokens — which works out to roughly $0.015 per minute of generated audio. With 13+ distinct voices, 50+ language support, and steerable tone/emotion via prompts, it's OpenAI's answer to ElevenLabs at a much lower price point. This guide covers real pricing math, voice selection, streaming capabilities, production gotchas, and when to pick it vs alternatives. All data verified against OpenAI's April 2026 documentation.
Table of Contents
- What gpt-4o-mini-tts Is
- Pricing Explained (Per-Minute vs Token)
- Voice Selection: 13 Voices Covered
- Steerable Speech via Prompts
- Supported LLM Providers and Model Routing
- When to Use It vs ElevenLabs vs Alternatives
- Language Support
- Streaming and Latency
- Production Gotchas
- Quick Usage
- FAQ
What gpt-4o-mini-tts Is
Announced March 2025, gpt-4o-mini-tts is OpenAI's cost-efficient text-to-speech model. Built on the GPT-4o multimodal foundation, it converts text input into natural-sounding audio with:
- 13+ distinct voices
- 50+ languages
- Tone and emotion control via prompt engineering
- Both synchronous and streaming modes
- ~$0.015 per minute of generated audio
Key attributes:
| Attribute | Value |
|---|---|
| Creator | OpenAI |
| Released | March 2025 |
| Endpoint | /v1/audio/speech |
| Context window | 2,000 input tokens |
| Voices available | 13+ (Alloy, Ash, Ballad, Coral, Echo, Fable, Nova, Onyx, Sage, Shimmer, Verse, Marin, Cedar) |
| Languages | 50+ |
| Text input price | $0.60 / MTok |
| Audio output price | $12 / MTok |
| Per-minute equivalent | ~$0.015 / min generated |
| Streaming | Yes |
Pricing Explained (Per-Minute vs Token)
OpenAI prices gpt-4o-mini-tts on a token-based model:
- Input text: $0.60 per million tokens
- Output audio: $12 per million audio tokens
- Effective cost: ~$0.015 per minute of generated speech
Practical monthly cost examples:
| Workload | Generated audio/month | Monthly cost |
|---|---|---|
| Personal podcasting (2 hrs/wk) | 8 hours | $7.20 |
| Audiobook narration | 100 hours | $90.00 |
| Voicemail/IVR responses | 500 hours | $450.00 |
| Video course audio generation | 50 hours | $45.00 |
| Customer support voice bot | 2,000 hours | $1,800.00 |
Understanding the token-based pricing:
Audio tokens represent compressed audio. For gpt-4o-mini-tts, OpenAI's audio tokenization means:
- 1 minute of speech ≈ ~1,500 audio tokens
- 1 minute of speech at $12/MTok = ~$0.018 (slightly above the $0.015 "average" due to variable speech density)
Key insight on pricing: costs scale with audio length (output), not input text length. A 100-word sentence spoken slowly costs more than the same 100 words spoken quickly.
Voice Selection: 13 Voices Covered
OpenAI offers a range of voice personas:
| Voice | Style | Best for |
|---|---|---|
| Alloy | Neutral, clear | General purpose |
| Ash | Warm male | Narration |
| Ballad | Smooth, storytelling | Audiobooks |
| Coral | Energetic female | Marketing |
| Echo | Calm, measured | Tutorials |
| Fable | Narrative storyteller | Fiction |
| Nova | Younger, energetic | Consumer apps |
| Onyx | Deep, authoritative | News, serious content |
| Sage | Wise, reflective | Documentaries |
| Shimmer | Bright, expressive | Marketing, upbeat |
| Verse | Creative, poetic | Art / creative content |
| Marin | (newer addition) | Varies |
| Cedar | (newer addition) | Varies |
For production work, test multiple voices on your actual content — voice fit depends heavily on your material's tone.
Steerable Speech via Prompts
Unlike traditional TTS models where voice parameters are fixed, gpt-4o-mini-tts accepts natural-language guidance:
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="alloy",
input="Breaking news: the election results are in.",
instructions="Speak with urgency and a serious news-anchor tone. Slightly faster pace than normal.",
)
The instructions parameter accepts guidance on:
- Emotion (excited, sad, calm, urgent)
- Pace (faster, slower, measured)
- Pronunciation emphasis
- Accent (regional within a language)
- Character-specific direction
Example steering for different contexts:
# Audiobook narration
instructions="Slow, deliberate pacing. Thoughtful pauses between sentences. Gentle storytelling tone."
# Marketing ad
instructions="Upbeat and energetic. Faster pace. Enthusiastic but not shouty."
# Meditation app
instructions="Very slow, soothing. Long pauses between phrases. Calming and soft."
This steerability is a real differentiator vs ElevenLabs and Google TTS, where voice characteristics are more fixed.
Supported LLM Providers and Model Routing
gpt-4o-mini-tts is accessible via:
- OpenAI direct (
api.openai.com/v1/audio/speech) - Azure OpenAI — same model, enterprise deployment
- OpenAI-compatible aggregators — TokenMix.ai, OpenRouter, and similar
Through TokenMix.ai, you get OpenAI-compatible access to gpt-4o-mini-tts, gpt-4o-transcribe, gpt-4o-mini-transcribe alongside Anthropic, Google, and 300+ other models through one API key. For teams building apps that combine text generation (LLM), transcription (speech-to-text), and TTS (text-to-speech), this unified access eliminates cross-provider billing complexity. A customer service voice bot using GPT-5.5 for reasoning + gpt-4o-mini-tts for responses becomes a single API key setup.
Basic usage:
from openai import OpenAI
client = OpenAI(
api_key="your-tokenmix-key",
base_url="https://api.tokenmix.ai/v1",
)
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="alloy",
input="Hello, this is generated speech.",
)
response.stream_to_file("output.mp3")
When to Use It vs ElevenLabs vs Alternatives
Competitive landscape:
| Provider | Pricing | Strength |
|---|---|---|
| gpt-4o-mini-tts | $0.015/min | Cheapest quality TTS |
| ElevenLabs Turbo | ~$0.03/min | Voice cloning, emotion |
| ElevenLabs Multilingual | ~$0.05/min | 29 languages, highest quality |
| Google Cloud TTS (WaveNet) | ~$0.016/min | Google ecosystem |
| Azure TTS Neural | ~$0.016/min | Microsoft ecosystem |
| Play.ht | ~$0.03/min | Voice cloning |
| Deepgram Aura | ~$0.015/min | Ultra-low latency |
| Coqui (open-source) | $0 + infra | Self-hosted |
When to pick gpt-4o-mini-tts:
- You're already in the OpenAI ecosystem
- Cost is the primary constraint
- You need steerable voice via prompts
- You want unified audio + LLM billing
When to pick ElevenLabs:
- Quality is paramount (still slightly ahead on natural expressiveness)
- You need voice cloning
- You're building a consumer app where voice quality is a differentiator
When to pick Deepgram Aura:
- Latency-critical real-time applications
- You need the fastest possible TTS for interactive agents
When to self-host Coqui:
- Strict data privacy
- High volume (>1M minutes/month) making API costs prohibitive
- Team has ML infrastructure capacity
Language Support
50+ languages supported, including:
- All major European languages (English, Spanish, French, German, Italian, Portuguese, Dutch, Polish)
- East Asian (Chinese/Mandarin, Japanese, Korean)
- South Asian (Hindi, Bengali, Tamil)
- Middle Eastern (Arabic, Hebrew, Turkish)
- Southeast Asian (Vietnamese, Thai, Indonesian, Malay)
Quality varies — English and major European languages are strongest. Lower-resource languages may sound less natural.
Voice-language pairing: not every voice sounds equally good in every language. Test voice + language combinations on your specific content.
Streaming and Latency
gpt-4o-mini-tts supports streaming — audio plays as it's generated rather than waiting for full completion.
with client.audio.speech.with_streaming_response.create(
model="gpt-4o-mini-tts",
voice="alloy",
input="This will stream as it generates.",
) as response:
response.stream_to_file("output.mp3")
Latency observations:
- First audio chunk: typically ~300-600ms
- Complete generation (10-second sentence): ~1-2 seconds
- Streaming is meaningfully faster perceptually for interactive applications
For real-time voice agents (customer support, conversational AI), streaming is essential.
Production Gotchas
1. 2,000 input token context. Long text must be chunked. ~1,500 English words per chunk safely.
2. Variable output cost. Audio token count varies with speech density. Slow, measured speech costs more per input word than fast speech. Budget accordingly.
3. Voice consistency across chunks. When chunking long text, pass the same voice. OpenAI maintains voice consistency per session but may vary slightly across requests.
4. Pronunciation edge cases. Rare proper nouns, technical terminology, or intentional mispronunciations may need phonetic spelling in input.
5. SSML not supported (yet). Other TTS services use SSML (Speech Synthesis Markup Language) for fine control. gpt-4o-mini-tts uses natural-language instructions instead. Migration from SSML-based workflows requires rewriting.
6. No emotion tags in text. Some services let you mark specific words with emotion. gpt-4o-mini-tts applies instructions to the whole input.
7. MP3 output is standard. Also supports Opus, AAC, FLAC via response_format parameter.
Quick Usage
Basic text-to-speech:
from openai import OpenAI
client = OpenAI()
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="alloy",
input="Hello world!",
)
response.stream_to_file("hello.mp3")
With steering:
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="nova",
input="Welcome to our app!",
instructions="Warm, inviting, like a friendly receptionist.",
)
Streaming for real-time playback:
with client.audio.speech.with_streaming_response.create(
model="gpt-4o-mini-tts",
voice="onyx",
input="Streaming text-to-speech output...",
) as response:
for chunk in response.iter_bytes():
audio_player.play(chunk)
Batch for audiobook generation:
import json
for chapter_num, chapter_text in enumerate(audiobook_chapters):
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="ballad",
input=chapter_text[:2000], # chunk within token limit
instructions="Measured narration, suitable for long-form audiobook.",
)
response.stream_to_file(f"chapter_{chapter_num}.mp3")
FAQ
How does gpt-4o-mini-tts compare to the older tts-1 and tts-1-hd?
gpt-4o-mini-tts replaces tts-1 as the cheap tier. tts-1-hd remains for higher-quality needs, but gpt-4o-mini-tts is now OpenAI's primary TTS recommendation. Quality is comparable; cost is similar; gpt-4o-mini-tts wins on steerability via prompts.
Is it cheaper than ElevenLabs?
Yes, meaningfully. ElevenLabs starts around $0.03/min on their budget tier. gpt-4o-mini-tts is ~50% cheaper at $0.015/min.
Can I clone voices?
No. gpt-4o-mini-tts uses fixed preset voices. For voice cloning, use ElevenLabs or Play.ht.
Is the audio output commercial-use licensed?
Yes, under OpenAI's standard terms. You own the generated audio and can use it commercially. Verify specific licensing for your use case with OpenAI's usage policies.
Does it support SSML?
No. Instead of SSML tags, use natural-language instructions parameter. Different paradigm, similar outcome.
Can I get real-time streaming?
Yes, via with_streaming_response. First audio chunk arrives in ~300-600ms. Good enough for conversational agents; Deepgram Aura is faster if latency is critical.
What's the maximum input length per request?
2,000 tokens (~1,500 English words). For longer content, chunk into multiple requests.
Is this available through aggregators?
Yes. TokenMix.ai provides OpenAI-compatible access to gpt-4o-mini-tts alongside the full OpenAI audio stack (transcribe, TTS) and 300+ LLM models through one API key.
Related Articles
- Ultimate LLM Comparison Hub 2026: Every Major Model Benchmarked
- text-embedding-3-small: $0.02/MTok, 1536 Dims, MTEB 62.26 Guide
- GPT-5 Nano: $0.05/$0.40 Pricing, 400K Context, Should You Still Use It?
- gpt-4o-transcribe: Speech-to-Text API Guide ($0.006/Min, 2026)
- claude-sonnet-4-5-20250929 vs 4-20250514: Version Diff Guide
Author: TokenMix Research Lab | Last Updated: April 25, 2026 | Data Sources: OpenAI gpt-4o-mini-tts docs, OpenAI Text-to-Speech guide, OpenAI API pricing, PromptLayer gpt-4o-mini-tts analysis, TokenMix.ai unified audio API