gpt-4o-mini-tts: The Cheapest TTS API in 2026 ($0.015/Min, 13 Voices)
OpenAI's gpt-4o-mini-tts is the cost-optimized text-to-speech model that pairs with gpt-4o-mini-transcribe and gpt-4o-transcribe to complete OpenAI's audio stack. Released March 2025, it delivers natural-sounding speech at $0.60 per million text-input tokens and
2 per million audio-output tokens — which works out to roughly $0.015 per minute of generated audio. With 13+ distinct voices, 50+ language support, and steerable tone/emotion via prompts, it's OpenAI's answer to ElevenLabs at a much lower price point. This guide covers real pricing math, voice selection, streaming capabilities, production gotchas, and when to pick it vs alternatives. All data verified against OpenAI's April 2026 documentation.
Announced March 2025, gpt-4o-mini-tts is OpenAI's cost-efficient text-to-speech model. Built on the GPT-4o multimodal foundation, it converts text input into natural-sounding audio with:
1 minute of speech at
2/MTok = ~$0.018 (slightly above the $0.015 "average" due to variable speech density)
Key insight on pricing: costs scale with audio length (output), not input text length. A 100-word sentence spoken slowly costs more than the same 100 words spoken quickly.
Voice Selection: 13 Voices Covered
OpenAI offers a range of voice personas:
Voice
Style
Best for
Alloy
Neutral, clear
General purpose
Ash
Warm male
Narration
Ballad
Smooth, storytelling
Audiobooks
Coral
Energetic female
Marketing
Echo
Calm, measured
Tutorials
Fable
Narrative storyteller
Fiction
Nova
Younger, energetic
Consumer apps
Onyx
Deep, authoritative
News, serious content
Sage
Wise, reflective
Documentaries
Shimmer
Bright, expressive
Marketing, upbeat
Verse
Creative, poetic
Art / creative content
Marin
(newer addition)
Varies
Cedar
(newer addition)
Varies
For production work, test multiple voices on your actual content — voice fit depends heavily on your material's tone.
Steerable Speech via Prompts
Unlike traditional TTS models where voice parameters are fixed, gpt-4o-mini-tts accepts natural-language guidance:
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="alloy",
input="Breaking news: the election results are in.",
instructions="Speak with urgency and a serious news-anchor tone. Slightly faster pace than normal.",
)
The instructions parameter accepts guidance on:
Emotion (excited, sad, calm, urgent)
Pace (faster, slower, measured)
Pronunciation emphasis
Accent (regional within a language)
Character-specific direction
Example steering for different contexts:
# Audiobook narration
instructions="Slow, deliberate pacing. Thoughtful pauses between sentences. Gentle storytelling tone."
# Marketing ad
instructions="Upbeat and energetic. Faster pace. Enthusiastic but not shouty."
# Meditation app
instructions="Very slow, soothing. Long pauses between phrases. Calming and soft."
This steerability is a real differentiator vs ElevenLabs and Google TTS, where voice characteristics are more fixed.
Supported LLM Providers and Model Routing
gpt-4o-mini-tts is accessible via:
OpenAI direct (api.openai.com/v1/audio/speech)
Azure OpenAI — same model, enterprise deployment
OpenAI-compatible aggregators — TokenMix.ai, OpenRouter, and similar
Through TokenMix.ai, you get OpenAI-compatible access to gpt-4o-mini-tts, gpt-4o-transcribe, gpt-4o-mini-transcribe alongside Anthropic, Google, and 300+ other models through one API key. For teams building apps that combine text generation (LLM), transcription (speech-to-text), and TTS (text-to-speech), this unified access eliminates cross-provider billing complexity. A customer service voice bot using GPT-5.5 for reasoning + gpt-4o-mini-tts for responses becomes a single API key setup.
Basic usage:
from openai import OpenAI
client = OpenAI(
api_key="your-tokenmix-key",
base_url="https://api.tokenmix.ai/v1",
)
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="alloy",
input="Hello, this is generated speech.",
)
response.stream_to_file("output.mp3")
When to Use It vs ElevenLabs vs Alternatives
Competitive landscape:
Provider
Pricing
Strength
gpt-4o-mini-tts
$0.015/min
Cheapest quality TTS
ElevenLabs Turbo
~$0.03/min
Voice cloning, emotion
ElevenLabs Multilingual
~$0.05/min
29 languages, highest quality
Google Cloud TTS (WaveNet)
~$0.016/min
Google ecosystem
Azure TTS Neural
~$0.016/min
Microsoft ecosystem
Play.ht
~$0.03/min
Voice cloning
Deepgram Aura
~$0.015/min
Ultra-low latency
Coqui (open-source)
$0 + infra
Self-hosted
When to pick gpt-4o-mini-tts:
You're already in the OpenAI ecosystem
Cost is the primary constraint
You need steerable voice via prompts
You want unified audio + LLM billing
When to pick ElevenLabs:
Quality is paramount (still slightly ahead on natural expressiveness)
You need voice cloning
You're building a consumer app where voice quality is a differentiator
When to pick Deepgram Aura:
Latency-critical real-time applications
You need the fastest possible TTS for interactive agents
When to self-host Coqui:
Strict data privacy
High volume (>1M minutes/month) making API costs prohibitive
Team has ML infrastructure capacity
Language Support
50+ languages supported, including:
All major European languages (English, Spanish, French, German, Italian, Portuguese, Dutch, Polish)
East Asian (Chinese/Mandarin, Japanese, Korean)
South Asian (Hindi, Bengali, Tamil)
Middle Eastern (Arabic, Hebrew, Turkish)
Southeast Asian (Vietnamese, Thai, Indonesian, Malay)
Quality varies — English and major European languages are strongest. Lower-resource languages may sound less natural.
Voice-language pairing: not every voice sounds equally good in every language. Test voice + language combinations on your specific content.
Streaming and Latency
gpt-4o-mini-tts supports streaming — audio plays as it's generated rather than waiting for full completion.
with client.audio.speech.with_streaming_response.create(
model="gpt-4o-mini-tts",
voice="alloy",
input="This will stream as it generates.",
) as response:
response.stream_to_file("output.mp3")
Streaming is meaningfully faster perceptually for interactive applications
For real-time voice agents (customer support, conversational AI), streaming is essential.
Production Gotchas
1. 2,000 input token context. Long text must be chunked. ~1,500 English words per chunk safely.
2. Variable output cost. Audio token count varies with speech density. Slow, measured speech costs more per input word than fast speech. Budget accordingly.
3. Voice consistency across chunks. When chunking long text, pass the same voice. OpenAI maintains voice consistency per session but may vary slightly across requests.
4. Pronunciation edge cases. Rare proper nouns, technical terminology, or intentional mispronunciations may need phonetic spelling in input.
5. SSML not supported (yet). Other TTS services use SSML (Speech Synthesis Markup Language) for fine control. gpt-4o-mini-tts uses natural-language instructions instead. Migration from SSML-based workflows requires rewriting.
6. No emotion tags in text. Some services let you mark specific words with emotion. gpt-4o-mini-tts applies instructions to the whole input.
7. MP3 output is standard. Also supports Opus, AAC, FLAC via response_format parameter.
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="nova",
input="Welcome to our app!",
instructions="Warm, inviting, like a friendly receptionist.",
)
Streaming for real-time playback:
with client.audio.speech.with_streaming_response.create(
model="gpt-4o-mini-tts",
voice="onyx",
input="Streaming text-to-speech output...",
) as response:
for chunk in response.iter_bytes():
audio_player.play(chunk)
Batch for audiobook generation:
import json
for chapter_num, chapter_text in enumerate(audiobook_chapters):
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="ballad",
input=chapter_text[:2000], # chunk within token limit
instructions="Measured narration, suitable for long-form audiobook.",
)
response.stream_to_file(f"chapter_{chapter_num}.mp3")
FAQ
How does gpt-4o-mini-tts compare to the older tts-1 and tts-1-hd?
gpt-4o-mini-tts replaces tts-1 as the cheap tier. tts-1-hd remains for higher-quality needs, but gpt-4o-mini-tts is now OpenAI's primary TTS recommendation. Quality is comparable; cost is similar; gpt-4o-mini-tts wins on steerability via prompts.
Is it cheaper than ElevenLabs?
Yes, meaningfully. ElevenLabs starts around $0.03/min on their budget tier. gpt-4o-mini-tts is ~50% cheaper at $0.015/min.
Can I clone voices?
No. gpt-4o-mini-tts uses fixed preset voices. For voice cloning, use ElevenLabs or Play.ht.
Is the audio output commercial-use licensed?
Yes, under OpenAI's standard terms. You own the generated audio and can use it commercially. Verify specific licensing for your use case with OpenAI's usage policies.
Does it support SSML?
No. Instead of SSML tags, use natural-language instructions parameter. Different paradigm, similar outcome.
Can I get real-time streaming?
Yes, via with_streaming_response. First audio chunk arrives in ~300-600ms. Good enough for conversational agents; Deepgram Aura is faster if latency is critical.
What's the maximum input length per request?
2,000 tokens (~1,500 English words). For longer content, chunk into multiple requests.
Is this available through aggregators?
Yes. TokenMix.ai provides OpenAI-compatible access to gpt-4o-mini-tts alongside the full OpenAI audio stack (transcribe, TTS) and 300+ LLM models through one API key.