TokenMix Research Lab · 2026-04-24
GPT-4o Realtime Audio API 2026: Setup + Cost Math
Last Updated: 2026-04-24
Author: TokenMix Research Lab
GPT-4o Realtime is OpenAI's end-to-end voice-to-voice API — user speaks, model hears, model thinks, model speaks back, all in a single WebSocket session with ~300ms end-to-end latency. Three model variants exist: gpt-4o-realtime-preview (flagship), gpt-4o-mini-realtime-preview (cheaper), and gpt-4o-audio-preview (HTTP-based, non-streaming). Pricing is meaningful: $0.06 per minute of audio input, $0.24 per minute of audio output on the standard variant. This guide covers the WebSocket setup (15 lines of code), cost math at 3 production scales, latency benchmarks vs ElevenLabs Conversational, and when to pick OpenAI Realtime vs Gemini 3.1 Flash Live. TokenMix.ai routes all three variants.
Table of Contents
- Confirmed vs Speculation
- The Three Realtime Variants
- WebSocket Setup in 15 Lines
- Cost Math: 3 Production Scales
- Latency vs ElevenLabs Conversational + Gemini Live
- When to Pick Each
- FAQ
Confirmed vs Speculation
| Claim | Status | Source |
|---|---|---|
| GPT-4o Realtime WebSocket API | Confirmed | OpenAI Realtime docs |
| ~300ms end-to-end latency | Confirmed (p50) | OpenAI benchmark |
| Supports native voice input (no separate STT) | Confirmed | Architecture |
| Three variants (realtime / mini / audio-preview) | Confirmed | API reference |
| Audio output preserves tone/emotion of input | Confirmed | Key differentiator |
| Works through OpenAI-compatible proxies | Partial — real-time needs direct WebSocket | — |
| $0.06/min input audio | Confirmed | Pricing |
| Matches ElevenLabs latency | Close — 300ms vs 250ms typical | Benchmark |
Snapshot note (2026-04-24): Per-minute pricing for
gpt-4o-realtime-previewand variants is as posted on OpenAI's API pricing page at snapshot. End-to-end latency numbers are OpenAI-reported medians; actual p50/p95 in your product vary with client geography, audio codec, VAD config, and whether you proxy via a gateway. ElevenLabs and Gemini Live figures come from each vendor plus community benchmarks.
The Three Realtime Variants
gpt-4o-realtime-preview (flagship):
- WebSocket only, voice-to-voice
- 300ms p50 end-to-end
- $0.06/min audio in, $0.24/min audio out, $5/MTok input text, $20/MTok output text
- Best for production voice agents, real-time assistants
gpt-4o-mini-realtime-preview:
- Same WebSocket architecture
- ~20% slower, still <500ms p50
- $0.01/min audio in, $0.04/min audio out (6× cheaper)
- Best for high-volume consumer voice apps
gpt-4o-audio-preview:
- HTTP-based, non-streaming (send audio, wait for audio response)
- 2-8 second full response time
- $0.06/min, same as realtime
- Best for async voicemail processing, voice-based batch tasks
WebSocket Setup in 15 Lines
Minimum viable connection (Python):
import asyncio, json, websockets
async def voice_session():
async with websockets.connect(
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
additional_headers={
"Authorization": "Bearer $OPENAI_KEY",
"OpenAI-Beta": "realtime=v1"
}
) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {"voice": "alloy", "instructions": "Be concise."}
}))
# Send audio chunks (base64-encoded PCM16 24kHz)
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64_audio_chunk
}))
# Receive audio + text deltas
async for msg in ws:
print(json.loads(msg).get("type"))
asyncio.run(voice_session())
For production, add: audio VAD (voice activity detection), interruption handling, turn-taking logic, function calling for tool use.
Cost Math: 3 Production Scales
Assume average call: 2 minutes user audio, 3 minutes agent audio.
Small — 1,000 calls/month:
- Realtime: (1000 × 2 × $0.06) + (1000 × 3 × $0.24) = $840/mo
- Mini: (1000 × 2 × $0.01) + (1000 × 3 × $0.04) = $140/mo
- Audio-preview (async): same token cost as Realtime but no streaming UX
Mid — 50,000 calls/month (call center):
- Realtime: $42,000/mo
- Mini: $7,000/mo
- Hybrid route (Realtime for premium, Mini for simple): ~$15,000/mo
Enterprise — 500,000 calls/month:
- Realtime: $420,000/mo
- Mini: $70,000/mo
- Custom pricing typically negotiated above 1M min/month
Comparison to ElevenLabs Conversational (at $0.30/min blended): ~$30,000/mo at 50K calls. GPT-4o-mini-Realtime is ~4× cheaper than ElevenLabs while similar quality for most business voice use.
Latency vs ElevenLabs Conversational + Gemini Live
End-to-end (user stop speaking → agent starts speaking):
| Model | p50 latency | p95 | Voice polish | Cost |
|---|---|---|---|---|
| GPT-4o-Realtime | 300ms | 500ms | Good | $0.30/min blended |
| GPT-4o-mini-Realtime | 450ms | 700ms | Acceptable | $0.05/min |
| ElevenLabs Conversational | 250ms | 400ms | Best | $0.30-0.40/min |
| Gemini 3.1 Flash Live | 350ms | 550ms | Good | $0.04/min |
All four cross the "feels conversational" threshold (<500ms). Pick based on cost or voice polish preference.
When to Pick Each
| Your app | Pick | Why |
|---|---|---|
| Customer service voice agent, mid-volume | GPT-4o-Realtime | Best balance |
| High-volume consumer voice assistant | GPT-4o-mini-Realtime | Cheapest |
| Premium brand voice (character cloning) | ElevenLabs Conversational | Voice polish |
| Google ecosystem integration | Gemini 3.1 Flash Live | Native |
| Async voicemail transcription + response | gpt-4o-audio-preview | HTTP simpler |
| Ultra-low-latency (gaming, interactive media) | ElevenLabs | 250ms edge |
| Cost-first startup voice feature | Gemini Flash Live at $0.04/min | |
| B2B product that needs voice QA logs | Any with diarization + logging | Logging flexibility |
FAQ
Is GPT-4o-Realtime a single model or pipeline?
Single unified model — OpenAI does voice-to-voice in one forward pass, not STT→LLM→TTS pipeline. This is the main latency advantage vs chained approaches.
Can I use Realtime with a proxy/aggregator like TokenMix.ai?
Most aggregators forward WebSocket traffic. TokenMix.ai supports GPT-4o-Realtime via transparent WebSocket proxy. Adds ~20-50ms of latency for routing but simplifies multi-provider fallback.
Does Realtime support function calling / tool use?
Yes, native. Send tools array in session config, model invokes tools mid-conversation, you execute, return results, conversation continues. Works identically to GPT-5.4 function calling.
What's the audio format?
PCM16 at 24kHz, mono. Base64-encoded in WebSocket messages. Most voice SDKs (Twilio, Daily, LiveKit) support this or can transcode.
Can I interrupt the assistant mid-speech?
Yes. Send input_audio_buffer.append any time — the model halts speaking, processes new input, continues. This is what makes the UX feel natural.
How does pricing compare to Whisper + GPT + TTS pipeline?
Pipeline cost at 5-min call: Whisper ($0.03) + GPT-5.4 (~$0.10) + TTS ($0.05) = $0.18. GPT-4o-Realtime: $0.30 blended for same call. Realtime costs 67% more but latency is 3× better and voice tone/emotion is preserved. For user-facing agents, worth it.
Any rate limits I should know?
Tier 4: 20 concurrent WebSocket sessions. Enterprise: negotiable. For high-volume consumer voice apps, ElevenLabs or Gemini may have higher concurrent limits at similar pricing.
Sources
- OpenAI Realtime API Docs
- OpenAI Realtime Introduction
- Voice AI API Comparison — TokenMix
- ElevenLabs Scribe v2 — TokenMix
- Gemini Flash TTS — TokenMix
- GPT-4o Transcribe — TokenMix
By TokenMix Research Lab · Updated 2026-04-24