TokenMix Research Lab · 2026-04-24
GPT-4o Realtime Audio API 2026: Setup + Cost Math
GPT-4o Realtime is OpenAI's end-to-end voice-to-voice API — user speaks, model hears, model thinks, model speaks back, all in a single WebSocket session with ~300ms end-to-end latency. Three model variants exist: gpt-4o-realtime-preview (flagship), gpt-4o-mini-realtime-preview (cheaper), and gpt-4o-audio-preview (HTTP-based, non-streaming). Pricing is meaningful: $0.06 per minute of audio input, $0.24 per minute of audio output on the standard variant. This guide covers the WebSocket setup (15 lines of code), cost math at 3 production scales, latency benchmarks vs ElevenLabs Conversational, and when to pick OpenAI Realtime vs Gemini 3.1 Flash Live. TokenMix.ai routes all three variants.
Table of Contents
- Confirmed vs Speculation
- The Three Realtime Variants
- WebSocket Setup in 15 Lines
- Cost Math: 3 Production Scales
- Latency vs ElevenLabs Conversational + Gemini Live
- When to Pick Each
- FAQ
Confirmed vs Speculation
| Claim | Status | Source |
|---|---|---|
| GPT-4o Realtime WebSocket API | Confirmed | OpenAI Realtime docs |
| ~300ms end-to-end latency | Confirmed (p50) | OpenAI benchmark |
| Supports native voice input (no separate STT) | Confirmed | Architecture |
| Three variants (realtime / mini / audio-preview) | Confirmed | API reference |
| Audio output preserves tone/emotion of input | Confirmed | Key differentiator |
| Works through OpenAI-compatible proxies | Partial — real-time needs direct WebSocket | — |
| $0.06/min input audio | Confirmed | Pricing |
| Matches ElevenLabs latency | Close — 300ms vs 250ms typical | Benchmark |
The Three Realtime Variants
gpt-4o-realtime-preview (flagship):
- WebSocket only, voice-to-voice
- 300ms p50 end-to-end
- $0.06/min audio in, $0.24/min audio out, $5/MTok input text, $20/MTok output text
- Best for production voice agents, real-time assistants
gpt-4o-mini-realtime-preview:
- Same WebSocket architecture
- ~20% slower, still <500ms p50
- $0.01/min audio in, $0.04/min audio out (6× cheaper)
- Best for high-volume consumer voice apps
gpt-4o-audio-preview:
- HTTP-based, non-streaming (send audio, wait for audio response)
- 2-8 second full response time
- $0.06/min, same as realtime
- Best for async voicemail processing, voice-based batch tasks
WebSocket Setup in 15 Lines
Minimum viable connection (Python):
import asyncio, json, websockets
async def voice_session():
async with websockets.connect(
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
additional_headers={
"Authorization": "Bearer $OPENAI_KEY",
"OpenAI-Beta": "realtime=v1"
}
) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {"voice": "alloy", "instructions": "Be concise."}
}))
# Send audio chunks (base64-encoded PCM16 24kHz)
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64_audio_chunk
}))
# Receive audio + text deltas
async for msg in ws:
print(json.loads(msg).get("type"))
asyncio.run(voice_session())
For production, add: audio VAD (voice activity detection), interruption handling, turn-taking logic, function calling for tool use.
Cost Math: 3 Production Scales
Assume average call: 2 minutes user audio, 3 minutes agent audio.
Small — 1,000 calls/month:
- Realtime: (1000 × 2 × $0.06) + (1000 × 3 × $0.24) = $840/mo
- Mini: (1000 × 2 × $0.01) + (1000 × 3 × $0.04) =