TokenMix Research Lab · 2026-04-24

GPT-4o Realtime Audio API 2026: Setup + Cost Math

Last Updated: 2026-04-24
Author: TokenMix Research Lab

GPT-4o Realtime is OpenAI's end-to-end voice-to-voice API — user speaks, model hears, model thinks, model speaks back, all in a single WebSocket session with ~300ms end-to-end latency. Three model variants exist: gpt-4o-realtime-preview (flagship), gpt-4o-mini-realtime-preview (cheaper), and gpt-4o-audio-preview (HTTP-based, non-streaming). Pricing is meaningful: $0.06 per minute of audio input, $0.24 per minute of audio output on the standard variant. This guide covers the WebSocket setup (15 lines of code), cost math at 3 production scales, latency benchmarks vs ElevenLabs Conversational, and when to pick OpenAI Realtime vs Gemini 3.1 Flash Live. TokenMix.ai routes all three variants.

Confirmed vs Speculation
The Three Realtime Variants
WebSocket Setup in 15 Lines
Cost Math: 3 Production Scales
Latency vs ElevenLabs Conversational + Gemini Live
When to Pick Each
FAQ

Confirmed vs Speculation

Claim	Status	Source
GPT-4o Realtime WebSocket API	Confirmed	OpenAI Realtime docs
~300ms end-to-end latency	Confirmed (p50)	OpenAI benchmark
Supports native voice input (no separate STT)	Confirmed	Architecture
Three variants (realtime / mini / audio-preview)	Confirmed	API reference
Audio output preserves tone/emotion of input	Confirmed	Key differentiator
Works through OpenAI-compatible proxies	Partial — real-time needs direct WebSocket	—
$0.06/min input audio	Confirmed	Pricing
Matches ElevenLabs latency	Close — 300ms vs 250ms typical	Benchmark

Snapshot note (2026-04-24): Per-minute pricing for gpt-4o-realtime-preview and variants is as posted on OpenAI's API pricing page at snapshot. End-to-end latency numbers are OpenAI-reported medians; actual p50/p95 in your product vary with client geography, audio codec, VAD config, and whether you proxy via a gateway. ElevenLabs and Gemini Live figures come from each vendor plus community benchmarks.

The Three Realtime Variants

gpt-4o-realtime-preview (flagship):

WebSocket only, voice-to-voice
300ms p50 end-to-end
$0.06/min audio in, $0.24/min audio out, $5/MTok input text, $20/MTok output text
Best for production voice agents, real-time assistants

gpt-4o-mini-realtime-preview:

Same WebSocket architecture
~20% slower, still <500ms p50
$0.01/min audio in, $0.04/min audio out (6× cheaper)
Best for high-volume consumer voice apps

gpt-4o-audio-preview:

HTTP-based, non-streaming (send audio, wait for audio response)
2-8 second full response time
$0.06/min, same as realtime
Best for async voicemail processing, voice-based batch tasks

WebSocket Setup in 15 Lines

Minimum viable connection (Python):

import asyncio, json, websockets

async def voice_session():
    async with websockets.connect(
        "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
        additional_headers={
            "Authorization": "Bearer $OPENAI_KEY",
            "OpenAI-Beta": "realtime=v1"
        }
    ) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {"voice": "alloy", "instructions": "Be concise."}
        }))
        # Send audio chunks (base64-encoded PCM16 24kHz)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": base64_audio_chunk
        }))
        # Receive audio + text deltas
        async for msg in ws:
            print(json.loads(msg).get("type"))

asyncio.run(voice_session())

For production, add: audio VAD (voice activity detection), interruption handling, turn-taking logic, function calling for tool use.

Cost Math: 3 Production Scales

Assume average call: 2 minutes user audio, 3 minutes agent audio.

Small — 1,000 calls/month:

Realtime: (1000 × 2 × $0.06) + (1000 × 3 × $0.24) = $840/mo
Mini: (1000 × 2 × $0.01) + (1000 × 3 × $0.04) = $140/mo
Audio-preview (async): same token cost as Realtime but no streaming UX

Mid — 50,000 calls/month (call center):

Realtime: $42,000/mo
Mini: $7,000/mo
Hybrid route (Realtime for premium, Mini for simple): ~$15,000/mo

Enterprise — 500,000 calls/month:

Realtime: $420,000/mo
Mini: $70,000/mo
Custom pricing typically negotiated above 1M min/month

Comparison to ElevenLabs Conversational (at $0.30/min blended): ~$30,000/mo at 50K calls. GPT-4o-mini-Realtime is ~4× cheaper than ElevenLabs while similar quality for most business voice use.

Latency vs ElevenLabs Conversational + Gemini Live

End-to-end (user stop speaking → agent starts speaking):

Model	p50 latency	p95	Voice polish	Cost
GPT-4o-Realtime	300ms	500ms	Good	$0.30/min blended
GPT-4o-mini-Realtime	450ms	700ms	Acceptable	$0.05/min
ElevenLabs Conversational	250ms	400ms	Best	$0.30-0.40/min
Gemini 3.1 Flash Live	350ms	550ms	Good	$0.04/min

All four cross the "feels conversational" threshold (<500ms). Pick based on cost or voice polish preference.

When to Pick Each

Your app	Pick	Why
Customer service voice agent, mid-volume	GPT-4o-Realtime	Best balance
High-volume consumer voice assistant	GPT-4o-mini-Realtime	Cheapest
Premium brand voice (character cloning)	ElevenLabs Conversational	Voice polish
Google ecosystem integration	Gemini 3.1 Flash Live	Native
Async voicemail transcription + response	gpt-4o-audio-preview	HTTP simpler
Ultra-low-latency (gaming, interactive media)	ElevenLabs	250ms edge
Cost-first startup voice feature	Gemini Flash Live at $0.04/min
B2B product that needs voice QA logs	Any with diarization + logging	Logging flexibility

FAQ

Is GPT-4o-Realtime a single model or pipeline?

Single unified model — OpenAI does voice-to-voice in one forward pass, not STT→LLM→TTS pipeline. This is the main latency advantage vs chained approaches.

Can I use Realtime with a proxy/aggregator like TokenMix.ai?

Most aggregators forward WebSocket traffic. TokenMix.ai supports GPT-4o-Realtime via transparent WebSocket proxy. Adds ~20-50ms of latency for routing but simplifies multi-provider fallback.

Does Realtime support function calling / tool use?

Yes, native. Send tools array in session config, model invokes tools mid-conversation, you execute, return results, conversation continues. Works identically to GPT-5.4 function calling.

What's the audio format?

PCM16 at 24kHz, mono. Base64-encoded in WebSocket messages. Most voice SDKs (Twilio, Daily, LiveKit) support this or can transcode.

Can I interrupt the assistant mid-speech?

Yes. Send input_audio_buffer.append any time — the model halts speaking, processes new input, continues. This is what makes the UX feel natural.

How does pricing compare to Whisper + GPT + TTS pipeline?

Pipeline cost at 5-min call: Whisper ($0.03) + GPT-5.4 (~$0.10) + TTS ($0.05) = $0.18. GPT-4o-Realtime: $0.30 blended for same call. Realtime costs 67% more but latency is 3× better and voice tone/emotion is preserved. For user-facing agents, worth it.

Any rate limits I should know?

Tier 4: 20 concurrent WebSocket sessions. Enterprise: negotiable. For high-volume consumer voice apps, ElevenLabs or Gemini may have higher concurrent limits at similar pricing.

Sources

By TokenMix Research Lab · Updated 2026-04-24