TokenMix Research Lab · 2026-04-24

GPT-4o Realtime Audio API 2026: Setup + Cost Math

GPT-4o Realtime Audio API 2026: Setup + Cost Math

GPT-4o Realtime is OpenAI's end-to-end voice-to-voice API — user speaks, model hears, model thinks, model speaks back, all in a single WebSocket session with ~300ms end-to-end latency. Three model variants exist: gpt-4o-realtime-preview (flagship), gpt-4o-mini-realtime-preview (cheaper), and gpt-4o-audio-preview (HTTP-based, non-streaming). Pricing is meaningful: $0.06 per minute of audio input, $0.24 per minute of audio output on the standard variant. This guide covers the WebSocket setup (15 lines of code), cost math at 3 production scales, latency benchmarks vs ElevenLabs Conversational, and when to pick OpenAI Realtime vs Gemini 3.1 Flash Live. TokenMix.ai routes all three variants.

Table of Contents


Confirmed vs Speculation

Claim Status Source
GPT-4o Realtime WebSocket API Confirmed OpenAI Realtime docs
~300ms end-to-end latency Confirmed (p50) OpenAI benchmark
Supports native voice input (no separate STT) Confirmed Architecture
Three variants (realtime / mini / audio-preview) Confirmed API reference
Audio output preserves tone/emotion of input Confirmed Key differentiator
Works through OpenAI-compatible proxies Partial — real-time needs direct WebSocket
$0.06/min input audio Confirmed Pricing
Matches ElevenLabs latency Close — 300ms vs 250ms typical Benchmark

Snapshot note (2026-04-24): Per-minute pricing for gpt-4o-realtime-preview and variants is as posted on OpenAI's API pricing page at snapshot. End-to-end latency numbers are OpenAI-reported medians; actual p50/p95 in your product vary with client geography, audio codec, VAD config, and whether you proxy via a gateway. ElevenLabs and Gemini Live figures come from each vendor plus community benchmarks.

The Three Realtime Variants

gpt-4o-realtime-preview (flagship):

gpt-4o-mini-realtime-preview:

gpt-4o-audio-preview:

WebSocket Setup in 15 Lines

Minimum viable connection (Python):

import asyncio, json, websockets

async def voice_session():
    async with websockets.connect(
        "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
        additional_headers={
            "Authorization": "Bearer $OPENAI_KEY",
            "OpenAI-Beta": "realtime=v1"
        }
    ) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {"voice": "alloy", "instructions": "Be concise."}
        }))
        # Send audio chunks (base64-encoded PCM16 24kHz)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": base64_audio_chunk
        }))
        # Receive audio + text deltas
        async for msg in ws:
            print(json.loads(msg).get("type"))

asyncio.run(voice_session())

For production, add: audio VAD (voice activity detection), interruption handling, turn-taking logic, function calling for tool use.

Cost Math: 3 Production Scales

Assume average call: 2 minutes user audio, 3 minutes agent audio.

Small — 1,000 calls/month:

Mid — 50,000 calls/month (call center):

Enterprise — 500,000 calls/month:

Comparison to ElevenLabs Conversational (at $0.30/min blended): ~$30,000/mo at 50K calls. GPT-4o-mini-Realtime is ~4× cheaper than ElevenLabs while similar quality for most business voice use.

Latency vs ElevenLabs Conversational + Gemini Live

End-to-end (user stop speaking → agent starts speaking):

Model p50 latency p95 Voice polish Cost
GPT-4o-Realtime 300ms 500ms Good $0.30/min blended
GPT-4o-mini-Realtime 450ms 700ms Acceptable $0.05/min
ElevenLabs Conversational 250ms 400ms Best $0.30-0.40/min
Gemini 3.1 Flash Live 350ms 550ms Good $0.04/min

All four cross the "feels conversational" threshold (<500ms). Pick based on cost or voice polish preference.

When to Pick Each

Your app Pick Why
Customer service voice agent, mid-volume GPT-4o-Realtime Best balance
High-volume consumer voice assistant GPT-4o-mini-Realtime Cheapest
Premium brand voice (character cloning) ElevenLabs Conversational Voice polish
Google ecosystem integration Gemini 3.1 Flash Live Native
Async voicemail transcription + response gpt-4o-audio-preview HTTP simpler
Ultra-low-latency (gaming, interactive media) ElevenLabs 250ms edge
Cost-first startup voice feature Gemini Flash Live at $0.04/min
B2B product that needs voice QA logs Any with diarization + logging Logging flexibility

FAQ

Is GPT-4o-Realtime a single model or pipeline?

Single unified model — OpenAI does voice-to-voice in one forward pass, not STT→LLM→TTS pipeline. This is the main latency advantage vs chained approaches.

Can I use Realtime with a proxy/aggregator like TokenMix.ai?

Most aggregators forward WebSocket traffic. TokenMix.ai supports GPT-4o-Realtime via transparent WebSocket proxy. Adds ~20-50ms of latency for routing but simplifies multi-provider fallback.

Does Realtime support function calling / tool use?

Yes, native. Send tools array in session config, model invokes tools mid-conversation, you execute, return results, conversation continues. Works identically to GPT-5.4 function calling.

What's the audio format?

PCM16 at 24kHz, mono. Base64-encoded in WebSocket messages. Most voice SDKs (Twilio, Daily, LiveKit) support this or can transcode.

Can I interrupt the assistant mid-speech?

Yes. Send input_audio_buffer.append any time — the model halts speaking, processes new input, continues. This is what makes the UX feel natural.

How does pricing compare to Whisper + GPT + TTS pipeline?

Pipeline cost at 5-min call: Whisper ($0.03) + GPT-5.4 (~$0.10) + TTS ($0.05) = $0.18. GPT-4o-Realtime: $0.30 blended for same call. Realtime costs 67% more but latency is 3× better and voice tone/emotion is preserved. For user-facing agents, worth it.

Any rate limits I should know?

Tier 4: 20 concurrent WebSocket sessions. Enterprise: negotiable. For high-volume consumer voice apps, ElevenLabs or Gemini may have higher concurrent limits at similar pricing.


Sources

By TokenMix Research Lab · Updated 2026-04-24