TokenMix Research Lab · 2026-06-08

OpenAI Realtime Voice 2026: $32 Audio, Cost and Latency Traps
Last Updated: 2026-06-08 Author: TokenMix Research Lab Data verified: 2026-06-08 - OpenAI Realtime and audio docs, GPT-Realtime-2 model page, Realtime cost guide, pricing page, and May 7 voice model announcement
OpenAI Realtime Voice is production-ready enough to test, but the cost trap is conversation growth. Later turns get more expensive.
OpenAI introduced GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper on May 7, 2026. The pricing page lists GPT-Realtime-2 audio at $32 input and $64 output per million tokens, Translate at $0.034/minute, and Whisper at $0.017/minute. OpenAI's cost guide says Realtime costs accrue when a Response is created and the entire conversation is sent each turn, so session design matters as much as model choice.
Table of Contents
- Quick Verdict
- Model and Price Table
- Billing Mechanics
- Cost Math
- Architecture Choice
- Latency and Tool Risk
- Implementation Pattern
- Search Intent Map
- Cost Per Task Calculator
- Decision Matrix
- Monitoring Checklist
- Non-Claims and Caveats
- Final Recommendation
- FAQ
- Sources
- Related Articles
Quick Verdict
| Claim | Status | Source |
|---|---|---|
| OpenAI announced three new audio models on May 7, 2026 | Confirmed | OpenAI announcement |
| GPT-Realtime-2 is described as a voice model with GPT-5-class reasoning | Confirmed | OpenAI announcement |
| GPT-Realtime-Translate supports 70+ input languages and 13 output languages | Confirmed | OpenAI announcement |
| GPT-Realtime-2 audio input is listed at $32 per 1M tokens | Confirmed | OpenAI pricing |
| Realtime API costs accrue when a Response is created | Confirmed | OpenAI Realtime costs |
| Network bandwidth is currently charged separately for Realtime API | False | OpenAI cost docs say no current bandwidth/connection cost |
| Realtime voice is always cheaper than chained STT-LLM-TTS | False | Architecture depends on latency and control needs |
| Voice agents will need stricter per-session caps than text chat | Likely | Conversation state and audio output can compound cost |
Model and Price Table
| Model | Use case | Price signal | Status |
|---|---|---|---|
| gpt-realtime-2 audio input | Live voice agent | $32/1M tokens | Confirmed |
| gpt-realtime-2 cached audio input | Reused audio input | $0.40/1M tokens | Confirmed |
| gpt-realtime-2 audio output | Spoken response | $64/1M tokens | Confirmed |
| gpt-realtime-2 text input | Text in session | $4/1M tokens | Confirmed |
| gpt-realtime-2 text output | Text response | $24/1M tokens | Confirmed |
| gpt-realtime-translate | Live translation | $0.034/minute | Confirmed |
| gpt-realtime-whisper | Live transcription | $0.017/minute | Confirmed |
For adjacent OpenAI cost planning, use OpenAI API Cost, OpenAI API Verification, and AI API Gateway.
Billing Mechanics
| Mechanic | OpenAI doc signal | Cost implication | Status |
|---|---|---|---|
| Response created | Cost accrues | Avoid unnecessary responses | Confirmed |
| VAD | Empty audio filtered | VAD can reduce input waste | Confirmed |
| Entire conversation sent | Later turns cost more | Truncate or summarize | Confirmed |
| Audio token unit | User audio 1 token/100ms, assistant 1 token/50ms | Output speech is dense | Confirmed |
| Truncation | Old items dropped after limit | Cache can be affected | Confirmed |
| Retention ratio | Can drop extra messages | Cost-memory tradeoff | Confirmed |
Voice cost is not just minutes. It is audio tokens, text tokens, retained conversation state, tool calls, and output length.
Cost Math
Scenario 1: 10-minute live translation session. At $0.034/minute, GPT-Realtime-Translate costs about $0.34 before any surrounding app cost.
Scenario 2: 10-minute live transcription session. At $0.017/minute, GPT-Realtime-Whisper costs about $0.17.
Scenario 3: voice agent with 100K audio input tokens and 80K audio output tokens. At $32/$64 per 1M, cost is $3.20 + $5.12 = $8.32. That is why output discipline matters.
| Scenario | Unit assumption | Estimated cost | Main control |
|---|---|---|---|
| Live translation, 10 min | $0.034/min | $0.34 | Route translation-only |
| Live transcription, 10 min | $0.017/min | $0.17 | Use transcription session |
| Voice agent, 100K in / 80K out | $32/$64 per 1M | $8.32 | Short responses |
| 1,000 support calls at $0.50 | Per-session blended | $500 | Per-call cap |
| 10,000 calls at $0.50 | Per-session blended | $5,000 | Routing and escalation |
Architecture Choice
| Architecture | Use when | Cost risk | Status |
|---|---|---|---|
| Speech-to-speech Realtime | Natural low-latency conversation | Audio output cost | Confirmed |
| Translation session | Continuous interpreter | Minute cost | Confirmed |
| Transcription session | Need live transcript only | Lower than full voice agent | Confirmed |
| Chained STT -> LLM -> TTS | Need deterministic control | More moving parts | Confirmed |
| Text-first fallback | Voice optional | Lower latency/cost risk | Likely |
OpenAI says Realtime sessions are best for live audio that needs low latency. Request-based audio APIs are better for bounded files or speech generation that does not need a live session.
Latency and Tool Risk
| Risk | Symptom | Fix | Status |
|---|---|---|---|
| Tool call delay | Awkward silence | Speak status preamble | Likely |
| Long session memory | Later turns cost more | Retention ratio | Confirmed |
| Output verbosity | High audio output tokens | Short response policy | Confirmed |
| Bad VAD settings | Empty audio or missed speech | Tune VAD on real audio | Confirmed |
| Browser key exposure | Secret leaked | Use ephemeral tokens | Confirmed |
| User abuse | One user burns quota | Safety identifier and caps | Confirmed |
Voice UX makes cost errors visible. A text chatbot can be slow quietly; a voice agent fails in front of the user.
Implementation Pattern
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";
const agent = new RealtimeAgent({
name: "SupportVoice",
instructions: "Keep answers under 12 seconds unless the user asks for detail."
});
const session = new RealtimeSession(agent, { model: "gpt-realtime-2" });
await session.connect({ apiKey: "ephemeral_key_from_server" });
def voice_route(goal, needs_low_latency):
if goal == "translation":
return "gpt-realtime-translate"
if goal == "transcription_only":
return "gpt-realtime-whisper"
if needs_low_latency:
return "gpt-realtime-2"
return "chained_stt_llm_tts"
Search Intent Map
| Search query | What the user really needs | Best answer | Status |
|---|---|---|---|
openai realtime voice |
A current, non-marketing answer | Compare official limits and cost controls | Confirmed |
openai realtime voice pricing |
Whether this becomes a monthly bill | Use per-task math, not sticker price | Confirmed |
openai realtime voice free |
Whether a no-cost path exists | Treat free quota as testing capacity | Likely |
openai realtime voice error |
Why setup fails | Check auth, quota, region, and model access | Likely |
openai realtime voice alternative |
Whether another route is safer | Compare direct API, gateway, and self-hosting | Likely |
This is the reason the article is structured around tables instead of a narrative review. Search traffic for these terms usually comes from blocked developers, not readers browsing AI news.
Cost Per Task Calculator
| Cost component | Formula | Why it matters | Status |
|---|---|---|---|
| Input tokens | input MTok x input price | Long prompts dominate retrieval and agents | Confirmed |
| Output tokens | output MTok x output price | Reasoning and verbose answers compound cost | Confirmed |
| Retry waste | failed calls x average cost | 429 and timeout loops become real spend | Likely |
| Human review | minutes saved or added x hourly rate | Tooling can shift, not remove, labor cost | Likely |
| Infrastructure | storage, runners, or hosted platform cost | Non-token cost often appears later | Confirmed |
Use this minimum calculator before choosing a provider: 30 days x calls per day x average input tokens x input price, plus 30 days x calls per day x average output tokens x output price. Then add retries. If the retry rate is 10%, your apparent price is already 1.1x before latency or support cost.
| Monthly calls | Avg input | Avg output | Token volume | Operational reading |
|---|---|---|---|---|
| 1,000 | 1K | 300 | 1M in / 0.3M out | Prototype |
| 10,000 | 2K | 600 | 20M in / 6M out | Small app |
| 100,000 | 4K | 1K | 400M in / 100M out | Production workload |
| 1,000,000 | 2K | 500 | 2B in / 500M out | Procurement problem |
Decision Matrix
| If your situation is... | Default move | Why | Confidence |
|---|---|---|---|
| You are still prototyping | Use the lowest-friction official route | Learning speed beats premature optimization | Likely |
| You have user-facing traffic | Add fallback and spend caps before launch | Users feel quota failures immediately | Confirmed |
| You have compliance constraints | Prefer direct vendor, cloud marketplace, or audited gateway | Procurement trail matters | Likely |
| You have high volume but flexible latency | Test batch or async processing | Batch discounts can beat realtime routes | Confirmed where documented |
| You have unknown token shape | Run a 7-day sample before committing | Average prompts hide tail risk | Likely |
| You need newest model features | Check direct provider docs first | Gateways and clouds may lag direct release | Likely |
The durable rule: do not optimize for the cheapest successful demo. Optimize for the cheapest successful month with logs, retries, fallback, and support.
def pick_route(stage, traffic, compliance, latency_flexible):
if stage == "prototype" and traffic < 1000:
return "official_free_or_low_cost_route"
if compliance == "strict":
return "direct_vendor_or_cloud_marketplace"
if latency_flexible and traffic > 100000:
return "batch_or_async_route"
if traffic > 10000:
return "gateway_with_budget_caps"
return "direct_api_with_monitoring"
Monitoring Checklist
| Metric | Alert threshold | Why | Status |
|---|---|---|---|
| 429 rate | >2% sustained | Quota is now user-visible | Confirmed |
| Retry multiplier | >1.1x | Hidden cost leak | Likely |
| Fallback rate | >10% | Primary route is unstable | Likely |
| Output/input ratio | Sudden 2x jump | Prompt or model behavior changed | Likely |
| Cost per successful task | Week-over-week increase | Real business KPI | Confirmed |
| Error by model | Any model-specific spike | Route or provider issue | Confirmed |
| User-level spend | Outlier user >5x median | Abuse or runaway workflow | Likely |
The operational test is simple: if you cannot answer which model, user, route, or retry loop created the cost, you are not ready to scale that workflow.
Non-Claims and Caveats
| Not claimed | Reason | Label |
|---|---|---|
| Universal benchmark superiority | No single benchmark covers every workload and provider route | False as a broad claim |
| Permanent free availability | Free tiers and previews can change | Speculation |
| Guaranteed model access in every region | Providers gate by region, tier, quota, or account status | False as a broad claim |
| Refund availability without official text | Refund terms must come from provider policy or support | Speculation |
| Identical pricing across direct API, cloud, and gateway | Routing layer, region, priority, and batch mode can change cost | False as a broad claim |
| Production safety from docs alone | Real workloads need logs and failure drills | Confirmed |
This article uses official docs for hard numbers and marks forward-looking guidance as Likely or Speculation. If a provider changes a price, model name, rate limit, or credit rule after the data verification date, the conclusion should be rechecked before procurement.
Final Recommendation
Use GPT-Realtime-2 for live voice agents that need low latency and tool use. Use Translate or Whisper when the job is only translation or transcription. Cap session length and output speech before launch.
FAQ
What is GPT-Realtime-2?
OpenAI describes GPT-Realtime-2 as its most capable realtime voice model with configurable reasoning effort, stronger instruction following, and tool use for voice-agent workflows.
How much does OpenAI Realtime voice cost?
The pricing page lists GPT-Realtime-2 audio at $32 input and $64 output per million tokens. Translate is listed at $0.034/minute and Whisper at $0.017/minute.
When does Realtime API billing happen?
OpenAI says costs accrue when a Response is created and are based on input and output tokens, except input transcription costs.
Is bandwidth billed?
OpenAI's Realtime cost guide says there is currently no cost for network bandwidth or connections.
Why do later turns cost more?
OpenAI says the entire conversation is sent to the model for each Response, so later turns include more context unless truncated or managed.
Should I use Realtime for transcription only?
No need. Use a transcription session or transcription model if you only need live text and not spoken model responses.
How do I control cost?
Use VAD, short voice responses, session truncation, per-call caps, ephemeral credentials, and separate translation/transcription routes.
Sources
- OpenAI Voice Model Announcement
- OpenAI GPT-Realtime-2 Model Page
- OpenAI Realtime Costs
- OpenAI Realtime and Audio
- OpenAI Voice Agents
- OpenAI API Pricing
- TokenMix OpenAI API Cost
- TokenMix OpenAI API Verification