TokenMix Research Lab · 2026-04-22

ElevenLabs Scribe v2: 150ms Latency Real-Time Speech API (2026)

ElevenLabs Scribe v2 Realtime streams audio in and returns transcriptions in ~150 milliseconds — the fastest major commercial speech-to-text API in 2026. April 2026 release updates added multimodal message support, guardrail events, and DTMF input, completing the voice agent feature set. This review examines real latency, transcription accuracy, pricing, and comparison to OpenAI Realtime and Gemini Live for building production voice agents. TokenMix.ai includes Scribe v2 in its voice API routing, letting teams combine ElevenLabs transcription with reasoning from any LLM provider.

Confirmed vs Speculation: Scribe v2 Facts
150ms Latency: What That Actually Means
Accuracy Benchmarks vs OpenAI Whisper and Gemini
April 2026 Feature Additions
Scribe v2 vs OpenAI Realtime vs Gemini Live
Pricing at Real Usage Levels
When to Use Scribe v2 and When Not To
FAQ

Confirmed vs Speculation: Scribe v2 Facts

Claim	Status	Source
Scribe v2 Realtime available	Confirmed	ElevenLabs docs
~150ms end-to-end latency	Confirmed	Official specs
Multimodal message support (April 2026)	Confirmed	ElevenLabs changelog
Guardrail triggered event	Confirmed	API docs
DTMF input support	Confirmed	April release notes
Beats Whisper on accuracy	Partial — depends on language/domain	Community benchmarks
Cheapest real-time STT	No — Whisper API often cheaper for batch	—
Works offline	No — cloud API only	—

Bottom line: real latency leader for commercial real-time STT. Pricing is premium but justified for voice agent use cases.

150ms Latency: What That Actually Means

"150ms latency" refers to time from last audio chunk sent to transcript returned. Breakdown:

Stage	Typical time
Audio chunk upload (WebSocket)	10-30ms
Speech-to-text inference	60-100ms
Network return	10-30ms
Total (p50)	150ms
Total (p95)	250ms
Total (p99)	400ms

Why 150ms matters: human conversation has ~200-300ms turn-taking gaps. For an AI voice agent to feel like a phone call (not walkie-talkie), the pipeline must finish transcription AND inference AND TTS under 400ms total. With 150ms STT + 150ms LLM + 150ms TTS, you hit 450ms — barely natural. With slower STT at 400-600ms (like older systems), you're at 900-1000ms and conversation feels laggy.

Accuracy Benchmarks vs OpenAI Whisper and Gemini

WER (Word Error Rate) on standard test sets, April 2026:

Model	English WER	Multilingual avg WER	Streaming
ElevenLabs Scribe v2	4.1%	5.8%	Yes, 150ms
OpenAI Whisper Large v3	4.3%	6.1%	Yes, 400ms
OpenAI Realtime	4.5%	6.5%	Yes, 350ms
Google Speech-to-Text V2	4.8%	6.2%	Yes, 300ms
Gemini Live (integrated)	~5.0%	~6.4%	Yes, 250ms
Deepgram Nova-3	4.6%	6.0%	Yes, 200ms

Scribe v2 leads on accuracy and latency simultaneously. Trade-off: 2-4× more expensive than Whisper API for comparable batch workloads.

Domain performance:

Clean studio audio: all models within 1 percentage point
Noisy environments (car, cafe, call center): Scribe v2 pulls ahead by 2-4pp
Accented English, code-switching: Scribe v2's margin widens further

April 2026 Feature Additions

New additions that matter for production:

1. multimodal_message WebSocket event — send audio and images together in the same stream. Useful for voice-first agents that also handle screenshots, documents, or live video frames.

2. onGuardrailTriggered callback — server-side content filtering fires a client-side event when harmful content is detected. Lets you react in real-time instead of discovering issues after the fact.

3. DTMF input — touch-tone detection from phone integrations. Lets Scribe handle "press 1 for sales" scenarios alongside speech.

4. Scoped analysis and test folders — better organization for production deployments with multiple agent configs.

5. useConversationControls hook (React SDK) — simpler React integration for building voice agent UIs.

Scribe v2 vs OpenAI Realtime vs Gemini Live

Dimension	ElevenLabs Scribe v2	OpenAI Realtime	Gemini Live
STT latency	150ms	350ms	250ms
STT WER	4.1%	4.5%	5.0%
Integrated LLM	No (BYO)	GPT-5.4 bundled	Gemini 3.1 Flash bundled
Integrated TTS	ElevenLabs voices	OpenAI voices	Gemini TTS
Voice polish	Best-in-class	Good	Good
Voice cloning	Yes	Limited	No
End-to-end latency (full loop)	400-500ms (custom pipeline)	300-500ms	300-450ms
Pricing model	Per-minute streaming	Per-token (audio tokens)	Per-token
Flexibility	High (swap components)	Low	Medium

Positioning:

Scribe v2: best if you want best-in-class STT and will bring your own LLM (via OpenAI/Claude/Gemini API)
OpenAI Realtime: best if you want one vendor for STT+LLM+TTS with lowest integration work
Gemini Live: best if you're already on Google's stack

See our voice AI API comparison for the full three-way analysis.

Pricing at Real Usage Levels

ElevenLabs Scribe v2 Realtime pricing (April 2026):

Base: $0.25 per minute of audio processed
Volume tier (>10K min/mo): $0.20 per minute
Enterprise (>100K min/mo): $0.15 per minute with dedicated capacity

Monthly cost estimates:

Use case	Audio minutes/mo	Cost/mo
Solo developer testing	500	25
Small startup voice agent	5,000	,250
Mid-sized customer support	50,000	0,000
Enterprise call center	500,000	$75,000 (volume discount)

Compare to OpenAI Whisper API at $0.006/min (batch) — 40× cheaper but no streaming. For voice agents where 150ms matters, Scribe v2 is worth the premium. For batch transcription of recorded calls/podcasts, Whisper is the right pick.

When to Use Scribe v2 and When Not To

Use Scribe v2 when:

Building voice agents where conversation feel matters (customer service, virtual assistants)
Handling noisy audio (car, cafe, call center)
Real-time captioning where latency < 250ms is required
You already use ElevenLabs for TTS and want unified billing
Voice cloning is in your roadmap

Don't use Scribe v2 when:

Batch transcribing recorded audio — use Whisper or Google Cloud STT
Budget is the primary constraint — Whisper is 40× cheaper
You need offline/on-device — no ElevenLabs option
You want integrated LLM — OpenAI Realtime or Gemini Live are simpler

FAQ

Is Scribe v2 really 150ms end-to-end?

For the STT portion, yes at p50. p95 is ~250ms, p99 ~400ms. "End-to-end voice agent" latency (STT → LLM → TTS) is higher, typically 400-500ms total when using Scribe v2 + Claude/GPT + ElevenLabs TTS. Still best-in-class, just not a single 150ms number.

How does Scribe v2 compare to Whisper API?

Whisper API is batch (no streaming), ~40× cheaper per minute of audio, but has 400-800ms total response latency. Scribe v2 is streaming, 150ms latency, and 2-4pp better on WER for noisy/multilingual audio. Use Whisper for recorded audio; use Scribe v2 for real-time voice agents.

Can I use Scribe v2 with Claude or GPT instead of ElevenLabs' voice agent?

Yes. Scribe v2 is a standalone STT API — you can pipe transcripts to any LLM and pipe LLM responses to any TTS (including ElevenLabs TTS, OpenAI TTS, or Gemini 3.1 Flash TTS). This is the most flexible architecture and what we recommend for production voice agents.

Does Scribe v2 work in non-English languages?

Yes, 40+ languages supported with reasonable quality. Best in English, Spanish, French, German, Mandarin, Japanese. Quality degrades for low-resource languages but remains usable. Multilingual WER averages 5.8% — best in class.

How do I handle rate limits on Scribe v2?

Scribe v2 rate limits are generous for paying customers (typically not the bottleneck). For burst traffic or redundancy, use TokenMix.ai's voice routing which falls back to Gemini Live STT or OpenAI Realtime when Scribe v2 is unavailable. See our voice AI comparison article for the full fallback architecture.

Is Scribe v2 production-ready as of April 2026?

Yes. In production at major customer service platforms and voice agent startups. Uptime historically 99.9%+. The April 2026 feature additions (multimodal, guardrails, DTMF) round out the last production-critical gaps.

What's the biggest downside of Scribe v2?

Cost. At $0.25/min base, a high-volume voice agent can exceed 0K/mo just on STT. If you're price-sensitive, Gemini Live's integrated pricing ($0.02-0.05/min effective) is much cheaper, trading some accuracy for cost.

Sources

By TokenMix Research Lab · Updated 2026-04-22