TokenMix Research Lab · 2026-04-22

ElevenLabs Scribe v2: 150ms Latency Real-Time Speech API (2026)

ElevenLabs Scribe v2: 150ms Latency Real-Time Speech API (2026)

ElevenLabs Scribe v2 Realtime streams audio in and returns transcriptions in ~150 milliseconds — the fastest major commercial speech-to-text API in 2026. April 2026 release updates added multimodal message support, guardrail events, and DTMF input, completing the voice agent feature set. This review examines real latency, transcription accuracy, pricing, and comparison to OpenAI Realtime and Gemini Live for building production voice agents. TokenMix.ai includes Scribe v2 in its voice API routing, letting teams combine ElevenLabs transcription with reasoning from any LLM provider.

Table of Contents


Confirmed vs Speculation: Scribe v2 Facts

Claim Status Source
Scribe v2 Realtime available Confirmed ElevenLabs docs
~150ms end-to-end latency Confirmed Official specs
Multimodal message support (April 2026) Confirmed ElevenLabs changelog
Guardrail triggered event Confirmed API docs
DTMF input support Confirmed April release notes
Beats Whisper on accuracy Partial — depends on language/domain Community benchmarks
Cheapest real-time STT No — Whisper API often cheaper for batch
Works offline No — cloud API only

Bottom line: real latency leader for commercial real-time STT. Pricing is premium but justified for voice agent use cases.

150ms Latency: What That Actually Means

"150ms latency" refers to time from last audio chunk sent to transcript returned. Breakdown:

Stage Typical time
Audio chunk upload (WebSocket) 10-30ms
Speech-to-text inference 60-100ms
Network return 10-30ms
Total (p50) 150ms
Total (p95) 250ms
Total (p99) 400ms

Why 150ms matters: human conversation has ~200-300ms turn-taking gaps. For an AI voice agent to feel like a phone call (not walkie-talkie), the pipeline must finish transcription AND inference AND TTS under 400ms total. With 150ms STT + 150ms LLM + 150ms TTS, you hit 450ms — barely natural. With slower STT at 400-600ms (like older systems), you're at 900-1000ms and conversation feels laggy.

Accuracy Benchmarks vs OpenAI Whisper and Gemini

WER (Word Error Rate) on standard test sets, April 2026:

Model English WER Multilingual avg WER Streaming
ElevenLabs Scribe v2 4.1% 5.8% Yes, 150ms
OpenAI Whisper Large v3 4.3% 6.1% Yes, 400ms
OpenAI Realtime 4.5% 6.5% Yes, 350ms
Google Speech-to-Text V2 4.8% 6.2% Yes, 300ms
Gemini Live (integrated) ~5.0% ~6.4% Yes, 250ms
Deepgram Nova-3 4.6% 6.0% Yes, 200ms

Scribe v2 leads on accuracy and latency simultaneously. Trade-off: 2-4× more expensive than Whisper API for comparable batch workloads.

Domain performance:

April 2026 Feature Additions

New additions that matter for production:

1. multimodal_message WebSocket event — send audio and images together in the same stream. Useful for voice-first agents that also handle screenshots, documents, or live video frames.

2. onGuardrailTriggered callback — server-side content filtering fires a client-side event when harmful content is detected. Lets you react in real-time instead of discovering issues after the fact.

3. DTMF input — touch-tone detection from phone integrations. Lets Scribe handle "press 1 for sales" scenarios alongside speech.

4. Scoped analysis and test folders — better organization for production deployments with multiple agent configs.

5. useConversationControls hook (React SDK) — simpler React integration for building voice agent UIs.

Scribe v2 vs OpenAI Realtime vs Gemini Live

Dimension ElevenLabs Scribe v2 OpenAI Realtime Gemini Live
STT latency 150ms 350ms 250ms
STT WER 4.1% 4.5% 5.0%
Integrated LLM No (BYO) GPT-5.4 bundled Gemini 3.1 Flash bundled
Integrated TTS ElevenLabs voices OpenAI voices Gemini TTS
Voice polish Best-in-class Good Good
Voice cloning Yes Limited No
End-to-end latency (full loop) 400-500ms (custom pipeline) 300-500ms 300-450ms
Pricing model Per-minute streaming Per-token (audio tokens) Per-token
Flexibility High (swap components) Low Medium

Positioning:

See our voice AI API comparison for the full three-way analysis.

Pricing at Real Usage Levels

ElevenLabs Scribe v2 Realtime pricing (April 2026):

Monthly cost estimates:

Use case Audio minutes/mo Cost/mo
Solo developer testing 500 25
Small startup voice agent 5,000 ,250
Mid-sized customer support 50,000 0,000
Enterprise call center 500,000 $75,000 (volume discount)

Compare to OpenAI Whisper API at $0.006/min (batch) — 40× cheaper but no streaming. For voice agents where 150ms matters, Scribe v2 is worth the premium. For batch transcription of recorded calls/podcasts, Whisper is the right pick.

When to Use Scribe v2 and When Not To

Use Scribe v2 when:

Don't use Scribe v2 when:

FAQ

Is Scribe v2 really 150ms end-to-end?

For the STT portion, yes at p50. p95 is ~250ms, p99 ~400ms. "End-to-end voice agent" latency (STT → LLM → TTS) is higher, typically 400-500ms total when using Scribe v2 + Claude/GPT + ElevenLabs TTS. Still best-in-class, just not a single 150ms number.

How does Scribe v2 compare to Whisper API?

Whisper API is batch (no streaming), ~40× cheaper per minute of audio, but has 400-800ms total response latency. Scribe v2 is streaming, 150ms latency, and 2-4pp better on WER for noisy/multilingual audio. Use Whisper for recorded audio; use Scribe v2 for real-time voice agents.

Can I use Scribe v2 with Claude or GPT instead of ElevenLabs' voice agent?

Yes. Scribe v2 is a standalone STT API — you can pipe transcripts to any LLM and pipe LLM responses to any TTS (including ElevenLabs TTS, OpenAI TTS, or Gemini 3.1 Flash TTS). This is the most flexible architecture and what we recommend for production voice agents.

Does Scribe v2 work in non-English languages?

Yes, 40+ languages supported with reasonable quality. Best in English, Spanish, French, German, Mandarin, Japanese. Quality degrades for low-resource languages but remains usable. Multilingual WER averages 5.8% — best in class.

How do I handle rate limits on Scribe v2?

Scribe v2 rate limits are generous for paying customers (typically not the bottleneck). For burst traffic or redundancy, use TokenMix.ai's voice routing which falls back to Gemini Live STT or OpenAI Realtime when Scribe v2 is unavailable. See our voice AI comparison article for the full fallback architecture.

Is Scribe v2 production-ready as of April 2026?

Yes. In production at major customer service platforms and voice agent startups. Uptime historically 99.9%+. The April 2026 feature additions (multimodal, guardrails, DTMF) round out the last production-critical gaps.

What's the biggest downside of Scribe v2?

Cost. At $0.25/min base, a high-volume voice agent can exceed 0K/mo just on STT. If you're price-sensitive, Gemini Live's integrated pricing ($0.02-0.05/min effective) is much cheaper, trading some accuracy for cost.


Sources

By TokenMix Research Lab · Updated 2026-04-22