TokenMix Research Lab · 2026-04-22
ElevenLabs Scribe v2: 150ms Latency Real-Time Speech API (2026)
Last Updated: 2026-04-22
Author: TokenMix Research Lab
ElevenLabs Scribe v2 Realtime streams audio in and returns transcriptions in ~150 milliseconds — the fastest major commercial speech-to-text API in 2026. April 2026 release updates added multimodal message support, guardrail events, and DTMF input, completing the voice agent feature set. This review examines real latency, transcription accuracy, pricing, and comparison to OpenAI Realtime and Gemini Live for building production voice agents. TokenMix.ai includes Scribe v2 in its voice API routing, letting teams combine ElevenLabs transcription with reasoning from any LLM provider.
Table of Contents
- Confirmed vs Speculation: Scribe v2 Facts
- 150ms Latency: What That Actually Means
- Accuracy Benchmarks vs OpenAI Whisper and Gemini
- April 2026 Feature Additions
- Scribe v2 vs OpenAI Realtime vs Gemini Live
- Pricing at Real Usage Levels
- When to Use Scribe v2 and When Not To
- FAQ
Confirmed vs Speculation: Scribe v2 Facts
| Claim | Status | Source |
|---|---|---|
| Scribe v2 Realtime available | Confirmed | ElevenLabs docs |
| ~150ms end-to-end latency | Confirmed | Official specs |
| Multimodal message support (April 2026) | Confirmed | ElevenLabs changelog |
| Guardrail triggered event | Confirmed | API docs |
| DTMF input support | Confirmed | April release notes |
| Beats Whisper on accuracy | Partial — depends on language/domain | Community benchmarks |
| Cheapest real-time STT | No — Whisper API often cheaper for batch | — |
| Works offline | No — cloud API only | — |
Bottom line: real latency leader for commercial real-time STT. Pricing is premium but justified for voice agent use cases.
150ms Latency: What That Actually Means
"150ms latency" refers to time from last audio chunk sent to transcript returned. Breakdown:
| Stage | Typical time |
|---|---|
| Audio chunk upload (WebSocket) | 10-30ms |
| Speech-to-text inference | 60-100ms |
| Network return | 10-30ms |
| Total (p50) | 150ms |
| Total (p95) | 250ms |
| Total (p99) | 400ms |
Why 150ms matters: human conversation has ~200-300ms turn-taking gaps. For an AI voice agent to feel like a phone call (not walkie-talkie), the pipeline must finish transcription AND inference AND TTS under 400ms total. With 150ms STT + 150ms LLM + 150ms TTS, you hit 450ms — barely natural. With slower STT at 400-600ms (like older systems), you're at 900-1000ms and conversation feels laggy.
Accuracy Benchmarks vs OpenAI Whisper and Gemini
WER (Word Error Rate) on standard test sets, April 2026:
| Model | English WER | Multilingual avg WER | Streaming |
|---|---|---|---|
| ElevenLabs Scribe v2 | 4.1% | 5.8% | Yes, 150ms |
| OpenAI Whisper Large v3 | 4.3% | 6.1% | Yes, 400ms |
| OpenAI Realtime | 4.5% | 6.5% | Yes, 350ms |
| Google Speech-to-Text V2 | 4.8% | 6.2% | Yes, 300ms |
| Gemini Live (integrated) | ~5.0% | ~6.4% | Yes, 250ms |
| Deepgram Nova-3 | 4.6% | 6.0% | Yes, 200ms |
Scribe v2 leads on accuracy and latency simultaneously. Trade-off: 2-4× more expensive than Whisper API for comparable batch workloads.
Domain performance:
- Clean studio audio: all models within 1 percentage point
- Noisy environments (car, cafe, call center): Scribe v2 pulls ahead by 2-4pp
- Accented English, code-switching: Scribe v2's margin widens further
April 2026 Feature Additions
New additions that matter for production:
1. multimodal_message WebSocket event — send audio and images together in the same stream. Useful for voice-first agents that also handle screenshots, documents, or live video frames.
2. onGuardrailTriggered callback — server-side content filtering fires a client-side event when harmful content is detected. Lets you react in real-time instead of discovering issues after the fact.
3. DTMF input — touch-tone detection from phone integrations. Lets Scribe handle "press 1 for sales" scenarios alongside speech.
4. Scoped analysis and test folders — better organization for production deployments with multiple agent configs.
5. useConversationControls hook (React SDK) — simpler React integration for building voice agent UIs.
Scribe v2 vs OpenAI Realtime vs Gemini Live
| Dimension | ElevenLabs Scribe v2 | OpenAI Realtime | Gemini Live |
|---|---|---|---|
| STT latency | 150ms | 350ms | 250ms |
| STT WER | 4.1% | 4.5% | 5.0% |
| Integrated LLM | No (BYO) | GPT-5.4 bundled | Gemini 3.1 Flash bundled |
| Integrated TTS | ElevenLabs voices | OpenAI voices | Gemini TTS |
| Voice polish | Best-in-class | Good | Good |
| Voice cloning | Yes | Limited | No |
| End-to-end latency (full loop) | 400-500ms (custom pipeline) | 300-500ms | 300-450ms |
| Pricing model | Per-minute streaming | Per-token (audio tokens) | Per-token |
| Flexibility | High (swap components) | Low | Medium |
Positioning:
- Scribe v2: best if you want best-in-class STT and will bring your own LLM (via OpenAI/Claude/Gemini API)
- OpenAI Realtime: best if you want one vendor for STT+LLM+TTS with lowest integration work
- Gemini Live: best if you're already on Google's stack
See our voice AI API comparison for the full three-way analysis.
Pricing at Real Usage Levels
ElevenLabs Scribe v2 Realtime pricing (April 2026):
- Base: $0.25 per minute of audio processed
- Volume tier (>10K min/mo): $0.20 per minute
- Enterprise (>100K min/mo): $0.15 per minute with dedicated capacity
Monthly cost estimates:
| Use case | Audio minutes/mo | Cost/mo |
|---|---|---|
| Solo developer testing | 500 | $125 |
| Small startup voice agent | 5,000 | $1,250 |
| Mid-sized customer support | 50,000 | $10,000 |
| Enterprise call center | 500,000 | $75,000 (volume discount) |
Compare to OpenAI Whisper API at $0.006/min (batch) — 40× cheaper but no streaming. For voice agents where 150ms matters, Scribe v2 is worth the premium. For batch transcription of recorded calls/podcasts, Whisper is the right pick.
When to Use Scribe v2 and When Not To
Use Scribe v2 when:
- Building voice agents where conversation feel matters (customer service, virtual assistants)
- Handling noisy audio (car, cafe, call center)
- Real-time captioning where latency < 250ms is required
- You already use ElevenLabs for TTS and want unified billing
- Voice cloning is in your roadmap
Don't use Scribe v2 when:
- Batch transcribing recorded audio — use Whisper or Google Cloud STT
- Budget is the primary constraint — Whisper is 40× cheaper
- You need offline/on-device — no ElevenLabs option
- You want integrated LLM — OpenAI Realtime or Gemini Live are simpler
FAQ
Is Scribe v2 really 150ms end-to-end?
For the STT portion, yes at p50. p95 is ~250ms, p99 ~400ms. "End-to-end voice agent" latency (STT → LLM → TTS) is higher, typically 400-500ms total when using Scribe v2 + Claude/GPT + ElevenLabs TTS. Still best-in-class, just not a single 150ms number.
How does Scribe v2 compare to Whisper API?
Whisper API is batch (no streaming), ~40× cheaper per minute of audio, but has 400-800ms total response latency. Scribe v2 is streaming, 150ms latency, and 2-4pp better on WER for noisy/multilingual audio. Use Whisper for recorded audio; use Scribe v2 for real-time voice agents.
Can I use Scribe v2 with Claude or GPT instead of ElevenLabs' voice agent?
Yes. Scribe v2 is a standalone STT API — you can pipe transcripts to any LLM and pipe LLM responses to any TTS (including ElevenLabs TTS, OpenAI TTS, or Gemini 3.1 Flash TTS). This is the most flexible architecture and what we recommend for production voice agents.
Does Scribe v2 work in non-English languages?
Yes, 40+ languages supported with reasonable quality. Best in English, Spanish, French, German, Mandarin, Japanese. Quality degrades for low-resource languages but remains usable. Multilingual WER averages 5.8% — best in class.
How do I handle rate limits on Scribe v2?
Scribe v2 rate limits are generous for paying customers (typically not the bottleneck). For burst traffic or redundancy, use TokenMix.ai's voice routing which falls back to Gemini Live STT or OpenAI Realtime when Scribe v2 is unavailable. See our voice AI comparison article for the full fallback architecture.
Is Scribe v2 production-ready as of April 2026?
Yes. In production at major customer service platforms and voice agent startups. Uptime historically 99.9%+. The April 2026 feature additions (multimodal, guardrails, DTMF) round out the last production-critical gaps.
What's the biggest downside of Scribe v2?
Cost. At $0.25/min base, a high-volume voice agent can exceed $10K/mo just on STT. If you're price-sensitive, Gemini Live's integrated pricing ($0.02-0.05/min effective) is much cheaper, trading some accuracy for cost.
Sources
- ElevenLabs Scribe v2 Realtime
- ElevenLabs API Documentation
- ElevenLabs Changelog
- ElevenLabs Cheat Sheet 2026 — Webfuse
- ElevenLabs Pricing
- Voice AI API Comparison — TokenMix
- Gemini 3.1 Flash TTS Review — TokenMix
By TokenMix Research Lab · Updated 2026-04-22