TokenMix Research Lab · 2026-04-22
ElevenLabs Scribe v2: 150ms Latency Real-Time Speech API (2026)
ElevenLabs Scribe v2 Realtime streams audio in and returns transcriptions in ~150 milliseconds — the fastest major commercial speech-to-text API in 2026. April 2026 release updates added multimodal message support, guardrail events, and DTMF input, completing the voice agent feature set. This review examines real latency, transcription accuracy, pricing, and comparison to OpenAI Realtime and Gemini Live for building production voice agents. TokenMix.ai includes Scribe v2 in its voice API routing, letting teams combine ElevenLabs transcription with reasoning from any LLM provider.
Table of Contents
- Confirmed vs Speculation: Scribe v2 Facts
- 150ms Latency: What That Actually Means
- Accuracy Benchmarks vs OpenAI Whisper and Gemini
- April 2026 Feature Additions
- Scribe v2 vs OpenAI Realtime vs Gemini Live
- Pricing at Real Usage Levels
- When to Use Scribe v2 and When Not To
- FAQ
Confirmed vs Speculation: Scribe v2 Facts
| Claim | Status | Source |
|---|---|---|
| Scribe v2 Realtime available | Confirmed | ElevenLabs docs |
| ~150ms end-to-end latency | Confirmed | Official specs |
| Multimodal message support (April 2026) | Confirmed | ElevenLabs changelog |
| Guardrail triggered event | Confirmed | API docs |
| DTMF input support | Confirmed | April release notes |
| Beats Whisper on accuracy | Partial — depends on language/domain | Community benchmarks |
| Cheapest real-time STT | No — Whisper API often cheaper for batch | — |
| Works offline | No — cloud API only | — |
Bottom line: real latency leader for commercial real-time STT. Pricing is premium but justified for voice agent use cases.
150ms Latency: What That Actually Means
"150ms latency" refers to time from last audio chunk sent to transcript returned. Breakdown:
| Stage | Typical time |
|---|---|
| Audio chunk upload (WebSocket) | 10-30ms |
| Speech-to-text inference | 60-100ms |
| Network return | 10-30ms |
| Total (p50) | 150ms |
| Total (p95) | 250ms |
| Total (p99) | 400ms |
Why 150ms matters: human conversation has ~200-300ms turn-taking gaps. For an AI voice agent to feel like a phone call (not walkie-talkie), the pipeline must finish transcription AND inference AND TTS under 400ms total. With 150ms STT + 150ms LLM + 150ms TTS, you hit 450ms — barely natural. With slower STT at 400-600ms (like older systems), you're at 900-1000ms and conversation feels laggy.
Accuracy Benchmarks vs OpenAI Whisper and Gemini
WER (Word Error Rate) on standard test sets, April 2026:
| Model | English WER | Multilingual avg WER | Streaming |
|---|---|---|---|
| ElevenLabs Scribe v2 | 4.1% | 5.8% | Yes, 150ms |
| OpenAI Whisper Large v3 | 4.3% | 6.1% | Yes, 400ms |
| OpenAI Realtime | 4.5% | 6.5% | Yes, 350ms |
| Google Speech-to-Text V2 | 4.8% | 6.2% | Yes, 300ms |
| Gemini Live (integrated) | ~5.0% | ~6.4% | Yes, 250ms |
| Deepgram Nova-3 | 4.6% | 6.0% | Yes, 200ms |
Scribe v2 leads on accuracy and latency simultaneously. Trade-off: 2-4× more expensive than Whisper API for comparable batch workloads.
Domain performance:
- Clean studio audio: all models within 1 percentage point
- Noisy environments (car, cafe, call center): Scribe v2 pulls ahead by 2-4pp
- Accented English, code-switching: Scribe v2's margin widens further
April 2026 Feature Additions
New additions that matter for production:
1. multimodal_message WebSocket event — send audio and images together in the same stream. Useful for voice-first agents that also handle screenshots, documents, or live video frames.
2. onGuardrailTriggered callback — server-side content filtering fires a client-side event when harmful content is detected. Lets you react in real-time instead of discovering issues after the fact.
3. DTMF input — touch-tone detection from phone integrations. Lets Scribe handle "press 1 for sales" scenarios alongside speech.
4. Scoped analysis and test folders — better organization for production deployments with multiple agent configs.
5. useConversationControls hook (React SDK) — simpler React integration for building voice agent UIs.
Scribe v2 vs OpenAI Realtime vs Gemini Live
| Dimension | ElevenLabs Scribe v2 | OpenAI Realtime | Gemini Live |
|---|---|---|---|
| STT latency | 150ms | 350ms | 250ms |
| STT WER | 4.1% | 4.5% | 5.0% |
| Integrated LLM | No (BYO) | GPT-5.4 bundled | Gemini 3.1 Flash bundled |
| Integrated TTS | ElevenLabs voices | OpenAI voices | Gemini TTS |
| Voice polish | Best-in-class | Good | Good |
| Voice cloning | Yes | Limited | No |
| End-to-end latency (full loop) | 400-500ms (custom pipeline) | 300-500ms | 300-450ms |
| Pricing model | Per-minute streaming | Per-token (audio tokens) | Per-token |
| Flexibility | High (swap components) | Low | Medium |
Positioning:
- Scribe v2: best if you want best-in-class STT and will bring your own LLM (via OpenAI/Claude/Gemini API)
- OpenAI Realtime: best if you want one vendor for STT+LLM+TTS with lowest integration work
- Gemini Live: best if you're already on Google's stack
See our voice AI API comparison for the full three-way analysis.
Pricing at Real Usage Levels
ElevenLabs Scribe v2 Realtime pricing (April 2026):
- Base: $0.25 per minute of audio processed
- Volume tier (>10K min/mo): $0.20 per minute
- Enterprise (>100K min/mo): $0.15 per minute with dedicated capacity
Monthly cost estimates:
| Use case | Audio minutes/mo | Cost/mo |
|---|---|---|
| Solo developer testing | 500 |