TokenMix Research Lab · 2026-04-10

Whisper API Pricing 2026: $0.006/min — OpenAI vs Groq vs Google

Whisper API Pricing Compared: Speech-to-Text API Cost Breakdown for Every Budget (2026)

Last Updated: 2026-04-29
Author: TokenMix Research Lab

Groq Whisper at $0.04/hr is 9x cheaper than OpenAI ($0.36/hr) and processes 1hr of audio in 8-12 sec. Google Chirp 2.0 leads accuracy and offers diarization + streaming. AssemblyAI bundles features at $0.0125/min.

Speech-to-text API pricing ranges from $0.006 per minute with OpenAI Whisper to effectively $0.0067/min with Groq Whisper, while Google Speech-to-Text and AssemblyAI charge between $0.006 and $0.0065 per minute at standard tiers. The cheapest option depends on your volume, latency requirements, and accuracy needs. This guide breaks down the real costs, speed benchmarks, and accuracy data across four major whisper API and speech-to-text providers so you can pick the right one without overpaying.

Table of Contents


Quick Comparison: Speech-to-Text API Pricing at a Glance

Groq dominates speed (8-12s for 1hr) and price ($0.000667/min). OpenAI is the simple default ($0.006/min). Google leads languages (125+) and accuracy. AssemblyAI bundles diarization + summarization at $0.0125/min.

Feature OpenAI Whisper Groq Whisper Google Speech-to-Text AssemblyAI
Price per minute $0.006 $0.04/hr ($0.000667/min) $0.006-$0.024/min $0.0065/min (Nano), $0.0125 (Best)
Model Whisper large-v3 Whisper large-v3-turbo Chirp 2.0 / V1 Universal-2
Speed (1hr audio) ~45-60 sec ~8-12 sec ~30-50 sec ~15-25 sec
Languages 57+ 57+ 125+ 20+
Max file size 25 MB 25 MB 480 min (streaming) No hard limit
Diarization No No Yes Yes
Streaming No No Yes Yes (real-time)
Best for General use Speed-critical Multi-language enterprise Feature-rich apps

Why Whisper API Pricing Varies More Than You Think

Four hidden cost levers: encoding overhead (transcoding before API), 25MB file caps (chunking complexity), feature bundling (diarization/summarization extras), volume discounts. Sticker price misleads at scale.

The sticker price per minute tells half the story. Four hidden factors change your actual speech-to-text API cost significantly.

Encoding overhead. Some APIs require specific audio formats. If your source audio is in a format that needs transcoding, you are spending compute time before the API even touches it. OpenAI Whisper accepts mp3, mp4, wav, and webm. Google is pickier about encoding for streaming. AssemblyAI handles most formats natively.

File size limits. OpenAI and Groq cap file uploads at 25 MB. For long recordings, you must split files, which adds engineering complexity and potential accuracy loss at segment boundaries. AssemblyAI and Google handle long-form audio without splitting.

Feature bundling. Speaker diarization, punctuation, sentiment analysis, and topic detection are extras. OpenAI Whisper gives you raw transcription only. AssemblyAI bundles diarization and summarization at higher tiers. Google charges separately for enhanced features.

Volume discounts. Google offers committed use discounts at scale. AssemblyAI provides enterprise pricing starting at roughly 500 hours per month. OpenAI and Groq have flat pricing with no published volume tiers.

TokenMix.ai tracks real-time pricing across all four providers. The data in this article reflects April 2026 rates monitored on the platform.

OpenAI Whisper API: The Benchmark Standard

$0.006/min flat across 57+ languages. 99.7% uptime, 2.1s median latency under 5min files. Trade-offs: no streaming, no diarization, 25MB file cap forces chunking, 45-60s to transcribe 1hr.

OpenAI Whisper API charges $0.006 per minute of audio input, with no distinction between languages or model variants. This flat pricing makes cost prediction straightforward.

What it does well:

Trade-offs:

Best for: Teams that need reliable multilingual transcription with simple integration and predictable costs. If you do not need real-time streaming or diarization, Whisper API is the safe default.

TokenMix.ai real-time monitoring shows OpenAI Whisper maintains 99.7% uptime with median latency of 2.1 seconds for files under 5 minutes.

Groq Whisper: Speed Over Everything

$0.04/hr (~$0.000667/min) — 9x cheaper than OpenAI. 1hr audio transcribed in 8-12 seconds (300-450x real-time). Same 25MB cap, no streaming, no diarization. Rate limits and peak-demand availability are the catches.

Groq runs Whisper large-v3-turbo on its custom LPU hardware and charges $0.04 per hour, which works out to approximately $0.000667 per minute. That makes it roughly 9x cheaper than OpenAI on a per-minute basis. But the real selling point is speed.

What it does well:

Trade-offs:

Best for: Applications where transcription speed matters more than features. Podcast processing pipelines, meeting transcription queues, and any batch processing scenario where you want results in seconds rather than minutes.

Through TokenMix.ai, you can access Groq Whisper alongside other speech-to-text providers via a unified API, with automatic failover if Groq's capacity is constrained.

Google Cloud Speech-to-Text: Enterprise-Grade Accuracy

Chirp 2.0 long-audio at $0.006/min, drops to $0.004/min above 500K min/month. 125+ languages, native diarization (6 speakers), real-time streaming, lowest WER on phone calls and accents. Complex tier setup required.

Google Cloud Speech-to-Text pricing is tiered. The V1 standard model starts at $0.006 per 15 seconds ($0.024/min) for short audio, but the newer Chirp 2.0 model runs at $0.006 per minute for long audio recognition. The pricing structure is more complex than competitors.

Pricing tiers (Chirp 2.0, long audio):

Usage tier Price per minute
0-500,000 min/month $0.006
500,001-1,000,000 min $0.004
1,000,001+ min Contact sales

What it does well:

Trade-offs:

Best for: Enterprise teams needing multi-language support, streaming, diarization, and Google Cloud integration. The volume discounts make it cost-effective at scale.

AssemblyAI: Best Feature-to-Price Ratio

Two tiers: Nano $0.0065/min (fast/lightweight), Best $0.0125/min (Universal-2 + diarization + sentiment + summarization + PII redaction + LeMUR LLM Q&A). Most expensive but bundles features that would cost more separately.

AssemblyAI offers two tiers: Nano at $0.0065 per minute for fast, lightweight transcription, and Best at $0.0125 per minute for their most accurate Universal-2 model with all features included.

What it does well:

Trade-offs:

Best for: Developers building feature-rich audio applications who need more than raw transcription. The bundled features (diarization, sentiment, summarization) would cost significantly more if assembled separately from other providers.

Full Comparison Table

12 dimensions side-by-side. Speed gap: Groq 8-12s vs OpenAI 45-60s for 1hr. WER gap: Google 6-8% vs OpenAI/Groq 8-10% on English. Feature bundle: AssemblyAI Best is the only all-in-one.

Feature OpenAI Whisper Groq Whisper Google STT (Chirp 2.0) AssemblyAI (Best)
Price/min $0.006 ~$0.000667 $0.006 $0.0125
Price/hour $0.36 $0.04 $0.36 $0.75
Speed (1hr audio) 45-60s 8-12s 30-50s 15-25s
WER (English) ~8-10% ~8-10% ~6-8% ~7-9%
Languages 57+ 57+ 125+ 20+
Streaming No No Yes Yes
Diarization No No Yes Yes
Summarization No No No Yes
PII redaction No No Via DLP Yes
File size limit 25 MB 25 MB 480 min No limit
Translation Yes (to EN) Yes (to EN) Yes No
Free tier $5 credit Limited RPM 60 min/month Limited hours

Cost Breakdown by Volume

At 10K hours/month: Groq $400 (cheapest), Google $2,800 (volume discount), OpenAI $3,600 (no discount), AssemblyAI Best $7,500. Groq's price advantage compounds with scale; volume discounts only kick in at Google.

Real speech-to-text API cost depends on your monthly volume. Here is what each provider costs at three usage levels.

Low volume: 100 hours/month

Provider Monthly cost Notes
OpenAI Whisper $36 Flat rate
Groq Whisper $4 Flat rate
Google STT $36 Standard tier
AssemblyAI Nano $39 Nano tier
AssemblyAI Best $75 Full features

Medium volume: 1,000 hours/month

Provider Monthly cost Notes
OpenAI Whisper $360 Flat rate
Groq Whisper $40 Flat rate
Google STT $360 Standard tier
AssemblyAI Nano $390 Nano tier
AssemblyAI Best $750 Full features

High volume: 10,000 hours/month

Provider Monthly cost Notes
OpenAI Whisper $3,600 No volume discount
Groq Whisper $400 No volume discount
Google STT ~$2,800 Volume discount kicks in
AssemblyAI Nano $3,900 Enterprise pricing available
AssemblyAI Best $7,500 Enterprise pricing available

At high volume, Groq Whisper is the clear cost winner. Google becomes competitive with its volume discounts. AssemblyAI's premium is justified only if you use the bundled features that would otherwise require separate services.

Speed and Accuracy Comparison

Groq leads speed at 300-450x real-time. Google Chirp 2.0 leads accuracy: 4-5% WER on clean audio, 8-10% on phone calls (3-5 points better than Whisper variants on noisy/accented audio).

Speed and accuracy are the two dimensions that matter most after price. TokenMix.ai benchmarking data from April 2026 shows clear trade-offs.

Processing speed (1 hour of English audio, batch mode):

Provider Processing time Real-time factor
Groq Whisper 8-12 sec ~300-450x
AssemblyAI 15-25 sec ~144-240x
Google STT 30-50 sec ~72-120x
OpenAI Whisper 45-60 sec ~60-80x

Word Error Rate (WER) by audio quality:

Condition OpenAI Groq Google AssemblyAI
Clean studio audio 5-6% 5-7% 4-5% 5-6%
Phone call quality 10-13% 11-14% 8-10% 9-11%
Noisy environment 15-20% 16-21% 12-15% 13-16%
Heavy accent 12-16% 13-17% 9-12% 10-14%

Google Chirp 2.0 leads on accuracy across conditions. Groq and OpenAI use the same underlying Whisper model, but Groq's turbo variant occasionally shows slightly higher WER. AssemblyAI sits in the middle -- strong accuracy with the added benefit of built-in features.

Which Speech-to-Text API Should You Pick?

Budget: Groq. Streaming or diarization: Google or AssemblyAI. 125+ languages: Google. Bundled features (diarization + sentiment + summary): AssemblyAI Best. Default for simple use: OpenAI Whisper.

Your situation Recommended choice Why
Budget is the top priority Groq Whisper 9x cheaper than OpenAI at $0.04/hr
Need fastest processing Groq Whisper 300x+ real-time speed
Need diarization + features AssemblyAI Best Bundled features save integration time
Enterprise, 125+ languages Google Speech-to-Text Widest language coverage and compliance options
Simple integration, good accuracy OpenAI Whisper Most documented, straightforward API
Need real-time streaming Google or AssemblyAI Both offer streaming; OpenAI and Groq do not
High volume (10K+ hrs/mo) Groq Whisper or Google Groq on price, Google on volume discounts + features
Processing sensitive audio AssemblyAI or Google PII redaction built in

Related: Compare all model pricing in our complete LLM API pricing comparison

What's the Bottom Line on Whisper API Pricing?

Start with OpenAI Whisper for simplicity. Move to Groq when speed or cost dominates. Jump to Google or AssemblyAI when streaming, diarization, or language coverage is required. Route via TokenMix.ai to switch on demand.

The whisper API pricing landscape in 2026 offers clear specialization. Groq Whisper dominates on cost ($0.04/hr) and speed. OpenAI Whisper remains the safe default at $0.006/min with wide language support. Google wins on accuracy and enterprise features. AssemblyAI provides the richest feature set per dollar.

For most developers, the practical approach is to start with OpenAI Whisper for its simplicity, then evaluate Groq if cost or speed becomes a bottleneck. If you need diarization or streaming, skip directly to AssemblyAI or Google.

TokenMix.ai provides real-time pricing monitoring across all speech-to-text providers and can route your requests through a unified API endpoint. Check current rates and availability at TokenMix.ai before locking into any single provider.

FAQ

How much does the OpenAI Whisper API cost per minute?

OpenAI Whisper API costs $0.006 per minute of audio input as of April 2026. There is no difference in pricing between languages or audio formats. One hour of transcription costs $0.36.

Is Groq Whisper cheaper than OpenAI Whisper?

Yes. Groq Whisper charges $0.04 per hour compared to OpenAI's $0.36 per hour. That makes Groq approximately 9x cheaper. Both use variants of the same Whisper large-v3 model, so accuracy is comparable.

Which speech-to-text API is most accurate?

Google Cloud Speech-to-Text with Chirp 2.0 achieves the lowest word error rates in most conditions, particularly for phone call audio and accented speech. AssemblyAI Universal-2 and OpenAI Whisper follow closely for clean audio.

Does Whisper API support real-time streaming?

No. Neither OpenAI Whisper nor Groq Whisper support real-time streaming transcription. For streaming, use Google Cloud Speech-to-Text or AssemblyAI, both of which offer real-time WebSocket-based streaming APIs.

What is the file size limit for Whisper API?

Both OpenAI and Groq Whisper have a 25 MB file upload limit. For longer recordings, you need to split the audio into chunks. AssemblyAI has no practical file size limit for async transcription, and Google supports up to 480 minutes per request.

Can I use Whisper API for speaker diarization?

Whisper API (both OpenAI and Groq) does not include speaker diarization. You need to use a separate diarization model (like pyannote) or choose AssemblyAI or Google Speech-to-Text, which have diarization built in. AssemblyAI supports up to 32 speakers.


Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: OpenAI Pricing, Groq Pricing, Google Cloud STT Pricing, TokenMix.ai