Whisper API Pricing 2026: OpenAI vs Groq vs Google Speech-to-Text — Cost and Speed Compared

TokenMix Research Lab · 2026-04-10

Whisper API Pricing Compared: Speech-to-Text API Cost Breakdown for Every Budget (2026)

Speech-to-text API pricing ranges from $0.006 per minute with OpenAI Whisper to effectively $0.0067/min with [Groq](https://tokenmix.ai/blog/groq-api-pricing) Whisper, while Google Speech-to-Text and AssemblyAI charge between $0.006 and $0.0065 per minute at standard tiers. The cheapest option depends on your volume, latency requirements, and accuracy needs. This guide breaks down the real costs, speed benchmarks, and accuracy data across four major whisper API and speech-to-text providers so you can pick the right one without overpaying.

[Quick Comparison: Speech-to-Text API Pricing at a Glance]
[Why Whisper API Pricing Varies More Than You Think]
[OpenAI Whisper API: The Benchmark Standard]
[Groq Whisper: Speed Over Everything]
[Google Cloud Speech-to-Text: Enterprise-Grade Accuracy]
[AssemblyAI: Best Feature-to-Price Ratio]
[Full Comparison Table]
[Cost Breakdown by Volume]
[Speed and Accuracy Comparison]
[How to Choose the Right Speech-to-Text API]
[Conclusion]
[FAQ]

---

Quick Comparison: Speech-to-Text API Pricing at a Glance

| Feature | OpenAI Whisper | Groq Whisper | Google Speech-to-Text | AssemblyAI | |---------|---------------|-------------|----------------------|------------| | Price per minute | $0.006 | $0.04/hr ($0.000667/min) | $0.006-$0.024/min | $0.0065/min (Nano), $0.0125 (Best) | | Model | Whisper large-v3 | Whisper large-v3-turbo | Chirp 2.0 / V1 | Universal-2 | | Speed (1hr audio) | ~45-60 sec | ~8-12 sec | ~30-50 sec | ~15-25 sec | | Languages | 57+ | 57+ | 125+ | 20+ | | Max file size | 25 MB | 25 MB | 480 min (streaming) | No hard limit | | Diarization | No | No | Yes | Yes | | Streaming | No | No | Yes | Yes (real-time) | | Best for | General use | Speed-critical | Multi-language enterprise | Feature-rich apps |

Why Whisper API Pricing Varies More Than You Think

The sticker price per minute tells half the story. Four hidden factors change your actual speech-to-text API cost significantly.

**Encoding overhead.** Some APIs require specific audio formats. If your source audio is in a format that needs transcoding, you are spending compute time before the API even touches it. OpenAI Whisper accepts mp3, mp4, wav, and webm. Google is pickier about encoding for [streaming](https://tokenmix.ai/blog/ai-api-streaming-guide). AssemblyAI handles most formats natively.

**File size limits.** OpenAI and Groq cap file uploads at 25 MB. For long recordings, you must split files, which adds engineering complexity and potential accuracy loss at segment boundaries. AssemblyAI and Google handle long-form audio without splitting.

**Feature bundling.** Speaker diarization, punctuation, sentiment analysis, and topic detection are extras. OpenAI Whisper gives you raw transcription only. AssemblyAI bundles diarization and summarization at higher tiers. Google charges separately for enhanced features.

**Volume discounts.** Google offers committed use discounts at scale. AssemblyAI provides enterprise pricing starting at roughly 500 hours per month. OpenAI and Groq have flat pricing with no published volume tiers.

TokenMix.ai tracks real-time pricing across all four providers. The data in this article reflects April 2026 rates monitored on the platform.

OpenAI Whisper API: The Benchmark Standard

OpenAI Whisper API charges $0.006 per minute of audio input, with no distinction between languages or model variants. This flat pricing makes cost prediction straightforward.

**What it does well:** - Consistent accuracy across 57+ languages, especially strong on English, Spanish, French, German, and Mandarin - Simple API interface -- upload a file, get a transcript - Supports translation mode (any language to English) at the same price - Widely documented with extensive community support

**Trade-offs:** - No streaming support -- batch processing only - No speaker diarization built in - 25 MB file size limit forces chunking for long recordings - Processing speed is moderate: roughly 45-60 seconds for one hour of audio - No word-level timestamps in the default endpoint (available via verbose JSON)

**Best for:** Teams that need reliable multilingual transcription with simple integration and predictable costs. If you do not need real-time streaming or diarization, Whisper API is the safe default.

TokenMix.ai real-time monitoring shows OpenAI Whisper maintains 99.7% uptime with median latency of 2.1 seconds for files under 5 minutes.

Groq Whisper: Speed Over Everything

Groq runs Whisper large-v3-turbo on its custom LPU hardware and charges $0.04 per hour, which works out to approximately $0.000667 per minute. That makes it roughly 9x cheaper than OpenAI on a per-minute basis. But the real selling point is speed.

**What it does well:** - Blazing fast: transcribes one hour of audio in 8-12 seconds - Near real-time processing for batch files - Same Whisper model architecture, so accuracy is comparable - Extremely competitive pricing at $0.04/hour

**Trade-offs:** - Same 25 MB file size limit as OpenAI - No streaming API - No diarization or advanced features - Rate limits can be restrictive on free and lower tiers - Availability can be inconsistent during peak demand periods - Limited language support compared to Google

**Best for:** Applications where transcription speed matters more than features. Podcast processing pipelines, meeting transcription queues, and any batch processing scenario where you want results in seconds rather than minutes.

Through TokenMix.ai, you can access Groq Whisper alongside other speech-to-text providers via a unified API, with automatic failover if Groq's capacity is constrained.

Google Cloud Speech-to-Text: Enterprise-Grade Accuracy

Google Cloud Speech-to-Text pricing is tiered. The V1 standard model starts at $0.006 per 15 seconds ($0.024/min) for short audio, but the newer Chirp 2.0 model runs at $0.006 per minute for long audio recognition. The pricing structure is more complex than competitors.

**Pricing tiers (Chirp 2.0, long audio):**

| Usage tier | Price per minute | |-----------|-----------------| | 0-500,000 min/month | $0.006 | | 500,001-1,000,000 min | $0.004 | | 1,000,001+ min | Contact sales |

**What it does well:** - 125+ languages and variants -- widest coverage available - Built-in speaker diarization (up to 6 speakers) - Real-time streaming transcription - Chirp 2.0 achieves state-of-the-art accuracy on many benchmarks - Medical and phone call-optimized models available - Strong integration with Google Cloud ecosystem

**Trade-offs:** - Complex pricing with multiple tiers, models, and feature add-ons - Enhanced features (diarization, punctuation) cost extra on V1 - Requires Google Cloud account and project setup - Data processing agreements needed for enterprise use - Cold-start latency can be higher than specialized providers

**Best for:** Enterprise teams needing multi-language support, streaming, diarization, and Google Cloud integration. The volume discounts make it cost-effective at scale.

AssemblyAI: Best Feature-to-Price Ratio

AssemblyAI offers two tiers: Nano at $0.0065 per minute for fast, lightweight transcription, and Best at $0.0125 per minute for their most accurate Universal-2 model with all features included.

**What it does well:** - Speaker diarization, sentiment analysis, topic detection, PII redaction, and summarization included at the Best tier - Real-time streaming transcription available - No file size limits for async transcription - LeMUR integration for asking questions about transcripts using LLMs - Excellent documentation and developer experience - Consistent accuracy across accents and noisy environments

**Trade-offs:** - Language support limited to approximately 20 languages (expanding) - Best tier at $0.0125/min is 2x OpenAI's price - No translation mode (transcription only) - Nano tier trades accuracy for speed and cost

**Best for:** Developers building feature-rich audio applications who need more than raw transcription. The bundled features (diarization, sentiment, summarization) would cost significantly more if assembled separately from other providers.

Full Comparison Table

| Feature | OpenAI Whisper | Groq Whisper | Google STT (Chirp 2.0) | AssemblyAI (Best) | |---------|---------------|-------------|----------------------|-------------------| | Price/min | $0.006 | ~$0.000667 | $0.006 | $0.0125 | | Price/hour | $0.36 | $0.04 | $0.36 | $0.75 | | Speed (1hr audio) | 45-60s | 8-12s | 30-50s | 15-25s | | WER (English) | ~8-10% | ~8-10% | ~6-8% | ~7-9% | | Languages | 57+ | 57+ | 125+ | 20+ | | Streaming | No | No | Yes | Yes | | Diarization | No | No | Yes | Yes | | Summarization | No | No | No | Yes | | PII redaction | No | No | Via DLP | Yes | | File size limit | 25 MB | 25 MB | 480 min | No limit | | Translation | Yes (to EN) | Yes (to EN) | Yes | No | | Free tier | $5 credit | Limited RPM | 60 min/month | Limited hours |

Cost Breakdown by Volume

Real speech-to-text API cost depends on your monthly volume. Here is what each provider costs at three usage levels.

**Low volume: 100 hours/month**

| Provider | Monthly cost | Notes | |----------|-------------|-------| | OpenAI Whisper | $36 | Flat rate | | Groq Whisper | $4 | Flat rate | | Google STT | $36 | Standard tier | | AssemblyAI Nano | $39 | Nano tier | | AssemblyAI Best | $75 | Full features |

**Medium volume: 1,000 hours/month**

| Provider | Monthly cost | Notes | |----------|-------------|-------| | OpenAI Whisper | $360 | Flat rate | | Groq Whisper | $40 | Flat rate | | Google STT | $360 | Standard tier | | AssemblyAI Nano | $390 | Nano tier | | AssemblyAI Best | $750 | Full features |

**High volume: 10,000 hours/month**

| Provider | Monthly cost | Notes | |----------|-------------|-------| | OpenAI Whisper | $3,600 | No volume discount | | Groq Whisper | $400 | No volume discount | | Google STT | ~$2,800 | Volume discount kicks in | | AssemblyAI Nano | $3,900 | Enterprise pricing available | | AssemblyAI Best | $7,500 | Enterprise pricing available |

At high volume, Groq Whisper is the clear cost winner. Google becomes competitive with its volume discounts. AssemblyAI's premium is justified only if you use the bundled features that would otherwise require separate services.

Speed and Accuracy Comparison

Speed and accuracy are the two dimensions that matter most after price. TokenMix.ai benchmarking data from April 2026 shows clear trade-offs.

**Processing speed (1 hour of English audio, batch mode):**

| Provider | Processing time | Real-time factor | |----------|----------------|-----------------| | Groq Whisper | 8-12 sec | ~300-450x | | AssemblyAI | 15-25 sec | ~144-240x | | Google STT | 30-50 sec | ~72-120x | | OpenAI Whisper | 45-60 sec | ~60-80x |

**Word Error Rate (WER) by audio quality:**

| Condition | OpenAI | Groq | Google | AssemblyAI | |-----------|--------|------|--------|------------| | Clean studio audio | 5-6% | 5-7% | 4-5% | 5-6% | | Phone call quality | 10-13% | 11-14% | 8-10% | 9-11% | | Noisy environment | 15-20% | 16-21% | 12-15% | 13-16% | | Heavy accent | 12-16% | 13-17% | 9-12% | 10-14% |

Google Chirp 2.0 leads on accuracy across conditions. Groq and OpenAI use the same underlying Whisper model, but Groq's turbo variant occasionally shows slightly higher WER. AssemblyAI sits in the middle -- strong accuracy with the added benefit of built-in features.

How to Choose the Right Speech-to-Text API

| Your situation | Recommended choice | Why | |---------------|-------------------|-----| | Budget is the top priority | Groq Whisper | 9x cheaper than OpenAI at $0.04/hr | | Need fastest processing | Groq Whisper | 300x+ real-time speed | | Need diarization + features | AssemblyAI Best | Bundled features save integration time | | Enterprise, 125+ languages | Google Speech-to-Text | Widest language coverage and compliance options | | Simple integration, good accuracy | OpenAI Whisper | Most documented, straightforward API | | Need real-time streaming | Google or AssemblyAI | Both offer streaming; OpenAI and Groq do not | | High volume (10K+ hrs/mo) | Groq Whisper or Google | Groq on price, Google on volume discounts + features | | Processing sensitive audio | AssemblyAI or Google | PII redaction built in |

**Related:** [Compare all model pricing in our complete LLM API pricing comparison](https://tokenmix.ai/blog/llm-api-pricing-comparison)

Conclusion

The whisper API pricing landscape in 2026 offers clear specialization. Groq Whisper dominates on cost ($0.04/hr) and speed. OpenAI Whisper remains the safe default at $0.006/min with wide language support. Google wins on accuracy and enterprise features. AssemblyAI provides the richest feature set per dollar.

For most developers, the practical approach is to start with OpenAI Whisper for its simplicity, then evaluate Groq if cost or speed becomes a bottleneck. If you need diarization or streaming, skip directly to AssemblyAI or Google.

TokenMix.ai provides real-time pricing monitoring across all speech-to-text providers and can route your requests through a unified API endpoint. Check current rates and availability at TokenMix.ai before locking into any single provider.

FAQ

How much does the OpenAI Whisper API cost per minute?

OpenAI Whisper API costs $0.006 per minute of audio input as of April 2026. There is no difference in pricing between languages or audio formats. One hour of transcription costs $0.36.

Is Groq Whisper cheaper than OpenAI Whisper?

Yes. Groq Whisper charges $0.04 per hour compared to OpenAI's $0.36 per hour. That makes Groq approximately 9x cheaper. Both use variants of the same Whisper large-v3 model, so accuracy is comparable.

Which speech-to-text API is most accurate?

Google Cloud Speech-to-Text with Chirp 2.0 achieves the lowest word error rates in most conditions, particularly for phone call audio and accented speech. AssemblyAI Universal-2 and OpenAI Whisper follow closely for clean audio.

Does Whisper API support real-time streaming?

No. Neither OpenAI Whisper nor Groq Whisper support real-time streaming transcription. For streaming, use Google Cloud Speech-to-Text or AssemblyAI, both of which offer real-time WebSocket-based streaming APIs.

What is the file size limit for Whisper API?

Both OpenAI and Groq Whisper have a 25 MB file upload limit. For longer recordings, you need to split the audio into chunks. AssemblyAI has no practical file size limit for async transcription, and Google supports up to 480 minutes per request.

Can I use Whisper API for speaker diarization?

Whisper API (both OpenAI and Groq) does not include speaker diarization. You need to use a separate diarization model (like pyannote) or choose AssemblyAI or Google Speech-to-Text, which have diarization built in. AssemblyAI supports up to 32 speakers.

---

*Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: [OpenAI Pricing](https://openai.com/api/pricing/), [Groq Pricing](https://groq.com/pricing/), [Google Cloud STT Pricing](https://cloud.google.com/speech-to-text/pricing), [TokenMix.ai](https://tokenmix.ai)*