TokenMix Research Lab · 2026-04-07

AI API Rate Limits Guide: OpenAI, Anthropic, Google, DeepSeek, and Groq Limits Explained (2026)
Last Updated: 2026-04-29
Author: TokenMix Research Lab
OpenAI Tier 4 leads at 10,000 RPM + 30M TPM. Anthropic caps at 4,000 RPM (75x lower input TPM). DeepSeek throttles at 300 RPM. Multi-provider routing aggregates limits and is the only path past 10K RPM.
API rate limits are the hidden constraint that determines your real throughput — and your real cost. Every AI provider throttles requests by RPM (requests per minute), TPM (tokens per minute), and RPD (requests per day). OpenAI has four usage tiers. Anthropic has tier-based scaling. Google ties limits to your billing plan. DeepSeek and Groq have fixed-but-generous limits. This guide documents every provider's rate limits in one place, explains how RPM/TPM/RPD interact, and gives you proven strategies to handle limits in production: exponential backoff, request queuing, load balancing, and multi-provider routing. All data verified against official documentation and TokenMix.ai production monitoring, April 2026.
Table of Contents
- Quick Comparison: Rate Limits Across Providers
- How API Rate Limits Work: RPM, TPM, RPD
- OpenAI Rate Limits by Tier
- Anthropic Rate Limits by Tier
- Google Gemini API Rate Limits
- DeepSeek API Rate Limits
- Groq API Rate Limits
- Why Rate Limits Cost You More Than You Think
- Strategies to Handle Rate Limits in Production
- Multi-Provider Routing for Rate Limit Resilience
- Which Provider Should You Pick for Your Throughput?
- What's the Bottom Line on AI API Rate Limits?
- FAQ
Quick Comparison: Rate Limits Across Providers
OpenAI Tier 4 dominates at 10K-30K RPM and 30M-150M TPM. Anthropic Tier 4 maxes at 4K RPM and 400K TPM — 75x lower input throughput. DeepSeek and Groq cap below 300 RPM.
Top-tier rate limits for flagship models, April 2026:
| Provider | Model | RPM | TPM (Input) | TPM (Output) | RPD | Tier Required |
|---|---|---|---|---|---|---|
| OpenAI | GPT-5.4 | 10,000 | 30M | 10M | Unlimited | Tier 4 |
| OpenAI | GPT-5.4 Mini | 30,000 | 150M | 50M | Unlimited | Tier 4 |
| Anthropic | Sonnet 4.6 | 4,000 | 400K | 80K | Unlimited | Tier 4 |
| Anthropic | Haiku 4.5 | 4,000 | 400K | 80K | Unlimited | Tier 4 |
| Gemini 2.5 Flash | 2,000 | 4M | — | Unlimited | Pay-as-you-go | |
| DeepSeek | V4 | 300 | 300K | — | Unlimited | Standard |
| Groq | Llama 3.3 70B | 30 | 15K | 15K | 14,400 | Free |
Key takeaway: OpenAI's highest tier offers 10,000+ RPM with tens of millions of TPM — an order of magnitude more than Anthropic's 4,000 RPM at the same tier level. But getting to OpenAI Tier 4 requires significant spending history.
How API Rate Limits Work: RPM, TPM, RPD
RPM, TPM, and RPD apply simultaneously — you hit whichever fires first. RPM bottlenecks short-message chatbots; TPM bottlenecks long-document workloads; RPD only matters on free tiers.
Rate limits are simultaneous constraints. You hit whichever limit comes first. Understanding how they interact prevents wasted retries and unexpected throttling.
RPM (Requests Per Minute)
The maximum number of API calls you can make in a rolling 60-second window. Each API call counts as one request, regardless of token count. A 10-token request and a 100,000-token request each count as one RPM unit.
When RPM is your bottleneck: High-frequency, low-token workloads. Chatbots handling many short messages. Classification APIs processing many small inputs. Agent loops making rapid sequential calls.
TPM (Tokens Per Minute)
The maximum number of tokens (input + output combined, or split into separate input/output limits) processable in a rolling 60-second window. This is the throughput constraint for heavy workloads.
When TPM is your bottleneck: Long-document processing. Batch summarization. Code generation with large context windows. Any workload where individual requests consume thousands of tokens.
RPD (Requests Per Day)
A daily cap on total requests. Most providers have removed RPD limits for paid tiers, but free tiers and some providers still enforce them.
When RPD is your bottleneck: Free-tier development and testing. Prototyping before committing to a paid plan. Low-volume production workloads that might accidentally spike.
How Limits Stack
If your OpenAI Tier 1 limits are 500 RPM and 200K TPM for GPT-5.4:
- Sending 500 requests/minute with 100 tokens each = 50K TPM. RPM is the bottleneck.
- Sending 50 requests/minute with 5,000 tokens each = 250K TPM. TPM is the bottleneck (exceeds 200K).
- Both limits apply simultaneously. You must stay under all of them.
OpenAI Rate Limits by Tier
Six tiers, gated by cumulative spend ($5 → $1,000) and account age (7-30 days). Tier 5 unlocks 30K RPM and 1B input TPM on Mini — 200x more than Tier 1. Mini gets 5-10x more throughput than full GPT-5.4.
OpenAI uses a tier system based on cumulative spending and account age. Higher tiers unlock higher limits.
How to Advance Through OpenAI Tiers
| Tier | Requirement | Typical Timeline |
|---|---|---|
| Free | New account | Immediate |
| Tier 1 | $5 paid | ~7 days after first payment |
| Tier 2 | $50 paid, 7+ days since first payment | ~2-4 weeks |
| Tier 3 | $100 paid, 7+ days since Tier 2 | ~1-2 months |
| Tier 4 | $250 paid, 14+ days since Tier 3 | ~2-3 months |
| Tier 5 | $1,000 paid, 30+ days since Tier 4 | ~3-6 months |
OpenAI GPT-5.4 Rate Limits by Tier
| Tier | RPM | TPM (Input) | TPM (Output) | Batch Queue |
|---|---|---|---|---|
| Free | 3 | 40K | 4K | N/A |
| Tier 1 | 500 | 200K | 20K | 100K TPM |
| Tier 2 | 5,000 | 2M | 200K | 1M TPM |
| Tier 3 | 5,000 | 10M | 2M | 5M TPM |
| Tier 4 | 10,000 | 30M | 10M | 15M TPM |
| Tier 5 | 10,000 | 150M | 30M | 75M TPM |
OpenAI GPT-5.4 Mini Rate Limits by Tier
| Tier | RPM | TPM (Input) | TPM (Output) |
|---|---|---|---|
| Free | 3 | 40K | 16K |
| Tier 1 | 500 | 2M | 400K |
| Tier 2 | 5,000 | 20M | 4M |
| Tier 3 | 5,000 | 100M | 20M |
| Tier 4 | 10,000 | 150M | 50M |
| Tier 5 | 30,000 | 1B | 200M |
Mini's limits are 5-10x higher than GPT-5.4's at the same tier. This reflects OpenAI's infrastructure allocation — Mini is cheaper to serve and gets proportionally more throughput.
Source: OpenAI Rate Limits Documentation
Anthropic Rate Limits by Tier
Anthropic Tier 4 maxes at 4,000 RPM and 400K input TPM — 75x lower input throughput than OpenAI Tier 4. Tier advancement requires $5-$400 cumulative spend over 14+ days.
Anthropic also uses a tier system, but with different advancement criteria and notably lower RPM limits than OpenAI.
Anthropic Tier Advancement
| Tier | Requirement |
|---|---|
| Free | New account |
| Tier 1 | $5 credit purchase |
| Tier 2 | $40 credit purchase, 7+ days since Tier 1 |
| Tier 3 | $200 credit purchase, 7+ days since Tier 2 |
| Tier 4 | $400 credit purchase, 14+ days since Tier 3 |
Anthropic Claude Rate Limits by Tier
Sonnet 4.6 / Opus 4.6:
| Tier | RPM | Input TPM | Output TPM |
|---|---|---|---|
| Free | 5 | 20K | 4K |
| Tier 1 | 50 | 40K | 8K |
| Tier 2 | 1,000 | 80K | 16K |
| Tier 3 | 2,000 | 160K | 32K |
| Tier 4 | 4,000 | 400K | 80K |
Haiku 4.5:
| Tier | RPM | Input TPM | Output TPM |
|---|---|---|---|
| Free | 5 | 25K | 5K |
| Tier 1 | 50 | 50K | 10K |
| Tier 2 | 1,000 | 100K | 20K |
| Tier 3 | 2,000 | 200K | 40K |
| Tier 4 | 4,000 | 400K | 80K |
The Anthropic vs OpenAI gap: At the highest tier, Anthropic Sonnet 4.6 allows 4,000 RPM and 400K input TPM. OpenAI GPT-5.4 at Tier 4 allows 10,000 RPM and 30M input TPM — a 75x difference in input throughput. This is the single biggest rate limit gap between the two leading providers.
For teams requiring high throughput, this difference may force multi-provider architectures or Anthropic-specific workarounds (multiple API keys, enterprise agreements).
Source: Anthropic Rate Limits Documentation
Google Gemini API Rate Limits
Google's free tier offers an unusually generous 1M TPM but caps RPD at 1,500. Pay-as-you-go removes daily caps and unlocks 2,000 RPM Flash / 1,000 RPM Pro.
Google structures Gemini rate limits by billing plan rather than spending tiers.
Gemini 2.5 Flash
| Plan | RPM | TPM | RPD |
|---|---|---|---|
| Free | 15 | 1M | 1,500 |
| Pay-as-you-go | 2,000 | 4M | Unlimited |
Gemini 2.5 Pro
| Plan | RPM | TPM | RPD |
|---|---|---|---|
| Free | 5 | 1M | 25 |
| Pay-as-you-go | 1,000 | 4M | Unlimited |
Google's free tier is surprisingly generous on TPM — 1M tokens per minute even on free plans. But RPD caps (1,500 requests/day for Flash free) limit real usage. The jump to pay-as-you-go removes daily caps and increases RPM substantially.
Google also offers provisioned throughput for enterprise customers with guaranteed capacity and custom rate limits. Contact Google Cloud sales for pricing.
Source: Google AI Rate Limits
DeepSeek API Rate Limits
DeepSeek caps at 300 RPM (V4) and just 50 concurrent requests — most restrictive of any paid provider. Concurrency limits cap throughput regardless of RPM math; multiple keys are the only workaround.
DeepSeek maintains simpler rate limits without a tier system.
| Model | RPM | TPM | RPD | Concurrent Requests |
|---|---|---|---|---|
| DeepSeek V4 | 300 | 300K | Unlimited | 50 |
| DeepSeek R1 | 60 | 300K | Unlimited | 10 |
DeepSeek's limits are the most restrictive of any paid provider. 300 RPM and 50 concurrent requests cap real-world throughput significantly below what the per-token pricing suggests. For high-volume production workloads, you will hit these limits quickly.
Concurrent request limits are the hidden constraint. Even if your RPM math works out, having only 50 requests in flight simultaneously creates a hard ceiling on throughput that cannot be optimized around without multiple API keys.
For teams needing DeepSeek's pricing at higher throughput, TokenMix.ai's unified API can aggregate multiple DeepSeek connections and provide automatic failover to alternative providers when DeepSeek limits are reached.
Source: DeepSeek API Documentation
Groq API Rate Limits
Groq free tier is 30 RPM / 14,400 RPD — fine for prototyping, useless for production. Paid scales to 1,000 RPM. Enterprise contracts unlock the LPU's full 500+ tok/sec speed.
Groq offers the fastest inference speed but with strict rate limits, especially on the free tier.
| Model | Free RPM | Free TPM | Free RPD | Paid RPM | Paid TPM |
|---|---|---|---|---|---|
| Llama 3.3 70B | 30 | 15K | 14,400 | 1,000 | 100K |
| Llama 3.1 8B | 30 | 20K | 14,400 | 1,000 | 200K |
| Mixtral 8x7B | 30 | 10K | 14,400 | 1,000 | 100K |
| Gemma 2 9B | 30 | 15K | 14,400 | 1,000 | 100K |
Groq's free tier is excellent for prototyping — 14,400 RPD is enough for substantial testing. But 30 RPM and low TPM limits make the free tier impractical for production.
Paid Groq limits scale significantly with enterprise agreements. Groq's custom silicon (LPU) can deliver 500+ tokens/sec, but accessing this throughput requires direct enterprise contracts.
Source: Groq Documentation
Why Rate Limits Cost You More Than You Think
Three hidden costs: P95 latency spikes 2-5x at 80% utilization, tier upgrades tax growth, and rate limit handling adds 20-40 hours of engineering ($5K-10K). None show up on your invoice.
Rate limits create three categories of hidden costs that do not appear on your invoice.
1. Queuing Latency
When you hit RPM or TPM limits, requests queue. Each queued request adds latency. For user-facing applications, this latency degrades experience. For batch pipelines, it extends processing time and delays downstream workflows.
Quantified impact: TokenMix.ai's monitoring shows that teams operating at 80%+ of their rate limits experience 2-5x higher P95 latency compared to teams at 50% utilization. The relationship is not linear — it follows queuing theory curves where latency spikes dramatically near capacity.
2. Tier Upgrade Pressure
Rate limits create artificial urgency to spend more. An OpenAI Tier 1 user hitting 500 RPM needs to spend $50+ just to reach Tier 2 and unlock 5,000 RPM. This is effectively a tax on growth.
3. Architectural Complexity
Working around rate limits requires retry logic, request queuing, backoff algorithms, and monitoring. This engineering overhead costs developer time. TokenMix.ai estimates the average team spends 20-40 hours building and maintaining rate limit handling code — equivalent to $5,000-10,000 in engineering cost.
Strategies to Handle Rate Limits in Production
Five proven strategies, ranked by effectiveness: exponential backoff (mandatory baseline), client-side rate smoothing, request batching, multi-provider routing, multiple API keys (with caution).
Strategy 1: Exponential Backoff with Jitter
The standard approach. When you receive a 429 (rate limited) response, wait before retrying, doubling the wait each time with random jitter to prevent thundering herd.
Implementation pattern:
Base delay: 1 second
Retry 1: 1s + random(0-0.5s)
Retry 2: 2s + random(0-1s)
Retry 3: 4s + random(0-2s)
Retry 4: 8s + random(0-4s)
Max retries: 5
Max delay cap: 60 seconds
When to use: Every production integration should have exponential backoff as the baseline. It is not optional.
Strategy 2: Request Queuing and Rate Smoothing
Instead of sending requests as fast as possible and handling 429s, pre-throttle your request rate to stay below limits.
Implementation approach:
- Maintain a token bucket or leaky bucket rate limiter client-side
- Set the bucket rate to 80% of your RPM/TPM limit (safety margin)
- Queue excess requests and drain at the controlled rate
- Monitor queue depth — if it grows continuously, you need higher limits or multi-provider routing
When to use: High-volume production workloads where 429 retries would create unacceptable latency variance.
Strategy 3: Request Batching
Combine multiple small requests into fewer large ones. This reduces RPM consumption without changing total TPM.
Example: Instead of 100 separate classification requests (100 RPM consumed), batch 10 items per request (10 RPM consumed, same total tokens). Most providers support this through system prompt design — include multiple items in one prompt and parse structured output.
When to use: When RPM is your bottleneck and individual requests are small.
Strategy 4: Multi-Provider Load Balancing
Distribute requests across multiple providers to aggregate their rate limits. If OpenAI gives you 10,000 RPM and Anthropic gives you 4,000 RPM, routing across both gives you an effective 14,000 RPM ceiling.
When to use: When a single provider's limits are insufficient and your workload can tolerate model variation. TokenMix.ai's unified API handles this automatically — see next section.
Strategy 5: Multiple API Keys
Some providers allow multiple API keys under one organization. Each key may receive independent rate limits in certain configurations. Check provider terms of service — some explicitly prohibit this approach.
When to use: With caution. Verify provider policies first. More sustainable to upgrade tiers or use multi-provider routing.
Multi-Provider Routing for Rate Limit Resilience
Routing across OpenAI Tier 3 + Anthropic Tier 4 + Google PAYG yields ~9,000 effective RPM — close to OpenAI Tier 4 alone, but with built-in failover. Solves rate limits and cost optimization in one architecture.
The most robust rate limit strategy is not working around a single provider's limits — it is distributing across providers.
How multi-provider routing works:
- Define your model requirements (quality threshold, max latency, max cost)
- Map equivalent models across providers (e.g., GPT-5.4 Mini / Haiku 4.5 / Gemini Flash for budget tasks)
- Route requests to the provider with the most available capacity
- Fail over automatically when one provider hits limits
TokenMix.ai's approach: The platform maintains real-time rate limit utilization tracking across all connected providers. When you send a request through TokenMix.ai's unified API, it routes to the provider with the most headroom. If OpenAI is at 90% RPM utilization, the request goes to Anthropic or Google instead.
Effective capacity multiplication: A team with Tier 3 accounts at OpenAI (5,000 RPM), Anthropic (2,000 RPM), and Google (2,000 RPM) gets an effective 9,000 RPM through multi-provider routing — nearly as much as OpenAI Tier 4 alone.
Cost implication: Multi-provider routing also enables cost optimization. Route budget tasks to the cheapest available provider. Route quality-critical tasks to the best-performing model. Rate limit management and cost optimization converge into the same architecture.
Which Provider Should You Pick for Your Throughput?
Under 50 RPM: any provider works. 500-5,000 RPM: OpenAI Tier 2-4. Above 10,000 RPM: no single provider is reliable — use multi-provider routing via TokenMix.ai.
| Your throughput need | Recommended approach | Why |
|---|---|---|
| Under 50 RPM | Any provider, Tier 1 | All providers handle this easily |
| 50-500 RPM | OpenAI Tier 1 or Google Pay-as-you-go | Best low-tier limits |
| 500-2,000 RPM | OpenAI Tier 2-3 | Highest RPM at mid tiers |
| 2,000-5,000 RPM | OpenAI Tier 3-4 | Only provider offering this range |
| 5,000-10,000 RPM | OpenAI Tier 4-5 or multi-provider | Single provider or distributed |
| 10,000+ RPM | Multi-provider via TokenMix.ai | No single provider reliably serves this |
| High TPM, low RPM | Anthropic or Google | High per-request token allowance |
| Maximum throughput at low cost | Multi-provider with DeepSeek + Flash | Aggregate cheap capacity |
| Enterprise SLA required | Direct enterprise agreement | Custom limits, guaranteed capacity |
What's the Bottom Line on AI API Rate Limits?
Rate limits are an architectural forcing function, not just a technical constraint. Implement exponential backoff from day one, scale through tiers, and pivot to multi-provider routing once any single provider becomes the bottleneck.
Rate limits are not just a technical constraint — they are a cost multiplier and an architectural forcing function. OpenAI offers the highest limits at top tiers (10,000+ RPM, 150M+ TPM) but requires months of spending to unlock them. Anthropic's limits are significantly lower (4,000 RPM max) but sufficient for most individual applications. Google offers generous TPM on free tiers. DeepSeek and Groq are heavily constrained.
For teams outgrowing a single provider's limits, multi-provider routing through TokenMix.ai multiplies your effective capacity by aggregating limits across providers. This approach simultaneously solves rate limit constraints, provides failover resilience, and enables cost optimization through intelligent routing.
The pragmatic path: start with one provider, implement exponential backoff from day one, upgrade tiers as volume grows, and switch to multi-provider routing when any single provider becomes a bottleneck. TokenMix.ai's real-time monitoring tracks your utilization across all providers and alerts you before you hit limits.
Do not build rate limit handling as an afterthought. It is a core infrastructure concern that directly impacts your application's reliability and cost.
FAQ
What happens when I hit an API rate limit?
The provider returns an HTTP 429 (Too Many Requests) response with a Retry-After header indicating how long to wait. Your application should implement exponential backoff — wait, then retry with increasing delays. Without retry logic, your requests simply fail.
How do I check my current OpenAI rate limit tier?
Go to your OpenAI dashboard under Settings > Limits. It shows your current tier, rate limits for each model, and the requirements to advance to the next tier. You can also check limits programmatically via response headers (x-ratelimit-limit-*, x-ratelimit-remaining-*).
Why are Anthropic rate limits lower than OpenAI?
Anthropic's published limits (4,000 RPM max at Tier 4 vs OpenAI's 10,000+ RPM) reflect different infrastructure scaling strategies. Anthropic focuses on per-request quality and safety, while OpenAI has scaled infrastructure for higher throughput. Enterprise Anthropic customers can negotiate custom limits.
Can I increase my rate limits without spending more?
Tier advancement at OpenAI and Anthropic requires cumulative spending. There is no way to skip tiers. However, you can effectively increase your available throughput by using multi-provider routing (aggregate limits across providers), using the Batch API (separate higher limits), and request batching (reducing RPM consumption).
What is the difference between RPM, TPM, and RPD?
RPM (requests per minute) limits the number of API calls. TPM (tokens per minute) limits total token throughput. RPD (requests per day) is a daily cap. All three apply simultaneously — you hit whichever limit comes first. Most production workloads are bottlenecked by TPM on large requests or RPM on many small requests.
How does TokenMix.ai handle rate limits?
TokenMix.ai monitors real-time rate limit utilization across all connected providers. When one provider approaches its limits, the platform automatically routes requests to providers with available capacity. This multi-provider approach effectively multiplies your aggregate rate limits while maintaining a single API integration.
Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: OpenAI Rate Limits, Anthropic Rate Limits, Google AI Pricing, TokenMix.ai