AI API Rate Limits Guide: OpenAI, Anthropic, Google, DeepSeek, and Groq Limits Explained (2026)
API rate limits are the hidden constraint that determines your real throughput — and your real cost. Every AI provider throttles requests by RPM (requests per minute), TPM (tokens per minute), and RPD (requests per day). OpenAI has four usage tiers. Anthropic has tier-based scaling. Google ties limits to your billing plan. DeepSeek and Groq have fixed-but-generous limits. This guide documents every provider's rate limits in one place, explains how RPM/TPM/RPD interact, and gives you proven strategies to handle limits in production: exponential backoff, request queuing, load balancing, and multi-provider routing. All data verified against official documentation and TokenMix.ai production monitoring, April 2026.
Table of Contents
[Quick Comparison: Rate Limits Across Providers]
[How API Rate Limits Work: RPM, TPM, RPD]
[OpenAI Rate Limits by Tier]
[Anthropic Rate Limits by Tier]
[Google Gemini API Rate Limits]
[DeepSeek API Rate Limits]
[Groq API Rate Limits]
[Why Rate Limits Cost You More Than You Think]
[Strategies to Handle Rate Limits in Production]
[Multi-Provider Routing for Rate Limit Resilience]
[How to Choose a Provider Based on Rate Limits]
[Conclusion]
[FAQ]
Quick Comparison: Rate Limits Across Providers
Top-tier rate limits for flagship models, April 2026:
Provider
Model
RPM
TPM (Input)
TPM (Output)
RPD
Tier Required
OpenAI
GPT-5.4
10,000
30M
10M
Unlimited
Tier 4
OpenAI
GPT-5.4 Mini
30,000
150M
50M
Unlimited
Tier 4
Anthropic
Sonnet 4.6
4,000
400K
80K
Unlimited
Tier 4
Anthropic
Haiku 4.5
4,000
400K
80K
Unlimited
Tier 4
Google
Gemini 2.5 Flash
2,000
4M
—
Unlimited
Pay-as-you-go
DeepSeek
V4
300
300K
—
Unlimited
Standard
Groq
Llama 3.3 70B
30
15K
15K
14,400
Free
Key takeaway: OpenAI's highest tier offers 10,000+ RPM with tens of millions of TPM — an order of magnitude more than Anthropic's 4,000 RPM at the same tier level. But getting to OpenAI Tier 4 requires significant spending history.
How API Rate Limits Work: RPM, TPM, RPD
Rate limits are simultaneous constraints. You hit whichever limit comes first. Understanding how they interact prevents wasted retries and unexpected throttling.
RPM (Requests Per Minute)
The maximum number of API calls you can make in a rolling 60-second window. Each API call counts as one request, regardless of token count. A 10-token request and a 100,000-token request each count as one RPM unit.
When RPM is your bottleneck: High-frequency, low-token workloads. Chatbots handling many short messages. Classification APIs processing many small inputs. Agent loops making rapid sequential calls.
TPM (Tokens Per Minute)
The maximum number of tokens (input + output combined, or split into separate input/output limits) processable in a rolling 60-second window. This is the throughput constraint for heavy workloads.
When TPM is your bottleneck: Long-document processing. Batch summarization. Code generation with large context windows. Any workload where individual requests consume thousands of tokens.
RPD (Requests Per Day)
A daily cap on total requests. Most providers have removed RPD limits for paid tiers, but free tiers and some providers still enforce them.
When RPD is your bottleneck: Free-tier development and testing. Prototyping before committing to a paid plan. Low-volume production workloads that might accidentally spike.
How Limits Stack
If your OpenAI Tier 1 limits are 500 RPM and 200K TPM for GPT-5.4:
Sending 500 requests/minute with 100 tokens each = 50K TPM. RPM is the bottleneck.
Sending 50 requests/minute with 5,000 tokens each = 250K TPM. TPM is the bottleneck (exceeds 200K).
Both limits apply simultaneously. You must stay under all of them.
OpenAI Rate Limits by Tier
OpenAI uses a tier system based on cumulative spending and account age. Higher tiers unlock higher limits.
How to Advance Through OpenAI Tiers
Tier
Requirement
Typical Timeline
Free
New account
Immediate
Tier 1
$5 paid
~7 days after first payment
Tier 2
$50 paid, 7+ days since first payment
~2-4 weeks
Tier 3
00 paid, 7+ days since Tier 2
~1-2 months
Tier 4
$250 paid, 14+ days since Tier 3
~2-3 months
Tier 5
,000 paid, 30+ days since Tier 4
~3-6 months
OpenAI GPT-5.4 Rate Limits by Tier
Tier
RPM
TPM (Input)
TPM (Output)
Batch Queue
Free
3
40K
4K
N/A
Tier 1
500
200K
20K
100K TPM
Tier 2
5,000
2M
200K
1M TPM
Tier 3
5,000
10M
2M
5M TPM
Tier 4
10,000
30M
10M
15M TPM
Tier 5
10,000
150M
30M
75M TPM
OpenAI GPT-5.4 Mini Rate Limits by Tier
Tier
RPM
TPM (Input)
TPM (Output)
Free
3
40K
16K
Tier 1
500
2M
400K
Tier 2
5,000
20M
4M
Tier 3
5,000
100M
20M
Tier 4
10,000
150M
50M
Tier 5
30,000
1B
200M
Mini's limits are 5-10x higher than GPT-5.4's at the same tier. This reflects OpenAI's infrastructure allocation — Mini is cheaper to serve and gets proportionally more throughput.
Anthropic also uses a tier system, but with different advancement criteria and notably lower RPM limits than OpenAI.
Anthropic Tier Advancement
Tier
Requirement
Free
New account
Tier 1
$5 credit purchase
Tier 2
$40 credit purchase, 7+ days since Tier 1
Tier 3
$200 credit purchase, 7+ days since Tier 2
Tier 4
$400 credit purchase, 14+ days since Tier 3
Anthropic Claude Rate Limits by Tier
Sonnet 4.6 / Opus 4.6:
Tier
RPM
Input TPM
Output TPM
Free
5
20K
4K
Tier 1
50
40K
8K
Tier 2
1,000
80K
16K
Tier 3
2,000
160K
32K
Tier 4
4,000
400K
80K
Haiku 4.5:
Tier
RPM
Input TPM
Output TPM
Free
5
25K
5K
Tier 1
50
50K
10K
Tier 2
1,000
100K
20K
Tier 3
2,000
200K
40K
Tier 4
4,000
400K
80K
The Anthropic vs OpenAI gap: At the highest tier, Anthropic Sonnet 4.6 allows 4,000 RPM and 400K input TPM. OpenAI GPT-5.4 at Tier 4 allows 10,000 RPM and 30M input TPM — a 75x difference in input throughput. This is the single biggest rate limit gap between the two leading providers.
For teams requiring high throughput, this difference may force multi-provider architectures or Anthropic-specific workarounds (multiple API keys, enterprise agreements).
Google structures Gemini rate limits by billing plan rather than spending tiers.
Gemini 2.5 Flash
Plan
RPM
TPM
RPD
Free
15
1M
1,500
Pay-as-you-go
2,000
4M
Unlimited
Gemini 2.5 Pro
Plan
RPM
TPM
RPD
Free
5
1M
25
Pay-as-you-go
1,000
4M
Unlimited
Google's free tier is surprisingly generous on TPM — 1M tokens per minute even on free plans. But RPD caps (1,500 requests/day for Flash free) limit real usage. The jump to pay-as-you-go removes daily caps and increases RPM substantially.
Google also offers provisioned throughput for enterprise customers with guaranteed capacity and custom rate limits. Contact Google Cloud sales for pricing.
DeepSeek maintains simpler rate limits without a tier system.
Model
RPM
TPM
RPD
Concurrent Requests
DeepSeek V4
300
300K
Unlimited
50
DeepSeek R1
60
300K
Unlimited
10
DeepSeek's limits are the most restrictive of any paid provider. 300 RPM and 50 concurrent requests cap real-world throughput significantly below what the per-token pricing suggests. For high-volume production workloads, you will hit these limits quickly.
Concurrent request limits are the hidden constraint. Even if your RPM math works out, having only 50 requests in flight simultaneously creates a hard ceiling on throughput that cannot be optimized around without multiple API keys.
For teams needing DeepSeek's pricing at higher throughput, TokenMix.ai's unified API can aggregate multiple DeepSeek connections and provide automatic failover to alternative providers when DeepSeek limits are reached.
Groq offers the fastest inference speed but with strict rate limits, especially on the free tier.
Model
Free RPM
Free TPM
Free RPD
Paid RPM
Paid TPM
Llama 3.3 70B
30
15K
14,400
1,000
100K
Llama 3.1 8B
30
20K
14,400
1,000
200K
Mixtral 8x7B
30
10K
14,400
1,000
100K
Gemma 2 9B
30
15K
14,400
1,000
100K
Groq's free tier is excellent for prototyping — 14,400 RPD is enough for substantial testing. But 30 RPM and low TPM limits make the free tier impractical for production.
Paid Groq limits scale significantly with enterprise agreements. Groq's custom silicon (LPU) can deliver 500+ tokens/sec, but accessing this throughput requires direct enterprise contracts.
Rate limits create three categories of hidden costs that do not appear on your invoice.
1. Queuing Latency
When you hit RPM or TPM limits, requests queue. Each queued request adds latency. For user-facing applications, this latency degrades experience. For batch pipelines, it extends processing time and delays downstream workflows.
Quantified impact: TokenMix.ai's monitoring shows that teams operating at 80%+ of their rate limits experience 2-5x higher P95 latency compared to teams at 50% utilization. The relationship is not linear — it follows queuing theory curves where latency spikes dramatically near capacity.
2. Tier Upgrade Pressure
Rate limits create artificial urgency to spend more. An OpenAI Tier 1 user hitting 500 RPM needs to spend $50+ just to reach Tier 2 and unlock 5,000 RPM. This is effectively a tax on growth.
3. Architectural Complexity
Working around rate limits requires retry logic, request queuing, backoff algorithms, and monitoring. This engineering overhead costs developer time. TokenMix.ai estimates the average team spends 20-40 hours building and maintaining rate limit handling code — equivalent to $5,000-10,000 in engineering cost.
Strategies to Handle Rate Limits in Production
Strategy 1: Exponential Backoff with Jitter
The standard approach. When you receive a 429 (rate limited) response, wait before retrying, doubling the wait each time with random jitter to prevent thundering herd.
Implementation pattern:
Base delay: 1 second
Retry 1: 1s + random(0-0.5s)
Retry 2: 2s + random(0-1s)
Retry 3: 4s + random(0-2s)
Retry 4: 8s + random(0-4s)
Max retries: 5
Max delay cap: 60 seconds
When to use: Every production integration should have exponential backoff as the baseline. It is not optional.
Strategy 2: Request Queuing and Rate Smoothing
Instead of sending requests as fast as possible and handling 429s, pre-throttle your request rate to stay below limits.
Implementation approach:
Maintain a token bucket or leaky bucket rate limiter client-side
Set the bucket rate to 80% of your RPM/TPM limit (safety margin)
Queue excess requests and drain at the controlled rate
Monitor queue depth — if it grows continuously, you need higher limits or multi-provider routing
When to use: High-volume production workloads where 429 retries would create unacceptable latency variance.
Strategy 3: Request Batching
Combine multiple small requests into fewer large ones. This reduces RPM consumption without changing total TPM.
Example: Instead of 100 separate classification requests (100 RPM consumed), batch 10 items per request (10 RPM consumed, same total tokens). Most providers support this through system prompt design — include multiple items in one prompt and parse structured output.
When to use: When RPM is your bottleneck and individual requests are small.
Strategy 4: Multi-Provider Load Balancing
Distribute requests across multiple providers to aggregate their rate limits. If OpenAI gives you 10,000 RPM and Anthropic gives you 4,000 RPM, routing across both gives you an effective 14,000 RPM ceiling.
When to use: When a single provider's limits are insufficient and your workload can tolerate model variation. TokenMix.ai's unified API handles this automatically — see next section.
Strategy 5: Multiple API Keys
Some providers allow multiple API keys under one organization. Each key may receive independent rate limits in certain configurations. Check provider terms of service — some explicitly prohibit this approach.
When to use: With caution. Verify provider policies first. More sustainable to upgrade tiers or use multi-provider routing.
Multi-Provider Routing for Rate Limit Resilience
The most robust rate limit strategy is not working around a single provider's limits — it is distributing across providers.
How multi-provider routing works:
Define your model requirements (quality threshold, max latency, max cost)
Map equivalent models across providers (e.g., GPT-5.4 Mini / Haiku 4.5 / Gemini Flash for budget tasks)
Route requests to the provider with the most available capacity
Fail over automatically when one provider hits limits
TokenMix.ai's approach: The platform maintains real-time rate limit utilization tracking across all connected providers. When you send a request through TokenMix.ai's unified API, it routes to the provider with the most headroom. If OpenAI is at 90% RPM utilization, the request goes to Anthropic or Google instead.
Effective capacity multiplication: A team with Tier 3 accounts at OpenAI (5,000 RPM), Anthropic (2,000 RPM), and Google (2,000 RPM) gets an effective 9,000 RPM through multi-provider routing — nearly as much as OpenAI Tier 4 alone.
Cost implication: Multi-provider routing also enables cost optimization. Route budget tasks to the cheapest available provider. Route quality-critical tasks to the best-performing model. Rate limit management and cost optimization converge into the same architecture.
How to Choose a Provider Based on Rate Limits
Your throughput need
Recommended approach
Why
Under 50 RPM
Any provider, Tier 1
All providers handle this easily
50-500 RPM
OpenAI Tier 1 or Google Pay-as-you-go
Best low-tier limits
500-2,000 RPM
OpenAI Tier 2-3
Highest RPM at mid tiers
2,000-5,000 RPM
OpenAI Tier 3-4
Only provider offering this range
5,000-10,000 RPM
OpenAI Tier 4-5 or multi-provider
Single provider or distributed
10,000+ RPM
Multi-provider via TokenMix.ai
No single provider reliably serves this
High TPM, low RPM
Anthropic or Google
High per-request token allowance
Maximum throughput at low cost
Multi-provider with DeepSeek + Flash
Aggregate cheap capacity
Enterprise SLA required
Direct enterprise agreement
Custom limits, guaranteed capacity
Conclusion
Rate limits are not just a technical constraint — they are a cost multiplier and an architectural forcing function. OpenAI offers the highest limits at top tiers (10,000+ RPM, 150M+ TPM) but requires months of spending to unlock them. Anthropic's limits are significantly lower (4,000 RPM max) but sufficient for most individual applications. Google offers generous TPM on free tiers. DeepSeek and Groq are heavily constrained.
For teams outgrowing a single provider's limits, multi-provider routing through TokenMix.ai multiplies your effective capacity by aggregating limits across providers. This approach simultaneously solves rate limit constraints, provides failover resilience, and enables cost optimization through intelligent routing.
The pragmatic path: start with one provider, implement exponential backoff from day one, upgrade tiers as volume grows, and switch to multi-provider routing when any single provider becomes a bottleneck. TokenMix.ai's real-time monitoring tracks your utilization across all providers and alerts you before you hit limits.
Do not build rate limit handling as an afterthought. It is a core infrastructure concern that directly impacts your application's reliability and cost.
FAQ
What happens when I hit an API rate limit?
The provider returns an HTTP 429 (Too Many Requests) response with a Retry-After header indicating how long to wait. Your application should implement exponential backoff — wait, then retry with increasing delays. Without retry logic, your requests simply fail.
How do I check my current OpenAI rate limit tier?
Go to your OpenAI dashboard under Settings > Limits. It shows your current tier, rate limits for each model, and the requirements to advance to the next tier. You can also check limits programmatically via response headers (x-ratelimit-limit-*, x-ratelimit-remaining-*).
Why are Anthropic rate limits lower than OpenAI?
Anthropic's published limits (4,000 RPM max at Tier 4 vs OpenAI's 10,000+ RPM) reflect different infrastructure scaling strategies. Anthropic focuses on per-request quality and safety, while OpenAI has scaled infrastructure for higher throughput. Enterprise Anthropic customers can negotiate custom limits.
Can I increase my rate limits without spending more?
Tier advancement at OpenAI and Anthropic requires cumulative spending. There is no way to skip tiers. However, you can effectively increase your available throughput by using multi-provider routing (aggregate limits across providers), using the Batch API (separate higher limits), and request batching (reducing RPM consumption).
What is the difference between RPM, TPM, and RPD?
RPM (requests per minute) limits the number of API calls. TPM (tokens per minute) limits total token throughput. RPD (requests per day) is a daily cap. All three apply simultaneously — you hit whichever limit comes first. Most production workloads are bottlenecked by TPM on large requests or RPM on many small requests.
How does TokenMix.ai handle rate limits?
TokenMix.ai monitors real-time rate limit utilization across all connected providers. When one provider approaches its limits, the platform automatically routes requests to providers with available capacity. This multi-provider approach effectively multiplies your aggregate rate limits while maintaining a single API integration.