TokenMix Research Lab · 2026-06-04

Claude 429 Rate Limits 2026: RPM, TPM, Backoff, Jitter Fix
Last Updated: 2026-06-04 Author: TokenMix Research Lab Data verified: 2026-06-04 - Anthropic rate-limit docs, error docs, service tiers, prompt caching docs, Help Center rate-limit article, Claude Code error reference, and TokenMix Claude pricing cluster
Claude 429 is not one error. First identify the bucket: RPM, ITPM, OTPM, spend cap, workspace cap, fast mode, or acceleration. Then retry with retry-after plus jitter.
Anthropic documents three main Messages API rate-limit dimensions: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). If any bucket is exceeded, the API returns HTTP 429 with rate_limit_error and a retry-after header (Anthropic rate limits, Anthropic errors). The important detail: Anthropic does not use one generic "TPM" bucket like many providers. For most Claude models, cached input reads do not count toward ITPM, while input tokens after the last cache breakpoint and cache creation tokens do count (Anthropic rate limits). That makes the correct fix different from "sleep and retry." You need bucket-aware throttling, token-aware concurrency, prompt caching, workspace caps, and fallback routing.
Table of Contents
- Quick Verdict
- What 429 Means
- RPM ITPM OTPM Explained
- Response Headers
- Five Fixes That Actually Work
- Backoff and Jitter Code
- Cost and Capacity Math
- Workspace Service Tier and Fast Mode Traps
- Fallback Routing
- Risks and Caveats
- Final Recommendation
- FAQ
- Sources
- Related Articles
Quick Verdict
| Claim | Status | Source |
|---|---|---|
Claude API 429 maps to rate_limit_error |
Confirmed | Anthropic errors |
| Anthropic rate limits are measured by RPM, ITPM, and OTPM for Messages API model classes | Confirmed | Anthropic rate limits |
A 429 response includes retry-after telling how long to wait |
Confirmed | Anthropic rate limits |
| Short bursts can trigger rate limits even when the minute average looks valid | Confirmed | Anthropic rate limits |
| Cached input reads do not count toward ITPM for most Claude models | Confirmed | Anthropic rate limits |
max_tokens increases OTPM rate-limit usage before tokens are generated |
False | Anthropic says OTPM is evaluated on actual generated output, not max_tokens |
| 529 and 429 are the same failure | False | Anthropic errors separates 429 rate_limit_error from 529 overloaded_error |
| Priority Tier removes all regular rate limits | False | Service tiers says Priority Tier still observes regular rate limits |
| Fast mode uses the same Opus rate-limit bucket | False | Anthropic documents dedicated fast mode rate limits and anthropic-fast-* headers |
| Random exponential backoff is enough for production | Likely false | It helps, but headers, token accounting, caching, and fallback are still needed |
What 429 Means
| Surface symptom | Likely real cause | Best first check | Status |
|---|---|---|---|
rate_limit_error in API JSON |
RPM, ITPM, OTPM, workspace, spend, fast mode, or acceleration | Error body + response headers | Confirmed |
retry-after present |
Server tells exact wait window | Sleep at least that many seconds | Confirmed |
| Requests fail in bursts | Short-interval enforcement | Concurrency and queue shape | Confirmed |
| Long prompts fail even with low request count | ITPM exhaustion | Input-token headers | Confirmed |
| Long completions fail or stall | OTPM pressure | Output-token headers | Confirmed |
| New org suddenly scales traffic and gets 429 | Acceleration limit | Ramp traffic gradually | Confirmed |
Claude Code shows API Error: Request rejected (429) |
Claude Code / API capacity or account limit | Claude Code error reference + account state | Confirmed |
| 529 instead of 429 | Anthropic overloaded globally | Retry or fallback provider | Confirmed |
The first diagnostic question is not "how long do I sleep?" It is "which bucket did I exceed?"
RPM ITPM OTPM Explained
| Bucket | What it measures | What breaks it | Fix |
|---|---|---|---|
| RPM | Requests per minute | Too many API calls, especially bursty parallel calls | Queue, leaky bucket, concurrency cap |
| ITPM | Input tokens per minute | Large prompts, long RAG contexts, cache writes | Prompt compression, caching, chunking |
| OTPM | Output tokens per minute | Long generations, many streamed completions | Lower target output, route long jobs to batch |
| Spend limit | Monthly dollar cap by usage tier or custom org cap | Normal traffic after monthly spend ceiling | Raise cap, wait next month, use cheaper model |
| Workspace limit | Workspace-level custom cap | One workspace exceeds local cap | Rebalance workspace caps |
| Fast mode limit | Dedicated fast mode bucket | speed: "fast" traffic exceeds preview lane |
Fall back to standard mode |
| Acceleration limit | Sharp traffic increase | Sudden launch or retry storm | Gradual ramp and adaptive backoff |
Anthropic warns that a nominal 60 RPM limit can be enforced as 1 request per second, so dumping 60 requests at once can still fail. That is why queue shape matters as much as the published number.
Response Headers
| Header family | Meaning | How to use it |
|---|---|---|
retry-after |
Seconds to wait before retrying | Treat as minimum sleep time |
anthropic-ratelimit-requests-limit |
Request limit | Size queue and concurrency |
anthropic-ratelimit-requests-remaining |
Remaining requests before rate limit | Slow down before zero |
anthropic-ratelimit-requests-reset |
When request limit replenishes | Schedule retry |
anthropic-ratelimit-input-tokens-limit |
Input-token cap | Gate large prompts |
anthropic-ratelimit-input-tokens-remaining |
Input tokens left, rounded | Refuse large RAG calls before failure |
anthropic-ratelimit-input-tokens-reset |
Input-token reset time | Retry token-heavy work later |
anthropic-ratelimit-output-tokens-limit |
Output-token cap | Cap long completions |
anthropic-ratelimit-output-tokens-remaining |
Output tokens left, rounded | Route long generation elsewhere |
anthropic-fast-* |
Fast mode rate status | Only applies to fast mode preview |
request-id |
Unique request identifier | Include in support/debug logs |
Do not parse only the HTTP status. Save the headers. They are the difference between a one-line retry loop and a production throttle.
Five Fixes That Actually Work
| Fix | Solves | Implementation | Confidence |
|---|---|---|---|
Respect retry-after |
Ordinary 429 | Sleep at least the header value before retry | Confirmed |
| Add jitter | Retry storms | Add random delay on top of retry-after or exponential backoff |
Likely |
| Token-aware queue | ITPM/OTPM | Estimate input/output tokens before dispatch | Confirmed |
| Prompt caching | ITPM for repeated context | Cache long system prompts, tool definitions, docs, conversation state | Confirmed |
| Concurrency cap | RPM and bursts | Limit per-model and per-workspace parallel calls | Confirmed |
| Workspace caps | Multi-team fairness | Set per-workspace spend/rate limits below org maximum | Confirmed |
| Model fallback | Provider or model saturation | Route to cheaper/faster fallback model | Likely |
| Batch API | Async non-user-facing work | Move evals, summaries, offline transforms to batches | Confirmed |
| Priority Tier | Production availability | Use committed-spend tier when SLA matters | Confirmed |
The first five are the default fix stack. The last four are architecture choices.
Backoff and Jitter Code
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Say hello"}]
}' -i
import random
import time
def retry_delay_seconds(response, attempt):
retry_after = response.headers.get("retry-after")
if retry_after:
base = float(retry_after)
else:
base = min(60.0, 2 ** attempt)
jitter = random.uniform(0.1, 0.4) * base
return base + jitter
def should_retry(status_code, error_type):
if status_code == 429 and error_type == "rate_limit_error":
return True
if status_code == 529:
return True
if status_code in (400, 401, 403, 404, 413):
return False
return status_code >= 500
| Code rule | Why |
|---|---|
Retry 429 only after retry-after |
Earlier retries fail and amplify load |
| Retry 529 with provider fallback | It is overloaded capacity, not your quota |
| Never retry 401/403 blindly | Auth/permission errors do not heal with sleep |
| Never retry 413 blindly | Request too large must be resized |
Log request-id |
Anthropic support needs it |
| Store rate-limit headers | Header state tells which limiter fired |
Cost and Capacity Math
| Scenario | Math | Result | Status |
|---|---|---|---|
| Burst shape | 60 RPM can be enforced as 1 RPS | 60 calls at once can fail; 1/sec queue passes | Confirmed principle |
| Cache-aware ITPM | 2M ITPM with 80% cache hit | Effective 10M total input tokens/minute | Confirmed example |
| Workspace cap | Org 40K ITPM, workspace 30K ITPM | Other workspaces still have at least 10K ITPM if unused tokens remain | Confirmed example |
| Retry storm | 100 failed workers retry immediately | 100 more failures plus load spike | Likely |
| Long output | 20 calls x 4K output | 80K output tokens pressure OTPM | Confirmed math |
Cost calculation 1: If your org has 40,000 ITPM and one workspace is capped at 30,000 ITPM, that workspace cannot consume the full org bucket. Anthropic uses this exact pattern to explain workspace limits: the remaining unused tokens are available to other workspaces.
Cost calculation 2: With a 2,000,000 ITPM limit and 80% cache hit rate, Anthropic's docs say you can effectively process 10,000,000 total input tokens per minute because cached reads do not count toward ITPM for most models. That is a 5x effective throughput gain, not a pricing discount alone.
Cost calculation 3: If a job launches 200 parallel requests against a lane that effectively accepts 1 request/second, a naive retry loop can create minutes of self-inflicted 429s. Queueing those 200 requests at 1/sec completes dispatch in about 200 seconds without turning every retry into another rate-limit event.
For token price tradeoffs after the error is fixed, use Claude API Pricing 2026 and Claude API Cache Pricing 2026. A 429 fix that doubles cache hit rate can be worth more than a model downgrade.
Workspace Service Tier and Fast Mode Traps
| Feature | Common mistake | Correct read | Source |
|---|---|---|---|
| Usage tier | Assuming higher tier means no limits | Higher tiers raise limits but still enforce them | Anthropic rate limits |
| Spend limit | Treating 429 as purely RPM/TPM | Spend caps can halt usage until next month | Anthropic rate limits |
| Workspace limit | Looking only at org limits | Workspace caps can be lower than org caps | Anthropic rate limits |
| Priority Tier | Expecting no rate limits | Requests still pull from regular rate limits | Anthropic service tiers |
| Batch API | Using live API for offline jobs | Batch has separate queue limits | Anthropic rate limits |
| Fast mode | Assuming standard Opus limits apply | Fast mode has dedicated limits and headers | Anthropic rate limits |
| Claude Platform on AWS | Expecting Anthropic automatic tier advancement | AWS path has different billing/spend behavior | Anthropic rate limits |
The dangerous pattern is treating every 429 as a code bug. Sometimes it is spend ceiling. Sometimes it is workspace policy. Sometimes it is acceleration. The fix changes.
Fallback Routing
| Primary failure | Better fallback | Why |
|---|---|---|
| Claude RPM exhausted | Same Claude model later | If the model is required, delay is safest |
| Claude ITPM exhausted | Cached prompt or smaller-context model | Reduce input pressure |
| Claude OTPM exhausted | Shorter output or cheaper long-output model | Reduce output pressure |
| 529 overloaded | Another provider/model | This is shared platform load |
| Fast mode 429 | Standard mode | Different lane |
| Workspace cap | Different workspace only if policy allows | Avoid bypassing governance |
| User-facing SLA | Gateway fallback | Protect UX |
| Batch job | Batch API | Remove from live rate pool |
The routing layer matters because one provider's rate limit should not take down the product. TokenMix covers this pattern in AI API Gateway 2026, and the Claude/OpenAI-compatible setup path is covered in Anthropic OpenAI-Compatible API 2026.
Risks and Caveats
| Risk | Status | Mitigation |
|---|---|---|
| Published limits are maximum allowed usage, not guaranteed minimums | Confirmed | Build headroom |
| Headers may show the most restrictive active limiter | Confirmed | Log all headers, not only one |
| Short bursts can fail under minute limits | Confirmed | Smooth traffic |
| Cached-read behavior differs for marked models | Confirmed | Check dagger notes in docs |
max_tokens is mistaken for OTPM usage |
False | OTPM counts actual generated output |
| 529 is handled like 429 | False | Use fallback and retry separately |
| Priority Tier is treated as unlimited | False | Still obeys regular limits |
| Claude Code errors are assumed to be raw API errors | Likely | Check Claude Code docs and account state |
Final Recommendation
For Claude 429, do not start with a bigger sleep. Start with the bucket. Log retry-after, request headers, token headers, request ID, model, workspace, cache hit rate, and fast mode state. Then apply queueing, caching, jitter, and fallback in that order.
FAQ
What does Claude rate exceeded mean?
Claude rate exceeded usually means HTTP 429 rate_limit_error. It can come from RPM, ITPM, OTPM, spend caps, workspace caps, fast mode limits, or acceleration limits.
What is the difference between RPM and TPM for Claude?
Anthropic splits token rate limits into ITPM and OTPM, not one generic TPM bucket. ITPM covers input tokens per minute; OTPM covers actual output tokens generated per minute.
Should I retry every Claude 429?
Retry only after respecting retry-after. Add jitter and a maximum retry count. If the same bucket keeps failing, reduce concurrency, compress prompts, cache input, or route elsewhere.
Does prompt caching help Claude rate limits?
Yes. For most Claude models, cached input reads do not count toward ITPM. Cache creation tokens still count, so caching helps most when repeated context is reused across many requests.
Why do I get 429 even below my RPM limit?
You may be hitting ITPM, OTPM, workspace limits, spend limits, fast mode limits, or short-interval burst enforcement. Anthropic says a 60 RPM limit can be enforced as 1 request per second.
Is 529 the same as 429?
No. 429 means your account hit a rate limit or acceleration limit. 529 means the API is temporarily overloaded across users.
Does Priority Tier fix Claude rate limits?
No. Priority Tier improves service level and capacity priority, but Anthropic says requests still observe regular rate limits. Use it for production predictability, not unlimited throughput.
What should I log for Claude 429 debugging?
Log status code, error type, message, request-id, retry-after, all anthropic-ratelimit-* headers, model, workspace, cache hit rate, and whether fast mode or batch API was used.
Sources
- Anthropic Rate Limits - official RPM, ITPM, OTPM, spend, workspace, batch, fast mode, and headers
- Anthropic Errors - official HTTP error code and error shape reference
- Anthropic Service Tiers - official Standard, Priority, and Batch tier behavior
- Anthropic Prompt Caching - official prompt caching behavior
- Anthropic Help Center: API Rate Limits - official support article on RPM, ITPM, OTPM and
retry-after - Claude Code Error Reference - official Claude Code error wording and retry guidance
- Claude API Usage and Cost - official usage and cost reference
- Anthropic Batch Processing - official batch workflow reference
- Anthropic OpenAI SDK Compatibility - official OpenAI-compatible SDK path
- Anthropic Status - official status page for outages and overloaded periods
Related Articles
- Claude API Pricing 2026: Opus 4.8, Sonnet 4.8, Haiku 4.5 Compared
- Claude API Cache Pricing 2026: 90% Input Savings Explained
- Claude Opus 4.8 Review 2026: Pricing, Benchmarks, vs 4.7 and GPT-5.5
- Anthropic OpenAI-Compatible API 2026: Claude SDK Setup Guide
- AI API Gateway 2026: Routing, Fallbacks, Observability, and Cost Control