TokenMix Research Lab · 2026-06-04

Claude 429 Rate Limits 2026: RPM, TPM, Backoff, Jitter Fix

Claude 429 Rate Limits 2026: RPM, TPM, Backoff, Jitter Fix

Last Updated: 2026-06-04 Author: TokenMix Research Lab Data verified: 2026-06-04 - Anthropic rate-limit docs, error docs, service tiers, prompt caching docs, Help Center rate-limit article, Claude Code error reference, and TokenMix Claude pricing cluster

Claude 429 is not one error. First identify the bucket: RPM, ITPM, OTPM, spend cap, workspace cap, fast mode, or acceleration. Then retry with retry-after plus jitter.

Anthropic documents three main Messages API rate-limit dimensions: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). If any bucket is exceeded, the API returns HTTP 429 with rate_limit_error and a retry-after header (Anthropic rate limits, Anthropic errors). The important detail: Anthropic does not use one generic "TPM" bucket like many providers. For most Claude models, cached input reads do not count toward ITPM, while input tokens after the last cache breakpoint and cache creation tokens do count (Anthropic rate limits). That makes the correct fix different from "sleep and retry." You need bucket-aware throttling, token-aware concurrency, prompt caching, workspace caps, and fallback routing.

Table of Contents

Quick Verdict

Claim Status Source
Claude API 429 maps to rate_limit_error Confirmed Anthropic errors
Anthropic rate limits are measured by RPM, ITPM, and OTPM for Messages API model classes Confirmed Anthropic rate limits
A 429 response includes retry-after telling how long to wait Confirmed Anthropic rate limits
Short bursts can trigger rate limits even when the minute average looks valid Confirmed Anthropic rate limits
Cached input reads do not count toward ITPM for most Claude models Confirmed Anthropic rate limits
max_tokens increases OTPM rate-limit usage before tokens are generated False Anthropic says OTPM is evaluated on actual generated output, not max_tokens
529 and 429 are the same failure False Anthropic errors separates 429 rate_limit_error from 529 overloaded_error
Priority Tier removes all regular rate limits False Service tiers says Priority Tier still observes regular rate limits
Fast mode uses the same Opus rate-limit bucket False Anthropic documents dedicated fast mode rate limits and anthropic-fast-* headers
Random exponential backoff is enough for production Likely false It helps, but headers, token accounting, caching, and fallback are still needed

What 429 Means

Surface symptom Likely real cause Best first check Status
rate_limit_error in API JSON RPM, ITPM, OTPM, workspace, spend, fast mode, or acceleration Error body + response headers Confirmed
retry-after present Server tells exact wait window Sleep at least that many seconds Confirmed
Requests fail in bursts Short-interval enforcement Concurrency and queue shape Confirmed
Long prompts fail even with low request count ITPM exhaustion Input-token headers Confirmed
Long completions fail or stall OTPM pressure Output-token headers Confirmed
New org suddenly scales traffic and gets 429 Acceleration limit Ramp traffic gradually Confirmed
Claude Code shows API Error: Request rejected (429) Claude Code / API capacity or account limit Claude Code error reference + account state Confirmed
529 instead of 429 Anthropic overloaded globally Retry or fallback provider Confirmed

The first diagnostic question is not "how long do I sleep?" It is "which bucket did I exceed?"

RPM ITPM OTPM Explained

Bucket What it measures What breaks it Fix
RPM Requests per minute Too many API calls, especially bursty parallel calls Queue, leaky bucket, concurrency cap
ITPM Input tokens per minute Large prompts, long RAG contexts, cache writes Prompt compression, caching, chunking
OTPM Output tokens per minute Long generations, many streamed completions Lower target output, route long jobs to batch
Spend limit Monthly dollar cap by usage tier or custom org cap Normal traffic after monthly spend ceiling Raise cap, wait next month, use cheaper model
Workspace limit Workspace-level custom cap One workspace exceeds local cap Rebalance workspace caps
Fast mode limit Dedicated fast mode bucket speed: "fast" traffic exceeds preview lane Fall back to standard mode
Acceleration limit Sharp traffic increase Sudden launch or retry storm Gradual ramp and adaptive backoff

Anthropic warns that a nominal 60 RPM limit can be enforced as 1 request per second, so dumping 60 requests at once can still fail. That is why queue shape matters as much as the published number.

Response Headers

Header family Meaning How to use it
retry-after Seconds to wait before retrying Treat as minimum sleep time
anthropic-ratelimit-requests-limit Request limit Size queue and concurrency
anthropic-ratelimit-requests-remaining Remaining requests before rate limit Slow down before zero
anthropic-ratelimit-requests-reset When request limit replenishes Schedule retry
anthropic-ratelimit-input-tokens-limit Input-token cap Gate large prompts
anthropic-ratelimit-input-tokens-remaining Input tokens left, rounded Refuse large RAG calls before failure
anthropic-ratelimit-input-tokens-reset Input-token reset time Retry token-heavy work later
anthropic-ratelimit-output-tokens-limit Output-token cap Cap long completions
anthropic-ratelimit-output-tokens-remaining Output tokens left, rounded Route long generation elsewhere
anthropic-fast-* Fast mode rate status Only applies to fast mode preview
request-id Unique request identifier Include in support/debug logs

Do not parse only the HTTP status. Save the headers. They are the difference between a one-line retry loop and a production throttle.

Five Fixes That Actually Work

Fix Solves Implementation Confidence
Respect retry-after Ordinary 429 Sleep at least the header value before retry Confirmed
Add jitter Retry storms Add random delay on top of retry-after or exponential backoff Likely
Token-aware queue ITPM/OTPM Estimate input/output tokens before dispatch Confirmed
Prompt caching ITPM for repeated context Cache long system prompts, tool definitions, docs, conversation state Confirmed
Concurrency cap RPM and bursts Limit per-model and per-workspace parallel calls Confirmed
Workspace caps Multi-team fairness Set per-workspace spend/rate limits below org maximum Confirmed
Model fallback Provider or model saturation Route to cheaper/faster fallback model Likely
Batch API Async non-user-facing work Move evals, summaries, offline transforms to batches Confirmed
Priority Tier Production availability Use committed-spend tier when SLA matters Confirmed

The first five are the default fix stack. The last four are architecture choices.

Backoff and Jitter Code

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Say hello"}]
  }' -i
import random
import time

def retry_delay_seconds(response, attempt):
    retry_after = response.headers.get("retry-after")
    if retry_after:
        base = float(retry_after)
    else:
        base = min(60.0, 2 ** attempt)
    jitter = random.uniform(0.1, 0.4) * base
    return base + jitter

def should_retry(status_code, error_type):
    if status_code == 429 and error_type == "rate_limit_error":
        return True
    if status_code == 529:
        return True
    if status_code in (400, 401, 403, 404, 413):
        return False
    return status_code >= 500
Code rule Why
Retry 429 only after retry-after Earlier retries fail and amplify load
Retry 529 with provider fallback It is overloaded capacity, not your quota
Never retry 401/403 blindly Auth/permission errors do not heal with sleep
Never retry 413 blindly Request too large must be resized
Log request-id Anthropic support needs it
Store rate-limit headers Header state tells which limiter fired

Cost and Capacity Math

Scenario Math Result Status
Burst shape 60 RPM can be enforced as 1 RPS 60 calls at once can fail; 1/sec queue passes Confirmed principle
Cache-aware ITPM 2M ITPM with 80% cache hit Effective 10M total input tokens/minute Confirmed example
Workspace cap Org 40K ITPM, workspace 30K ITPM Other workspaces still have at least 10K ITPM if unused tokens remain Confirmed example
Retry storm 100 failed workers retry immediately 100 more failures plus load spike Likely
Long output 20 calls x 4K output 80K output tokens pressure OTPM Confirmed math

Cost calculation 1: If your org has 40,000 ITPM and one workspace is capped at 30,000 ITPM, that workspace cannot consume the full org bucket. Anthropic uses this exact pattern to explain workspace limits: the remaining unused tokens are available to other workspaces.

Cost calculation 2: With a 2,000,000 ITPM limit and 80% cache hit rate, Anthropic's docs say you can effectively process 10,000,000 total input tokens per minute because cached reads do not count toward ITPM for most models. That is a 5x effective throughput gain, not a pricing discount alone.

Cost calculation 3: If a job launches 200 parallel requests against a lane that effectively accepts 1 request/second, a naive retry loop can create minutes of self-inflicted 429s. Queueing those 200 requests at 1/sec completes dispatch in about 200 seconds without turning every retry into another rate-limit event.

For token price tradeoffs after the error is fixed, use Claude API Pricing 2026 and Claude API Cache Pricing 2026. A 429 fix that doubles cache hit rate can be worth more than a model downgrade.

Workspace Service Tier and Fast Mode Traps

Feature Common mistake Correct read Source
Usage tier Assuming higher tier means no limits Higher tiers raise limits but still enforce them Anthropic rate limits
Spend limit Treating 429 as purely RPM/TPM Spend caps can halt usage until next month Anthropic rate limits
Workspace limit Looking only at org limits Workspace caps can be lower than org caps Anthropic rate limits
Priority Tier Expecting no rate limits Requests still pull from regular rate limits Anthropic service tiers
Batch API Using live API for offline jobs Batch has separate queue limits Anthropic rate limits
Fast mode Assuming standard Opus limits apply Fast mode has dedicated limits and headers Anthropic rate limits
Claude Platform on AWS Expecting Anthropic automatic tier advancement AWS path has different billing/spend behavior Anthropic rate limits

The dangerous pattern is treating every 429 as a code bug. Sometimes it is spend ceiling. Sometimes it is workspace policy. Sometimes it is acceleration. The fix changes.

Fallback Routing

Primary failure Better fallback Why
Claude RPM exhausted Same Claude model later If the model is required, delay is safest
Claude ITPM exhausted Cached prompt or smaller-context model Reduce input pressure
Claude OTPM exhausted Shorter output or cheaper long-output model Reduce output pressure
529 overloaded Another provider/model This is shared platform load
Fast mode 429 Standard mode Different lane
Workspace cap Different workspace only if policy allows Avoid bypassing governance
User-facing SLA Gateway fallback Protect UX
Batch job Batch API Remove from live rate pool

The routing layer matters because one provider's rate limit should not take down the product. TokenMix covers this pattern in AI API Gateway 2026, and the Claude/OpenAI-compatible setup path is covered in Anthropic OpenAI-Compatible API 2026.

Risks and Caveats

Risk Status Mitigation
Published limits are maximum allowed usage, not guaranteed minimums Confirmed Build headroom
Headers may show the most restrictive active limiter Confirmed Log all headers, not only one
Short bursts can fail under minute limits Confirmed Smooth traffic
Cached-read behavior differs for marked models Confirmed Check dagger notes in docs
max_tokens is mistaken for OTPM usage False OTPM counts actual generated output
529 is handled like 429 False Use fallback and retry separately
Priority Tier is treated as unlimited False Still obeys regular limits
Claude Code errors are assumed to be raw API errors Likely Check Claude Code docs and account state

Final Recommendation

For Claude 429, do not start with a bigger sleep. Start with the bucket. Log retry-after, request headers, token headers, request ID, model, workspace, cache hit rate, and fast mode state. Then apply queueing, caching, jitter, and fallback in that order.

FAQ

What does Claude rate exceeded mean?

Claude rate exceeded usually means HTTP 429 rate_limit_error. It can come from RPM, ITPM, OTPM, spend caps, workspace caps, fast mode limits, or acceleration limits.

What is the difference between RPM and TPM for Claude?

Anthropic splits token rate limits into ITPM and OTPM, not one generic TPM bucket. ITPM covers input tokens per minute; OTPM covers actual output tokens generated per minute.

Should I retry every Claude 429?

Retry only after respecting retry-after. Add jitter and a maximum retry count. If the same bucket keeps failing, reduce concurrency, compress prompts, cache input, or route elsewhere.

Does prompt caching help Claude rate limits?

Yes. For most Claude models, cached input reads do not count toward ITPM. Cache creation tokens still count, so caching helps most when repeated context is reused across many requests.

Why do I get 429 even below my RPM limit?

You may be hitting ITPM, OTPM, workspace limits, spend limits, fast mode limits, or short-interval burst enforcement. Anthropic says a 60 RPM limit can be enforced as 1 request per second.

Is 529 the same as 429?

No. 429 means your account hit a rate limit or acceleration limit. 529 means the API is temporarily overloaded across users.

Does Priority Tier fix Claude rate limits?

No. Priority Tier improves service level and capacity priority, but Anthropic says requests still observe regular rate limits. Use it for production predictability, not unlimited throughput.

What should I log for Claude 429 debugging?

Log status code, error type, message, request-id, retry-after, all anthropic-ratelimit-* headers, model, workspace, cache hit rate, and whether fast mode or batch API was used.

Sources

Related Articles