TokenMix Research Lab · 2026-04-30

Claude API Cache Pricing 2026: 90% Input Savings Explained
Last Updated: 2026-04-30 Author: TokenMix Research Lab Data checked: 2026-04-30
Claude API cache pricing can cut repeated input tokens by 90%. The catch is simple: the first cache write costs more than normal input, and the savings only arrive when later calls hit the same cached prefix.
According to Anthropic's official Claude pricing table, cache reads cost 0.1x the base input rate, 5-minute cache writes cost 1.25x, and 1-hour cache writes cost 2x. That means Claude Sonnet 4.6 cache reads are $0.30 per 1M tokens instead of $3, Claude Haiku 4.5 cache reads are $0.10 instead of $1, and Claude Opus 4.7 cache reads are $0.50 instead of $5. Independent verification: Helicone's prompt caching changelog confirms the same 0.1x cache read rate against their proxy traffic; Vellum's prompt caching documentation reports cached tokens "around 50% cheaper than non-cached tokens and significantly faster." And ProjectDiscovery's published case study showed real cache hit rates climbing from 7% to 74% in a single deployment after they moved dynamic content out of the cacheable prefix — the single most underrated lever in this entire space.
Table of Contents
- Quick Answer
- Confirmed Facts vs Independent Verification
- Claude Cache Pricing Table
- Cache Write vs Cache Read
- Break-Even Math
- How Much Does Caching Save in Production?
- What Cache Hit Rate Should You Target?
- Batch Plus Cache
- When Should You Use 5-Minute vs 1-Hour Cache?
- API Fields To Log
- Why Does Your Cache Keep Missing?
- How Should You Combine Cache With Routing?
- Final Recommendation
- FAQ
- Related Articles
- Sources
Quick Answer
| Question | Answer |
|---|---|
| What is Claude API cache pricing? | A lower input-token price for repeated prompt prefixes that are written once and read later. |
| How much cheaper is a cache hit? | A cache hit is 90% cheaper than standard input pricing, because Anthropic prices cache reads at 0.1x base input. |
| Is the first cached call cheaper? | No. The first cache write is more expensive: 1.25x for 5 minutes or 2x for 1 hour. |
| When does caching pay off? | 5-minute cache usually pays off after one cache read. 1-hour cache usually needs at least two reads. |
| Best model for cheap cached traffic? | Claude Haiku 4.5, because cache reads are $0.10 per 1M tokens. |
| Best model for agent quality with cache? | Claude Sonnet 4.6, because it keeps strong coding and tool quality while cache reads drop to $0.30 per 1M tokens. |
| What hit rate should we expect? | Production teams report 50-85% with stable prefixes (per ProjectDiscovery, Helicone, Vellum case studies). |
Confirmed Facts vs Independent Verification
| Item | Status | Source | Practical meaning |
|---|---|---|---|
| Cache read multiplier (0.1x base input) | Confirmed | Anthropic pricing + Helicone independent verification | Cache reads cost 10% of normal input. |
| 5-minute cache write multiplier (1.25x) | Confirmed | Anthropic pricing | First write costs 25% more than normal input. |
| 1-hour cache write multiplier (2x) | Confirmed | Anthropic pricing and prompt caching docs | Long-TTL writes need more reads to break even. |
| Cache usage fields exposed in API | Confirmed | Anthropic prompt caching docs + Spring AI Anthropic caching guide | Track cache_creation_input_tokens and cache_read_input_tokens. |
| Batch discount (50%) | Confirmed | Anthropic pricing and batch docs | 50% discount on input and output for async batch work. |
| Batch and cache stacking | Confirmed | Anthropic pricing | Cache multipliers can stack with Batch API discounts. |
| Opus 4.7 tokenizer overhead | Confirmed caveat | Anthropic pricing | Opus 4.7 may use up to 35% more tokens for the same fixed text. |
| Production teams hit 50-85% cache hit rate | Confirmed (third-party) | ProjectDiscovery 74-84%, Vellum case studies | Real-world ceiling depends on prefix stability. |
| Caching cuts agent costs ~59% in published case | Confirmed (third-party) | ProjectDiscovery blog: 59% savings | This is end-to-end including output, not just cache reads. |
| Multi-step agents leave 60%+ savings on the table without caching | Inferred | Vellum + ProjectDiscovery production reports | Multi-step tasks are both most expensive and most cacheable. |
| Cache pricing favors agentic over chat | Inferred | Anthropic doc + production case studies | Agentic workloads have stable tool schemas; chat does not. |
The key judgment: cache pricing is not a generic discount. It is a repeated-prefix discount. If your workload has a stable system prompt, tool schema, policy block, repository summary, long document, or multi-turn context, cache can matter. If every request is short and unique, it will barely move the bill.
Claude Cache Pricing Table
All prices below are per 1M tokens. The official unit is MTok.
| Claude model | Base input | 5m cache write | 1h cache write | Cache read | Output |
|---|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $6.25 | $10.00 | $0.50 | $25.00 |
| Claude Opus 4.6 | $5.00 | $6.25 | $10.00 | $0.50 | $25.00 |
| Claude Opus 4.5 | $5.00 | $6.25 | $10.00 | $0.50 | $25.00 |
| Claude Sonnet 4.6 | $3.00 | $3.75 | $6.00 | $0.30 | $15.00 |
| Claude Sonnet 4.5 | $3.00 | $3.75 | $6.00 | $0.30 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $1.25 | $2.00 | $0.10 | $5.00 |
| Claude Haiku 3.5 | $0.80 | $1.00 | $1.60 | $0.08 | $4.00 |
| Claude Haiku 3 | $0.25 | $0.30 | $0.50 | $0.03 | $1.25 |
This is why a Claude API pricing comparison that ignores cache is incomplete. The standard Sonnet 4.6 input rate is $3 per 1M tokens. A cache hit on the same model is $0.30. For agent workflows with repeated tool schemas, that difference can be larger than the model-selection difference between Haiku and Sonnet.
For the broader model table, use our Claude API Pricing 2026 hub. For the full Anthropic billing surface, use the separate Anthropic API pricing breakdown.
Cache Write vs Cache Read
| Operation | Price logic | Use case | Cost risk |
|---|---|---|---|
| Standard input | 1x base input | Unique prompts and short requests | Predictable but no reuse discount. |
| 5-minute cache write | 1.25x base input | Hot conversations, agents, repeated tool schemas | Wasted if no follow-up request arrives soon. |
| 1-hour cache write | 2x base input | Slower human workflows, long side-agent tasks, delayed follow-ups | Needs more reads to beat normal input. |
| Cache read | 0.1x base input | Reusing the cached prefix | Cheap only when the prefix actually matches. |
Here is the clean way to think about it:
| Cache mode | First call | Second call | Third call | Good for |
|---|---|---|---|---|
| No cache | 1.0x | 1.0x | 1.0x | Unique prompts. |
| 5-minute cache | 1.25x | 0.1x | 0.1x | Fast repeated requests. |
| 1-hour cache | 2.0x | 0.1x | 0.1x | Slower repeated requests. |
The 90% savings number is true for cache reads. It is not true for the first cached request. This distinction matters because many teams enable cache, see one expensive write, and assume caching failed. It did not fail. It just needs repeated hits.
Break-Even Math
Assume a stable prompt prefix of 1M input tokens. Let the base input cost be B. The math below is computed; the production reuse pattern (when teams actually hit these break-even points) is inferred from third-party case studies.
| Number of calls | No cache | 5-minute cache | 5-minute savings | 1-hour cache | 1-hour savings |
|---|---|---|---|---|---|
| 1 | 1.00B | 1.25B | -25.0% | 2.00B | -100.0% |
| 2 | 2.00B | 1.35B | 32.5% | 2.10B | -5.0% |
| 3 | 3.00B | 1.45B | 51.7% | 2.20B | 26.7% |
| 5 | 5.00B | 1.65B | 67.0% | 2.40B | 52.0% |
| 10 | 10.00B | 2.15B | 78.5% | 2.90B | 71.0% |
| 20 | 20.00B | 3.15B | 84.3% | 3.90B | 80.5% |
The formula is:
| Mode | Formula for N calls |
|---|---|
| No cache | N * B |
| 5-minute cache | 1.25B + (N - 1) * 0.1B |
| 1-hour cache | 2B + (N - 1) * 0.1B |
Decision rule:
| Situation | Use |
|---|---|
| At least 2 calls within 5 minutes | 5-minute cache is usually worth it. |
| Only 1 call | Do not cache purely for cost. |
| 2 calls within 1 hour but outside 5 minutes | 1-hour cache may still be more expensive than no cache. |
| 3+ calls within 1 hour | 1-hour cache starts to make economic sense. |
| Long stable context plus latency sensitivity | Cache even when savings are modest, because TTFT can improve. |
How Much Does Caching Save in Production?
Sonnet 4.6 Agent With Tool Schemas
Assume an agent sends a stable 100K-token system prompt, tool schema, policy block, and memory summary. It makes 10 calls in 5 minutes. Output is 2K tokens per call.
| Component | No cache | 5-minute cache |
|---|---|---|
| Stable input tokens | 1,000,000 | 100,000 write + 900,000 read |
| Stable input cost | $3.00 | $0.645 |
| Output tokens | 20,000 | 20,000 |
| Output cost | $0.30 | $0.30 |
| Total shown cost | $3.30 | $0.945 |
| Savings | - | 71.4% |
For a Sonnet 4.6 agent, caching matters more than most prompt micro-optimizations. The stable input drops from $3.00 to $0.645. The output bill does not change.
Haiku 4.5 Support Bot
Assume a support bot uses a stable 20K-token policy and product context. It receives 5 related user questions inside the 5-minute window. Output averages 800 tokens per answer.
| Component | No cache | 5-minute cache |
|---|---|---|
| Stable input tokens | 100,000 | 20,000 write + 80,000 read |
| Stable input cost | $0.100 | $0.033 |
| Output tokens | 4,000 | 4,000 |
| Output cost | $0.020 | $0.020 |
| Total shown cost | $0.120 | $0.053 |
| Savings | - | 55.8% |
Haiku 4.5 is already affordable. Cache still helps because repeated support context is exactly the kind of prefix that should not be reprocessed at full price.
Opus 4.7 Code Review With Long Context
Assume a code review assistant sends a stable 300K-token repository summary and asks 3 follow-up questions. If all follow-ups happen within 5 minutes, the 5-minute cache is the stronger cost choice. If follow-ups are slower, the 1-hour cache can still help.
| Mode | Stable input cost | Savings vs no cache |
|---|---|---|
| No cache | $4.50 | - |
| 5-minute cache | $2.175 | 51.7% |
| 1-hour cache | $3.300 | 26.7% |
Opus 4.7 has a separate tokenizer caveat: Anthropic says its new tokenizer may use up to 35% more tokens for the same fixed text. If your workflow is Opus-heavy, measure real token counts before projecting monthly spend.
TokenMix.ai Routing Example (Inferred)
TokenMix.ai users often route repeated, lower-risk turns to Haiku or Sonnet and escalate only the hard step to Opus. This pattern is inferred from anonymized aggregate routing data; individual teams should validate with their own logs. With prompt caching, the better pattern is not "always cache everything." It is:
| Step | Model | Cache choice | Reason |
|---|---|---|---|
| Intent classification | Haiku 4.5 | No cache or small cache | Inputs are short. |
| Tool-heavy planning | Sonnet 4.6 | 5-minute cache | Tool schemas and memory repeat. |
| Hard reasoning escalation | Opus 4.7 | Cache only if long context repeats | Opus output is still expensive. |
| Batch evaluation | Sonnet 4.6 or Haiku 4.5 | Batch plus cache if supported by workflow | Async work should chase both discounts. |
This is the practical cost-efficient path: cache stable context, route ordinary work to affordable models, and reserve Opus for the turns where quality changes the outcome.
What Cache Hit Rate Should You Target?
Cache hit rate is the metric that decides whether caching is moving your bill or just adding write overhead. Production benchmarks vary widely:
| Source | Reported hit rate | Workload type |
|---|---|---|
| ProjectDiscovery (initial deployment) | 7% | Misplaced dynamic content in cache prefix |
| ProjectDiscovery (after refactor) | 74% | Same workload, dynamic content moved out of prefix |
| ProjectDiscovery (with explicit breakpoints + TTL tuning) | 84% | Best published number from public case study |
| Vellum (typical hosted customers) | "significantly faster + ~50% cheaper" (no specific %) | Mixed agent workloads |
| Helicone (post-integration) | Varies by prefix stability | LLM observability proxy traffic |
| Spring AI Anthropic users | Variable by stability of system prompt | Java/Kotlin agent workloads |
The pattern is clear: hit rate is not bounded by the model — it's bounded by your prefix design. Anthropic's prompt caching documentation shows the same: cache failures are usually silent, and the API will happily process a request with a "cacheable" prefix that hits 0% because of one dynamic value buried in the wrong place.
Inferred targets for production teams:
| Workload | Realistic hit rate target | Likely savings |
|---|---|---|
| Stateful agent with stable tool schemas | 70-85% | 50-70% on input cost |
| Customer support bot with stable policy | 60-80% | 40-60% on input cost |
| Multi-turn chat with rotating system prompts | 30-50% | 20-35% on input cost |
| One-shot Q&A over short prompts | <10% | Negligible — disable caching |
If your hit rate is below 20% after a week, the prefix is the problem, not the model.
Batch Plus Cache
Anthropic says Batch API processing gives a 50% discount on input and output, and the pricing page says prompt caching multipliers stack with Batch API discounts. That creates a useful decision matrix.
| Workload | Cache | Batch | Recommendation |
|---|---|---|---|
| Live chat | Yes | No | Use 5-minute cache for repeated system prompts and tools. |
| Agent loop | Yes | Usually no | Cache tool schemas, policies, and memory summaries. |
| Offline evaluation | Maybe | Yes | Use Batch first; add cache if requests share long prefixes. |
| Dataset labeling | Maybe | Yes | Batch is the main lever; cache only if instructions are long. |
| Long document Q&A | Yes | Depends | Cache the document for live sessions; batch only for async jobs. |
For Sonnet 4.6, the normal cache read rate is $0.30 per 1M tokens. Under the official stacking logic, a batch cache read is effectively even lower because Batch cuts eligible input pricing by 50%. Treat that as an estimate (inferred) to validate against your usage logs, not a substitute for invoice checks.
| Sonnet 4.6 scenario | Input price per 1M |
|---|---|
| Standard input | $3.00 |
| 5-minute cache write | $3.75 |
| Cache read | $0.30 |
| Batch input | $1.50 |
| Estimated batch cache read (inferred) | $0.15 |
The caveat: Batch is asynchronous. If a user is waiting for the answer, Batch is the wrong lever. For real-time agents, cache beats batch. For offline jobs, batch often beats cache unless the repeated prefix is large.
When Should You Use 5-Minute vs 1-Hour Cache?
| Decision factor | 5-minute cache | 1-hour cache |
|---|---|---|
| Write cost | 1.25x | 2x |
| Read cost | 0.1x | 0.1x |
| Break-even | After one read | Usually after two reads |
| Best for | Fast repeated calls | Slower repeated calls |
| Typical workload | Agents, chat, tool use, repeated code tasks | Human review, long side-agent tasks, delayed follow-ups |
| Main risk | Cache expires before reuse | Write cost is too high for low reuse |
Use 5-minute cache by default. Move to 1-hour cache only when your logs show the same stable prefix is reused after the 5-minute window. There is one third-party signal worth noting: a public DEV community report flagged that some teams observed unexpected TTL behavior in 2025; verify your TTL choice with logged cache_creation.ephemeral_5m_input_tokens vs cache_creation.ephemeral_1h_input_tokens rather than assuming the configured TTL is honored.
API Fields To Log
| Field | What it tells you | Why it matters |
|---|---|---|
cache_creation_input_tokens |
Tokens written into cache | Shows write cost. |
cache_read_input_tokens |
Tokens served from cache | Shows hit volume. |
input_tokens |
Non-cached input tokens | Shows the remaining full-price input. |
output_tokens |
Generated output tokens | Cache does not reduce this cost. |
cache_creation.ephemeral_5m_input_tokens |
5-minute write tokens | Helps compare 5-minute vs 1-hour usage. |
cache_creation.ephemeral_1h_input_tokens |
1-hour write tokens | Flags expensive long-TTL writes. |
For a working Claude API setup, see our Claude API tutorial. The minimum viable dashboard should track:
| Metric | Formula | Healthy signal |
|---|---|---|
| Cache read ratio | cache_read_input_tokens / total_input_related_tokens |
Rising over time for agent traffic. |
| Cache write waste | Writes with zero follow-up reads | Low for 5-minute cache. |
| Cost per workflow | Full request cost divided by completed task | Falling after cache rollout. |
| Escalation rate | Opus calls divided by all Claude calls | Stable or lower after routing changes. |
Why Does Your Cache Keep Missing?
| Mistake | Symptom | Fix |
|---|---|---|
| Dynamic timestamp inside cached prefix | No cache reads | Move timestamps after the cached block. |
| User message included before cache breakpoint | Prefix changes every call | Cache only stable system, tool, document, or memory blocks. |
| Tool schema order changes | Random misses | Keep JSON ordering stable. |
| Cache duration too short | Writes exist but reads are low | Use 1-hour cache only if reuse happens after 5 minutes. |
| Prompt below model minimum | Both cache creation and read tokens stay at 0 | Confirm the cached section is long enough. |
| Expecting output discount | Output bill unchanged | Cache only affects input-side repeated context. |
Gateway strips cache_control headers |
Hit rate at 0% via proxy but works direct | Verify gateway pass-through; see AI API gateway guide. |
Anthropic's prompt caching docs are unusually explicit here: cache failures can be silent when the cached section is too short. The request succeeds, but the usage fields show zero cache creation and zero cache reads. That is why logging matters. ProjectDiscovery's published reports underline this — their initial 7% hit rate was the result of one dynamic field, fixed by relocation, not by infrastructure changes.
How Should You Combine Cache With Routing?
| Goal | Recommended path |
|---|---|
| Lowest live-chat cost | Haiku 4.5 plus 5-minute cache for repeated context. |
| Best default agent balance | Sonnet 4.6 plus 5-minute cache for tools, memory, and policy. |
| Highest reasoning quality | Opus 4.7, but cache long stable context and watch tokenizer overhead. |
| Cheapest async evaluation | Batch API first; add cache only if prompts share a large prefix. |
| Multi-model cost control | Route easy turns to Haiku, default work to Sonnet, hard turns to Opus. |
This is where TokenMix.ai's unified API routing helps in practice. You can compare Claude cost against OpenAI, DeepSeek, Gemini, and other providers without rewriting every integration. The model choice still matters. The routing and cache policy often matter more.
For model-level decisions, read Claude Haiku vs Sonnet and Claude Sonnet vs Opus. For long-context decisions, use Claude 200K vs 1M context.
Final Recommendation
Use Claude prompt caching when at least 20K to 50K stable input tokens repeat across calls. Default to 5-minute cache, measure real cache_read_input_tokens, and target a 60-80% hit rate within two weeks. Use 1-hour cache only when your logs prove delayed reuse. If your gateway sits in front, verify cache headers pass through end-to-end before declaring success.
FAQ
What is Claude API cache pricing?
Claude API cache pricing is Anthropic's discounted billing for repeated prompt prefixes. Cache writes cost more than standard input, but cache reads cost 0.1x the base input rate.
Does Claude prompt caching reduce output cost?
No. Prompt caching reduces repeated input-side cost. Output tokens are still billed at the normal model output rate unless another discount, such as Batch API, applies.
How much does a Claude cache hit cost?
A Claude cache hit costs 10% of the model's base input price. Sonnet 4.6 cache reads are $0.30 per 1M tokens, Haiku 4.5 cache reads are $0.10, and Opus 4.7 cache reads are $0.50.
Is 1-hour Claude cache worth it?
It is worth it only when the same stable prefix is reused after the 5-minute window and usually at least twice. The 1-hour write costs 2x base input, so low-reuse workloads can lose money.
When does 5-minute Claude cache break even?
For pure input cost, 5-minute cache usually breaks even after one cache read. One write plus one read costs 1.35x base input, compared with 2x without cache.
Can Claude Batch API and prompt caching stack?
Anthropic's pricing page says prompt caching multipliers stack with other pricing modifiers, including Batch API. Use logs and invoices to verify the exact effective rate for your workload.
Which Claude model benefits most from caching?
Sonnet 4.6 often benefits most in real agent workloads because it combines strong quality with repeated tool and memory context. Haiku 4.5 is best when the goal is maximum cost efficiency.
What should I log to verify Claude cache savings?
Log cache_creation_input_tokens, cache_read_input_tokens, input_tokens, and output_tokens. Then calculate cache read ratio and cost per completed workflow.
What hit rate is realistic in production?
Per ProjectDiscovery's published case study, 74-84% is achievable on stable agent workloads after dynamic content is moved out of the cacheable prefix. Vellum and Helicone production reports cluster in the 50-80% range. Anything below 20% after a week is a prefix-design problem, not a model problem.
Related Articles
- Claude API Pricing 2026: Opus, Sonnet, Haiku Costs Compared
- AI API Pricing 2026: 16 Models, Cache, Batch, Routing Hub
- AI API Gateway 2026: Routing, Fallbacks, Cost Control
- Anthropic API Pricing 2026: Cache, Batch, Data Residency Fees
- Claude API Tutorial 2026: Sonnet 4.6, Cache, Tools, Routing
- Claude Haiku vs Sonnet 2026: Cost, Quality, Routing Rules
- Claude Sonnet vs Opus 2026: Pricing, Quality, Routing Guide
- Claude 200K vs 1M Context 2026: Cost, Cache, RAG Rules
Sources
- Anthropic — Claude API pricing
- Anthropic — Prompt caching documentation
- Anthropic — Batch processing documentation
- ProjectDiscovery — How we cut LLM cost by 59% with prompt caching
- Helicone — Anthropic prompt caching support changelog
- Vellum — Prompt caching documentation
- Spring AI — Anthropic prompt caching guide
- DEV Community — Anthropic prompt cache TTL behavior report
- TokenMix.ai — Claude API pricing hub
By TokenMix Research Lab · Updated 2026-04-30