Claude API Cache Pricing 2026: 90% Input Savings Explained
Last Updated: 2026-04-30
Author: TokenMix Research Lab
Data checked: 2026-04-30
Claude API cache pricing can cut repeated input tokens by 90%. The catch is simple: the first cache write costs more than normal input, and the savings only arrive when later calls hit the same cached prefix.
According to Anthropic's official Claude pricing table, cache reads cost 0.1x the base input rate, 5-minute cache writes cost 1.25x, and 1-hour cache writes cost 2x. That means Claude Sonnet 4.6 cache reads are $0.30 per 1M tokens instead of $3, Claude Haiku 4.5 cache reads are $0.10 instead of
, and Claude Opus 4.7 cache reads are $0.50 instead of $5. Independent verification: Helicone's prompt caching changelog confirms the same 0.1x cache read rate against their proxy traffic; Vellum's prompt caching documentation reports cached tokens "around 50% cheaper than non-cached tokens and significantly faster." And ProjectDiscovery's published case study showed real cache hit rates climbing from 7% to 74% in a single deployment after they moved dynamic content out of the cacheable prefix — the single most underrated lever in this entire space.
This is end-to-end including output, not just cache reads.
Multi-step agents leave 60%+ savings on the table without caching
Inferred
Vellum + ProjectDiscovery production reports
Multi-step tasks are both most expensive and most cacheable.
Cache pricing favors agentic over chat
Inferred
Anthropic doc + production case studies
Agentic workloads have stable tool schemas; chat does not.
The key judgment: cache pricing is not a generic discount. It is a repeated-prefix discount. If your workload has a stable system prompt, tool schema, policy block, repository summary, long document, or multi-turn context, cache can matter. If every request is short and unique, it will barely move the bill.
Claude Cache Pricing Table
All prices below are per 1M tokens. The official unit is MTok.
Claude model
Base input
5m cache write
1h cache write
Cache read
Output
Claude Opus 4.7
$5.00
$6.25
0.00
$0.50
$25.00
Claude Opus 4.6
$5.00
$6.25
0.00
$0.50
$25.00
Claude Opus 4.5
$5.00
$6.25
0.00
$0.50
$25.00
Claude Sonnet 4.6
$3.00
$3.75
$6.00
$0.30
5.00
Claude Sonnet 4.5
$3.00
$3.75
$6.00
$0.30
5.00
Claude Haiku 4.5
.00
.25
$2.00
$0.10
$5.00
Claude Haiku 3.5
$0.80
.00
.60
$0.08
$4.00
Claude Haiku 3
$0.25
$0.30
$0.50
$0.03
.25
This is why a Claude API pricing comparison that ignores cache is incomplete. The standard Sonnet 4.6 input rate is $3 per 1M tokens. A cache hit on the same model is $0.30. For agent workflows with repeated tool schemas, that difference can be larger than the model-selection difference between Haiku and Sonnet.
Slower human workflows, long side-agent tasks, delayed follow-ups
Needs more reads to beat normal input.
Cache read
0.1x base input
Reusing the cached prefix
Cheap only when the prefix actually matches.
Here is the clean way to think about it:
Cache mode
First call
Second call
Third call
Good for
No cache
1.0x
1.0x
1.0x
Unique prompts.
5-minute cache
1.25x
0.1x
0.1x
Fast repeated requests.
1-hour cache
2.0x
0.1x
0.1x
Slower repeated requests.
The 90% savings number is true for cache reads. It is not true for the first cached request. This distinction matters because many teams enable cache, see one expensive write, and assume caching failed. It did not fail. It just needs repeated hits.
Break-Even Math
Assume a stable prompt prefix of 1M input tokens. Let the base input cost be B. The math below is computed; the production reuse pattern (when teams actually hit these break-even points) is inferred from third-party case studies.
Number of calls
No cache
5-minute cache
5-minute savings
1-hour cache
1-hour savings
1
1.00B
1.25B
-25.0%
2.00B
-100.0%
2
2.00B
1.35B
32.5%
2.10B
-5.0%
3
3.00B
1.45B
51.7%
2.20B
26.7%
5
5.00B
1.65B
67.0%
2.40B
52.0%
10
10.00B
2.15B
78.5%
2.90B
71.0%
20
20.00B
3.15B
84.3%
3.90B
80.5%
The formula is:
Mode
Formula for N calls
No cache
N * B
5-minute cache
1.25B + (N - 1) * 0.1B
1-hour cache
2B + (N - 1) * 0.1B
Decision rule:
Situation
Use
At least 2 calls within 5 minutes
5-minute cache is usually worth it.
Only 1 call
Do not cache purely for cost.
2 calls within 1 hour but outside 5 minutes
1-hour cache may still be more expensive than no cache.
3+ calls within 1 hour
1-hour cache starts to make economic sense.
Long stable context plus latency sensitivity
Cache even when savings are modest, because TTFT can improve.
How Much Does Caching Save in Production?
Sonnet 4.6 Agent With Tool Schemas
Assume an agent sends a stable 100K-token system prompt, tool schema, policy block, and memory summary. It makes 10 calls in 5 minutes. Output is 2K tokens per call.
Component
No cache
5-minute cache
Stable input tokens
1,000,000
100,000 write + 900,000 read
Stable input cost
$3.00
$0.645
Output tokens
20,000
20,000
Output cost
$0.30
$0.30
Total shown cost
$3.30
$0.945
Savings
-
71.4%
For a Sonnet 4.6 agent, caching matters more than most prompt micro-optimizations. The stable input drops from $3.00 to $0.645. The output bill does not change.
Haiku 4.5 Support Bot
Assume a support bot uses a stable 20K-token policy and product context. It receives 5 related user questions inside the 5-minute window. Output averages 800 tokens per answer.
Component
No cache
5-minute cache
Stable input tokens
100,000
20,000 write + 80,000 read
Stable input cost
$0.100
$0.033
Output tokens
4,000
4,000
Output cost
$0.020
$0.020
Total shown cost
$0.120
$0.053
Savings
-
55.8%
Haiku 4.5 is already affordable. Cache still helps because repeated support context is exactly the kind of prefix that should not be reprocessed at full price.
Opus 4.7 Code Review With Long Context
Assume a code review assistant sends a stable 300K-token repository summary and asks 3 follow-up questions. If all follow-ups happen within 5 minutes, the 5-minute cache is the stronger cost choice. If follow-ups are slower, the 1-hour cache can still help.
Mode
Stable input cost
Savings vs no cache
No cache
$4.50
-
5-minute cache
$2.175
51.7%
1-hour cache
$3.300
26.7%
Opus 4.7 has a separate tokenizer caveat: Anthropic says its new tokenizer may use up to 35% more tokens for the same fixed text. If your workflow is Opus-heavy, measure real token counts before projecting monthly spend.
TokenMix.ai Routing Example (Inferred)
TokenMix.ai users often route repeated, lower-risk turns to Haiku or Sonnet and escalate only the hard step to Opus. This pattern is inferred from anonymized aggregate routing data; individual teams should validate with their own logs. With prompt caching, the better pattern is not "always cache everything." It is:
Step
Model
Cache choice
Reason
Intent classification
Haiku 4.5
No cache or small cache
Inputs are short.
Tool-heavy planning
Sonnet 4.6
5-minute cache
Tool schemas and memory repeat.
Hard reasoning escalation
Opus 4.7
Cache only if long context repeats
Opus output is still expensive.
Batch evaluation
Sonnet 4.6 or Haiku 4.5
Batch plus cache if supported by workflow
Async work should chase both discounts.
This is the practical cost-efficient path: cache stable context, route ordinary work to affordable models, and reserve Opus for the turns where quality changes the outcome.
What Cache Hit Rate Should You Target?
Cache hit rate is the metric that decides whether caching is moving your bill or just adding write overhead. Production benchmarks vary widely:
Source
Reported hit rate
Workload type
ProjectDiscovery (initial deployment)
7%
Misplaced dynamic content in cache prefix
ProjectDiscovery (after refactor)
74%
Same workload, dynamic content moved out of prefix
"significantly faster + ~50% cheaper" (no specific %)
Mixed agent workloads
Helicone (post-integration)
Varies by prefix stability
LLM observability proxy traffic
Spring AI Anthropic users
Variable by stability of system prompt
Java/Kotlin agent workloads
The pattern is clear: hit rate is not bounded by the model — it's bounded by your prefix design. Anthropic's prompt caching documentation shows the same: cache failures are usually silent, and the API will happily process a request with a "cacheable" prefix that hits 0% because of one dynamic value buried in the wrong place.
Inferred targets for production teams:
Workload
Realistic hit rate target
Likely savings
Stateful agent with stable tool schemas
70-85%
50-70% on input cost
Customer support bot with stable policy
60-80%
40-60% on input cost
Multi-turn chat with rotating system prompts
30-50%
20-35% on input cost
One-shot Q&A over short prompts
<10%
Negligible — disable caching
If your hit rate is below 20% after a week, the prefix is the problem, not the model.
Batch Plus Cache
Anthropic says Batch API processing gives a 50% discount on input and output, and the pricing page says prompt caching multipliers stack with Batch API discounts. That creates a useful decision matrix.
Workload
Cache
Batch
Recommendation
Live chat
Yes
No
Use 5-minute cache for repeated system prompts and tools.
Agent loop
Yes
Usually no
Cache tool schemas, policies, and memory summaries.
Offline evaluation
Maybe
Yes
Use Batch first; add cache if requests share long prefixes.
Dataset labeling
Maybe
Yes
Batch is the main lever; cache only if instructions are long.
Long document Q&A
Yes
Depends
Cache the document for live sessions; batch only for async jobs.
For Sonnet 4.6, the normal cache read rate is $0.30 per 1M tokens. Under the official stacking logic, a batch cache read is effectively even lower because Batch cuts eligible input pricing by 50%. Treat that as an estimate (inferred) to validate against your usage logs, not a substitute for invoice checks.
Sonnet 4.6 scenario
Input price per 1M
Standard input
$3.00
5-minute cache write
$3.75
Cache read
$0.30
Batch input
.50
Estimated batch cache read (inferred)
$0.15
The caveat: Batch is asynchronous. If a user is waiting for the answer, Batch is the wrong lever. For real-time agents, cache beats batch. For offline jobs, batch often beats cache unless the repeated prefix is large.
When Should You Use 5-Minute vs 1-Hour Cache?
Decision factor
5-minute cache
1-hour cache
Write cost
1.25x
2x
Read cost
0.1x
0.1x
Break-even
After one read
Usually after two reads
Best for
Fast repeated calls
Slower repeated calls
Typical workload
Agents, chat, tool use, repeated code tasks
Human review, long side-agent tasks, delayed follow-ups
Main risk
Cache expires before reuse
Write cost is too high for low reuse
Use 5-minute cache by default. Move to 1-hour cache only when your logs show the same stable prefix is reused after the 5-minute window. There is one third-party signal worth noting: a public DEV community report flagged that some teams observed unexpected TTL behavior in 2025; verify your TTL choice with logged cache_creation.ephemeral_5m_input_tokens vs cache_creation.ephemeral_1h_input_tokens rather than assuming the configured TTL is honored.
API Fields To Log
Field
What it tells you
Why it matters
cache_creation_input_tokens
Tokens written into cache
Shows write cost.
cache_read_input_tokens
Tokens served from cache
Shows hit volume.
input_tokens
Non-cached input tokens
Shows the remaining full-price input.
output_tokens
Generated output tokens
Cache does not reduce this cost.
cache_creation.ephemeral_5m_input_tokens
5-minute write tokens
Helps compare 5-minute vs 1-hour usage.
cache_creation.ephemeral_1h_input_tokens
1-hour write tokens
Flags expensive long-TTL writes.
For a working Claude API setup, see our Claude API tutorial. The minimum viable dashboard should track:
Anthropic's prompt caching docs are unusually explicit here: cache failures can be silent when the cached section is too short. The request succeeds, but the usage fields show zero cache creation and zero cache reads. That is why logging matters. ProjectDiscovery's published reports underline this — their initial 7% hit rate was the result of one dynamic field, fixed by relocation, not by infrastructure changes.
How Should You Combine Cache With Routing?
Goal
Recommended path
Lowest live-chat cost
Haiku 4.5 plus 5-minute cache for repeated context.
Best default agent balance
Sonnet 4.6 plus 5-minute cache for tools, memory, and policy.
Highest reasoning quality
Opus 4.7, but cache long stable context and watch tokenizer overhead.
Cheapest async evaluation
Batch API first; add cache only if prompts share a large prefix.
Multi-model cost control
Route easy turns to Haiku, default work to Sonnet, hard turns to Opus.
This is where TokenMix.ai's unified API routing helps in practice. You can compare Claude cost against OpenAI, DeepSeek, Gemini, and other providers without rewriting every integration. The model choice still matters. The routing and cache policy often matter more.
Use Claude prompt caching when at least 20K to 50K stable input tokens repeat across calls. Default to 5-minute cache, measure real cache_read_input_tokens, and target a 60-80% hit rate within two weeks. Use 1-hour cache only when your logs prove delayed reuse. If your gateway sits in front, verify cache headers pass through end-to-end before declaring success.
FAQ
What is Claude API cache pricing?
Claude API cache pricing is Anthropic's discounted billing for repeated prompt prefixes. Cache writes cost more than standard input, but cache reads cost 0.1x the base input rate.
Does Claude prompt caching reduce output cost?
No. Prompt caching reduces repeated input-side cost. Output tokens are still billed at the normal model output rate unless another discount, such as Batch API, applies.
How much does a Claude cache hit cost?
A Claude cache hit costs 10% of the model's base input price. Sonnet 4.6 cache reads are $0.30 per 1M tokens, Haiku 4.5 cache reads are $0.10, and Opus 4.7 cache reads are $0.50.
Is 1-hour Claude cache worth it?
It is worth it only when the same stable prefix is reused after the 5-minute window and usually at least twice. The 1-hour write costs 2x base input, so low-reuse workloads can lose money.
When does 5-minute Claude cache break even?
For pure input cost, 5-minute cache usually breaks even after one cache read. One write plus one read costs 1.35x base input, compared with 2x without cache.
Can Claude Batch API and prompt caching stack?
Anthropic's pricing page says prompt caching multipliers stack with other pricing modifiers, including Batch API. Use logs and invoices to verify the exact effective rate for your workload.
Which Claude model benefits most from caching?
Sonnet 4.6 often benefits most in real agent workloads because it combines strong quality with repeated tool and memory context. Haiku 4.5 is best when the goal is maximum cost efficiency.
What should I log to verify Claude cache savings?
Log cache_creation_input_tokens, cache_read_input_tokens, input_tokens, and output_tokens. Then calculate cache read ratio and cost per completed workflow.
What hit rate is realistic in production?
Per ProjectDiscovery's published case study, 74-84% is achievable on stable agent workloads after dynamic content is moved out of the cacheable prefix. Vellum and Helicone production reports cluster in the 50-80% range. Anything below 20% after a week is a prefix-design problem, not a model problem.