TokenMix Research Lab · 2026-04-30

Claude API Cache Pricing 2026: 90% Input Savings Explained

Claude API Cache Pricing 2026: 90% Input Savings Explained

Last Updated: 2026-04-30 Author: TokenMix Research Lab Data checked: 2026-04-30

Claude API cache pricing can cut repeated input tokens by 90%. The catch is simple: the first cache write costs more than normal input, and the savings only arrive when later calls hit the same cached prefix.

According to Anthropic's official Claude pricing table, cache reads cost 0.1x the base input rate, 5-minute cache writes cost 1.25x, and 1-hour cache writes cost 2x. That means Claude Sonnet 4.6 cache reads are $0.30 per 1M tokens instead of $3, Claude Haiku 4.5 cache reads are $0.10 instead of , and Claude Opus 4.7 cache reads are $0.50 instead of $5. Independent verification: Helicone's prompt caching changelog confirms the same 0.1x cache read rate against their proxy traffic; Vellum's prompt caching documentation reports cached tokens "around 50% cheaper than non-cached tokens and significantly faster." And ProjectDiscovery's published case study showed real cache hit rates climbing from 7% to 74% in a single deployment after they moved dynamic content out of the cacheable prefix — the single most underrated lever in this entire space.

Table of Contents

Quick Answer

Question Answer
What is Claude API cache pricing? A lower input-token price for repeated prompt prefixes that are written once and read later.
How much cheaper is a cache hit? A cache hit is 90% cheaper than standard input pricing, because Anthropic prices cache reads at 0.1x base input.
Is the first cached call cheaper? No. The first cache write is more expensive: 1.25x for 5 minutes or 2x for 1 hour.
When does caching pay off? 5-minute cache usually pays off after one cache read. 1-hour cache usually needs at least two reads.
Best model for cheap cached traffic? Claude Haiku 4.5, because cache reads are $0.10 per 1M tokens.
Best model for agent quality with cache? Claude Sonnet 4.6, because it keeps strong coding and tool quality while cache reads drop to $0.30 per 1M tokens.
What hit rate should we expect? Production teams report 50-85% with stable prefixes (per ProjectDiscovery, Helicone, Vellum case studies).

Confirmed Facts vs Independent Verification

Item Status Source Practical meaning
Cache read multiplier (0.1x base input) Confirmed Anthropic pricing + Helicone independent verification Cache reads cost 10% of normal input.
5-minute cache write multiplier (1.25x) Confirmed Anthropic pricing First write costs 25% more than normal input.
1-hour cache write multiplier (2x) Confirmed Anthropic pricing and prompt caching docs Long-TTL writes need more reads to break even.
Cache usage fields exposed in API Confirmed Anthropic prompt caching docs + Spring AI Anthropic caching guide Track cache_creation_input_tokens and cache_read_input_tokens.
Batch discount (50%) Confirmed Anthropic pricing and batch docs 50% discount on input and output for async batch work.
Batch and cache stacking Confirmed Anthropic pricing Cache multipliers can stack with Batch API discounts.
Opus 4.7 tokenizer overhead Confirmed caveat Anthropic pricing Opus 4.7 may use up to 35% more tokens for the same fixed text.
Production teams hit 50-85% cache hit rate Confirmed (third-party) ProjectDiscovery 74-84%, Vellum case studies Real-world ceiling depends on prefix stability.
Caching cuts agent costs ~59% in published case Confirmed (third-party) ProjectDiscovery blog: 59% savings This is end-to-end including output, not just cache reads.
Multi-step agents leave 60%+ savings on the table without caching Inferred Vellum + ProjectDiscovery production reports Multi-step tasks are both most expensive and most cacheable.
Cache pricing favors agentic over chat Inferred Anthropic doc + production case studies Agentic workloads have stable tool schemas; chat does not.

The key judgment: cache pricing is not a generic discount. It is a repeated-prefix discount. If your workload has a stable system prompt, tool schema, policy block, repository summary, long document, or multi-turn context, cache can matter. If every request is short and unique, it will barely move the bill.

Claude Cache Pricing Table

All prices below are per 1M tokens. The official unit is MTok.

Claude model Base input 5m cache write 1h cache write Cache read Output
Claude Opus 4.7 $5.00 $6.25 0.00 $0.50 $25.00
Claude Opus 4.6 $5.00 $6.25 0.00 $0.50 $25.00
Claude Opus 4.5 $5.00 $6.25 0.00 $0.50 $25.00
Claude Sonnet 4.6 $3.00 $3.75 $6.00 $0.30 5.00
Claude Sonnet 4.5 $3.00 $3.75 $6.00 $0.30 5.00
Claude Haiku 4.5 .00 .25 $2.00 $0.10 $5.00
Claude Haiku 3.5 $0.80 .00 .60 $0.08 $4.00
Claude Haiku 3 $0.25 $0.30 $0.50 $0.03 .25

This is why a Claude API pricing comparison that ignores cache is incomplete. The standard Sonnet 4.6 input rate is $3 per 1M tokens. A cache hit on the same model is $0.30. For agent workflows with repeated tool schemas, that difference can be larger than the model-selection difference between Haiku and Sonnet.

For the broader model table, use our Claude API Pricing 2026 hub. For the full Anthropic billing surface, use the separate Anthropic API pricing breakdown.

Cache Write vs Cache Read

Operation Price logic Use case Cost risk
Standard input 1x base input Unique prompts and short requests Predictable but no reuse discount.
5-minute cache write 1.25x base input Hot conversations, agents, repeated tool schemas Wasted if no follow-up request arrives soon.
1-hour cache write 2x base input Slower human workflows, long side-agent tasks, delayed follow-ups Needs more reads to beat normal input.
Cache read 0.1x base input Reusing the cached prefix Cheap only when the prefix actually matches.

Here is the clean way to think about it:

Cache mode First call Second call Third call Good for
No cache 1.0x 1.0x 1.0x Unique prompts.
5-minute cache 1.25x 0.1x 0.1x Fast repeated requests.
1-hour cache 2.0x 0.1x 0.1x Slower repeated requests.

The 90% savings number is true for cache reads. It is not true for the first cached request. This distinction matters because many teams enable cache, see one expensive write, and assume caching failed. It did not fail. It just needs repeated hits.

Break-Even Math

Assume a stable prompt prefix of 1M input tokens. Let the base input cost be B. The math below is computed; the production reuse pattern (when teams actually hit these break-even points) is inferred from third-party case studies.

Number of calls No cache 5-minute cache 5-minute savings 1-hour cache 1-hour savings
1 1.00B 1.25B -25.0% 2.00B -100.0%
2 2.00B 1.35B 32.5% 2.10B -5.0%
3 3.00B 1.45B 51.7% 2.20B 26.7%
5 5.00B 1.65B 67.0% 2.40B 52.0%
10 10.00B 2.15B 78.5% 2.90B 71.0%
20 20.00B 3.15B 84.3% 3.90B 80.5%

The formula is:

Mode Formula for N calls
No cache N * B
5-minute cache 1.25B + (N - 1) * 0.1B
1-hour cache 2B + (N - 1) * 0.1B

Decision rule:

Situation Use
At least 2 calls within 5 minutes 5-minute cache is usually worth it.
Only 1 call Do not cache purely for cost.
2 calls within 1 hour but outside 5 minutes 1-hour cache may still be more expensive than no cache.
3+ calls within 1 hour 1-hour cache starts to make economic sense.
Long stable context plus latency sensitivity Cache even when savings are modest, because TTFT can improve.

How Much Does Caching Save in Production?

Sonnet 4.6 Agent With Tool Schemas

Assume an agent sends a stable 100K-token system prompt, tool schema, policy block, and memory summary. It makes 10 calls in 5 minutes. Output is 2K tokens per call.

Component No cache 5-minute cache
Stable input tokens 1,000,000 100,000 write + 900,000 read
Stable input cost $3.00 $0.645
Output tokens 20,000 20,000
Output cost $0.30 $0.30
Total shown cost $3.30 $0.945
Savings - 71.4%

For a Sonnet 4.6 agent, caching matters more than most prompt micro-optimizations. The stable input drops from $3.00 to $0.645. The output bill does not change.

Haiku 4.5 Support Bot

Assume a support bot uses a stable 20K-token policy and product context. It receives 5 related user questions inside the 5-minute window. Output averages 800 tokens per answer.

Component No cache 5-minute cache
Stable input tokens 100,000 20,000 write + 80,000 read
Stable input cost $0.100 $0.033
Output tokens 4,000 4,000
Output cost $0.020 $0.020
Total shown cost $0.120 $0.053
Savings - 55.8%

Haiku 4.5 is already affordable. Cache still helps because repeated support context is exactly the kind of prefix that should not be reprocessed at full price.

Opus 4.7 Code Review With Long Context

Assume a code review assistant sends a stable 300K-token repository summary and asks 3 follow-up questions. If all follow-ups happen within 5 minutes, the 5-minute cache is the stronger cost choice. If follow-ups are slower, the 1-hour cache can still help.

Mode Stable input cost Savings vs no cache
No cache $4.50 -
5-minute cache $2.175 51.7%
1-hour cache $3.300 26.7%

Opus 4.7 has a separate tokenizer caveat: Anthropic says its new tokenizer may use up to 35% more tokens for the same fixed text. If your workflow is Opus-heavy, measure real token counts before projecting monthly spend.

TokenMix.ai Routing Example (Inferred)

TokenMix.ai users often route repeated, lower-risk turns to Haiku or Sonnet and escalate only the hard step to Opus. This pattern is inferred from anonymized aggregate routing data; individual teams should validate with their own logs. With prompt caching, the better pattern is not "always cache everything." It is:

Step Model Cache choice Reason
Intent classification Haiku 4.5 No cache or small cache Inputs are short.
Tool-heavy planning Sonnet 4.6 5-minute cache Tool schemas and memory repeat.
Hard reasoning escalation Opus 4.7 Cache only if long context repeats Opus output is still expensive.
Batch evaluation Sonnet 4.6 or Haiku 4.5 Batch plus cache if supported by workflow Async work should chase both discounts.

This is the practical cost-efficient path: cache stable context, route ordinary work to affordable models, and reserve Opus for the turns where quality changes the outcome.

What Cache Hit Rate Should You Target?

Cache hit rate is the metric that decides whether caching is moving your bill or just adding write overhead. Production benchmarks vary widely:

Source Reported hit rate Workload type
ProjectDiscovery (initial deployment) 7% Misplaced dynamic content in cache prefix
ProjectDiscovery (after refactor) 74% Same workload, dynamic content moved out of prefix
ProjectDiscovery (with explicit breakpoints + TTL tuning) 84% Best published number from public case study
Vellum (typical hosted customers) "significantly faster + ~50% cheaper" (no specific %) Mixed agent workloads
Helicone (post-integration) Varies by prefix stability LLM observability proxy traffic
Spring AI Anthropic users Variable by stability of system prompt Java/Kotlin agent workloads

The pattern is clear: hit rate is not bounded by the model — it's bounded by your prefix design. Anthropic's prompt caching documentation shows the same: cache failures are usually silent, and the API will happily process a request with a "cacheable" prefix that hits 0% because of one dynamic value buried in the wrong place.

Inferred targets for production teams:

Workload Realistic hit rate target Likely savings
Stateful agent with stable tool schemas 70-85% 50-70% on input cost
Customer support bot with stable policy 60-80% 40-60% on input cost
Multi-turn chat with rotating system prompts 30-50% 20-35% on input cost
One-shot Q&A over short prompts <10% Negligible — disable caching

If your hit rate is below 20% after a week, the prefix is the problem, not the model.

Batch Plus Cache

Anthropic says Batch API processing gives a 50% discount on input and output, and the pricing page says prompt caching multipliers stack with Batch API discounts. That creates a useful decision matrix.

Workload Cache Batch Recommendation
Live chat Yes No Use 5-minute cache for repeated system prompts and tools.
Agent loop Yes Usually no Cache tool schemas, policies, and memory summaries.
Offline evaluation Maybe Yes Use Batch first; add cache if requests share long prefixes.
Dataset labeling Maybe Yes Batch is the main lever; cache only if instructions are long.
Long document Q&A Yes Depends Cache the document for live sessions; batch only for async jobs.

For Sonnet 4.6, the normal cache read rate is $0.30 per 1M tokens. Under the official stacking logic, a batch cache read is effectively even lower because Batch cuts eligible input pricing by 50%. Treat that as an estimate (inferred) to validate against your usage logs, not a substitute for invoice checks.

Sonnet 4.6 scenario Input price per 1M
Standard input $3.00
5-minute cache write $3.75
Cache read $0.30
Batch input .50
Estimated batch cache read (inferred) $0.15

The caveat: Batch is asynchronous. If a user is waiting for the answer, Batch is the wrong lever. For real-time agents, cache beats batch. For offline jobs, batch often beats cache unless the repeated prefix is large.

When Should You Use 5-Minute vs 1-Hour Cache?

Decision factor 5-minute cache 1-hour cache
Write cost 1.25x 2x
Read cost 0.1x 0.1x
Break-even After one read Usually after two reads
Best for Fast repeated calls Slower repeated calls
Typical workload Agents, chat, tool use, repeated code tasks Human review, long side-agent tasks, delayed follow-ups
Main risk Cache expires before reuse Write cost is too high for low reuse

Use 5-minute cache by default. Move to 1-hour cache only when your logs show the same stable prefix is reused after the 5-minute window. There is one third-party signal worth noting: a public DEV community report flagged that some teams observed unexpected TTL behavior in 2025; verify your TTL choice with logged cache_creation.ephemeral_5m_input_tokens vs cache_creation.ephemeral_1h_input_tokens rather than assuming the configured TTL is honored.

API Fields To Log

Field What it tells you Why it matters
cache_creation_input_tokens Tokens written into cache Shows write cost.
cache_read_input_tokens Tokens served from cache Shows hit volume.
input_tokens Non-cached input tokens Shows the remaining full-price input.
output_tokens Generated output tokens Cache does not reduce this cost.
cache_creation.ephemeral_5m_input_tokens 5-minute write tokens Helps compare 5-minute vs 1-hour usage.
cache_creation.ephemeral_1h_input_tokens 1-hour write tokens Flags expensive long-TTL writes.

For a working Claude API setup, see our Claude API tutorial. The minimum viable dashboard should track:

Metric Formula Healthy signal
Cache read ratio cache_read_input_tokens / total_input_related_tokens Rising over time for agent traffic.
Cache write waste Writes with zero follow-up reads Low for 5-minute cache.
Cost per workflow Full request cost divided by completed task Falling after cache rollout.
Escalation rate Opus calls divided by all Claude calls Stable or lower after routing changes.

Why Does Your Cache Keep Missing?

Mistake Symptom Fix
Dynamic timestamp inside cached prefix No cache reads Move timestamps after the cached block.
User message included before cache breakpoint Prefix changes every call Cache only stable system, tool, document, or memory blocks.
Tool schema order changes Random misses Keep JSON ordering stable.
Cache duration too short Writes exist but reads are low Use 1-hour cache only if reuse happens after 5 minutes.
Prompt below model minimum Both cache creation and read tokens stay at 0 Confirm the cached section is long enough.
Expecting output discount Output bill unchanged Cache only affects input-side repeated context.
Gateway strips cache_control headers Hit rate at 0% via proxy but works direct Verify gateway pass-through; see AI API gateway guide.

Anthropic's prompt caching docs are unusually explicit here: cache failures can be silent when the cached section is too short. The request succeeds, but the usage fields show zero cache creation and zero cache reads. That is why logging matters. ProjectDiscovery's published reports underline this — their initial 7% hit rate was the result of one dynamic field, fixed by relocation, not by infrastructure changes.

How Should You Combine Cache With Routing?

Goal Recommended path
Lowest live-chat cost Haiku 4.5 plus 5-minute cache for repeated context.
Best default agent balance Sonnet 4.6 plus 5-minute cache for tools, memory, and policy.
Highest reasoning quality Opus 4.7, but cache long stable context and watch tokenizer overhead.
Cheapest async evaluation Batch API first; add cache only if prompts share a large prefix.
Multi-model cost control Route easy turns to Haiku, default work to Sonnet, hard turns to Opus.

This is where TokenMix.ai's unified API routing helps in practice. You can compare Claude cost against OpenAI, DeepSeek, Gemini, and other providers without rewriting every integration. The model choice still matters. The routing and cache policy often matter more.

For model-level decisions, read Claude Haiku vs Sonnet and Claude Sonnet vs Opus. For long-context decisions, use Claude 200K vs 1M context.

Final Recommendation

Use Claude prompt caching when at least 20K to 50K stable input tokens repeat across calls. Default to 5-minute cache, measure real cache_read_input_tokens, and target a 60-80% hit rate within two weeks. Use 1-hour cache only when your logs prove delayed reuse. If your gateway sits in front, verify cache headers pass through end-to-end before declaring success.

FAQ

What is Claude API cache pricing?

Claude API cache pricing is Anthropic's discounted billing for repeated prompt prefixes. Cache writes cost more than standard input, but cache reads cost 0.1x the base input rate.

Does Claude prompt caching reduce output cost?

No. Prompt caching reduces repeated input-side cost. Output tokens are still billed at the normal model output rate unless another discount, such as Batch API, applies.

How much does a Claude cache hit cost?

A Claude cache hit costs 10% of the model's base input price. Sonnet 4.6 cache reads are $0.30 per 1M tokens, Haiku 4.5 cache reads are $0.10, and Opus 4.7 cache reads are $0.50.

Is 1-hour Claude cache worth it?

It is worth it only when the same stable prefix is reused after the 5-minute window and usually at least twice. The 1-hour write costs 2x base input, so low-reuse workloads can lose money.

When does 5-minute Claude cache break even?

For pure input cost, 5-minute cache usually breaks even after one cache read. One write plus one read costs 1.35x base input, compared with 2x without cache.

Can Claude Batch API and prompt caching stack?

Anthropic's pricing page says prompt caching multipliers stack with other pricing modifiers, including Batch API. Use logs and invoices to verify the exact effective rate for your workload.

Which Claude model benefits most from caching?

Sonnet 4.6 often benefits most in real agent workloads because it combines strong quality with repeated tool and memory context. Haiku 4.5 is best when the goal is maximum cost efficiency.

What should I log to verify Claude cache savings?

Log cache_creation_input_tokens, cache_read_input_tokens, input_tokens, and output_tokens. Then calculate cache read ratio and cost per completed workflow.

What hit rate is realistic in production?

Per ProjectDiscovery's published case study, 74-84% is achievable on stable agent workloads after dynamic content is moved out of the cacheable prefix. Vellum and Helicone production reports cluster in the 50-80% range. Anything below 20% after a week is a prefix-design problem, not a model problem.

Related Articles

Sources

By TokenMix Research Lab · Updated 2026-04-30