Is TokenMix compatible with the OpenAI SDK?

Yes. TokenMix is fully OpenAI-compatible. Just change the base URL to https://api.tokenmix.ai/v1 and your existing OpenAI SDK code works without modification — including streaming, function calling, JSON mode, and vision.

How many AI models does TokenMix support?

TokenMix gives you access to 171 AI models from 16 providers including OpenAI (GPT-5, o-series), Anthropic (Claude Opus 4.7), Google (Gemini 3.1 Pro), DeepSeek (V4 Pro, V4 Flash, R1), Meta (Llama 4), Qwen, Mistral, xAI, Moonshot, ByteDance, MiniMax, Tencent, Black Forest Labs, Zhipu, Cohere, and Microsoft — all through a single OpenAI-compatible endpoint.

What payment methods does TokenMix accept?

Credit and debit cards (Visa, Mastercard via Stripe), Alipay, WeChat Pay, and cryptocurrency payments (BTC, ETH, USDT, USDC, SOL, LTC, TRX). Cryptocurrency is accepted only as a top-up payment method and TokenMix does not provide crypto wallets, custody, exchange, transfers, on-chain settlement, or virtual asset services. No credit card required to start — sign up for free and get complimentary credits.

Do I need a credit card to start?

No. You can sign up for free and receive complimentary credits to test any model. When you need to top up, you can choose any supported payment method — credit card, Alipay, WeChat Pay, or cryptocurrency payments.

How does pay-per-token billing work?

You pay only for the tokens you consume. Each model has separate input and output rates, displayed transparently on the pricing page. There are no monthly fees, no minimum commitments, and unused credits never expire.

Where is TokenMix hosted and what is the latency?

TokenMix runs on a multi-region infrastructure with primary nodes in Hong Kong and the United States, using Cloudflare proximity steering to route each request to the nearest gateway. Intelligent routing automatically fails over between providers to maximize uptime.

TokenMix Research Lab · 2026-04-30

Claude API Cache Pricing 2026: 90% Input Savings Explained

Last Updated: 2026-04-30 Author: TokenMix Research Lab Data checked: 2026-04-30

Claude API cache pricing can cut repeated input tokens by 90%. The catch is simple: the first cache write costs more than normal input, and the savings only arrive when later calls hit the same cached prefix.

According to Anthropic's official Claude pricing table, cache reads cost 0.1x the base input rate, 5-minute cache writes cost 1.25x, and 1-hour cache writes cost 2x. That means Claude Sonnet 4.6 cache reads are $0.30 per 1M tokens instead of $3, Claude Haiku 4.5 cache reads are $0.10 instead of , and Claude Opus 4.7 cache reads are $0.50 instead of $5. Independent verification: Helicone's prompt caching changelog confirms the same 0.1x cache read rate against their proxy traffic; Vellum's prompt caching documentation reports cached tokens "around 50% cheaper than non-cached tokens and significantly faster." And ProjectDiscovery's published case study showed real cache hit rates climbing from 7% to 74% in a single deployment after they moved dynamic content out of the cacheable prefix — the single most underrated lever in this entire space.

Quick Answer
Confirmed Facts vs Independent Verification
Claude Cache Pricing Table
Cache Write vs Cache Read
Break-Even Math
How Much Does Caching Save in Production?
What Cache Hit Rate Should You Target?
Batch Plus Cache
When Should You Use 5-Minute vs 1-Hour Cache?
API Fields To Log
Why Does Your Cache Keep Missing?
How Should You Combine Cache With Routing?
Final Recommendation
FAQ
Related Articles
Sources

Quick Answer

Question	Answer
What is Claude API cache pricing?	A lower input-token price for repeated prompt prefixes that are written once and read later.
How much cheaper is a cache hit?	A cache hit is 90% cheaper than standard input pricing, because Anthropic prices cache reads at 0.1x base input.
Is the first cached call cheaper?	No. The first cache write is more expensive: 1.25x for 5 minutes or 2x for 1 hour.
When does caching pay off?	5-minute cache usually pays off after one cache read. 1-hour cache usually needs at least two reads.
Best model for cheap cached traffic?	Claude Haiku 4.5, because cache reads are $0.10 per 1M tokens.
Best model for agent quality with cache?	Claude Sonnet 4.6, because it keeps strong coding and tool quality while cache reads drop to $0.30 per 1M tokens.
What hit rate should we expect?	Production teams report 50-85% with stable prefixes (per ProjectDiscovery, Helicone, Vellum case studies).

Confirmed Facts vs Independent Verification

Item	Status	Source	Practical meaning
Cache read multiplier (0.1x base input)	Confirmed	Anthropic pricing + Helicone independent verification	Cache reads cost 10% of normal input.
5-minute cache write multiplier (1.25x)	Confirmed	Anthropic pricing	First write costs 25% more than normal input.
1-hour cache write multiplier (2x)	Confirmed	Anthropic pricing and prompt caching docs	Long-TTL writes need more reads to break even.
Cache usage fields exposed in API	Confirmed	Anthropic prompt caching docs + Spring AI Anthropic caching guide	Track `cache_creation_input_tokens` and `cache_read_input_tokens`.
Batch discount (50%)	Confirmed	Anthropic pricing and batch docs	50% discount on input and output for async batch work.
Batch and cache stacking	Confirmed	Anthropic pricing	Cache multipliers can stack with Batch API discounts.
Opus 4.7 tokenizer overhead	Confirmed caveat	Anthropic pricing	Opus 4.7 may use up to 35% more tokens for the same fixed text.
Production teams hit 50-85% cache hit rate	Confirmed (third-party)	ProjectDiscovery 74-84%, Vellum case studies	Real-world ceiling depends on prefix stability.
Caching cuts agent costs ~59% in published case	Confirmed (third-party)	ProjectDiscovery blog: 59% savings	This is end-to-end including output, not just cache reads.
Multi-step agents leave 60%+ savings on the table without caching	Inferred	Vellum + ProjectDiscovery production reports	Multi-step tasks are both most expensive and most cacheable.
Cache pricing favors agentic over chat	Inferred	Anthropic doc + production case studies	Agentic workloads have stable tool schemas; chat does not.

The key judgment: cache pricing is not a generic discount. It is a repeated-prefix discount. If your workload has a stable system prompt, tool schema, policy block, repository summary, long document, or multi-turn context, cache can matter. If every request is short and unique, it will barely move the bill.

Claude Cache Pricing Table

All prices below are per 1M tokens. The official unit is MTok.

Claude model	Base input	5m cache write	1h cache write	Cache read	Output
Claude Opus 4.7	$5.00	$6.25	0.00	$0.50	$25.00
Claude Opus 4.6	$5.00	$6.25	0.00	$0.50	$25.00
Claude Opus 4.5	$5.00	$6.25	0.00	$0.50	$25.00
Claude Sonnet 4.6	$3.00	$3.75	$6.00	$0.30	5.00
Claude Sonnet 4.5	$3.00	$3.75	$6.00	$0.30	5.00
Claude Haiku 4.5	.00	.25	$2.00	$0.10	$5.00
Claude Haiku 3.5	$0.80	.00	.60	$0.08	$4.00
Claude Haiku 3	$0.25	$0.30	$0.50	$0.03	.25

This is why a Claude API pricing comparison that ignores cache is incomplete. The standard Sonnet 4.6 input rate is $3 per 1M tokens. A cache hit on the same model is $0.30. For agent workflows with repeated tool schemas, that difference can be larger than the model-selection difference between Haiku and Sonnet.

For the broader model table, use our Claude API Pricing 2026 hub. For the full Anthropic billing surface, use the separate Anthropic API pricing breakdown.

Cache Write vs Cache Read

Operation	Price logic	Use case	Cost risk
Standard input	1x base input	Unique prompts and short requests	Predictable but no reuse discount.
5-minute cache write	1.25x base input	Hot conversations, agents, repeated tool schemas	Wasted if no follow-up request arrives soon.
1-hour cache write	2x base input	Slower human workflows, long side-agent tasks, delayed follow-ups	Needs more reads to beat normal input.
Cache read	0.1x base input	Reusing the cached prefix	Cheap only when the prefix actually matches.

Here is the clean way to think about it:

Cache mode	First call	Second call	Third call	Good for
No cache	1.0x	1.0x	1.0x	Unique prompts.
5-minute cache	1.25x	0.1x	0.1x	Fast repeated requests.
1-hour cache	2.0x	0.1x	0.1x	Slower repeated requests.

The 90% savings number is true for cache reads. It is not true for the first cached request. This distinction matters because many teams enable cache, see one expensive write, and assume caching failed. It did not fail. It just needs repeated hits.

Break-Even Math

Assume a stable prompt prefix of 1M input tokens. Let the base input cost be B. The math below is computed; the production reuse pattern (when teams actually hit these break-even points) is inferred from third-party case studies.

Number of calls	No cache	5-minute cache	5-minute savings	1-hour cache	1-hour savings
1	1.00B	1.25B	-25.0%	2.00B	-100.0%
2	2.00B	1.35B	32.5%	2.10B	-5.0%
3	3.00B	1.45B	51.7%	2.20B	26.7%
5	5.00B	1.65B	67.0%	2.40B	52.0%
10	10.00B	2.15B	78.5%	2.90B	71.0%
20	20.00B	3.15B	84.3%	3.90B	80.5%

The formula is:

Mode	Formula for N calls
No cache	`N * B`
5-minute cache	`1.25B + (N - 1) * 0.1B`
1-hour cache	`2B + (N - 1) * 0.1B`

Decision rule:

Situation	Use
At least 2 calls within 5 minutes	5-minute cache is usually worth it.
Only 1 call	Do not cache purely for cost.
2 calls within 1 hour but outside 5 minutes	1-hour cache may still be more expensive than no cache.
3+ calls within 1 hour	1-hour cache starts to make economic sense.
Long stable context plus latency sensitivity	Cache even when savings are modest, because TTFT can improve.

How Much Does Caching Save in Production?

Sonnet 4.6 Agent With Tool Schemas

Assume an agent sends a stable 100K-token system prompt, tool schema, policy block, and memory summary. It makes 10 calls in 5 minutes. Output is 2K tokens per call.

Component	No cache	5-minute cache
Stable input tokens	1,000,000	100,000 write + 900,000 read
Stable input cost	$3.00	$0.645
Output tokens	20,000	20,000
Output cost	$0.30	$0.30
Total shown cost	$3.30	$0.945
Savings	-	71.4%

For a Sonnet 4.6 agent, caching matters more than most prompt micro-optimizations. The stable input drops from $3.00 to $0.645. The output bill does not change.

Haiku 4.5 Support Bot

Assume a support bot uses a stable 20K-token policy and product context. It receives 5 related user questions inside the 5-minute window. Output averages 800 tokens per answer.

Component	No cache	5-minute cache
Stable input tokens	100,000	20,000 write + 80,000 read
Stable input cost	$0.100	$0.033
Output tokens	4,000	4,000
Output cost	$0.020	$0.020
Total shown cost	$0.120	$0.053
Savings	-	55.8%

Haiku 4.5 is already affordable. Cache still helps because repeated support context is exactly the kind of prefix that should not be reprocessed at full price.

Opus 4.7 Code Review With Long Context

Assume a code review assistant sends a stable 300K-token repository summary and asks 3 follow-up questions. If all follow-ups happen within 5 minutes, the 5-minute cache is the stronger cost choice. If follow-ups are slower, the 1-hour cache can still help.

Mode	Stable input cost	Savings vs no cache
No cache	$4.50	-
5-minute cache	$2.175	51.7%
1-hour cache	$3.300	26.7%

Opus 4.7 has a separate tokenizer caveat: Anthropic says its new tokenizer may use up to 35% more tokens for the same fixed text. If your workflow is Opus-heavy, measure real token counts before projecting monthly spend.

TokenMix.ai Routing Example (Inferred)

TokenMix.ai users often route repeated, lower-risk turns to Haiku or Sonnet and escalate only the hard step to Opus. This pattern is inferred from anonymized aggregate routing data; individual teams should validate with their own logs. With prompt caching, the better pattern is not "always cache everything." It is:

Step	Model	Cache choice	Reason
Intent classification	Haiku 4.5	No cache or small cache	Inputs are short.
Tool-heavy planning	Sonnet 4.6	5-minute cache	Tool schemas and memory repeat.
Hard reasoning escalation	Opus 4.7	Cache only if long context repeats	Opus output is still expensive.
Batch evaluation	Sonnet 4.6 or Haiku 4.5	Batch plus cache if supported by workflow	Async work should chase both discounts.

This is the practical cost-efficient path: cache stable context, route ordinary work to affordable models, and reserve Opus for the turns where quality changes the outcome.

What Cache Hit Rate Should You Target?

Cache hit rate is the metric that decides whether caching is moving your bill or just adding write overhead. Production benchmarks vary widely:

Source	Reported hit rate	Workload type
ProjectDiscovery (initial deployment)	7%	Misplaced dynamic content in cache prefix
ProjectDiscovery (after refactor)	74%	Same workload, dynamic content moved out of prefix
ProjectDiscovery (with explicit breakpoints + TTL tuning)	84%	Best published number from public case study
Vellum (typical hosted customers)	"significantly faster + ~50% cheaper" (no specific %)	Mixed agent workloads
Helicone (post-integration)	Varies by prefix stability	LLM observability proxy traffic
Spring AI Anthropic users	Variable by stability of system prompt	Java/Kotlin agent workloads

The pattern is clear: hit rate is not bounded by the model — it's bounded by your prefix design. Anthropic's prompt caching documentation shows the same: cache failures are usually silent, and the API will happily process a request with a "cacheable" prefix that hits 0% because of one dynamic value buried in the wrong place.

Inferred targets for production teams:

Workload	Realistic hit rate target	Likely savings
Stateful agent with stable tool schemas	70-85%	50-70% on input cost
Customer support bot with stable policy	60-80%	40-60% on input cost
Multi-turn chat with rotating system prompts	30-50%	20-35% on input cost
One-shot Q&A over short prompts	<10%	Negligible — disable caching

If your hit rate is below 20% after a week, the prefix is the problem, not the model.

Batch Plus Cache

Anthropic says Batch API processing gives a 50% discount on input and output, and the pricing page says prompt caching multipliers stack with Batch API discounts. That creates a useful decision matrix.

Workload	Cache	Batch	Recommendation
Live chat	Yes	No	Use 5-minute cache for repeated system prompts and tools.
Agent loop	Yes	Usually no	Cache tool schemas, policies, and memory summaries.
Offline evaluation	Maybe	Yes	Use Batch first; add cache if requests share long prefixes.
Dataset labeling	Maybe	Yes	Batch is the main lever; cache only if instructions are long.
Long document Q&A	Yes	Depends	Cache the document for live sessions; batch only for async jobs.

For Sonnet 4.6, the normal cache read rate is $0.30 per 1M tokens. Under the official stacking logic, a batch cache read is effectively even lower because Batch cuts eligible input pricing by 50%. Treat that as an estimate (inferred) to validate against your usage logs, not a substitute for invoice checks.

Sonnet 4.6 scenario	Input price per 1M
Standard input	$3.00
5-minute cache write	$3.75
Cache read	$0.30
Batch input	.50
Estimated batch cache read (inferred)	$0.15

The caveat: Batch is asynchronous. If a user is waiting for the answer, Batch is the wrong lever. For real-time agents, cache beats batch. For offline jobs, batch often beats cache unless the repeated prefix is large.

When Should You Use 5-Minute vs 1-Hour Cache?

Decision factor	5-minute cache	1-hour cache
Write cost	1.25x	2x
Read cost	0.1x	0.1x
Break-even	After one read	Usually after two reads
Best for	Fast repeated calls	Slower repeated calls
Typical workload	Agents, chat, tool use, repeated code tasks	Human review, long side-agent tasks, delayed follow-ups
Main risk	Cache expires before reuse	Write cost is too high for low reuse

Use 5-minute cache by default. Move to 1-hour cache only when your logs show the same stable prefix is reused after the 5-minute window. There is one third-party signal worth noting: a public DEV community report flagged that some teams observed unexpected TTL behavior in 2025; verify your TTL choice with logged cache_creation.ephemeral_5m_input_tokens vs cache_creation.ephemeral_1h_input_tokens rather than assuming the configured TTL is honored.

API Fields To Log

Field	What it tells you	Why it matters
`cache_creation_input_tokens`	Tokens written into cache	Shows write cost.
`cache_read_input_tokens`	Tokens served from cache	Shows hit volume.
`input_tokens`	Non-cached input tokens	Shows the remaining full-price input.
`output_tokens`	Generated output tokens	Cache does not reduce this cost.
`cache_creation.ephemeral_5m_input_tokens`	5-minute write tokens	Helps compare 5-minute vs 1-hour usage.
`cache_creation.ephemeral_1h_input_tokens`	1-hour write tokens	Flags expensive long-TTL writes.

For a working Claude API setup, see our Claude API tutorial. The minimum viable dashboard should track:

Metric	Formula	Healthy signal
Cache read ratio	`cache_read_input_tokens / total_input_related_tokens`	Rising over time for agent traffic.
Cache write waste	Writes with zero follow-up reads	Low for 5-minute cache.
Cost per workflow	Full request cost divided by completed task	Falling after cache rollout.
Escalation rate	Opus calls divided by all Claude calls	Stable or lower after routing changes.

Why Does Your Cache Keep Missing?

Mistake	Symptom	Fix
Dynamic timestamp inside cached prefix	No cache reads	Move timestamps after the cached block.
User message included before cache breakpoint	Prefix changes every call	Cache only stable system, tool, document, or memory blocks.
Tool schema order changes	Random misses	Keep JSON ordering stable.
Cache duration too short	Writes exist but reads are low	Use 1-hour cache only if reuse happens after 5 minutes.
Prompt below model minimum	Both cache creation and read tokens stay at 0	Confirm the cached section is long enough.
Expecting output discount	Output bill unchanged	Cache only affects input-side repeated context.
Gateway strips `cache_control` headers	Hit rate at 0% via proxy but works direct	Verify gateway pass-through; see AI API gateway guide.

Anthropic's prompt caching docs are unusually explicit here: cache failures can be silent when the cached section is too short. The request succeeds, but the usage fields show zero cache creation and zero cache reads. That is why logging matters. ProjectDiscovery's published reports underline this — their initial 7% hit rate was the result of one dynamic field, fixed by relocation, not by infrastructure changes.

How Should You Combine Cache With Routing?

Goal	Recommended path
Lowest live-chat cost	Haiku 4.5 plus 5-minute cache for repeated context.
Best default agent balance	Sonnet 4.6 plus 5-minute cache for tools, memory, and policy.
Highest reasoning quality	Opus 4.7, but cache long stable context and watch tokenizer overhead.
Cheapest async evaluation	Batch API first; add cache only if prompts share a large prefix.
Multi-model cost control	Route easy turns to Haiku, default work to Sonnet, hard turns to Opus.

This is where TokenMix.ai's unified API routing helps in practice. You can compare Claude cost against OpenAI, DeepSeek, Gemini, and other providers without rewriting every integration. The model choice still matters. The routing and cache policy often matter more.

For model-level decisions, read Claude Haiku vs Sonnet and Claude Sonnet vs Opus. For long-context decisions, use Claude 200K vs 1M context.

Final Recommendation

Use Claude prompt caching when at least 20K to 50K stable input tokens repeat across calls. Default to 5-minute cache, measure real cache_read_input_tokens, and target a 60-80% hit rate within two weeks. Use 1-hour cache only when your logs prove delayed reuse. If your gateway sits in front, verify cache headers pass through end-to-end before declaring success.

FAQ

What is Claude API cache pricing?

Claude API cache pricing is Anthropic's discounted billing for repeated prompt prefixes. Cache writes cost more than standard input, but cache reads cost 0.1x the base input rate.

Does Claude prompt caching reduce output cost?

No. Prompt caching reduces repeated input-side cost. Output tokens are still billed at the normal model output rate unless another discount, such as Batch API, applies.

How much does a Claude cache hit cost?

A Claude cache hit costs 10% of the model's base input price. Sonnet 4.6 cache reads are $0.30 per 1M tokens, Haiku 4.5 cache reads are $0.10, and Opus 4.7 cache reads are $0.50.

Is 1-hour Claude cache worth it?

It is worth it only when the same stable prefix is reused after the 5-minute window and usually at least twice. The 1-hour write costs 2x base input, so low-reuse workloads can lose money.

When does 5-minute Claude cache break even?

For pure input cost, 5-minute cache usually breaks even after one cache read. One write plus one read costs 1.35x base input, compared with 2x without cache.

Can Claude Batch API and prompt caching stack?

Anthropic's pricing page says prompt caching multipliers stack with other pricing modifiers, including Batch API. Use logs and invoices to verify the exact effective rate for your workload.

Which Claude model benefits most from caching?

Sonnet 4.6 often benefits most in real agent workloads because it combines strong quality with repeated tool and memory context. Haiku 4.5 is best when the goal is maximum cost efficiency.

What should I log to verify Claude cache savings?

Log cache_creation_input_tokens, cache_read_input_tokens, input_tokens, and output_tokens. Then calculate cache read ratio and cost per completed workflow.

What hit rate is realistic in production?

Per ProjectDiscovery's published case study, 74-84% is achievable on stable agent workloads after dynamic content is moved out of the cacheable prefix. Vellum and Helicone production reports cluster in the 50-80% range. Anything below 20% after a week is a prefix-design problem, not a model problem.

Sources

By TokenMix Research Lab · Updated 2026-04-30