TokenMix Research Lab · 2026-04-07

Prompt Caching Guide 2026: Save 50-95% on AI API Costs with OpenAI, Anthropic, and Google Caching

Prompt Caching Guide: How to Save 90% on AI API Costs with LLM Caching (2026)

Prompt caching is the single most effective cost reduction technique for AI APIs. OpenAI's automatic caching gives 50% off cached input tokens. Anthropic's explicit caching gives 90% off cache hits. Google's context caching charges per-hour storage but slashes read costs by 75%. At scale, caching turns a 0,000/month API bill into $2,000-4,000 without changing model quality or output. This is the detailed implementation guide — how caching works at each provider, code examples, ROI calculations, and when caching helps versus when it does not. Every pricing article on TokenMix.ai references this guide because caching affects every cost comparison. All data from official provider documentation and TokenMix.ai production monitoring, April 2026.

[Quick Comparison: Prompt Caching Across Providers]
[Why Prompt Caching Matters for AI API Costs]
[How Prompt Caching Works: The Core Mechanism]
[OpenAI Prompt Caching: Automatic and Free]
[Anthropic Prompt Caching: Explicit Control, 90% Savings]
[Google Context Caching: Hourly Storage Model]
[DeepSeek and Other Providers: Caching Options]
[Prompt Caching Implementation Guide with Code]
[ROI Calculation: When Caching Pays Off]
[When Prompt Caching Does Not Help]
[Stacking Caching with Other Discounts]
[Prompt Caching Best Practices]
[How to Choose a Caching Strategy]
[Conclusion]
[FAQ]

Quick Comparison: Prompt Caching Across Providers

Feature	OpenAI	Anthropic	Google
Cache discount	50% off input	90% off input	75% off input
Cache write cost	Free (automatic)	1.25x base (5min) / 2x base (1hr)	Free (storage-based)
Cache duration	~5-10 min (automatic)	5 min or 1 hour (explicit)	Until deleted (manual)
Implementation	Zero-code (automatic)	One field per request	API call to create cache
Min cacheable tokens	1,024	1,024	32,768
Cache storage cost	Free	Free	$4.50/M tokens/hour
Cache granularity	Prefix-based	Prefix-based	Explicit context
Batch API compatible	Yes	Yes	N/A
Models supported	All current models	All Claude models	Gemini 1.5+, 2.0+, 2.5

Bottom line: Anthropic gives the deepest discount (90%) but charges for cache writes. OpenAI caching is free and automatic but only saves 50%. Google's model is unique — free writes but hourly storage fees that make short-lived caches expensive.

Why Prompt Caching Matters for AI API Costs

Most production AI applications send the same tokens repeatedly. System prompts, few-shot examples, document context for RAG, and conversation history all contain content that does not change between requests.

Typical token breakdown in a production API call:

Component	Token count	Changes between requests?
System prompt	500-2,000	No
Few-shot examples	2,000-10,000	No
RAG context	5,000-50,000	Partially
Conversation history	1,000-20,000	Grows incrementally
User query	50-500	Yes (always unique)

In a typical setup, 80-95% of input tokens are repeated across requests. Caching these tokens means you pay full price once, then 10-50% of the price on every subsequent request.

Real-world impact: TokenMix.ai's production data shows that teams implementing prompt caching reduce their input token costs by 60-85% on average. For applications with long system prompts or RAG patterns, savings exceed 90%.

Why every pricing comparison references this guide: When comparing models like GPT-5.4 ($2.50/M input) vs Claude Sonnet ($3.00/M input), the cache-adjusted prices tell a completely different story. Sonnet with 90% cache discount drops to $0.30/M — cheaper than GPT-5.4 with 50% caching at .25/M.

How Prompt Caching Works: The Core Mechanism

All prompt caching systems work on the same principle: store the computed internal state (key-value cache) of previously processed tokens so the model does not need to reprocess them.

The Technical Flow

First request: The model processes all input tokens from scratch. The provider stores the computed KV-cache for the prefix portion of your prompt.
Subsequent requests: If the beginning of your new prompt matches a cached prefix, the model skips reprocessing those tokens. It loads the cached state and only processes new tokens.
Cache matching: Matching is prefix-based and exact. The cache hits only if the beginning of your prompt matches byte-for-byte with a cached prefix. Changing even one token in the cached portion invalidates the cache.

What Gets Cached

System prompts (the most common and effective target)
Few-shot examples appended after the system prompt
Document context loaded before the user query
Tool/function definitions
Conversation history (for multi-turn applications)

What Cannot Be Cached

User queries (always unique)
Dynamic content that changes every request
Content after the first non-matching token (prefix-only matching)
Output tokens (caching only applies to input)

OpenAI Prompt Caching: Automatic and Free

OpenAI's caching is the simplest to use: it is automatic, requires zero code changes, and has no write cost.

How OpenAI Caching Works

Automatic activation: Any prompt with 1,024+ tokens is eligible. No API changes needed.
Cache discount: 50% off input tokens on cache hit.
Cache duration: Approximately 5-10 minutes of inactivity before eviction. During off-peak hours, caches may persist longer.
Cache scope: Scoped to your organization. Different API keys under the same org share caches.
Matching: Exact prefix matching. The longest matching prefix is used.

OpenAI Cached Pricing (April 2026)

Model	Standard Input/M	Cached Input/M	Savings
GPT-5.4	$2.50	.25	50%
GPT-5.4 Mini	$0.75	$0.375	50%
GPT-5.4 Nano	$0.20	$0.10	50%
o3	$2.50	.25	50%
o4-mini	$0.75	$0.375	50%

OpenAI Cache Verification

Check if caching is active by inspecting response headers:

# In the API response usage object:
{
  "usage": {
    "prompt_tokens": 2048,
    "completion_tokens": 512,
    "prompt_tokens_details": {
      "cached_tokens": 1536  # These tokens were cache hits
    }
  }
}

If cached_tokens is 0 on repeated requests with the same prefix, your prompt may be below the 1,024-token minimum or the cache expired between requests.

OpenAI Caching Limitations

50% discount is modest compared to Anthropic's 90%. For cost-sensitive workloads, Anthropic caching delivers nearly double the savings.
No explicit control. You cannot force cache persistence or set TTL. Cache eviction is automatic and opaque.
No guarantee of cache hit. During peak hours, caches may be evicted more aggressively.
Minimum 1,024 tokens. Short system prompts below this threshold receive no caching benefit.

Source: OpenAI Prompt Caching Documentation

Anthropic Prompt Caching: Explicit Control, 90% Savings

Anthropic's caching is the most powerful in terms of discount depth: 90% off input on cache hits. But it requires explicit implementation and charges for cache writes.

How Anthropic Caching Works

Explicit activation: Add a cache_control field to the content block you want to cache.
Two TTL options: 5-minute cache (1.25x write cost) or 1-hour cache (2x write cost).
Cache discount: 90% off input tokens on cache hit (0.1x base price).
Cache scope: Per-organization, per-model.
Matching: Prefix-based, exact match.

Anthropic Cached Pricing (April 2026)

Model	Base Input/M	5min Cache Write/M	1hr Cache Write/M	Cache Hit/M	Savings on Hit
Opus 4.6	$5.00	$6.25	0.00	$0.50	90%
Sonnet 4.6	$3.00	$3.75	$6.00	$0.30	90%
Haiku 4.5	.00	.25	$2.00	$0.10	90%

Anthropic Cache Break-Even Analysis

5-minute cache (1.25x write cost):

Write cost: 1.25x = 25% premium on first request
Each cache hit saves: 90% (0.1x vs 1.0x)
Break-even: After 1 cache hit, total cost is lower than no caching
With 2 hits: Total savings = 0.1 + 0.1 + 1.25 = 1.45x vs 3.0x without caching = 52% savings

1-hour cache (2.0x write cost):

Write cost: 2.0x = 100% premium on first request
Each cache hit saves: 90%
Break-even: After 2 cache hits, total cost is lower than no caching
With 5 hits: Total savings = 0.5 + 2.0 = 2.5x vs 6.0x without caching = 58% savings

Any workload with more than 1-2 requests per 5 minutes using the same system prompt should enable caching. The ROI is immediate.

Anthropic Cache Implementation

Add cache_control to the content block you want cached:

{
  "model": "claude-sonnet-4-6-20260401",
  "max_tokens": 1024,
  "system": [
    {
      "type": "text",
      "text": "Your system prompt here with instructions, examples, context...",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "User query here"}
  ]
}

The cache_control field with "type": "ephemeral" uses the 5-minute cache. For 1-hour cache, the API provides a separate type parameter.

Anthropic Cache Verification

{
  "usage": {
    "input_tokens": 2048,
    "cache_creation_input_tokens": 1536,
    "cache_read_input_tokens": 0
  }
}

On the first request, cache_creation_input_tokens shows what was written. On subsequent hits, cache_read_input_tokens shows what was served from cache. Monitor both to verify caching is working.

Source: Anthropic Prompt Caching Documentation

Google Context Caching: Hourly Storage Model

Google's approach is fundamentally different: no write premium, but you pay hourly storage fees for cached content.

How Google Context Caching Works

Explicit creation: Create a cached content object via API. Upload your context (minimum 32,768 tokens).
Storage pricing: $4.50 per million tokens per hour.
Read discount: 75% off input tokens when referencing cached context.
TTL: You set the expiration. Cache persists until TTL expires or you delete it.
Cache scope: Per-project.

Google Cached Pricing (Gemini 2.5 Flash, April 2026)

Operation	Price/M Tokens
Standard input	$0.30
Cached input (read)	$0.075
Cache storage	$4.50/M tokens/hour
Cache write	Free

Google Cache Cost Analysis

The hourly storage model means Google caching is only cost-effective for workloads with high request frequency over sustained periods.

Example: 50,000 tokens of cached context

Storage cost: 0.05M tokens x $4.50/M/hour = $0.225/hour = $5.40/day
Savings per request: (0.30 - 0.075) x 0.05M = $0.01125 per request
Break-even: $5.40/day / $0.01125/request = 480 requests/day to break even

If you make fewer than 480 requests/day using this cached context, the storage fee exceeds the savings. This makes Google caching impractical for low-to-medium volume workloads.

When Google caching wins: High-volume applications (1,000+ requests/hour) with very large context (100K+ tokens). At 100K cached tokens and 5,000 requests/hour, the savings are substantial and storage fees are a small fraction of total spend.

Source: Google AI Context Caching

DeepSeek and Other Providers: Caching Options

DeepSeek

DeepSeek offers automatic prefix caching similar to OpenAI:

Discount: Cache hits on input tokens are discounted (typically ~50% off)
Automatic: No code changes required
Duration: Short-lived cache, similar to OpenAI's automatic eviction
Already cheap: Since DeepSeek V4 input is $0.30/M, cached input drops to ~$0.15/M — marginal savings in absolute terms

Groq

Groq does not currently offer prompt caching. Given Groq's focus on speed rather than cost optimization, caching is less critical — their pricing is already competitive on input.

Open-Source / Self-Hosted

If you self-host models (via vLLM, TGI, or similar), prefix caching is available at the inference server level:

vLLM: Built-in automatic prefix caching. Zero additional cost beyond compute.
TGI: Prefix caching available with configuration.
No per-token cost: You are already paying for GPU time. Caching reduces compute time per request but does not generate a separate discount.

Prompt Caching Implementation Guide with Code

Python: OpenAI (Automatic)

No changes needed. Caching is automatic for prompts with 1,024+ tokens:

from openai import OpenAI
client = OpenAI()

# This system prompt will be automatically cached
system_prompt = "Your long system prompt here..." # 1024+ tokens

# Request 1: Full price (cache miss)
response1 = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "First query"}
    ]
)

# Request 2: 50% off input (cache hit if within ~5-10 min)
response2 = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Second query"}
    ]
)

# Verify cache hit
print(response2.usage.prompt_tokens_details.cached_tokens)

Python: Anthropic (Explicit)

import anthropic
client = anthropic.Anthropic()

# Enable caching by adding cache_control to system prompt
response = client.messages.create(
    model="claude-sonnet-4-6-20260401",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your long system prompt here...",  # 1024+ tokens
            "cache_control": {"type": "ephemeral"}  # 5-minute cache
        }
    ],
    messages=[
        {"role": "user", "content": "User query here"}
    ]
)

# Check cache status
print(f"Cache created: {response.usage.cache_creation_input_tokens}")
print(f"Cache read: {response.usage.cache_read_input_tokens}")

Python: Google (Context Caching)

import google.generativeai as genai

# Step 1: Create cached content
cache = genai.caching.CachedContent.create(
    model="gemini-2.5-flash",
    contents=[{
        "parts": [{"text": "Your very long context here..."}],  # 32,768+ tokens
        "role": "user"
    }],
    ttl="3600s"  # 1 hour TTL
)

# Step 2: Use cached content in requests
model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content("Query using the cached context")

# Step 3: Delete cache when done (stop storage charges)
cache.delete()

Node.js / TypeScript: Anthropic

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-6-20260401",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: "Your long system prompt...",
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: "Query" }],
});

console.log("Cache created:", response.usage.cache_creation_input_tokens);
console.log("Cache hit:", response.usage.cache_read_input_tokens);

Multi-Provider Caching via TokenMix.ai

TokenMix.ai's unified API abstracts caching across providers. You mark content as cacheable once, and the platform applies the provider-specific caching mechanism automatically:

# Conceptual example — TokenMix.ai unified API
response = tokenmix.chat.completions.create(
    model="auto",  # Routes to best available model
    messages=[...],
    cache_config={
        "enabled": True,
        "ttl": 300  # 5 minutes
    }
)
# Platform handles OpenAI automatic caching,
# Anthropic cache_control, or Google context caching
# based on which provider serves the request

ROI Calculation: When Caching Pays Off

Formula

Monthly savings = (cacheable_tokens_per_request x requests_per_month x base_price x cache_discount_rate) - cache_write_costs

ROI = monthly_savings / implementation_cost

Example: SaaS Product with Claude Sonnet 4.6

Setup:

System prompt: 3,000 tokens (cacheable)
Few-shot examples: 5,000 tokens (cacheable)
User query: 200 tokens (not cacheable)
Total cacheable: 8,000 tokens per request
Request volume: 500,000 requests/month
Cache hit rate: 95% (same system prompt reused)

Without caching:

Input cost: 8,200 tokens x 500K requests = 4.1B tokens = 2,300/month

With Anthropic 5-minute caching (90% off hits):

Cache writes (5% miss rate): 25K requests x 8K tokens x $3.75/M = $750
Cache reads (95% hit rate): 475K requests x 8K tokens x $0.30/M = ,140
Uncacheable tokens: 500K requests x 200 tokens x $3.00/M = $300
Total: $2,190/month
Savings: 0,110/month (82% reduction)

With OpenAI automatic caching (50% off hits):

Assuming similar 95% hit rate
Cache hits: 475K requests x 8K tokens x .25/M = $4,750
Cache misses: 25K requests x 8K tokens x $2.50/M = $500
Uncacheable tokens: 500K requests x 200 tokens x $2.50/M = $250
Total: $5,500/month
Savings: $6,800/month (55% reduction)

Anthropic caching saves 82% vs OpenAI's 55%. This is why cache-adjusted pricing comparisons often favor Anthropic despite higher list prices.

Break-Even by Provider

Provider	Min requests for caching ROI	Notes
OpenAI	1 (automatic, free)	Always beneficial when prompt is 1,024+ tokens
Anthropic (5min)	2 requests within 5 minutes	Cache write cost recovered after 1 hit
Anthropic (1hr)	3 requests within 1 hour	Higher write cost needs 2 hits to break even
Google	480+ requests/day (50K context)	Storage fees require sustained high volume

When Prompt Caching Does Not Help

Caching is not universally beneficial. These scenarios see minimal or no savings:

1. Fully Dynamic Prompts

If every token in your prompt changes between requests (no shared system prompt, no repeated context), there is nothing to cache. This is uncommon in production but happens with certain creative generation workflows.

2. Very Short Prompts

Below 1,024 tokens (OpenAI/Anthropic minimum), caching cannot activate. If your total prompt is 500 tokens, no caching is possible. Solution: pad your system prompt with useful context to exceed the minimum.

3. Low Request Frequency

If requests are spaced more than 5-10 minutes apart (OpenAI) or more than 5 minutes / 1 hour apart (Anthropic), caches expire between requests. Every request is a cold start with cache write costs but no read benefits.

For Anthropic: If your average request interval exceeds 5 minutes, the 5-minute cache will frequently miss. The 1-hour cache costs more to write but may hit more often. Calculate based on your actual request pattern.

4. Output-Dominated Costs

Caching only reduces input costs. If your workload generates much more output than input (e.g., long-form content generation with short prompts), caching has minimal impact on total spend. A workload with 1K input tokens and 10K output tokens saves very little from input caching.

5. Google Caching with Low Volume

Google's hourly storage fee ($4.50/M tokens/hour) makes caching uneconomical unless request volume is high enough to amortize the storage cost. For 50K cached tokens, you need 480+ requests/day just to break even.

Stacking Caching with Other Discounts

The deepest savings come from combining caching with batch processing.

Anthropic: Caching + Batch API

Discount Layer	Sonnet 4.6 Input/M
Standard	$3.00
Cache hit only	$0.30 (90% off)
Batch only	.50 (50% off)
Cache hit + Batch	$0.15 (95% off)

Sonnet 4.6 input at $0.15/M tokens is cheaper than DeepSeek V4's standard $0.30/M. This is how a "premium" model becomes the budget option for batch workloads with cacheable prompts.

OpenAI: Caching + Batch API

Discount Layer	GPT-5.4 Input/M
Standard	$2.50
Cache hit only	.25 (50% off)
Batch only	.25 (50% off)
Cache hit + Batch	$0.625 (75% off)

OpenAI's stacked discount reaches 75% off — significant, but Anthropic's 95% still wins by a wide margin on input costs.

Cross-Provider Comparison with Full Discount Stack

Model	Stacked Input Price/M	Standard Input/M	Total Discount
Claude Sonnet 4.6 (cache+batch)	$0.15	$3.00	95%
Claude Haiku 4.5 (cache+batch)	$0.05	.00	95%
GPT-5.4 (cache+batch)	$0.625	$2.50	75%
GPT-5.4 Mini (cache+batch)	$0.1875	$0.75	75%
DeepSeek V4 (no stacking)	$0.30	$0.30	0%

Haiku 4.5 with full discount stack at $0.05/M input is the cheapest option from any major provider. This changes the competitive landscape entirely.

Prompt Caching Best Practices

1. Put Cacheable Content First

Caching is prefix-based. The cacheable content must come at the beginning of your prompt. Structure your messages as:

System prompt (static, cacheable)
Few-shot examples (static, cacheable)
Document context (semi-static, cacheable if reused)
Conversation history (growing, partially cacheable)
User query (dynamic, never cached)

2. Standardize System Prompts

Even minor whitespace differences break cache matching. Standardize your system prompt generation — use a constant string, not a template with variable formatting. Store the canonical prompt text and reference it rather than rebuilding it per request.

3. Monitor Cache Hit Rates

Track cached_tokens (OpenAI) or cache_read_input_tokens (Anthropic) in every response. If your hit rate drops below 80%, investigate: prompts may be changing unexpectedly, request frequency may be too low, or cache eviction may be aggressive.

Target cache hit rates by use case:

Chatbot with shared system prompt: 95%+
RAG with document rotation: 60-80%
Batch processing with same prompt: 99%+

4. Cache the Largest Stable Content Blocks

Maximize savings by caching the largest blocks of content that remain stable across requests. A 10,000-token few-shot example block saves 10x more per cache hit than a 1,000-token system prompt.

5. Use Anthropic 1-Hour Cache for Infrequent Workloads

If your requests are spaced 5-30 minutes apart, the 5-minute cache will frequently miss. The 1-hour cache costs more to write (2x vs 1.25x) but dramatically improves hit rates for workloads with moderate request frequency.

6. Delete Google Caches When Done

Google charges hourly storage. If you create a cache for a batch job, delete it immediately after the batch completes. Forgetting to delete is the Google equivalent of the "zombie model" problem in fine-tuning.

How to Choose a Caching Strategy

Your situation	Recommended approach	Why
Any workload with 1,024+ token prompts on OpenAI	Do nothing (automatic)	Caching is free and automatic
High-volume workload on Anthropic	Enable 5-minute cache	90% savings, breaks even after 1 hit
Infrequent requests on Anthropic (5-30 min intervals)	Use 1-hour cache	Higher write cost but better hit rate
Very large context (100K+) with high volume on Google	Use context caching	Storage fee justified by volume
Large context, low volume on Google	Skip caching	Storage fees exceed savings
Batch workloads on Anthropic/OpenAI	Cache + Batch API	Stack discounts for up to 95% off
Multi-provider routing	Use TokenMix.ai	Automatic provider-appropriate caching
Output-dominated workloads	Caching has limited value	Focus on model routing or output optimization

Conclusion

Prompt caching is not optional for production AI APIs — it is the difference between viable and unsustainable economics. Anthropic's 90% cache discount makes Sonnet 4.6 input cheaper than DeepSeek V4 at scale. OpenAI's automatic 50% discount requires zero effort. Google's storage-based model suits high-volume, large-context workloads.

The implementation cost is minimal — one field in your Anthropic requests, zero changes for OpenAI, and a few API calls for Google. The ROI is immediate for any workload exceeding 2 requests per cache window with shared context.

TokenMix.ai applies caching automatically across providers through its unified API, ensuring you always get the best available discount regardless of which model serves your request. For teams managing multiple providers, this eliminates the need to implement and monitor three different caching systems.

Every cost comparison in AI APIs should be made on cache-adjusted prices, not list prices. A model that looks expensive at list price may be the cheapest option after caching. Run the numbers with your actual request patterns, and let the math guide your architecture.

FAQ

What is prompt caching in AI APIs?

Prompt caching stores the processed state of repeated input tokens so the model does not recompute them on subsequent requests. When the beginning of a new prompt matches a previously cached prefix, the provider serves those tokens from cache at a reduced price — 50% off at OpenAI, 90% off at Anthropic, and 75% off at Google.

How much does prompt caching save?

Savings depend on the cache hit rate and the proportion of cacheable tokens. With 90% cache hit rate and 80% cacheable tokens, Anthropic caching reduces input costs by approximately 72%. OpenAI caching reduces input costs by approximately 40% under the same conditions. Real-world savings range from 40-85% of input token costs.

Does OpenAI prompt caching require code changes?

No. OpenAI caching is fully automatic for prompts with 1,024 or more tokens. No API changes, no configuration, no additional fields. Caching activates automatically and the discount applies when cache hits occur. You can verify caching through the cached_tokens field in the response usage object.

How long does a prompt cache last?

OpenAI: approximately 5-10 minutes of inactivity, automatically managed. Anthropic: either 5 minutes or 1 hour, depending on which cache tier you select. Google: until the TTL you set expires or you manually delete the cache. Keep request frequency above the cache expiration interval to maintain high hit rates.

Can prompt caching and batch API discounts be combined?

Yes, at both OpenAI and Anthropic. Anthropic's combined discount reaches 95% off input (cache 90% + batch 50% stacked). OpenAI's combined discount reaches 75% off input (cache 50% + batch 50% stacked). This is the deepest discount available from any major AI provider.

Is prompt caching worth it for small workloads?

For OpenAI, yes — caching is free and automatic, so any workload benefits. For Anthropic, you need at least 2 requests within a 5-minute window using the same prompt prefix to break even on the 5-minute cache write cost. For Google, you need hundreds of daily requests to justify the hourly storage fee. If your request volume is very low, OpenAI's automatic caching or Anthropic's 5-minute cache are the most accessible options.

Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: OpenAI Prompt Caching Docs, Anthropic Prompt Caching Docs, Google Context Caching Docs, TokenMix.ai