TokenMix Research Lab · 2026-04-07

Prompt Caching Guide 2026: Save 50-95% on AI API Costs with OpenAI, Anthropic, and Google Caching

Prompt Caching Guide: How to Save 90% on AI API Costs with LLM Caching (2026)

Prompt caching is the single most effective cost reduction technique for AI APIs. OpenAI's automatic caching gives 50% off cached input tokens. Anthropic's explicit caching gives 90% off cache hits. Google's context caching charges per-hour storage but slashes read costs by 75%. At scale, caching turns a 0,000/month API bill into $2,000-4,000 without changing model quality or output. This is the detailed implementation guide — how caching works at each provider, code examples, ROI calculations, and when caching helps versus when it does not. Every pricing article on TokenMix.ai references this guide because caching affects every cost comparison. All data from official provider documentation and TokenMix.ai production monitoring, April 2026.

Table of Contents


Quick Comparison: Prompt Caching Across Providers

Feature OpenAI Anthropic Google
Cache discount 50% off input 90% off input 75% off input
Cache write cost Free (automatic) 1.25x base (5min) / 2x base (1hr) Free (storage-based)
Cache duration ~5-10 min (automatic) 5 min or 1 hour (explicit) Until deleted (manual)
Implementation Zero-code (automatic) One field per request API call to create cache
Min cacheable tokens 1,024 1,024 32,768
Cache storage cost Free Free $4.50/M tokens/hour
Cache granularity Prefix-based Prefix-based Explicit context
Batch API compatible Yes Yes N/A
Models supported All current models All Claude models Gemini 1.5+, 2.0+, 2.5

Bottom line: Anthropic gives the deepest discount (90%) but charges for cache writes. OpenAI caching is free and automatic but only saves 50%. Google's model is unique — free writes but hourly storage fees that make short-lived caches expensive.


Why Prompt Caching Matters for AI API Costs

Most production AI applications send the same tokens repeatedly. System prompts, few-shot examples, document context for RAG, and conversation history all contain content that does not change between requests.

Typical token breakdown in a production API call:

Component Token count Changes between requests?
System prompt 500-2,000 No
Few-shot examples 2,000-10,000 No
RAG context 5,000-50,000 Partially
Conversation history 1,000-20,000 Grows incrementally
User query 50-500 Yes (always unique)

In a typical setup, 80-95% of input tokens are repeated across requests. Caching these tokens means you pay full price once, then 10-50% of the price on every subsequent request.

Real-world impact: TokenMix.ai's production data shows that teams implementing prompt caching reduce their input token costs by 60-85% on average. For applications with long system prompts or RAG patterns, savings exceed 90%.

Why every pricing comparison references this guide: When comparing models like GPT-5.4 ($2.50/M input) vs Claude Sonnet ($3.00/M input), the cache-adjusted prices tell a completely different story. Sonnet with 90% cache discount drops to $0.30/M — cheaper than GPT-5.4 with 50% caching at .25/M.


How Prompt Caching Works: The Core Mechanism

All prompt caching systems work on the same principle: store the computed internal state (key-value cache) of previously processed tokens so the model does not need to reprocess them.

The Technical Flow

  1. First request: The model processes all input tokens from scratch. The provider stores the computed KV-cache for the prefix portion of your prompt.
  2. Subsequent requests: If the beginning of your new prompt matches a cached prefix, the model skips reprocessing those tokens. It loads the cached state and only processes new tokens.
  3. Cache matching: Matching is prefix-based and exact. The cache hits only if the beginning of your prompt matches byte-for-byte with a cached prefix. Changing even one token in the cached portion invalidates the cache.

What Gets Cached

What Cannot Be Cached


OpenAI Prompt Caching: Automatic and Free

OpenAI's caching is the simplest to use: it is automatic, requires zero code changes, and has no write cost.

How OpenAI Caching Works

OpenAI Cached Pricing (April 2026)

Model Standard Input/M Cached Input/M Savings
GPT-5.4 $2.50 .25 50%
GPT-5.4 Mini $0.75 $0.375 50%
GPT-5.4 Nano $0.20 $0.10 50%
o3 $2.50 .25 50%
o4-mini $0.75 $0.375 50%

OpenAI Cache Verification

Check if caching is active by inspecting response headers:

# In the API response usage object:
{
  "usage": {
    "prompt_tokens": 2048,
    "completion_tokens": 512,
    "prompt_tokens_details": {
      "cached_tokens": 1536  # These tokens were cache hits
    }
  }
}

If cached_tokens is 0 on repeated requests with the same prefix, your prompt may be below the 1,024-token minimum or the cache expired between requests.

OpenAI Caching Limitations

Source: OpenAI Prompt Caching Documentation


Anthropic Prompt Caching: Explicit Control, 90% Savings

Anthropic's caching is the most powerful in terms of discount depth: 90% off input on cache hits. But it requires explicit implementation and charges for cache writes.

How Anthropic Caching Works

Anthropic Cached Pricing (April 2026)

Model Base Input/M 5min Cache Write/M 1hr Cache Write/M Cache Hit/M Savings on Hit
Opus 4.6 $5.00 $6.25 0.00 $0.50 90%
Sonnet 4.6 $3.00 $3.75 $6.00 $0.30 90%
Haiku 4.5 .00 .25 $2.00 $0.10 90%

Anthropic Cache Break-Even Analysis

5-minute cache (1.25x write cost):

1-hour cache (2.0x write cost):

Any workload with more than 1-2 requests per 5 minutes using the same system prompt should enable caching. The ROI is immediate.

Anthropic Cache Implementation

Add cache_control to the content block you want cached:

{
  "model": "claude-sonnet-4-6-20260401",
  "max_tokens": 1024,
  "system": [
    {
      "type": "text",
      "text": "Your system prompt here with instructions, examples, context...",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "User query here"}
  ]
}

The cache_control field with "type": "ephemeral" uses the 5-minute cache. For 1-hour cache, the API provides a separate type parameter.

Anthropic Cache Verification

{
  "usage": {
    "input_tokens": 2048,
    "cache_creation_input_tokens": 1536,
    "cache_read_input_tokens": 0
  }
}

On the first request, cache_creation_input_tokens shows what was written. On subsequent hits, cache_read_input_tokens shows what was served from cache. Monitor both to verify caching is working.

Source: Anthropic Prompt Caching Documentation


Google Context Caching: Hourly Storage Model

Google's approach is fundamentally different: no write premium, but you pay hourly storage fees for cached content.

How Google Context Caching Works

Google Cached Pricing (Gemini 2.5 Flash, April 2026)

Operation Price/M Tokens
Standard input $0.30
Cached input (read) $0.075
Cache storage $4.50/M tokens/hour
Cache write Free

Google Cache Cost Analysis

The hourly storage model means Google caching is only cost-effective for workloads with high request frequency over sustained periods.

Example: 50,000 tokens of cached context

If you make fewer than 480 requests/day using this cached context, the storage fee exceeds the savings. This makes Google caching impractical for low-to-medium volume workloads.

When Google caching wins: High-volume applications (1,000+ requests/hour) with very large context (100K+ tokens). At 100K cached tokens and 5,000 requests/hour, the savings are substantial and storage fees are a small fraction of total spend.

Source: Google AI Context Caching


DeepSeek and Other Providers: Caching Options

DeepSeek

DeepSeek offers automatic prefix caching similar to OpenAI:

Groq

Groq does not currently offer prompt caching. Given Groq's focus on speed rather than cost optimization, caching is less critical — their pricing is already competitive on input.

Open-Source / Self-Hosted

If you self-host models (via vLLM, TGI, or similar), prefix caching is available at the inference server level:


Prompt Caching Implementation Guide with Code

Python: OpenAI (Automatic)

No changes needed. Caching is automatic for prompts with 1,024+ tokens:

from openai import OpenAI
client = OpenAI()

# This system prompt will be automatically cached
system_prompt = "Your long system prompt here..." # 1024+ tokens

# Request 1: Full price (cache miss)
response1 = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "First query"}
    ]
)

# Request 2: 50% off input (cache hit if within ~5-10 min)
response2 = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Second query"}
    ]
)

# Verify cache hit
print(response2.usage.prompt_tokens_details.cached_tokens)

Python: Anthropic (Explicit)

import anthropic
client = anthropic.Anthropic()

# Enable caching by adding cache_control to system prompt
response = client.messages.create(
    model="claude-sonnet-4-6-20260401",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your long system prompt here...",  # 1024+ tokens
            "cache_control": {"type": "ephemeral"}  # 5-minute cache
        }
    ],
    messages=[
        {"role": "user", "content": "User query here"}
    ]
)

# Check cache status
print(f"Cache created: {response.usage.cache_creation_input_tokens}")
print(f"Cache read: {response.usage.cache_read_input_tokens}")

Python: Google (Context Caching)

import google.generativeai as genai

# Step 1: Create cached content
cache = genai.caching.CachedContent.create(
    model="gemini-2.5-flash",
    contents=[{
        "parts": [{"text": "Your very long context here..."}],  # 32,768+ tokens
        "role": "user"
    }],
    ttl="3600s"  # 1 hour TTL
)

# Step 2: Use cached content in requests
model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content("Query using the cached context")

# Step 3: Delete cache when done (stop storage charges)
cache.delete()

Node.js / TypeScript: Anthropic

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-6-20260401",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: "Your long system prompt...",
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: "Query" }],
});

console.log("Cache created:", response.usage.cache_creation_input_tokens);
console.log("Cache hit:", response.usage.cache_read_input_tokens);

Multi-Provider Caching via TokenMix.ai

TokenMix.ai's unified API abstracts caching across providers. You mark content as cacheable once, and the platform applies the provider-specific caching mechanism automatically:

# Conceptual example — TokenMix.ai unified API
response = tokenmix.chat.completions.create(
    model="auto",  # Routes to best available model
    messages=[...],
    cache_config={
        "enabled": True,
        "ttl": 300  # 5 minutes
    }
)
# Platform handles OpenAI automatic caching,
# Anthropic cache_control, or Google context caching
# based on which provider serves the request

ROI Calculation: When Caching Pays Off

Formula

Monthly savings = (cacheable_tokens_per_request x requests_per_month x base_price x cache_discount_rate) - cache_write_costs

ROI = monthly_savings / implementation_cost

Example: SaaS Product with Claude Sonnet 4.6

Setup:

Without caching:

With Anthropic 5-minute caching (90% off hits):

With OpenAI automatic caching (50% off hits):

Anthropic caching saves 82% vs OpenAI's 55%. This is why cache-adjusted pricing comparisons often favor Anthropic despite higher list prices.

Break-Even by Provider

Provider Min requests for caching ROI Notes
OpenAI 1 (automatic, free) Always beneficial when prompt is 1,024+ tokens
Anthropic (5min) 2 requests within 5 minutes Cache write cost recovered after 1 hit
Anthropic (1hr) 3 requests within 1 hour Higher write cost needs 2 hits to break even
Google 480+ requests/day (50K context) Storage fees require sustained high volume

When Prompt Caching Does Not Help

Caching is not universally beneficial. These scenarios see minimal or no savings:

1. Fully Dynamic Prompts

If every token in your prompt changes between requests (no shared system prompt, no repeated context), there is nothing to cache. This is uncommon in production but happens with certain creative generation workflows.

2. Very Short Prompts

Below 1,024 tokens (OpenAI/Anthropic minimum), caching cannot activate. If your total prompt is 500 tokens, no caching is possible. Solution: pad your system prompt with useful context to exceed the minimum.

3. Low Request Frequency

If requests are spaced more than 5-10 minutes apart (OpenAI) or more than 5 minutes / 1 hour apart (Anthropic), caches expire between requests. Every request is a cold start with cache write costs but no read benefits.

For Anthropic: If your average request interval exceeds 5 minutes, the 5-minute cache will frequently miss. The 1-hour cache costs more to write but may hit more often. Calculate based on your actual request pattern.

4. Output-Dominated Costs

Caching only reduces input costs. If your workload generates much more output than input (e.g., long-form content generation with short prompts), caching has minimal impact on total spend. A workload with 1K input tokens and 10K output tokens saves very little from input caching.

5. Google Caching with Low Volume

Google's hourly storage fee ($4.50/M tokens/hour) makes caching uneconomical unless request volume is high enough to amortize the storage cost. For 50K cached tokens, you need 480+ requests/day just to break even.


Stacking Caching with Other Discounts

The deepest savings come from combining caching with batch processing.

Anthropic: Caching + Batch API

Discount Layer Sonnet 4.6 Input/M
Standard $3.00
Cache hit only $0.30 (90% off)
Batch only .50 (50% off)
Cache hit + Batch $0.15 (95% off)

Sonnet 4.6 input at $0.15/M tokens is cheaper than DeepSeek V4's standard $0.30/M. This is how a "premium" model becomes the budget option for batch workloads with cacheable prompts.

OpenAI: Caching + Batch API

Discount Layer GPT-5.4 Input/M
Standard $2.50
Cache hit only .25 (50% off)
Batch only .25 (50% off)
Cache hit + Batch $0.625 (75% off)

OpenAI's stacked discount reaches 75% off — significant, but Anthropic's 95% still wins by a wide margin on input costs.

Cross-Provider Comparison with Full Discount Stack

Model Stacked Input Price/M Standard Input/M Total Discount
Claude Sonnet 4.6 (cache+batch) $0.15 $3.00 95%
Claude Haiku 4.5 (cache+batch) $0.05 .00 95%
GPT-5.4 (cache+batch) $0.625 $2.50 75%
GPT-5.4 Mini (cache+batch) $0.1875 $0.75 75%
DeepSeek V4 (no stacking) $0.30 $0.30 0%

Haiku 4.5 with full discount stack at $0.05/M input is the cheapest option from any major provider. This changes the competitive landscape entirely.


Prompt Caching Best Practices

1. Put Cacheable Content First

Caching is prefix-based. The cacheable content must come at the beginning of your prompt. Structure your messages as:

  1. System prompt (static, cacheable)
  2. Few-shot examples (static, cacheable)
  3. Document context (semi-static, cacheable if reused)
  4. Conversation history (growing, partially cacheable)
  5. User query (dynamic, never cached)

2. Standardize System Prompts

Even minor whitespace differences break cache matching. Standardize your system prompt generation — use a constant string, not a template with variable formatting. Store the canonical prompt text and reference it rather than rebuilding it per request.

3. Monitor Cache Hit Rates

Track cached_tokens (OpenAI) or cache_read_input_tokens (Anthropic) in every response. If your hit rate drops below 80%, investigate: prompts may be changing unexpectedly, request frequency may be too low, or cache eviction may be aggressive.

Target cache hit rates by use case:

4. Cache the Largest Stable Content Blocks

Maximize savings by caching the largest blocks of content that remain stable across requests. A 10,000-token few-shot example block saves 10x more per cache hit than a 1,000-token system prompt.

5. Use Anthropic 1-Hour Cache for Infrequent Workloads

If your requests are spaced 5-30 minutes apart, the 5-minute cache will frequently miss. The 1-hour cache costs more to write (2x vs 1.25x) but dramatically improves hit rates for workloads with moderate request frequency.

6. Delete Google Caches When Done

Google charges hourly storage. If you create a cache for a batch job, delete it immediately after the batch completes. Forgetting to delete is the Google equivalent of the "zombie model" problem in fine-tuning.


How to Choose a Caching Strategy

Your situation Recommended approach Why
Any workload with 1,024+ token prompts on OpenAI Do nothing (automatic) Caching is free and automatic
High-volume workload on Anthropic Enable 5-minute cache 90% savings, breaks even after 1 hit
Infrequent requests on Anthropic (5-30 min intervals) Use 1-hour cache Higher write cost but better hit rate
Very large context (100K+) with high volume on Google Use context caching Storage fee justified by volume
Large context, low volume on Google Skip caching Storage fees exceed savings
Batch workloads on Anthropic/OpenAI Cache + Batch API Stack discounts for up to 95% off
Multi-provider routing Use TokenMix.ai Automatic provider-appropriate caching
Output-dominated workloads Caching has limited value Focus on model routing or output optimization

Conclusion

Prompt caching is not optional for production AI APIs — it is the difference between viable and unsustainable economics. Anthropic's 90% cache discount makes Sonnet 4.6 input cheaper than DeepSeek V4 at scale. OpenAI's automatic 50% discount requires zero effort. Google's storage-based model suits high-volume, large-context workloads.

The implementation cost is minimal — one field in your Anthropic requests, zero changes for OpenAI, and a few API calls for Google. The ROI is immediate for any workload exceeding 2 requests per cache window with shared context.

TokenMix.ai applies caching automatically across providers through its unified API, ensuring you always get the best available discount regardless of which model serves your request. For teams managing multiple providers, this eliminates the need to implement and monitor three different caching systems.

Every cost comparison in AI APIs should be made on cache-adjusted prices, not list prices. A model that looks expensive at list price may be the cheapest option after caching. Run the numbers with your actual request patterns, and let the math guide your architecture.


FAQ

What is prompt caching in AI APIs?

Prompt caching stores the processed state of repeated input tokens so the model does not recompute them on subsequent requests. When the beginning of a new prompt matches a previously cached prefix, the provider serves those tokens from cache at a reduced price — 50% off at OpenAI, 90% off at Anthropic, and 75% off at Google.

How much does prompt caching save?

Savings depend on the cache hit rate and the proportion of cacheable tokens. With 90% cache hit rate and 80% cacheable tokens, Anthropic caching reduces input costs by approximately 72%. OpenAI caching reduces input costs by approximately 40% under the same conditions. Real-world savings range from 40-85% of input token costs.

Does OpenAI prompt caching require code changes?

No. OpenAI caching is fully automatic for prompts with 1,024 or more tokens. No API changes, no configuration, no additional fields. Caching activates automatically and the discount applies when cache hits occur. You can verify caching through the cached_tokens field in the response usage object.

How long does a prompt cache last?

OpenAI: approximately 5-10 minutes of inactivity, automatically managed. Anthropic: either 5 minutes or 1 hour, depending on which cache tier you select. Google: until the TTL you set expires or you manually delete the cache. Keep request frequency above the cache expiration interval to maintain high hit rates.

Can prompt caching and batch API discounts be combined?

Yes, at both OpenAI and Anthropic. Anthropic's combined discount reaches 95% off input (cache 90% + batch 50% stacked). OpenAI's combined discount reaches 75% off input (cache 50% + batch 50% stacked). This is the deepest discount available from any major AI provider.

Is prompt caching worth it for small workloads?

For OpenAI, yes — caching is free and automatic, so any workload benefits. For Anthropic, you need at least 2 requests within a 5-minute window using the same prompt prefix to break even on the 5-minute cache write cost. For Google, you need hundreds of daily requests to justify the hourly storage fee. If your request volume is very low, OpenAI's automatic caching or Anthropic's 5-minute cache are the most accessible options.


Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: OpenAI Prompt Caching Docs, Anthropic Prompt Caching Docs, Google Context Caching Docs, TokenMix.ai