TokenMix Research Lab · 2026-04-21

AI Gateway Caching 2026: Why L1 + L2 Layers Cut 90% API Cost

AI gateway caching is not one feature — it's two completely different mechanisms that teams keep conflating. L1 (result cache) skips the upstream model entirely and saves 100% per hit. L2 (prompt cache) lives inside the model provider and cuts the cost of cached prefix tokens by 50-90% per call. Most production teams using aggregation platforms like OpenRouter, Portkey, or TokenMix.ai get only L2 savings by default. Adding an L1 layer on top — either via Helicone or self-hosted Redis — compounds the two for cost reductions up to 95% on repetitive workloads.

This guide separates the two layers with real 2026 pricing data, shows where each fits, and walks through the architecture patterns that actually work in production.

Quick Comparison: L1 vs L2 Caching
Why Aggregation Platforms Need Caching
L1: Result Caching (Exact + Semantic)
L2: Prompt Caching at the Provider Level
L3: Observability — Seeing the Savings
Real Cost Math: 10M Requests/Month Scenarios
Architecture Patterns That Actually Work
How to Choose
Conclusion
FAQ
Sources

Quick Comparison: L1 vs L2 Caching

Dimension	L1 — Result Cache	L2 — Prompt Cache
What it caches	Full model response	KV state of prompt prefix
Does it call upstream?	No — response returned locally	Yes — model still generates output
Savings per hit	100% input + output tokens + API call	50-90% on cached input tokens only
Hit trigger	Exact or semantic match of prompt	Same prefix across requests
Who runs it	Proxy layer (Helicone, custom Redis)	Model provider (Claude, OpenAI, DeepSeek, Gemini)
Integration	1-line SDK base_url swap or custom	Automatic for some, explicit for others
Latency reduction	Sub-100ms vs 300-3000ms upstream	Reduced prefill time, typically 30-60% faster
Stale risk	High — stale responses if invalidation weak	None — underlying data always fresh
Best for	Repetitive queries (support bots, FAQs)	Large system prompts, RAG contexts

Neither cache replaces the other. They target different savings. L1 kills the upstream call; L2 reduces the cost when you do call upstream.

Why Aggregation Platforms Need Caching

An AI gateway — OpenRouter, Portkey, TokenMix.ai, Helicone AI Gateway — fundamentally does four things: auth, routing, forwarding, billing. Without caching, every dimension suffers:

Cost scales linearly. Every request hits an upstream provider. If your workload has 40% duplicate or near-duplicate queries (the empirical median across chatbots, support agents, and RAG systems we track through TokenMix.ai), you're paying full price for 40% of requests that didn't need to touch the model.

Latency compounds. An aggregation platform already adds 5-30ms of forwarding overhead. Without caching, you then pay the upstream provider's full latency — 300-3000ms depending on model and prompt length. A cache hit turns that into 50-100ms end-to-end, which is the difference between "feels like a website" and "feels like a conversation."

Rate limits bite harder. Every upstream provider enforces RPM/TPM limits. Cached requests don't consume upstream rate-limit budget. Teams hitting 10x their effective RPM through caching is common.

Failure modes multiply. An aggregation platform that doesn't cache has zero degraded-mode capability when upstream is down. With a cache, 30-50% of requests continue serving from the cache layer during outages.

This is why caching isn't a nice-to-have feature for aggregation platforms — it's structural to the product's value proposition. The question is which layer(s) the platform implements.

L1: Result Caching (Exact + Semantic)

L1 is the simplest mental model: the gateway remembers past responses and returns them for matching new requests. Two variants:

Exact match. The cache key is a hash of the entire request (model + messages + parameters). If two users send byte-identical requests, the second one hits cache.

Semantic match. The cache key is the vector embedding of the prompt. Similar prompts — "What is photosynthesis?" vs "Explain photosynthesis" — match within a cosine similarity threshold. Requires an embedding model and vector store, adds 30-80ms to check cache, but dramatically increases hit rate.

Who runs L1 caching in 2026

Helicone — the dominant L1 provider for aggregation use. Point your SDK at https://oai.helicone.ai/v1 or use their AI Gateway directly. Supports Redis (ultra-fast in-memory) or S3 (persistent) backends, with intelligent invalidation that respects model updates and context windows (Helicone caching docs). Reported cost reduction 20-30% on typical workloads, up to 95% for highly repetitive use cases like customer support bots and code assistants (Helicone cost reduction guide).

Self-hosted Redis + embedding. Rolling your own is a 1-2 engineer-week project. Tradeoff: full control over cache policy, but you own the ops burden (invalidation, memory eviction, embedding model availability).

OpenRouter, Portkey, TokenMix.ai. These are pass-through gateways — they forward your request to the upstream provider so L2 prompt caching fires. They do not run their own L1 result cache. If you want L1 + their routing, you stack Helicone or Redis in front.

L1 trade-offs

The big upside is obvious: 100% savings per hit. The big downside is staleness risk:

Dynamic content (news, stock prices, user-specific data) — L1 is dangerous. Stale response served from cache when source changed.
Deterministic content (documentation QA, fixed-corpus RAG, code completion from stable codebase) — L1 is ideal.
Semi-dynamic (chatbot with some personalization) — use TTL + selective invalidation.

Practical rule: enable L1 for paths where temperature=0 and the knowledge base doesn't change within the TTL window. Disable it elsewhere.

L2: Prompt Caching at the Provider Level

L2 is what OpenAI, Anthropic, Google, and DeepSeek all ship under various names — "prompt caching," "context caching," "KV cache" — with similar mechanics:

You send a long prompt with a large stable prefix (system prompt, tools, documents).
The provider computes KV state for the prefix once, stores it in a hot cache.
Subsequent requests with the same prefix skip prefix computation, paying for cache read instead of full input.

Critically, the model still generates output every call — it's not an L1 cache, it's a speedup for the prefix portion only. But input-token savings are dramatic: 50-90% depending on provider.

L2 pricing across the four majors (April 2026)

Provider	Base input	Cache read	Cache write	Auto?
Claude Sonnet 4.6	$3/M	$0.30/M (90% off)	$3.75/M (25% premium, 5-min TTL)	Explicit — must set `cache_control`
Claude Opus 4.6	$5/M	$0.50/M (90% off)	$6.25/M (5-min) / 0/M (1-hour)	Explicit
DeepSeek V3.2	$0.28/M	$0.028/M (90% off)	Same as base (no write premium)	Automatic, zero config
OpenAI GPT-5.4	$2.50/M	$0.25/M (90% off)	Same as base	Automatic for prompts ≥1024 tokens
Gemini 3.1 Pro	$2/M (≤200K)	~25% discount	Storage $4.50/M per hour	Explicit via `cachedContents.create`

Two design patterns worth noting:

Automatic caching (OpenAI, DeepSeek): zero code changes, cache fires when eligible prefix detected. Friction-free but less control.

Explicit caching (Claude, Gemini): you mark which content to cache with cache_control (Claude) or an explicit cache resource (Gemini). More code, but precise control over TTL and which prefix to cache.

Break-even math for Claude's explicit cache

Claude's cache write costs 25% more than base input (for 5-min TTL) or 100% more (for 1-hour TTL). So:

Cache pays off after 1 read at 5-min TTL. Every subsequent hit is pure savings.
Cache pays off after 2 reads at 1-hour TTL. Anything beyond is profit.

For a RAG system that answers multiple questions against the same document within 5 minutes, Claude caching recoups instantly. For batched nightly processing of static context, 1-hour TTL is worth the 2x write premium because you amortize over hundreds of reads.

How aggregation platforms handle L2

If you send a Claude request with cache_control through OpenRouter, Portkey, or TokenMix.ai, the gateway forwards the cache directive unchanged to Anthropic. Anthropic applies the cache discount. The gateway bills you the discounted price (plus whatever markup they charge).

This only works if the gateway doesn't rewrite your request in ways that break prefix stability. The gateway's OpenAI-compatible translation layer must preserve the exact prefix hash. TokenMix.ai preserves prefix stability by design, which is why the Claude prompt cache fires as expected when routing Claude traffic through the unified endpoint.

L3: Observability — Seeing the Savings

Caching only helps if you can measure hit rate and cost impact. This is L3: dashboards that break out cached vs uncached tokens, hit rate over time, and cost-per-request before/after cache.

What good L3 looks like:

Per-key cache hit rate metric
Cached input tokens vs uncached tokens per request
Estimated savings dollars per day/week/month
Alerts when cache hit rate drops (indicates prefix instability or cache backend issues)

Who provides L3 in 2026:

Helicone — best-in-class. Dashboard surfaces cache hits at L1 level, and parses provider response metadata to report L2 cached tokens.
Langfuse / Braintrust — observability platforms that ingest provider metadata, surface cache metrics with some work.
TokenMix.ai — surfaces cached token counts per request in the usage dashboard; built specifically to distinguish L2 cache hits from non-cached input.
OpenRouter, Portkey — basic usage logs, less detailed cache breakdown.

Without L3, you can't verify your L1 invalidation is correct or your L2 cache is actually firing. Running caching blind is worse than no caching — you think you're saving money but may be serving stale responses or missing cache opportunities entirely.

Real Cost Math: 10M Requests/Month Scenarios

Let's work three realistic scenarios. Assumptions: average prompt 4,000 input tokens, average output 500 tokens, model Claude Sonnet 4.6 ($3/ 5 per M).

Scenario A: No caching (baseline)

Monthly input tokens: 10M × 4K = 40B tokens
Monthly output tokens: 10M × 500 = 5B tokens
Cost: 40,000 × $3 + 5,000 × 5 = 95,000/month

Scenario B: L2 only (native Claude prompt cache)

Assume 3,500 of the 4K input tokens are stable prefix (system prompt + document context), 500 are unique per request. Cache hit rate on prefix: 80% (typical for RAG).

Cached input tokens: 10M × 3,500 × 80% = 28B @ $0.30/M = $8,400
Uncached input tokens: 10M × 3,500 × 20% + 10M × 500 = 12B @ $3/M = $36,000
Output tokens: 5B × 5/M = $75,000
Cache write overhead (~5% of cached volume at 25% premium): $300
Total: ~ 19,700/month (save $75,300, 39% reduction)

Scenario C: L1 + L2 stacked

Assume 25% of total requests are L1-cacheable (repetitive / FAQ-like) with exact or semantic match. Remaining 75% go through L2 on Claude.

L1-served requests: 10M × 25% = 2.5M → $0 in LLM cost
L1 Redis/Helicone infrastructure: ~$500/month for this scale
L2-eligible requests: 10M × 75% = 7.5M
- Cached input: 7.5M × 3,500 × 80% = 21B @ $0.30/M = $6,300
- Uncached input: 7.5M × 3,500 × 20% + 7.5M × 500 = 9B @ $3/M = $27,000
- Output tokens: 7.5M × 500 × 5/M = $56,250
Total: ~$90,050/month (save 04,950 from baseline, 54% reduction)

The compound savings of L1 + L2 is not additive — it's multiplicative in the right direction. The 25% of requests L1 absorbs never reach the L2 system, so the L1 savings aren't diluted.

Architecture Patterns That Actually Work

Pattern 1: Helicone proxy → vendor direct

[Your App] → [Helicone (L1 cache)] → [Anthropic/OpenAI/etc (L2 cache)]

Single proxy layer. Helicone handles L1 semantic cache and observability. Upstream provider handles L2. Best for teams already committed to one or two vendors.

Pattern 2: Gateway → Helicone → vendors (multi-model)

[Your App] → [TokenMix.ai (routing + L2 passthrough)] → [Helicone (L1)] → [vendors]

Gateway handles model routing and failover; Helicone adds L1. More hops but supports multi-vendor architecture with both cache layers.

Pattern 3: Self-hosted L1 + TokenMix.ai

[Your App] → [Redis/own L1 cache] → [TokenMix.ai (routing + L2 passthrough)] → [vendors]

Fine-grained control over L1 policy (TTL, invalidation, semantic thresholds). TokenMix.ai handles the model-routing complexity and preserves L2 cache directives. Best for teams with specific compliance or data-residency requirements that prevent using Helicone's managed proxy.

Pattern 4: Vendor-only (no aggregation, no L1)

[Your App] → [Vendor API directly]

L2 only. Simplest but no multi-model flexibility, no L1 savings. Fine for single-vendor deployments that don't need routing.

How to Choose

Your situation	Pattern	Why
Single vendor, want max simplicity	Pattern 4	No extra hops, vendor's L2 auto-fires
Single vendor, want L1 savings	Pattern 1	Helicone adds the missing L1 layer
Multi-vendor with routing needs	Pattern 2	TokenMix.ai + Helicone combined
Strict compliance/data residency	Pattern 3	Self-hosted L1 keeps all cache on-prem
High-volume repetitive workload (support bots, FAQ)	Any pattern with L1	L1 savings are the big win here
Low-volume or highly dynamic content	Skip L1, enable L2 only	L1 stale-risk outweighs benefit

Conclusion

Caching in AI gateways is not a single feature to toggle — it's two layers with different economics, different failure modes, and different integration patterns. Understand them separately, then decide which layers make sense for your workload.

The practical rule for 2026: always enable L2 (it's free money on Claude, OpenAI, DeepSeek for any prompt with stable prefix). Add L1 when your workload has genuine repetition and your content tolerates TTL-bounded staleness. Combining the two can cut LLM bills 50-60% on realistic production workloads.

TokenMix.ai preserves L2 cache directives across 300+ models through one OpenAI-compatible endpoint, and surfaces cached token counts per request so you can verify caching is actually firing. Stack Helicone in front for L1 when the workload warrants it. The two-layer architecture is the cheapest production AI stack you can run in 2026.

This article is also available on Dev.to for developer community discussion with code examples.

FAQ

Q1: What's the difference between prompt caching and result caching?

Prompt caching (L2) is a vendor-side optimization that stores KV state for prompt prefixes — the model still generates output every call, but prefix computation is skipped, saving 50-90% on cached input tokens. Result caching (L1) is a proxy-side feature that returns previous responses for matching requests, skipping the model call entirely — 100% savings per hit but requires tolerating some staleness.

Q2: Does OpenRouter cache my requests?

OpenRouter forwards requests to upstream providers without running its own L1 result cache. Provider-side L2 prompt caching works through OpenRouter because the gateway preserves request prefix stability. For actual L1 caching (skipping upstream calls entirely), you need to add Helicone or a self-hosted Redis layer.

Q3: How much does Claude prompt caching actually save?

Claude cache read tokens cost 10% of base input price — a 90% discount. Cache writes cost 25% more than base input for 5-minute TTL, or 100% more for 1-hour TTL. The break-even is 1 cache read for 5-minute TTL and 2 reads for 1-hour TTL. For RAG systems with stable document context, savings are typically 60-85% on total input cost.

Q4: Is DeepSeek caching automatic?

Yes. DeepSeek's context caching on disk is enabled by default for all users. Any request with a cacheable prefix automatically hits the cache, with cache read tokens priced at about 10% of base input — a 90% discount. No code changes required, no write premium.

Q5: When should I use semantic cache vs exact cache?

Exact cache for high-volume deterministic workloads where requests are truly byte-identical (e.g., programmatic RAG over a fixed corpus). Semantic cache for user-facing chatbots where people phrase the same question many ways ("How does X work?" vs "Explain X"). Semantic cache has 10-30x higher hit rate in natural language contexts but costs 30-80ms per lookup to compute the embedding comparison.

Q6: Can I combine L1 result cache with L2 prompt cache?

Yes, and this is the optimal configuration for repetitive workloads. L1 absorbs exact and semantic matches, never touching the upstream. L2 fires on L1 misses, reducing the cost of prompts that do reach the model. The combined savings on a typical 10M requests/month workload is 50-60% vs baseline uncached, significantly more than either layer alone.

Q7: Does caching work with streaming responses?

L2 prompt caching works with streaming — the cache affects the prefill phase before tokens start streaming. L1 result caching is compatible but more complex: you need to store the full response (or complete chunk sequence) and replay it. Helicone handles this automatically. Self-hosted implementations need to buffer the stream to cache and replay from buffer on hits.

Q8: How do I measure whether my cache is actually working?

Check two metrics: (1) cache hit rate (percentage of requests served from cache), and (2) cached tokens reported in provider response metadata (for L2). Without an observability layer, you're flying blind. Helicone, TokenMix.ai, and Langfuse all expose these metrics in dashboards. For self-hosted caches, log cache hit/miss explicitly and aggregate in your monitoring stack.

Sources

Anthropic — Prompt Caching Documentation — official Claude cache mechanics, TTL options, pricing multipliers
Anthropic — API Pricing — base input/output rates and cache discount structure
OpenAI — Prompt Caching in the API — auto-caching for prompts ≥1024 tokens, 50% default discount
OpenAI — Prompt Caching Guide — current caching mechanics for GPT-5 series
DeepSeek API Docs — Context Caching on Disk — disk-based KV cache architecture, 90% price reduction announcement
DeepSeek API Docs — Context Caching Guide — integration patterns and pricing details
DeepSeek API Docs — Models & Pricing — V3.2 cache hit/miss rates
Helicone Docs — LLM Caching — semantic cache implementation, Redis/S3 backends
Helicone — Effective LLM Caching — cost reduction data, 20-30% typical, up to 95% for repetitive workloads
Helicone — What is an AI Gateway? — gateway role analysis with caching positioning
DEV Community — Top LLM Gateways That Support Semantic Caching 2026 — third-party comparison of cache support across gateways
Portkey — OpenAI's Prompt Caching Deep Dive — OpenAI cache mechanics analysis

Data collected 2026-04-21. LLM cache pricing and mechanisms are actively iterated by every major provider — verify specific numbers against vendor pricing pages before making architectural decisions based on dollar figures in this article.

By TokenMix Research Lab · Updated 2026-04-21