TokenMix Research Lab · 2026-04-21

AI Gateway Caching 2026: Why L1 + L2 Layers Cut 90% API Cost

AI Gateway Caching 2026: Why L1 + L2 Layers Cut 90% API Cost

AI gateway caching is not one feature — it's two completely different mechanisms that teams keep conflating. L1 (result cache) skips the upstream model entirely and saves 100% per hit. L2 (prompt cache) lives inside the model provider and cuts the cost of cached prefix tokens by 50-90% per call. Most production teams using aggregation platforms like OpenRouter, Portkey, or TokenMix.ai get only L2 savings by default. Adding an L1 layer on top — either via Helicone or self-hosted Redis — compounds the two for cost reductions up to 95% on repetitive workloads.

This guide separates the two layers with real 2026 pricing data, shows where each fits, and walks through the architecture patterns that actually work in production.

Table of Contents


Quick Comparison: L1 vs L2 Caching

Dimension L1 — Result Cache L2 — Prompt Cache
What it caches Full model response KV state of prompt prefix
Does it call upstream? No — response returned locally Yes — model still generates output
Savings per hit 100% input + output tokens + API call 50-90% on cached input tokens only
Hit trigger Exact or semantic match of prompt Same prefix across requests
Who runs it Proxy layer (Helicone, custom Redis) Model provider (Claude, OpenAI, DeepSeek, Gemini)
Integration 1-line SDK base_url swap or custom Automatic for some, explicit for others
Latency reduction Sub-100ms vs 300-3000ms upstream Reduced prefill time, typically 30-60% faster
Stale risk High — stale responses if invalidation weak None — underlying data always fresh
Best for Repetitive queries (support bots, FAQs) Large system prompts, RAG contexts

Neither cache replaces the other. They target different savings. L1 kills the upstream call; L2 reduces the cost when you do call upstream.

Why Aggregation Platforms Need Caching

An AI gateway — OpenRouter, Portkey, TokenMix.ai, Helicone AI Gateway — fundamentally does four things: auth, routing, forwarding, billing. Without caching, every dimension suffers:

Cost scales linearly. Every request hits an upstream provider. If your workload has 40% duplicate or near-duplicate queries (the empirical median across chatbots, support agents, and RAG systems we track through TokenMix.ai), you're paying full price for 40% of requests that didn't need to touch the model.

Latency compounds. An aggregation platform already adds 5-30ms of forwarding overhead. Without caching, you then pay the upstream provider's full latency — 300-3000ms depending on model and prompt length. A cache hit turns that into 50-100ms end-to-end, which is the difference between "feels like a website" and "feels like a conversation."

Rate limits bite harder. Every upstream provider enforces RPM/TPM limits. Cached requests don't consume upstream rate-limit budget. Teams hitting 10x their effective RPM through caching is common.

Failure modes multiply. An aggregation platform that doesn't cache has zero degraded-mode capability when upstream is down. With a cache, 30-50% of requests continue serving from the cache layer during outages.

This is why caching isn't a nice-to-have feature for aggregation platforms — it's structural to the product's value proposition. The question is which layer(s) the platform implements.

L1: Result Caching (Exact + Semantic)

L1 is the simplest mental model: the gateway remembers past responses and returns them for matching new requests. Two variants:

Exact match. The cache key is a hash of the entire request (model + messages + parameters). If two users send byte-identical requests, the second one hits cache.

Semantic match. The cache key is the vector embedding of the prompt. Similar prompts — "What is photosynthesis?" vs "Explain photosynthesis" — match within a cosine similarity threshold. Requires an embedding model and vector store, adds 30-80ms to check cache, but dramatically increases hit rate.

Who runs L1 caching in 2026

Helicone — the dominant L1 provider for aggregation use. Point your SDK at https://oai.helicone.ai/v1 or use their AI Gateway directly. Supports Redis (ultra-fast in-memory) or S3 (persistent) backends, with intelligent invalidation that respects model updates and context windows (Helicone caching docs). Reported cost reduction 20-30% on typical workloads, up to 95% for highly repetitive use cases like customer support bots and code assistants (Helicone cost reduction guide).

Self-hosted Redis + embedding. Rolling your own is a 1-2 engineer-week project. Tradeoff: full control over cache policy, but you own the ops burden (invalidation, memory eviction, embedding model availability).

OpenRouter, Portkey, TokenMix.ai. These are pass-through gateways — they forward your request to the upstream provider so L2 prompt caching fires. They do not run their own L1 result cache. If you want L1 + their routing, you stack Helicone or Redis in front.

L1 trade-offs

The big upside is obvious: 100% savings per hit. The big downside is staleness risk:

Practical rule: enable L1 for paths where temperature=0 and the knowledge base doesn't change within the TTL window. Disable it elsewhere.

L2: Prompt Caching at the Provider Level

L2 is what OpenAI, Anthropic, Google, and DeepSeek all ship under various names — "prompt caching," "context caching," "KV cache" — with similar mechanics:

  1. You send a long prompt with a large stable prefix (system prompt, tools, documents).
  2. The provider computes KV state for the prefix once, stores it in a hot cache.
  3. Subsequent requests with the same prefix skip prefix computation, paying for cache read instead of full input.

Critically, the model still generates output every call — it's not an L1 cache, it's a speedup for the prefix portion only. But input-token savings are dramatic: 50-90% depending on provider.

L2 pricing across the four majors (April 2026)

Provider Base input Cache read Cache write Auto?
Claude Sonnet 4.6 $3/M $0.30/M (90% off) $3.75/M (25% premium, 5-min TTL) Explicit — must set cache_control
Claude Opus 4.6 $5/M $0.50/M (90% off) $6.25/M (5-min) / 0/M (1-hour) Explicit
DeepSeek V3.2 $0.28/M $0.028/M (90% off) Same as base (no write premium) Automatic, zero config
OpenAI GPT-5.4 $2.50/M $0.25/M (90% off) Same as base Automatic for prompts ≥1024 tokens
Gemini 3.1 Pro $2/M (≤200K) ~25% discount Storage $4.50/M per hour Explicit via cachedContents.create

Two design patterns worth noting:

Automatic caching (OpenAI, DeepSeek): zero code changes, cache fires when eligible prefix detected. Friction-free but less control.

Explicit caching (Claude, Gemini): you mark which content to cache with cache_control (Claude) or an explicit cache resource (Gemini). More code, but precise control over TTL and which prefix to cache.

Break-even math for Claude's explicit cache

Claude's cache write costs 25% more than base input (for 5-min TTL) or 100% more (for 1-hour TTL). So:

For a RAG system that answers multiple questions against the same document within 5 minutes, Claude caching recoups instantly. For batched nightly processing of static context, 1-hour TTL is worth the 2x write premium because you amortize over hundreds of reads.

How aggregation platforms handle L2

If you send a Claude request with cache_control through OpenRouter, Portkey, or TokenMix.ai, the gateway forwards the cache directive unchanged to Anthropic. Anthropic applies the cache discount. The gateway bills you the discounted price (plus whatever markup they charge).

This only works if the gateway doesn't rewrite your request in ways that break prefix stability. The gateway's OpenAI-compatible translation layer must preserve the exact prefix hash. TokenMix.ai preserves prefix stability by design, which is why the Claude prompt cache fires as expected when routing Claude traffic through the unified endpoint.

L3: Observability — Seeing the Savings

Caching only helps if you can measure hit rate and cost impact. This is L3: dashboards that break out cached vs uncached tokens, hit rate over time, and cost-per-request before/after cache.

What good L3 looks like:

Who provides L3 in 2026:

Without L3, you can't verify your L1 invalidation is correct or your L2 cache is actually firing. Running caching blind is worse than no caching — you think you're saving money but may be serving stale responses or missing cache opportunities entirely.

Real Cost Math: 10M Requests/Month Scenarios

Let's work three realistic scenarios. Assumptions: average prompt 4,000 input tokens, average output 500 tokens, model Claude Sonnet 4.6 ($3/ 5 per M).

Scenario A: No caching (baseline)

Scenario B: L2 only (native Claude prompt cache)

Assume 3,500 of the 4K input tokens are stable prefix (system prompt + document context), 500 are unique per request. Cache hit rate on prefix: 80% (typical for RAG).

Scenario C: L1 + L2 stacked

Assume 25% of total requests are L1-cacheable (repetitive / FAQ-like) with exact or semantic match. Remaining 75% go through L2 on Claude.

The compound savings of L1 + L2 is not additive — it's multiplicative in the right direction. The 25% of requests L1 absorbs never reach the L2 system, so the L1 savings aren't diluted.

Architecture Patterns That Actually Work

Pattern 1: Helicone proxy → vendor direct

[Your App] → [Helicone (L1 cache)] → [Anthropic/OpenAI/etc (L2 cache)]

Single proxy layer. Helicone handles L1 semantic cache and observability. Upstream provider handles L2. Best for teams already committed to one or two vendors.

Pattern 2: Gateway → Helicone → vendors (multi-model)

[Your App] → [TokenMix.ai (routing + L2 passthrough)] → [Helicone (L1)] → [vendors]

Gateway handles model routing and failover; Helicone adds L1. More hops but supports multi-vendor architecture with both cache layers.

Pattern 3: Self-hosted L1 + TokenMix.ai

[Your App] → [Redis/own L1 cache] → [TokenMix.ai (routing + L2 passthrough)] → [vendors]

Fine-grained control over L1 policy (TTL, invalidation, semantic thresholds). TokenMix.ai handles the model-routing complexity and preserves L2 cache directives. Best for teams with specific compliance or data-residency requirements that prevent using Helicone's managed proxy.

Pattern 4: Vendor-only (no aggregation, no L1)

[Your App] → [Vendor API directly]

L2 only. Simplest but no multi-model flexibility, no L1 savings. Fine for single-vendor deployments that don't need routing.

How to Choose

Your situation Pattern Why
Single vendor, want max simplicity Pattern 4 No extra hops, vendor's L2 auto-fires
Single vendor, want L1 savings Pattern 1 Helicone adds the missing L1 layer
Multi-vendor with routing needs Pattern 2 TokenMix.ai + Helicone combined
Strict compliance/data residency Pattern 3 Self-hosted L1 keeps all cache on-prem
High-volume repetitive workload (support bots, FAQ) Any pattern with L1 L1 savings are the big win here
Low-volume or highly dynamic content Skip L1, enable L2 only L1 stale-risk outweighs benefit

Conclusion

Caching in AI gateways is not a single feature to toggle — it's two layers with different economics, different failure modes, and different integration patterns. Understand them separately, then decide which layers make sense for your workload.

The practical rule for 2026: always enable L2 (it's free money on Claude, OpenAI, DeepSeek for any prompt with stable prefix). Add L1 when your workload has genuine repetition and your content tolerates TTL-bounded staleness. Combining the two can cut LLM bills 50-60% on realistic production workloads.

TokenMix.ai preserves L2 cache directives across 300+ models through one OpenAI-compatible endpoint, and surfaces cached token counts per request so you can verify caching is actually firing. Stack Helicone in front for L1 when the workload warrants it. The two-layer architecture is the cheapest production AI stack you can run in 2026.

This article is also available on Dev.to for developer community discussion with code examples.

FAQ

Q1: What's the difference between prompt caching and result caching?

Prompt caching (L2) is a vendor-side optimization that stores KV state for prompt prefixes — the model still generates output every call, but prefix computation is skipped, saving 50-90% on cached input tokens. Result caching (L1) is a proxy-side feature that returns previous responses for matching requests, skipping the model call entirely — 100% savings per hit but requires tolerating some staleness.

Q2: Does OpenRouter cache my requests?

OpenRouter forwards requests to upstream providers without running its own L1 result cache. Provider-side L2 prompt caching works through OpenRouter because the gateway preserves request prefix stability. For actual L1 caching (skipping upstream calls entirely), you need to add Helicone or a self-hosted Redis layer.

Q3: How much does Claude prompt caching actually save?

Claude cache read tokens cost 10% of base input price — a 90% discount. Cache writes cost 25% more than base input for 5-minute TTL, or 100% more for 1-hour TTL. The break-even is 1 cache read for 5-minute TTL and 2 reads for 1-hour TTL. For RAG systems with stable document context, savings are typically 60-85% on total input cost.

Q4: Is DeepSeek caching automatic?

Yes. DeepSeek's context caching on disk is enabled by default for all users. Any request with a cacheable prefix automatically hits the cache, with cache read tokens priced at about 10% of base input — a 90% discount. No code changes required, no write premium.

Q5: When should I use semantic cache vs exact cache?

Exact cache for high-volume deterministic workloads where requests are truly byte-identical (e.g., programmatic RAG over a fixed corpus). Semantic cache for user-facing chatbots where people phrase the same question many ways ("How does X work?" vs "Explain X"). Semantic cache has 10-30x higher hit rate in natural language contexts but costs 30-80ms per lookup to compute the embedding comparison.

Q6: Can I combine L1 result cache with L2 prompt cache?

Yes, and this is the optimal configuration for repetitive workloads. L1 absorbs exact and semantic matches, never touching the upstream. L2 fires on L1 misses, reducing the cost of prompts that do reach the model. The combined savings on a typical 10M requests/month workload is 50-60% vs baseline uncached, significantly more than either layer alone.

Q7: Does caching work with streaming responses?

L2 prompt caching works with streaming — the cache affects the prefill phase before tokens start streaming. L1 result caching is compatible but more complex: you need to store the full response (or complete chunk sequence) and replay it. Helicone handles this automatically. Self-hosted implementations need to buffer the stream to cache and replay from buffer on hits.

Q8: How do I measure whether my cache is actually working?

Check two metrics: (1) cache hit rate (percentage of requests served from cache), and (2) cached tokens reported in provider response metadata (for L2). Without an observability layer, you're flying blind. Helicone, TokenMix.ai, and Langfuse all expose these metrics in dashboards. For self-hosted caches, log cache hit/miss explicitly and aggregate in your monitoring stack.


Sources

Data collected 2026-04-21. LLM cache pricing and mechanisms are actively iterated by every major provider — verify specific numbers against vendor pricing pages before making architectural decisions based on dollar figures in this article.


By TokenMix Research Lab · Updated 2026-04-21