DeepSeek V4 Pro vs Flash: 1.6T or 284B, Which Fits You? (2026)
DeepSeek open-sourced V4 on April 24, 2026 — one day after GPT-5.5 shipped — and did it in two variants: V4-Pro (1.6 trillion parameters, 49B active) and V4-Flash (284 billion parameters, 13B active). Both Apache 2.0, both 1M context, both featuring the new CSA + HCA hybrid attention. Pricing is aggressive: V4-Flash at $0.14/$0.28 per MTok (matching V3.2's floor), V4-Pro at
.74/$3.48 per MTok (roughly 3× V3.2 but 1/3 of GPT-5.5). The strategic question isn't "is V4 good" — the headline benchmarks already confirm it. The question is Pro or Flash for your specific workload. This breakdown covers the spec differences, the honest benchmark gap, the economics at scale, and a decision framework. TokenMix.ai tracks both variants alongside 300+ other models for teams routing across tiers.
Partial — V4-Pro ~85% Verified vs GPT-5.5 88.7%; within 4 points
V4-Pro replaces frontier closed models entirely
No — still 10-15% gap on SWE-Bench Pro vs Opus 4.7
DeepSeek filed delay per Huawei chip bottleneck (March reports)
Resolved — shipped April 24
Pro vs Flash: Spec-by-Spec
Dimension
V4-Pro
V4-Flash
Total parameters
1.6T
284B
Active per token
49B
13B
Architecture
Sparse MoE + CSA/HCA
Sparse MoE + CSA/HCA
Context
1M
1M
License
Apache 2.0
Apache 2.0
Input price
.74 / MTok
$0.14 / MTok
Output price
$3.48 / MTok
$0.28 / MTok
Cache hit input
~$0.14 / MTok (92% off)
~$0.03 / MTok (80% off)
Throughput
~60-120 tok/s
~150-300 tok/s
Self-host VRAM (FP8)
~800GB+
~150GB
Realistic self-host
8× H200 or 16× H100
2× H100 or 4× A100
Best workload
Frontier reasoning, complex agents
High-throughput production, batch jobs
The 3.8× parameter gap (49B vs 13B active) and 12× price gap (
.74 vs $0.14) are the big decision drivers. Flash is economically disruptive — close to V3.2 pricing with a clearly better architecture. Pro is a legitimate frontier competitor at fractional cost.
What CSA + HCA Hybrid Attention Actually Does
DeepSeek V4's attention mechanism is the architectural novelty worth understanding:
CSA (Compressed Sparse Attention) — handles local context with aggressive compression, suitable for the "near past" tokens where fine-grained detail matters most.
HCA (Heavily Compressed Attention) — handles the 1M-token long tail with even heavier compression, where rough semantic recall is what you need, not token-level fidelity.
The hybrid routes tokens through whichever mechanism fits the distance from the current generation position. Net effect: 1M context without the O(n²) memory blowup that makes most long-context claims fall apart in practice.
Early third-party testing shows V4 maintains solid recall up to ~700K tokens, with noticeable degradation only past 850K. Compare to Claude Opus 4.7's 1M (stable to ~900K) and GPT-5.5's 256K cap — V4 is competitive on long context, not best-in-class.
Pricing Breakdown (Cache Hits Matter)
List prices are only half the story. Cache-hit prices are where DeepSeek genuinely disrupts:
Claude Opus 4.7: $5 / MTok input, cache hits reduce to ~$0.50
The takeaway: On workloads with stable system prompts (agent systems, RAG with fixed contexts, long document Q&A), V4-Flash at $0.03/MTok cache hit is 17× cheaper than GPT-5.5's cache hit and 17× cheaper than Opus 4.7's.
For a team running 10M input tokens/day with 70% cache hit:
GPT-5.5: ~$30/day input
V4-Flash: ~
.40/day input
Saving: $28.60/day, $858/month on just input tokens
Which Variant Wins on Which Benchmark
Benchmark
V4-Pro
V4-Flash
Context
SWE-Bench Verified
~85%
~78%
vs GPT-5.5 88.7%, Opus 4.7 87.6%
SWE-Bench Pro
~55%
~48%
vs GPT-5.5 58.6%, Opus 4.7 64.3%
AIME 2025
~94
~88
vs Step 3.5 Flash 97.3
MMLU
~89%
~84%
vs GPT-5.5 92.4%
Long-context recall (500K)
Strong
Strong
1M context supports both
Terminal-Bench 2.0
~75%
~68%
vs GPT-5.5 82.7%, Qwen 3.6-27B 59.3%
The honest read:
V4-Pro is within 4-10 points of frontier closed models (GPT-5.5, Opus 4.7) on most benchmarks
V4-Flash trails Pro by ~7-10 points on reasoning tasks but is dramatically cheaper and faster
Neither beats Step 3.5 Flash on pure math (AIME 97.3) — StepFun still leads that specific slice
Neither beats Opus 4.7 on SWE-Bench Pro — closed frontier still wins the hardest coding benchmark
Workload Decision Framework
Pick V4-Pro when:
You need frontier quality and Apache 2.0 / open weights matter
Your workload is complex enough that the quality gap (vs Flash) shows up in outcomes
You're running long-horizon agents where per-step quality compounds
You need 1M context with strong recall
Pick V4-Flash when:
You're running high-throughput production workloads (100K+ requests/day)
Cost matters more than frontier quality
You want to replace V3.2 at the same price with better architecture
You need fast inference (150-300 tok/s)
You're running A/B experiments across many prompt variants
Skip both when:
You need the absolute top on SWE-Bench Pro (Claude Opus 4.7 still wins at $5/MTok)
You need omnimodal input (GPT-5.5 wins at $5/MTok)
Your workload is pure math/STEM reasoning (Step 3.5 Flash is cheaper and better)
Self-Hosting: Can Your Hardware Run It
Both variants are Apache 2.0 open-weight, available on Hugging Face. Self-hosting is viable:
Throughput: ~40-80 tok/s/request with batch size 4
Capital cost: ~$250K for the hardware, or
5K-30K/month rental on cloud
Best for: Enterprises with data residency requirements, or teams processing 10M+ requests/day
V4-Flash self-host reality:
284B params in FP8 ≈ 150GB VRAM
Realistic setup: 2× H100 SXM or 4× A100 80GB
Throughput: ~100-200 tok/s/request with batch size 4
Capital cost: ~$50K for hardware, or $3K-6K/month rental
Best for: Teams wanting OpenAI-compatible API on-prem without frontier-scale investment
For most teams, hosted API is the right call unless there's a specific reason (compliance, sovereignty, extreme volume). The $0.14/MTok hosted price is hard to beat on TCO for anything short of 100M+ daily tokens.
V4-Flash vs V3.2: Same Price, New Architecture
DeepSeek kept V4-Flash's pricing identical to V3.2 ($0.14/$0.28 per MTok). This is strategic:
Migration path: drop-in replacement, no cost change
Quality upgrade: ~10-15% better across benchmarks vs V3.2
Architecture upgrade: 284B MoE vs V3.2's 671B MoE — Flash is smaller but better due to CSA/HCA attention and updated training
Throughput gain: ~20% faster than V3.2 at same cost
For existing V3.2 users, the migration decision is trivial: switch to V4-Flash today, get better model at same price.
V4 in the 2026 Q2 Landscape
Ranking current frontier-tier models on open/closed and price:
Closed premium: Claude Opus 4.7 ($5/$25), GPT-5.5 ($5/$30)
Open budget: DeepSeek V4-Flash ($0.14/$0.28), Step 3.5 Flash ($0.10/$0.30)
Closed budget: GPT-5.4 Mini ($0.50/$3), Claude Haiku 4.5
V4 slots in at two levels: Pro is the best open-weight 1.6T you can get with Apache 2.0; Flash is the price-performance floor for agentic production.
See our Chinese AI models comparison guide for the broader landscape including Kimi K2.6, Step 3.5 Flash, Qwen 3.6, GLM-5.1, and others.
FAQ
Q: Which DeepSeek V4 should I use for coding agents?
A: V4-Pro if you want quality parity with Claude/GPT at 1/10 the cost. V4-Flash if you need fast iteration or are running high-throughput jobs where quality gap is acceptable.
Q: How does V4-Pro compare to GPT-5.5?
A: V4-Pro is ~4 points behind GPT-5.5 on SWE-Bench Verified, ~3 points behind on MMLU, but costs 1/3 of GPT-5.5 and is open-weight. For most workloads, V4-Pro is the better economic pick; for absolute frontier coding on SWE-Bench Pro, Claude Opus 4.7 still leads.
Q: Can I self-host V4-Pro on a single H100?
A: No. V4-Pro requires ~800GB VRAM in FP8 — you need 8-16 GPUs. V4-Flash at 284B fits in 2× H100, which is realistic for small teams.
Q: Does V4 support OpenAI-compatible API?
A: Yes. Both variants are accessible via api.deepseek.com/v1 with the same OpenAI-compatible endpoint as V3.2. Drop-in migration.
Q: Is V4's 1M context real or marketing?
A: Third-party testing shows stable recall up to ~700K, noticeable degradation past 850K. For production, stay under ~600K for critical accuracy. This is comparable to Claude Opus 4.7's 1M behavior.
Q: What's the cache-hit economics on V4?
A: V4-Pro cache hits drop input from
.74 to ~$0.14/MTok (92% off). V4-Flash cache hits drop from $0.14 to ~$0.03/MTok (80% off). For agent workloads with stable system prompts, real effective input cost is ~$0.03-0.14/MTok.
Q: When will DeepSeek V4.1 or V4.2 release?
A: Based on DeepSeek's historical cadence, expect a V4.1 minor update within 3-4 months. Full V5 likely Q2 2027.