TokenMix Research Lab · 2026-04-24
GPT-5.5 Review: 88.7% SWE-Bench, 92.4% MMLU, 2× Price Tag (2026)
Last Updated: 2026-04-28
Author: TokenMix Research Lab
OpenAI shipped GPT-5.5 (codename "Spud") on April 23, 2026 — the first fully retrained base model since GPT-4.5, and the single biggest frontier step since GPT-5 launched. Headline numbers: 88.7% SWE-Bench Verified, 92.4% MMLU, 60% hallucination rate reduction vs GPT-5.4, natively omnimodal architecture, and 40% fewer output tokens on the same Codex tasks. But the price jumped to $5/$30 per MTok — a hard 2× vs GPT-5.4. This review covers what the benchmarks actually mean in production, whether the 2× price is justified, and how GPT-5.5 stacks against Claude Opus 4.7 and DeepSeek V4 (which shipped the next day). TokenMix.ai tracks live pricing and benchmarks across 300+ models including GPT-5.5 from day one.
Table of Contents
- Confirmed vs Speculation
- What Changed from GPT-5.4 to GPT-5.5
- Benchmark Deep Dive: Where Spud Actually Wins
- The 2× Price Tag: Who Actually Pays It
- Omnimodal + 40% Fewer Output Tokens
- GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4
- What GPT-5.5 Still Can't Do
- Prediction Review: We Called 70% Odds for April
- Who Should Upgrade
- FAQ
Confirmed vs Speculation
| Claim | Status |
|---|---|
| Released April 23, 2026 | Confirmed (OpenAI announcement) |
| Internal codename "Spud" | Confirmed |
| First fully retrained base since GPT-4.5 | Confirmed |
| 88.7% SWE-Bench Verified | Confirmed |
| 73.1% Expert-SWE (internal OpenAI eval) | Confirmed |
| 82.7% Terminal-Bench 2.0 | Confirmed |
| 92.4% MMLU (vs GPT-4's 86.4%) | Confirmed |
| 60% hallucination rate reduction vs GPT-5.4 | Confirmed (OpenAI self-reported) |
| 2× price jump: $5/$30 per MTok | Confirmed |
| gpt-5.5-pro unchanged at $30/$180 per MTok | Confirmed |
| Natively omnimodal (text + image + audio + video) | Confirmed |
| 40% fewer output tokens on Codex tasks | Confirmed |
| Same per-token latency as GPT-5.4 in serving | Confirmed |
| Enterprise API live April 23, consumer rollout early May | Confirmed |
| Would beat Claude Opus 4.7 on every benchmark | No — they win different benchmarks |
| Would force DeepSeek V4 price war | Likely — V4 shipped next day at $0.30/$0.50 |
What Changed from GPT-5.4 to GPT-5.5
GPT-5.5 is not a tuning iteration. It's a full base-model retrain — the first since GPT-4.5. Three concrete differences:
1. Native omnimodal architecture. GPT-5.4 handled images via adapter layers bolted onto the text base. GPT-5.5 has a unified architecture that processes text, images, audio, and video through the same parameter pool — similar to what Google shipped in Gemini 1.5 but with OpenAI's scaling approach.
2. Token efficiency. Same per-token serving latency as GPT-5.4, but uses roughly 40% fewer output tokens to complete the same Codex task. Translation: lower latency on multi-step agent workflows and lower effective cost per completed task even with the 2× list price hike.
3. Hallucination reduction. The biggest qualitative change. OpenAI self-reports 60% fewer hallucinations in standard benchmark comparisons vs GPT-5.4. Independent third-party verification is still pending, but early reviews from production teams confirm visibly fewer fabricated API signatures, hallucinated package versions, and invented citations.
Benchmark Deep Dive: Where Spud Actually Wins
| Benchmark | GPT-5.5 | GPT-5.4 | Claude Opus 4.7 |
|---|---|---|---|
| SWE-Bench Verified | 88.7% | 82.1% | 87.6% |
| SWE-Bench Pro | 58.6% | 57.7% | 64.3% |
| Expert-SWE (OpenAI eval) | 73.1% | 68.5% | — |
| Terminal-Bench 2.0 | 82.7% | ~74% | — |
| MMLU | 92.4% | 89.8% | ~91% |
| Hallucination rate | -60% vs 5.4 | baseline | — |
Source: OpenAI announcement, Handy AI model drop, Startup Fortune coverage
The honest read:
- GPT-5.5 leads on SWE-Bench Verified (the saturated benchmark) and Terminal-Bench 2.0 (agentic tool use)
- Claude Opus 4.7 still leads on SWE-Bench Pro (the harder, less-saturated benchmark) — by 5.7 points
- On MMLU, GPT-5.5's 92.4% is frontier but not dramatically ahead of Opus 4.7
Read this carefully: GPT-5.5 won the headline (88.7% Verified) but lost the harder benchmark (Pro) to Opus 4.7. In practice, for real-world complex coding tasks, the two are roughly tied — they just optimize for different slices of the distribution.
The 2× Price Tag: Who Actually Pays It
GPT-5.5 pricing:
- Input: $5 per million tokens (up from $2.50 on GPT-5.4)
- Output: $30 per million tokens (up from $15)
- gpt-5.5-pro: $30/$180 per MTok (unchanged from GPT-5.4 Pro)
Translation: standard tier doubled, pro tier held flat. This is unusual — typically new generations ship at slightly higher input price but similar output. Doubling both input AND output is aggressive.
The math that saves it: 40% fewer output tokens on Codex tasks. If you spend $100/day on GPT-5.4 Codex at 60% output-token share, your GPT-5.5 bill is roughly:
- Input: $40 × 2 = $80 (2× price)
- Output: $60 × 2 × 0.60 = $72 (2× price but 40% fewer tokens)
- Total: $152/day — a 52% increase, not 100%
So "2× price" in practice is "~50% effective cost increase" on Codex-type workloads. On non-Codex workloads without the token efficiency gain, it's the full 2× hit.
Who this pricing is for:
- ✅ Teams already paying GPT-5.4 premium and wanting latest capabilities
- ✅ Workloads where hallucination cost > token cost (legal, medical, financial)
- ✅ Codex/coding workloads that benefit from token efficiency
Who should NOT upgrade:
- ❌ Cost-sensitive batch processing — DeepSeek V4-Flash at $0.14/$0.28 is 35× cheaper
- ❌ General chat and content generation — GPT-5.4 or Claude Sonnet 4.6 are fine
- ❌ Open-weight requirements — Kimi K2.6 or DeepSeek V4 solve this
Omnimodal + 40% Fewer Output Tokens
Two architectural wins worth understanding:
Omnimodal unified processing. The same parameter pool handles text, images, audio, video. Practical advantages: better cross-modal reasoning (ask about a screenshot + paste code and get coherent answers), lower latency on multimodal requests (no adapter hop), consistent quality across modalities.
Compare to GPT-5.4, which routed image requests through a separate vision tower — working but with noticeable quality drop for cross-modal reasoning tasks.
40% output token reduction on Codex. OpenAI's framing: "GPT-5.5 matches GPT-5.4's per-token latency in production serving while using about 40% fewer output tokens to finish the same Codex task." In practice, Codex-style tasks (multi-step coding with tool calls) generate more concise reasoning traces without sacrificing final output quality.
This is where the 2× price hike becomes defensible for coding workloads specifically. For pure chat or content generation, the efficiency gain is smaller (~10-20%) and the price hike bites harder.
GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4
| Dimension | GPT-5.5 | Claude Opus 4.7 | DeepSeek V4-Pro | DeepSeek V4-Flash |
|---|---|---|---|---|
| Released | 2026-04-23 | 2026-04-16 | 2026-04-24 | 2026-04-24 |
| Architecture | Dense+omnimodal | Dense | MoE 1.6T/49B active | MoE 284B/13B active |
| Context | 256K | 1M | 1M | 1M |
| SWE-Bench Verified | 88.7% | 87.6% | ~85% | ~78% |
| SWE-Bench Pro | 58.6% | 64.3% | ~55% | ~48% |
| Input price ($/M) | $5 | $5 (+0-35% tokenizer tax) | $1.74 | $0.14 |
| Output price ($/M) | $30 | $25 | $3.48 | $0.28 |
| Open weights | No | No | Yes (Apache 2.0) | Yes (Apache 2.0) |
| Multimodal | Omnimodal (text+image+audio+video) | Text+image | Text | Text |
Read the landscape:
- Frontier quality leader (coding): Claude Opus 4.7 (SWE-Bench Pro 64.3)
- Frontier quality leader (breadth): GPT-5.5 (MMLU 92.4 + Terminal-Bench 82.7 + omnimodal)
- Price-performance leader: DeepSeek V4-Flash ($0.14/M, 78% SWE-Bench Verified, open-weight)
- Open-weight flagship: DeepSeek V4-Pro (85% SWE-Bench Verified, Apache 2.0, 1M context)
GPT-5.5 at $5/$30 is 18-214× more expensive than DeepSeek V4-Flash, while being 10-15% better on top benchmarks. Whether that gap is worth 18× cost depends entirely on what the workload costs per hallucination.
What GPT-5.5 Still Can't Do
Honest caveats:
- Not class-leading on SWE-Bench Pro — Claude Opus 4.7 wins the harder coding benchmark by 5.7 points
- Context window stays at 256K — Claude Opus 4.7 and DeepSeek V4 both offer 1M
- Closed-weight — privacy-sensitive deployment still needs Claude or open alternatives
- Consumer rollout delayed — enterprise API is live, ChatGPT consumer access is early May
- Omnimodal audio/video quality — text+image is confirmed strong, audio and video modalities are newer and early reports show some rough edges
- Hallucination claim needs third-party validation — 60% reduction is OpenAI self-reported; independent benchmarks pending
Prediction Review: We Called 70% Odds for April
On April 17, 2026 — six days before GPT-5.5 shipped — we published "GPT-5.5 Release Date: 70% Odds for April, Spud Pretraining Done". The prediction was based on Polymarket odds + pretraining completion reports + historical OpenAI release cadence.
Result: April release confirmed, 70% odds calibrated.
Specific predictions vs reality:
- ✅ April release — correct (4-23)
- ✅ Safety eval ~30 days — correct (pretraining 3-24, release 4-23, exactly 30 days)
- ✅ Significant benchmark jump over GPT-5.4 — correct (88.7 vs 82.1 SWE-Bench Verified)
- ✅ Hallucination focus — correct (60% reduction is the headline feature)
- ⚠️ Pricing — we didn't predict 2× input/output hike; it was unexpected
At TokenMix.ai, tracking 300+ model release cycles and pricing changes daily gives us the data base for predictions like this. The 70% calibration here matches our historical accuracy on similar release windows.
Who Should Upgrade
| Your situation | Recommendation |
|---|---|
| Already paying GPT-5.4, want latest | Upgrade. The token efficiency offsets ~50% of the price hike. |
| Cost-sensitive Codex/coding workload | Run A/B: GPT-5.5 vs DeepSeek V4-Pro. V4-Pro is 85% the quality at 3% the cost. |
| Cost-sensitive general chat | Stay on GPT-5.4 Mini, Claude Haiku 4.5, or DeepSeek V4-Flash |
| Multimodal-heavy workflow | GPT-5.5 — omnimodal architecture is genuinely new |
| Open-weight requirement | Skip GPT-5.5 entirely. DeepSeek V4 or Kimi K2.6 win. |
| Need 1M context | Claude Opus 4.7 or DeepSeek V4 — GPT-5.5 caps at 256K |
| Frontier reasoning with hallucination sensitivity | GPT-5.5. 60% hallucination drop is the single most valuable quality improvement of 2026 Q2. |
For teams running mixed workloads, TokenMix.ai provides OpenAI-compatible routing across GPT-5.5, Claude Opus 4.7, DeepSeek V4, Kimi K2.6, and 300+ other models — useful for A/B testing which tier actually fits each slice of your workload before committing to the 2× price.
FAQ
Q: Is GPT-5.5 the same as "Spud"? A: Yes. "Spud" was OpenAI's internal codename for the project during pretraining (completed March 24, 2026). GPT-5.5 is the public release name.
Q: What's the real cost increase versus GPT-5.4? A: List price doubles (2×). Effective cost increase on Codex workloads is ~50% due to 40% fewer output tokens. On non-coding workloads, it's closer to the full 2× hit.
Q: Is GPT-5.5 better than Claude Opus 4.7? A: Depends on the benchmark. GPT-5.5 wins SWE-Bench Verified (88.7 vs 87.6), Terminal-Bench 2.0, and MMLU. Claude Opus 4.7 wins SWE-Bench Pro (64.3 vs 58.6), has 1M context vs 256K, and has smaller tokenizer overhead. In practice they're roughly tied for frontier coding work.
Q: Can I use GPT-5.5 through the ChatGPT app? A: Consumer rollout is scheduled for early May 2026. Enterprise API subscribers have access since April 23.
Q: How much does the 60% hallucination reduction actually matter? A: Significantly, if your workload has high hallucination cost (legal research, medical summarization, financial analysis). For general chat or content generation, the quality improvement is visible but less decisive.
Q: Is gpt-5.5-pro worth the 6× price over standard? A: Only for tasks requiring the absolute highest quality — research-grade reasoning, complex multi-agent orchestration, or regulated workflows where any error has material cost. For standard coding and chat, gpt-5.5 standard is the sweet spot.
Q: Will GPT-5.5 pricing come down? A: Historical OpenAI pattern: prices drop ~40-60% within 12-18 months as infrastructure optimization kicks in. GPT-5.5-mini (expected Q3 2026) will likely land at $0.50/$3.00 per MTok range.
Sources
- OpenAI: Introducing GPT-5.5
- Handy AI: Model Drop GPT-5.5
- Startup Fortune: 60% Hallucination Drop
- Axios: OpenAI Spud Release
- Apidog GPT-5.5 Pricing Breakdown
- TokenMix Original Prediction (April 17)
By TokenMix Research Lab · Updated 2026-04-24