GPT-5.5 Spud Benchmarks: Projected Scores vs GPT-5.4 & Claude 4.7 (2026)
GPT-5.5 — codenamed "Spud" — finished pretraining on March 24, 2026. No benchmarks have been released. No model card exists. What we have is the performance floor the market now demands: GPT-5.4 holds coding with 57.7% on SWE-bench Pro, Claude Opus 4.7 jumped to 87.6% on SWE-bench Verified, and Gemini 3.1 Pro leads GPQA Diamond at 94.3%. This article models three benchmark scenarios for Spud based on real competitor data — what it must beat, what it is likely to hit, and what would justify a price premium. TokenMix.ai tracks benchmark movement across 300+ models in production and will publish Spud scores within 24 hours of release.
Historical pattern (every major OpenAI release beats prior)
Specific GPQA/SWE-bench numbers
Not leaked
No credible source has published figures
Release before June 2026
Likely
Post-training typically 4-12 weeks, GPT-5.4 took ~6 weeks
Bottom line: any article claiming specific Spud benchmark numbers as fact is fabricating. The model is still in post-training (RLHF, red-teaming, safety eval). Benchmarks only stabilize in the final 2-3 weeks before release.
What we can do is project plausible scenarios based on real competitor scores and OpenAI's release pattern.
The Competitor Floor: What GPT-5.5 Must Beat
Before projecting Spud, establish the floor. These are publicly reported scores from Q1-Q2 2026:
GPT-5.4 still leads on 3 benchmarks (SWE-bench Pro, HumanEval, OSWorld, GDPval) — OpenAI can't ship a worse model
Coding is OpenAI's moat — they'll protect SWE-bench Pro and HumanEval dominance
Gemini 3.1 Pro is the reasoning benchmark leader (GPQA, ARC-AGI-2) — Spud has to close these gaps
Claude Opus 4.7 disrupted SWE-bench Verified jumping to 87.6%, forcing OpenAI to respond
For Spud to be a meaningful release (not just a point update like 5.1 → 5.2), it needs to beat GPT-5.4 on every benchmark and reclaim at least one category leadership from Claude/Gemini.
Positioning: Defensive release. Close gaps on reasoning but don't stretch for SOTA.
Why this scenario happens: OpenAI prioritizes stability over peaks. Post-training timelines compressed because GPT-5.4 only shipped 6 weeks before Spud pretraining finished. Less time for RLHF tuning.
Projected scores:
Benchmark
GPT-5.5 (Scenario 1)
vs GPT-5.4
vs Leader
SWE-bench Verified
82-85%
+24-26pp
Still behind Opus 4.7 (87.6%)
SWE-bench Pro
61-64%
+3-6pp
New SOTA
GPQA Diamond
94.0-94.5%
+1-2pp
Ties Gemini 3.1 Pro
Terminal-Bench 2.0
64-68%
+4-8pp
Close to Opus 4.7
ARC-AGI-2
75-78%
+2-5pp
Matches Gemini 3.1 Pro
HumanEval
94-95%
+1-2pp
New SOTA
Probability: Moderate-high. Matches OpenAI's pattern of GPT-5.1, 5.2, 5.3 — each adds ~2-3pp across benchmarks without dramatic leaps.
What developers should do: Treat Spud as a GPT-5.4 drop-in replacement. Migration cost is the only variable. See our migration checklist for Spud.
Benchmark Scenario 2: Expected (Beat Claude Opus 4.7 on Coding)
Positioning: Reclaim coding leadership. Force Anthropic into a reactive 4.8 release.
Why this scenario happens: Anthropic just blindsided OpenAI by jumping SWE-bench Verified 7pp (80.8% → 87.6%) in Opus 4.7. OpenAI rarely lets a competitor hold a coding SOTA for more than one release cycle. Internal GPT-5 codex variants give OpenAI a clear pathway.
Projected scores:
Benchmark
GPT-5.5 (Scenario 2)
vs GPT-5.4
vs Leader
SWE-bench Verified
88-91%
+30-33pp
New SOTA (beats Opus 4.7)
SWE-bench Pro
65-70%
+7-13pp
New SOTA
GPQA Diamond
94.5-95.5%
+2-3pp
New SOTA
Terminal-Bench 2.0
70-75%
+10-15pp
New SOTA
ARC-AGI-2
78-82%
+5-9pp
Beats Gemini 3.1 Pro
HumanEval
95-97%
+2-4pp
New SOTA
Finance Agent
66-70%
+8-12pp
New SOTA
Probability: Highest. Aligns with OpenAI's competitive posture and the narrative value of "biggest capability jump since GPT-5.0". Matches Polymarket odds of 70%+ for April release.
What developers should do: Prepare a benchmarking harness now. On release day you'll want to rerun your internal evals within 24 hours. Teams using TokenMix.ai's model gateway can switch model IDs in config without touching application code.
Benchmark Scenario 3: Breakthrough (New SOTA Across the Board)
Positioning: Generational leap. Ships as "GPT-6" rather than "GPT-5.5".
Why this scenario happens: Pretraining required enough new compute and data that OpenAI decided to rebrand. Three signals would indicate this: (1) an unusually long post-training phase (8-12 weeks), (2) a live demo event instead of a quiet API release, and (3) the model gets a full new model card rather than an incremental one.
Projected scores:
Benchmark
GPT-5.5 (Scenario 3)
vs GPT-5.4
vs Leader
SWE-bench Verified
92-95%
+34-37pp
Dominant SOTA
SWE-bench Pro
72-78%
+15-21pp
Dominant SOTA
GPQA Diamond
96-98%
+3-5pp
Near-saturation
Terminal-Bench 2.0
78-83%
+18-23pp
Dominant SOTA
ARC-AGI-2
85-90%
+12-17pp
Dominant SOTA
HumanEval
98-99%
+5-6pp
Saturated
Finance Agent
75-82%
+17-24pp
Dominant SOTA
Probability: Low-moderate. OpenAI historically reserves step-function improvements for major version numbers (GPT-3 → GPT-4 → GPT-5). A 5.5 label with breakthrough scores would be off-pattern.
What developers should do: If this scenario materializes, expect pricing to be 1.5-2× higher than GPT-5.4, and rate limits tighter for the first 30 days. Have a fallback to GPT-5.4 ready for burst traffic. See our GPT-5.5 pricing prediction for cost modeling.
Full Projected Comparison Table
All three scenarios side by side against current market leaders:
Why the "Benchmarks Leak" Stories Are Mostly Noise
If you search "GPT-5.5 benchmarks leak" right now, you'll find a mix of three sources, all of which should be treated skeptically:
1. Twitter/X posts with screenshots. These almost always show synthetic data. Genuine benchmark leaks from OpenAI are rare and usually surface through The Information or direct OpenAI preprints on arXiv. A random Twitter screenshot with no source chain is not a leak.
2. Third-party "scores" from shadow testing. Some platforms claim to have tested Spud through leaked API access. OpenAI's pre-release testing is done on isolated infrastructure with hard-coded model IDs — you cannot "accidentally" hit a pretraining-only model.
3. Inference from embedding analysis. A few researchers publish guesses based on probing OpenAI's public APIs for behavior changes. This has historically been wrong 80%+ of the time — behavior drift during post-training is standard and doesn't predict final benchmark scores.
Rule of thumb: if a "leak" doesn't include (a) a full benchmark methodology, (b) a named source chain, and (c) an OpenAI response (even a no-comment), it's noise.
What to Watch on Release Day
When Spud drops — whether as GPT-5.5 or under a different name — five signals will tell you which scenario landed:
SWE-bench Verified score. If under 88%, Scenario 1. If 88-91%, Scenario 2. If 92%+, Scenario 3.
Pricing vs GPT-5.4. Same price = incremental. 1.3-1.5× = expected. 2×+ = breakthrough (with premium positioning).
Context window. If unchanged from GPT-5.4's 272K, scenarios 1-2. If extended to 500K+, scenario 3.
Launch format. Silent API update = 1. Blog post + demo = 2. Live event = 3.
Claude response time. Anthropic ships Opus 4.8 within 4 weeks = scenario 2+ hit hard enough to force response.
TokenMix.ai will publish same-day pricing comparisons, latency benchmarks, and a migration decision matrix within 24 hours of Spud release. Track our blog for the drop.
FAQ
When will GPT-5.5 benchmarks be officially released?
Benchmarks are typically published alongside the API release. Based on OpenAI's pattern with GPT-5.1 through GPT-5.4, expect a blog post with model card, benchmark table, and API pricing on the same day. Polymarket currently gives 70%+ odds on an April 2026 release, with May-June as the fallback window.
Will GPT-5.5 beat Claude Opus 4.7 on coding?
Most likely yes on SWE-bench Pro (where GPT-5.4 already leads), and probably yes on SWE-bench Verified if scenario 2 materializes. OpenAI historically doesn't let a competitor hold a coding benchmark SOTA for more than one release cycle, and Claude Opus 4.7 only shipped April 16, 2026 with 87.6% on SWE-bench Verified.
Will GPT-5.5 beat Gemini 3.1 Pro on GPQA Diamond?
Probably yes by 0.5-1.5pp. Gemini 3.1 Pro currently holds 94.3%. Saturation near 98% makes dramatic gains unlikely. Spud's more plausible lead comes on ARC-AGI-2 and SWE-bench, where headroom remains.
Are GPT-5.5 benchmark "leaks" I see on Twitter real?
Almost never. Genuine OpenAI leaks surface through The Information or arXiv preprints, not anonymous Twitter screenshots. If the "leak" lacks a methodology, a source chain, and an OpenAI response, treat it as synthetic.
Should I wait for Spud or use GPT-5.4 for my project today?
Use GPT-5.4 today if it meets your needs. Waiting on an unannounced release is bad engineering practice. Build with model abstraction — through TokenMix.ai's OpenAI-compatible gateway or any similar service — so you can switch to Spud the day it launches with a config change rather than a code rewrite. Our GPT-5.5 migration checklist covers this in 7 steps.
How will GPT-5.5 pricing compare to GPT-5.4?
See our full pricing prediction for three scenarios. In short: scenario 1 (status quo at $2.50/
5), scenario 2 (defensive cut to match Gemini 3.1 Pro at $2/
2), or scenario 3 (premium at $3-$5/
8-$25). Scenario 2 is most consistent with competitive dynamics.