TokenMix Research Lab · 2026-04-21
GPT-5.5 Spud Benchmarks: Projected Scores vs GPT-5.4 & Claude 4.7 (2026)
Last Updated: 2026-04-28
Author: TokenMix Research Lab
GPT-5.5 — codenamed "Spud" — finished pretraining on March 24, 2026. No benchmarks have been released. No model card exists. What we have is the performance floor the market now demands: GPT-5.4 holds coding with 57.7% on SWE-bench Pro, Claude Opus 4.7 jumped to 87.6% on SWE-bench Verified, and Gemini 3.1 Pro leads GPQA Diamond at 94.3%. This article models three benchmark scenarios for Spud based on real competitor data — what it must beat, what it is likely to hit, and what would justify a price premium. TokenMix.ai tracks benchmark movement across 300+ models in production and will publish Spud scores within 24 hours of release.
Table of Contents
- Confirmed vs Speculation: What We Actually Know About Spud
- The Competitor Floor: What GPT-5.5 Must Beat
- Benchmark Scenario 1: Incremental (Match Gemini 3.1 Pro)
- Benchmark Scenario 2: Expected (Beat Claude Opus 4.7 on Coding)
- Benchmark Scenario 3: Breakthrough (New SOTA Across the Board)
- Full Projected Comparison Table
- Why the "Benchmarks Leak" Stories Are Mostly Noise
- What to Watch on Release Day
- FAQ
Confirmed vs Speculation: What We Actually Know About Spud
| Claim | Status | Source |
|---|---|---|
| Pretraining finished March 24, 2026 | Confirmed | The Information, corroborated by multiple industry trackers |
| Internal codename "Spud" | Confirmed | Multiple corroborated reports |
| GPT-5.4 shipped March 5, 2026 at $2.50/$15 per MTok | Confirmed | OpenAI API docs |
| Will be called "GPT-5.5" | Unverified | Community assumption |
| Benchmark scores above GPT-5.4 | Likely | Historical pattern (every major OpenAI release beats prior) |
| Specific GPQA/SWE-bench numbers | Not leaked | No credible source has published figures |
| Release before June 2026 | Likely | Post-training typically 4-12 weeks, GPT-5.4 took ~6 weeks |
Bottom line: any article claiming specific Spud benchmark numbers as fact is fabricating. The model is still in post-training (RLHF, red-teaming, safety eval). Benchmarks only stabilize in the final 2-3 weeks before release.
What we can do is project plausible scenarios based on real competitor scores and OpenAI's release pattern.
The Competitor Floor: What GPT-5.5 Must Beat
Before projecting Spud, establish the floor. These are publicly reported scores from Q1-Q2 2026:
| Benchmark | GPT-5.4 | Claude Opus 4.7 | Gemini 3.1 Pro | Current SOTA |
|---|---|---|---|---|
| SWE-bench Verified | 58.7% | 87.6% | 80.6% | Claude Opus 4.7 |
| SWE-bench Pro | 57.7% | 54.2% (est) | ~52% (est) | GPT-5.4 |
| GPQA Diamond | 92.8% | 94.2% | 94.3% | Gemini 3.1 Pro |
| Terminal-Bench 2.0 | ~60% (est) | 69.4% | ~58% (est) | Claude Opus 4.7 |
| ARC-AGI-2 | 73.3% | ~71% (est) | 77.1% | Gemini 3.1 Pro |
| HumanEval | 93.1% | ~92% | ~92% | GPT-5.4 |
| Finance Agent | ~58% (est) | 64.4% | ~55% (est) | Claude Opus 4.7 |
| OSWorld (computer use) | 75% | ~68% (est) | ~60% (est) | GPT-5.4 |
| GDPval (knowledge work) | 83% | ~78% (est) | ~75% (est) | GPT-5.4 |
Sources: Anthropic Opus 4.7 launch, Google Gemini 3.1 Pro benchmarks, OpenAI GPT-5.4 docs, Vellum LLM leaderboard.
Key observations:
- GPT-5.4 still leads on 3 benchmarks (SWE-bench Pro, HumanEval, OSWorld, GDPval) — OpenAI can't ship a worse model
- Coding is OpenAI's moat — they'll protect SWE-bench Pro and HumanEval dominance
- Gemini 3.1 Pro is the reasoning benchmark leader (GPQA, ARC-AGI-2) — Spud has to close these gaps
- Claude Opus 4.7 disrupted SWE-bench Verified jumping to 87.6%, forcing OpenAI to respond
For Spud to be a meaningful release (not just a point update like 5.1 → 5.2), it needs to beat GPT-5.4 on every benchmark and reclaim at least one category leadership from Claude/Gemini.
Benchmark Scenario 1: Incremental (Match Gemini 3.1 Pro)
Positioning: Defensive release. Close gaps on reasoning but don't stretch for SOTA.
Why this scenario happens: OpenAI prioritizes stability over peaks. Post-training timelines compressed because GPT-5.4 only shipped 6 weeks before Spud pretraining finished. Less time for RLHF tuning.
Projected scores:
| Benchmark | GPT-5.5 (Scenario 1) | vs GPT-5.4 | vs Leader |
|---|---|---|---|
| SWE-bench Verified | 82-85% | +24-26pp | Still behind Opus 4.7 (87.6%) |
| SWE-bench Pro | 61-64% | +3-6pp | New SOTA |
| GPQA Diamond | 94.0-94.5% | +1-2pp | Ties Gemini 3.1 Pro |
| Terminal-Bench 2.0 | 64-68% | +4-8pp | Close to Opus 4.7 |
| ARC-AGI-2 | 75-78% | +2-5pp | Matches Gemini 3.1 Pro |
| HumanEval | 94-95% | +1-2pp | New SOTA |
Probability: Moderate-high. Matches OpenAI's pattern of GPT-5.1, 5.2, 5.3 — each adds ~2-3pp across benchmarks without dramatic leaps.
What developers should do: Treat Spud as a GPT-5.4 drop-in replacement. Migration cost is the only variable. See our migration checklist for Spud.
Benchmark Scenario 2: Expected (Beat Claude Opus 4.7 on Coding)
Positioning: Reclaim coding leadership. Force Anthropic into a reactive 4.8 release.
Why this scenario happens: Anthropic just blindsided OpenAI by jumping SWE-bench Verified 7pp (80.8% → 87.6%) in Opus 4.7. OpenAI rarely lets a competitor hold a coding SOTA for more than one release cycle. Internal GPT-5 codex variants give OpenAI a clear pathway.
Projected scores:
| Benchmark | GPT-5.5 (Scenario 2) | vs GPT-5.4 | vs Leader |
|---|---|---|---|
| SWE-bench Verified | 88-91% | +30-33pp | New SOTA (beats Opus 4.7) |
| SWE-bench Pro | 65-70% | +7-13pp | New SOTA |
| GPQA Diamond | 94.5-95.5% | +2-3pp | New SOTA |
| Terminal-Bench 2.0 | 70-75% | +10-15pp | New SOTA |
| ARC-AGI-2 | 78-82% | +5-9pp | Beats Gemini 3.1 Pro |
| HumanEval | 95-97% | +2-4pp | New SOTA |
| Finance Agent | 66-70% | +8-12pp | New SOTA |
Probability: Highest. Aligns with OpenAI's competitive posture and the narrative value of "biggest capability jump since GPT-5.0". Matches Polymarket odds of 70%+ for April release.
What developers should do: Prepare a benchmarking harness now. On release day you'll want to rerun your internal evals within 24 hours. Teams using TokenMix.ai's model gateway can switch model IDs in config without touching application code.
Benchmark Scenario 3: Breakthrough (New SOTA Across the Board)
Positioning: Generational leap. Ships as "GPT-6" rather than "GPT-5.5".
Why this scenario happens: Pretraining required enough new compute and data that OpenAI decided to rebrand. Three signals would indicate this: (1) an unusually long post-training phase (8-12 weeks), (2) a live demo event instead of a quiet API release, and (3) the model gets a full new model card rather than an incremental one.
Projected scores:
| Benchmark | GPT-5.5 (Scenario 3) | vs GPT-5.4 | vs Leader |
|---|---|---|---|
| SWE-bench Verified | 92-95% | +34-37pp | Dominant SOTA |
| SWE-bench Pro | 72-78% | +15-21pp | Dominant SOTA |
| GPQA Diamond | 96-98% | +3-5pp | Near-saturation |
| Terminal-Bench 2.0 | 78-83% | +18-23pp | Dominant SOTA |
| ARC-AGI-2 | 85-90% | +12-17pp | Dominant SOTA |
| HumanEval | 98-99% | +5-6pp | Saturated |
| Finance Agent | 75-82% | +17-24pp | Dominant SOTA |
Probability: Low-moderate. OpenAI historically reserves step-function improvements for major version numbers (GPT-3 → GPT-4 → GPT-5). A 5.5 label with breakthrough scores would be off-pattern.
What developers should do: If this scenario materializes, expect pricing to be 1.5-2× higher than GPT-5.4, and rate limits tighter for the first 30 days. Have a fallback to GPT-5.4 ready for burst traffic. See our GPT-5.5 pricing prediction for cost modeling.
Full Projected Comparison Table
All three scenarios side by side against current market leaders:
| Benchmark | GPT-5.4 | Opus 4.7 | Gemini 3.1 | GPT-5.5 S1 | GPT-5.5 S2 | GPT-5.5 S3 |
|---|---|---|---|---|---|---|
| SWE-bench Verified | 58.7% | 87.6% | 80.6% | 82-85% | 88-91% | 92-95% |
| SWE-bench Pro | 57.7% | 54.2% | 52% | 61-64% | 65-70% | 72-78% |
| GPQA Diamond | 92.8% | 94.2% | 94.3% | 94.0-94.5% | 94.5-95.5% | 96-98% |
| Terminal-Bench 2.0 | 60% | 69.4% | 58% | 64-68% | 70-75% | 78-83% |
| ARC-AGI-2 | 73.3% | 71% | 77.1% | 75-78% | 78-82% | 85-90% |
| HumanEval | 93.1% | 92% | 92% | 94-95% | 95-97% | 98-99% |
| Finance Agent | 58% | 64.4% | 55% | 62-66% | 66-70% | 75-82% |
| OSWorld | 75% | 68% | 60% | 76-79% | 79-83% | 85-90% |
Scenarios legend: S1=Incremental, S2=Expected (most likely), S3=Breakthrough.
Why the "Benchmarks Leak" Stories Are Mostly Noise
If you search "GPT-5.5 benchmarks leak" right now, you'll find a mix of three sources, all of which should be treated skeptically:
1. Twitter/X posts with screenshots. These almost always show synthetic data. Genuine benchmark leaks from OpenAI are rare and usually surface through The Information or direct OpenAI preprints on arXiv. A random Twitter screenshot with no source chain is not a leak.
2. Third-party "scores" from shadow testing. Some platforms claim to have tested Spud through leaked API access. OpenAI's pre-release testing is done on isolated infrastructure with hard-coded model IDs — you cannot "accidentally" hit a pretraining-only model.
3. Inference from embedding analysis. A few researchers publish guesses based on probing OpenAI's public APIs for behavior changes. This has historically been wrong 80%+ of the time — behavior drift during post-training is standard and doesn't predict final benchmark scores.
Rule of thumb: if a "leak" doesn't include (a) a full benchmark methodology, (b) a named source chain, and (c) an OpenAI response (even a no-comment), it's noise.
What to Watch on Release Day
When Spud drops — whether as GPT-5.5 or under a different name — five signals will tell you which scenario landed:
- SWE-bench Verified score. If under 88%, Scenario 1. If 88-91%, Scenario 2. If 92%+, Scenario 3.
- Pricing vs GPT-5.4. Same price = incremental. 1.3-1.5× = expected. 2×+ = breakthrough (with premium positioning).
- Context window. If unchanged from GPT-5.4's 272K, scenarios 1-2. If extended to 500K+, scenario 3.
- Launch format. Silent API update = 1. Blog post + demo = 2. Live event = 3.
- Claude response time. Anthropic ships Opus 4.8 within 4 weeks = scenario 2+ hit hard enough to force response.
TokenMix.ai will publish same-day pricing comparisons, latency benchmarks, and a migration decision matrix within 24 hours of Spud release. Track our blog for the drop.
FAQ
When will GPT-5.5 benchmarks be officially released?
Benchmarks are typically published alongside the API release. Based on OpenAI's pattern with GPT-5.1 through GPT-5.4, expect a blog post with model card, benchmark table, and API pricing on the same day. Polymarket currently gives 70%+ odds on an April 2026 release, with May-June as the fallback window.
Will GPT-5.5 beat Claude Opus 4.7 on coding?
Most likely yes on SWE-bench Pro (where GPT-5.4 already leads), and probably yes on SWE-bench Verified if scenario 2 materializes. OpenAI historically doesn't let a competitor hold a coding benchmark SOTA for more than one release cycle, and Claude Opus 4.7 only shipped April 16, 2026 with 87.6% on SWE-bench Verified.
Will GPT-5.5 beat Gemini 3.1 Pro on GPQA Diamond?
Probably yes by 0.5-1.5pp. Gemini 3.1 Pro currently holds 94.3%. Saturation near 98% makes dramatic gains unlikely. Spud's more plausible lead comes on ARC-AGI-2 and SWE-bench, where headroom remains.
Are GPT-5.5 benchmark "leaks" I see on Twitter real?
Almost never. Genuine OpenAI leaks surface through The Information or arXiv preprints, not anonymous Twitter screenshots. If the "leak" lacks a methodology, a source chain, and an OpenAI response, treat it as synthetic.
Should I wait for Spud or use GPT-5.4 for my project today?
Use GPT-5.4 today if it meets your needs. Waiting on an unannounced release is bad engineering practice. Build with model abstraction — through TokenMix.ai's OpenAI-compatible gateway or any similar service — so you can switch to Spud the day it launches with a config change rather than a code rewrite. Our GPT-5.5 migration checklist covers this in 7 steps.
How will GPT-5.5 pricing compare to GPT-5.4?
See our full pricing prediction for three scenarios. In short: scenario 1 (status quo at $2.50/$15), scenario 2 (defensive cut to match Gemini 3.1 Pro at $2/$12), or scenario 3 (premium at $3-$5/$18-$25). Scenario 2 is most consistent with competitive dynamics.
Sources
- Claude Opus 4.7 launch — Anthropic
- Gemini 3.1 Pro benchmarks — Google DeepMind
- OpenAI API pricing (current GPT-5.4)
- Vellum LLM Leaderboard 2026
- GPT-5.5 Release Date analysis — TokenMix
- The Information — OpenAI model tracking
- Artificial Analysis — Claude Opus 4.7 profile
By TokenMix Research Lab · Updated 2026-04-22