TokenMix Research Lab · 2026-04-21

GPT-5.5 Spud Benchmarks: Projected Scores vs GPT-5.4 & Claude 4.7

GPT-5.5 Spud Benchmarks: Projected Scores vs GPT-5.4 & Claude 4.7 (2026)

GPT-5.5 — codenamed "Spud" — finished pretraining on March 24, 2026. No benchmarks have been released. No model card exists. What we have is the performance floor the market now demands: GPT-5.4 holds coding with 57.7% on SWE-bench Pro, Claude Opus 4.7 jumped to 87.6% on SWE-bench Verified, and Gemini 3.1 Pro leads GPQA Diamond at 94.3%. This article models three benchmark scenarios for Spud based on real competitor data — what it must beat, what it is likely to hit, and what would justify a price premium. TokenMix.ai tracks benchmark movement across 300+ models in production and will publish Spud scores within 24 hours of release.

Table of Contents


Confirmed vs Speculation: What We Actually Know About Spud

Claim Status Source
Pretraining finished March 24, 2026 Confirmed The Information, corroborated by multiple industry trackers
Internal codename "Spud" Confirmed Multiple corroborated reports
GPT-5.4 shipped March 5, 2026 at $2.50/ 5 per MTok Confirmed OpenAI API docs
Will be called "GPT-5.5" Unverified Community assumption
Benchmark scores above GPT-5.4 Likely Historical pattern (every major OpenAI release beats prior)
Specific GPQA/SWE-bench numbers Not leaked No credible source has published figures
Release before June 2026 Likely Post-training typically 4-12 weeks, GPT-5.4 took ~6 weeks

Bottom line: any article claiming specific Spud benchmark numbers as fact is fabricating. The model is still in post-training (RLHF, red-teaming, safety eval). Benchmarks only stabilize in the final 2-3 weeks before release.

What we can do is project plausible scenarios based on real competitor scores and OpenAI's release pattern.

The Competitor Floor: What GPT-5.5 Must Beat

Before projecting Spud, establish the floor. These are publicly reported scores from Q1-Q2 2026:

Benchmark GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro Current SOTA
SWE-bench Verified 58.7% 87.6% 80.6% Claude Opus 4.7
SWE-bench Pro 57.7% 54.2% (est) ~52% (est) GPT-5.4
GPQA Diamond 92.8% 94.2% 94.3% Gemini 3.1 Pro
Terminal-Bench 2.0 ~60% (est) 69.4% ~58% (est) Claude Opus 4.7
ARC-AGI-2 73.3% ~71% (est) 77.1% Gemini 3.1 Pro
HumanEval 93.1% ~92% ~92% GPT-5.4
Finance Agent ~58% (est) 64.4% ~55% (est) Claude Opus 4.7
OSWorld (computer use) 75% ~68% (est) ~60% (est) GPT-5.4
GDPval (knowledge work) 83% ~78% (est) ~75% (est) GPT-5.4

Sources: Anthropic Opus 4.7 launch, Google Gemini 3.1 Pro benchmarks, OpenAI GPT-5.4 docs, Vellum LLM leaderboard.

Key observations:

For Spud to be a meaningful release (not just a point update like 5.1 → 5.2), it needs to beat GPT-5.4 on every benchmark and reclaim at least one category leadership from Claude/Gemini.

Benchmark Scenario 1: Incremental (Match Gemini 3.1 Pro)

Positioning: Defensive release. Close gaps on reasoning but don't stretch for SOTA.

Why this scenario happens: OpenAI prioritizes stability over peaks. Post-training timelines compressed because GPT-5.4 only shipped 6 weeks before Spud pretraining finished. Less time for RLHF tuning.

Projected scores:

Benchmark GPT-5.5 (Scenario 1) vs GPT-5.4 vs Leader
SWE-bench Verified 82-85% +24-26pp Still behind Opus 4.7 (87.6%)
SWE-bench Pro 61-64% +3-6pp New SOTA
GPQA Diamond 94.0-94.5% +1-2pp Ties Gemini 3.1 Pro
Terminal-Bench 2.0 64-68% +4-8pp Close to Opus 4.7
ARC-AGI-2 75-78% +2-5pp Matches Gemini 3.1 Pro
HumanEval 94-95% +1-2pp New SOTA

Probability: Moderate-high. Matches OpenAI's pattern of GPT-5.1, 5.2, 5.3 — each adds ~2-3pp across benchmarks without dramatic leaps.

What developers should do: Treat Spud as a GPT-5.4 drop-in replacement. Migration cost is the only variable. See our migration checklist for Spud.

Benchmark Scenario 2: Expected (Beat Claude Opus 4.7 on Coding)

Positioning: Reclaim coding leadership. Force Anthropic into a reactive 4.8 release.

Why this scenario happens: Anthropic just blindsided OpenAI by jumping SWE-bench Verified 7pp (80.8% → 87.6%) in Opus 4.7. OpenAI rarely lets a competitor hold a coding SOTA for more than one release cycle. Internal GPT-5 codex variants give OpenAI a clear pathway.

Projected scores:

Benchmark GPT-5.5 (Scenario 2) vs GPT-5.4 vs Leader
SWE-bench Verified 88-91% +30-33pp New SOTA (beats Opus 4.7)
SWE-bench Pro 65-70% +7-13pp New SOTA
GPQA Diamond 94.5-95.5% +2-3pp New SOTA
Terminal-Bench 2.0 70-75% +10-15pp New SOTA
ARC-AGI-2 78-82% +5-9pp Beats Gemini 3.1 Pro
HumanEval 95-97% +2-4pp New SOTA
Finance Agent 66-70% +8-12pp New SOTA

Probability: Highest. Aligns with OpenAI's competitive posture and the narrative value of "biggest capability jump since GPT-5.0". Matches Polymarket odds of 70%+ for April release.

What developers should do: Prepare a benchmarking harness now. On release day you'll want to rerun your internal evals within 24 hours. Teams using TokenMix.ai's model gateway can switch model IDs in config without touching application code.

Benchmark Scenario 3: Breakthrough (New SOTA Across the Board)

Positioning: Generational leap. Ships as "GPT-6" rather than "GPT-5.5".

Why this scenario happens: Pretraining required enough new compute and data that OpenAI decided to rebrand. Three signals would indicate this: (1) an unusually long post-training phase (8-12 weeks), (2) a live demo event instead of a quiet API release, and (3) the model gets a full new model card rather than an incremental one.

Projected scores:

Benchmark GPT-5.5 (Scenario 3) vs GPT-5.4 vs Leader
SWE-bench Verified 92-95% +34-37pp Dominant SOTA
SWE-bench Pro 72-78% +15-21pp Dominant SOTA
GPQA Diamond 96-98% +3-5pp Near-saturation
Terminal-Bench 2.0 78-83% +18-23pp Dominant SOTA
ARC-AGI-2 85-90% +12-17pp Dominant SOTA
HumanEval 98-99% +5-6pp Saturated
Finance Agent 75-82% +17-24pp Dominant SOTA

Probability: Low-moderate. OpenAI historically reserves step-function improvements for major version numbers (GPT-3 → GPT-4 → GPT-5). A 5.5 label with breakthrough scores would be off-pattern.

What developers should do: If this scenario materializes, expect pricing to be 1.5-2× higher than GPT-5.4, and rate limits tighter for the first 30 days. Have a fallback to GPT-5.4 ready for burst traffic. See our GPT-5.5 pricing prediction for cost modeling.

Full Projected Comparison Table

All three scenarios side by side against current market leaders:

Benchmark GPT-5.4 Opus 4.7 Gemini 3.1 GPT-5.5 S1 GPT-5.5 S2 GPT-5.5 S3
SWE-bench Verified 58.7% 87.6% 80.6% 82-85% 88-91% 92-95%
SWE-bench Pro 57.7% 54.2% 52% 61-64% 65-70% 72-78%
GPQA Diamond 92.8% 94.2% 94.3% 94.0-94.5% 94.5-95.5% 96-98%
Terminal-Bench 2.0 60% 69.4% 58% 64-68% 70-75% 78-83%
ARC-AGI-2 73.3% 71% 77.1% 75-78% 78-82% 85-90%
HumanEval 93.1% 92% 92% 94-95% 95-97% 98-99%
Finance Agent 58% 64.4% 55% 62-66% 66-70% 75-82%
OSWorld 75% 68% 60% 76-79% 79-83% 85-90%

Scenarios legend: S1=Incremental, S2=Expected (most likely), S3=Breakthrough.

Why the "Benchmarks Leak" Stories Are Mostly Noise

If you search "GPT-5.5 benchmarks leak" right now, you'll find a mix of three sources, all of which should be treated skeptically:

1. Twitter/X posts with screenshots. These almost always show synthetic data. Genuine benchmark leaks from OpenAI are rare and usually surface through The Information or direct OpenAI preprints on arXiv. A random Twitter screenshot with no source chain is not a leak.

2. Third-party "scores" from shadow testing. Some platforms claim to have tested Spud through leaked API access. OpenAI's pre-release testing is done on isolated infrastructure with hard-coded model IDs — you cannot "accidentally" hit a pretraining-only model.

3. Inference from embedding analysis. A few researchers publish guesses based on probing OpenAI's public APIs for behavior changes. This has historically been wrong 80%+ of the time — behavior drift during post-training is standard and doesn't predict final benchmark scores.

Rule of thumb: if a "leak" doesn't include (a) a full benchmark methodology, (b) a named source chain, and (c) an OpenAI response (even a no-comment), it's noise.

What to Watch on Release Day

When Spud drops — whether as GPT-5.5 or under a different name — five signals will tell you which scenario landed:

  1. SWE-bench Verified score. If under 88%, Scenario 1. If 88-91%, Scenario 2. If 92%+, Scenario 3.
  2. Pricing vs GPT-5.4. Same price = incremental. 1.3-1.5× = expected. 2×+ = breakthrough (with premium positioning).
  3. Context window. If unchanged from GPT-5.4's 272K, scenarios 1-2. If extended to 500K+, scenario 3.
  4. Launch format. Silent API update = 1. Blog post + demo = 2. Live event = 3.
  5. Claude response time. Anthropic ships Opus 4.8 within 4 weeks = scenario 2+ hit hard enough to force response.

TokenMix.ai will publish same-day pricing comparisons, latency benchmarks, and a migration decision matrix within 24 hours of Spud release. Track our blog for the drop.

FAQ

When will GPT-5.5 benchmarks be officially released?

Benchmarks are typically published alongside the API release. Based on OpenAI's pattern with GPT-5.1 through GPT-5.4, expect a blog post with model card, benchmark table, and API pricing on the same day. Polymarket currently gives 70%+ odds on an April 2026 release, with May-June as the fallback window.

Will GPT-5.5 beat Claude Opus 4.7 on coding?

Most likely yes on SWE-bench Pro (where GPT-5.4 already leads), and probably yes on SWE-bench Verified if scenario 2 materializes. OpenAI historically doesn't let a competitor hold a coding benchmark SOTA for more than one release cycle, and Claude Opus 4.7 only shipped April 16, 2026 with 87.6% on SWE-bench Verified.

Will GPT-5.5 beat Gemini 3.1 Pro on GPQA Diamond?

Probably yes by 0.5-1.5pp. Gemini 3.1 Pro currently holds 94.3%. Saturation near 98% makes dramatic gains unlikely. Spud's more plausible lead comes on ARC-AGI-2 and SWE-bench, where headroom remains.

Are GPT-5.5 benchmark "leaks" I see on Twitter real?

Almost never. Genuine OpenAI leaks surface through The Information or arXiv preprints, not anonymous Twitter screenshots. If the "leak" lacks a methodology, a source chain, and an OpenAI response, treat it as synthetic.

Should I wait for Spud or use GPT-5.4 for my project today?

Use GPT-5.4 today if it meets your needs. Waiting on an unannounced release is bad engineering practice. Build with model abstraction — through TokenMix.ai's OpenAI-compatible gateway or any similar service — so you can switch to Spud the day it launches with a config change rather than a code rewrite. Our GPT-5.5 migration checklist covers this in 7 steps.

How will GPT-5.5 pricing compare to GPT-5.4?

See our full pricing prediction for three scenarios. In short: scenario 1 (status quo at $2.50/ 5), scenario 2 (defensive cut to match Gemini 3.1 Pro at $2/ 2), or scenario 3 (premium at $3-$5/ 8-$25). Scenario 2 is most consistent with competitive dynamics.


Sources

By TokenMix Research Lab · Updated 2026-04-22