TokenMix Research Lab · 2026-04-21

GPT-5.5 Spud Benchmarks: Projected Scores vs GPT-5.4 & Claude 4.7 (2026)

GPT-5.5 — codenamed "Spud" — finished pretraining on March 24, 2026. No benchmarks have been released. No model card exists. What we have is the performance floor the market now demands: GPT-5.4 holds coding with 57.7% on SWE-bench Pro, Claude Opus 4.7 jumped to 87.6% on SWE-bench Verified, and Gemini 3.1 Pro leads GPQA Diamond at 94.3%. This article models three benchmark scenarios for Spud based on real competitor data — what it must beat, what it is likely to hit, and what would justify a price premium. TokenMix.ai tracks benchmark movement across 300+ models in production and will publish Spud scores within 24 hours of release.

Confirmed vs Speculation: What We Actually Know About Spud
The Competitor Floor: What GPT-5.5 Must Beat
Benchmark Scenario 1: Incremental (Match Gemini 3.1 Pro)
Benchmark Scenario 2: Expected (Beat Claude Opus 4.7 on Coding)
Benchmark Scenario 3: Breakthrough (New SOTA Across the Board)
Full Projected Comparison Table
Why the "Benchmarks Leak" Stories Are Mostly Noise
What to Watch on Release Day
FAQ

Confirmed vs Speculation: What We Actually Know About Spud

Claim	Status	Source
Pretraining finished March 24, 2026	Confirmed	The Information, corroborated by multiple industry trackers
Internal codename "Spud"	Confirmed	Multiple corroborated reports
GPT-5.4 shipped March 5, 2026 at $2.50/ 5 per MTok	Confirmed	OpenAI API docs
Will be called "GPT-5.5"	Unverified	Community assumption
Benchmark scores above GPT-5.4	Likely	Historical pattern (every major OpenAI release beats prior)
Specific GPQA/SWE-bench numbers	Not leaked	No credible source has published figures
Release before June 2026	Likely	Post-training typically 4-12 weeks, GPT-5.4 took ~6 weeks

Bottom line: any article claiming specific Spud benchmark numbers as fact is fabricating. The model is still in post-training (RLHF, red-teaming, safety eval). Benchmarks only stabilize in the final 2-3 weeks before release.

What we can do is project plausible scenarios based on real competitor scores and OpenAI's release pattern.

The Competitor Floor: What GPT-5.5 Must Beat

Before projecting Spud, establish the floor. These are publicly reported scores from Q1-Q2 2026:

Benchmark	GPT-5.4	Claude Opus 4.7	Gemini 3.1 Pro	Current SOTA
SWE-bench Verified	58.7%	87.6%	80.6%	Claude Opus 4.7
SWE-bench Pro	57.7%	54.2% (est)	~52% (est)	GPT-5.4
GPQA Diamond	92.8%	94.2%	94.3%	Gemini 3.1 Pro
Terminal-Bench 2.0	~60% (est)	69.4%	~58% (est)	Claude Opus 4.7
ARC-AGI-2	73.3%	~71% (est)	77.1%	Gemini 3.1 Pro
HumanEval	93.1%	~92%	~92%	GPT-5.4
Finance Agent	~58% (est)	64.4%	~55% (est)	Claude Opus 4.7
OSWorld (computer use)	75%	~68% (est)	~60% (est)	GPT-5.4
GDPval (knowledge work)	83%	~78% (est)	~75% (est)	GPT-5.4

Sources: Anthropic Opus 4.7 launch, Google Gemini 3.1 Pro benchmarks, OpenAI GPT-5.4 docs, Vellum LLM leaderboard.

Key observations:

GPT-5.4 still leads on 3 benchmarks (SWE-bench Pro, HumanEval, OSWorld, GDPval) — OpenAI can't ship a worse model
Coding is OpenAI's moat — they'll protect SWE-bench Pro and HumanEval dominance
Gemini 3.1 Pro is the reasoning benchmark leader (GPQA, ARC-AGI-2) — Spud has to close these gaps
Claude Opus 4.7 disrupted SWE-bench Verified jumping to 87.6%, forcing OpenAI to respond

For Spud to be a meaningful release (not just a point update like 5.1 → 5.2), it needs to beat GPT-5.4 on every benchmark and reclaim at least one category leadership from Claude/Gemini.

Benchmark Scenario 1: Incremental (Match Gemini 3.1 Pro)

Positioning: Defensive release. Close gaps on reasoning but don't stretch for SOTA.

Why this scenario happens: OpenAI prioritizes stability over peaks. Post-training timelines compressed because GPT-5.4 only shipped 6 weeks before Spud pretraining finished. Less time for RLHF tuning.

Projected scores:

Benchmark	GPT-5.5 (Scenario 1)	vs GPT-5.4	vs Leader
SWE-bench Verified	82-85%	+24-26pp	Still behind Opus 4.7 (87.6%)
SWE-bench Pro	61-64%	+3-6pp	New SOTA
GPQA Diamond	94.0-94.5%	+1-2pp	Ties Gemini 3.1 Pro
Terminal-Bench 2.0	64-68%	+4-8pp	Close to Opus 4.7
ARC-AGI-2	75-78%	+2-5pp	Matches Gemini 3.1 Pro
HumanEval	94-95%	+1-2pp	New SOTA

Probability: Moderate-high. Matches OpenAI's pattern of GPT-5.1, 5.2, 5.3 — each adds ~2-3pp across benchmarks without dramatic leaps.

What developers should do: Treat Spud as a GPT-5.4 drop-in replacement. Migration cost is the only variable. See our migration checklist for Spud.

Benchmark Scenario 2: Expected (Beat Claude Opus 4.7 on Coding)

Positioning: Reclaim coding leadership. Force Anthropic into a reactive 4.8 release.

Why this scenario happens: Anthropic just blindsided OpenAI by jumping SWE-bench Verified 7pp (80.8% → 87.6%) in Opus 4.7. OpenAI rarely lets a competitor hold a coding SOTA for more than one release cycle. Internal GPT-5 codex variants give OpenAI a clear pathway.

Projected scores:

Benchmark	GPT-5.5 (Scenario 2)	vs GPT-5.4	vs Leader
SWE-bench Verified	88-91%	+30-33pp	New SOTA (beats Opus 4.7)
SWE-bench Pro	65-70%	+7-13pp	New SOTA
GPQA Diamond	94.5-95.5%	+2-3pp	New SOTA
Terminal-Bench 2.0	70-75%	+10-15pp	New SOTA
ARC-AGI-2	78-82%	+5-9pp	Beats Gemini 3.1 Pro
HumanEval	95-97%	+2-4pp	New SOTA
Finance Agent	66-70%	+8-12pp	New SOTA

Probability: Highest. Aligns with OpenAI's competitive posture and the narrative value of "biggest capability jump since GPT-5.0". Matches Polymarket odds of 70%+ for April release.

What developers should do: Prepare a benchmarking harness now. On release day you'll want to rerun your internal evals within 24 hours. Teams using TokenMix.ai's model gateway can switch model IDs in config without touching application code.

Benchmark Scenario 3: Breakthrough (New SOTA Across the Board)

Positioning: Generational leap. Ships as "GPT-6" rather than "GPT-5.5".

Why this scenario happens: Pretraining required enough new compute and data that OpenAI decided to rebrand. Three signals would indicate this: (1) an unusually long post-training phase (8-12 weeks), (2) a live demo event instead of a quiet API release, and (3) the model gets a full new model card rather than an incremental one.

Projected scores:

Benchmark	GPT-5.5 (Scenario 3)	vs GPT-5.4	vs Leader
SWE-bench Verified	92-95%	+34-37pp	Dominant SOTA
SWE-bench Pro	72-78%	+15-21pp	Dominant SOTA
GPQA Diamond	96-98%	+3-5pp	Near-saturation
Terminal-Bench 2.0	78-83%	+18-23pp	Dominant SOTA
ARC-AGI-2	85-90%	+12-17pp	Dominant SOTA
HumanEval	98-99%	+5-6pp	Saturated
Finance Agent	75-82%	+17-24pp	Dominant SOTA

Probability: Low-moderate. OpenAI historically reserves step-function improvements for major version numbers (GPT-3 → GPT-4 → GPT-5). A 5.5 label with breakthrough scores would be off-pattern.

What developers should do: If this scenario materializes, expect pricing to be 1.5-2× higher than GPT-5.4, and rate limits tighter for the first 30 days. Have a fallback to GPT-5.4 ready for burst traffic. See our GPT-5.5 pricing prediction for cost modeling.

Full Projected Comparison Table

All three scenarios side by side against current market leaders:

Benchmark	GPT-5.4	Opus 4.7	Gemini 3.1	GPT-5.5 S1	GPT-5.5 S2	GPT-5.5 S3
SWE-bench Verified	58.7%	87.6%	80.6%	82-85%	88-91%	92-95%
SWE-bench Pro	57.7%	54.2%	52%	61-64%	65-70%	72-78%
GPQA Diamond	92.8%	94.2%	94.3%	94.0-94.5%	94.5-95.5%	96-98%
Terminal-Bench 2.0	60%	69.4%	58%	64-68%	70-75%	78-83%
ARC-AGI-2	73.3%	71%	77.1%	75-78%	78-82%	85-90%
HumanEval	93.1%	92%	92%	94-95%	95-97%	98-99%
Finance Agent	58%	64.4%	55%	62-66%	66-70%	75-82%
OSWorld	75%	68%	60%	76-79%	79-83%	85-90%

Scenarios legend: S1=Incremental, S2=Expected (most likely), S3=Breakthrough.

Why the "Benchmarks Leak" Stories Are Mostly Noise

If you search "GPT-5.5 benchmarks leak" right now, you'll find a mix of three sources, all of which should be treated skeptically:

1. Twitter/X posts with screenshots. These almost always show synthetic data. Genuine benchmark leaks from OpenAI are rare and usually surface through The Information or direct OpenAI preprints on arXiv. A random Twitter screenshot with no source chain is not a leak.

2. Third-party "scores" from shadow testing. Some platforms claim to have tested Spud through leaked API access. OpenAI's pre-release testing is done on isolated infrastructure with hard-coded model IDs — you cannot "accidentally" hit a pretraining-only model.

3. Inference from embedding analysis. A few researchers publish guesses based on probing OpenAI's public APIs for behavior changes. This has historically been wrong 80%+ of the time — behavior drift during post-training is standard and doesn't predict final benchmark scores.

Rule of thumb: if a "leak" doesn't include (a) a full benchmark methodology, (b) a named source chain, and (c) an OpenAI response (even a no-comment), it's noise.

What to Watch on Release Day

When Spud drops — whether as GPT-5.5 or under a different name — five signals will tell you which scenario landed:

SWE-bench Verified score. If under 88%, Scenario 1. If 88-91%, Scenario 2. If 92%+, Scenario 3.
Pricing vs GPT-5.4. Same price = incremental. 1.3-1.5× = expected. 2×+ = breakthrough (with premium positioning).
Context window. If unchanged from GPT-5.4's 272K, scenarios 1-2. If extended to 500K+, scenario 3.
Launch format. Silent API update = 1. Blog post + demo = 2. Live event = 3.
Claude response time. Anthropic ships Opus 4.8 within 4 weeks = scenario 2+ hit hard enough to force response.

TokenMix.ai will publish same-day pricing comparisons, latency benchmarks, and a migration decision matrix within 24 hours of Spud release. Track our blog for the drop.

FAQ

When will GPT-5.5 benchmarks be officially released?

Benchmarks are typically published alongside the API release. Based on OpenAI's pattern with GPT-5.1 through GPT-5.4, expect a blog post with model card, benchmark table, and API pricing on the same day. Polymarket currently gives 70%+ odds on an April 2026 release, with May-June as the fallback window.

Will GPT-5.5 beat Claude Opus 4.7 on coding?

Most likely yes on SWE-bench Pro (where GPT-5.4 already leads), and probably yes on SWE-bench Verified if scenario 2 materializes. OpenAI historically doesn't let a competitor hold a coding benchmark SOTA for more than one release cycle, and Claude Opus 4.7 only shipped April 16, 2026 with 87.6% on SWE-bench Verified.

Will GPT-5.5 beat Gemini 3.1 Pro on GPQA Diamond?

Probably yes by 0.5-1.5pp. Gemini 3.1 Pro currently holds 94.3%. Saturation near 98% makes dramatic gains unlikely. Spud's more plausible lead comes on ARC-AGI-2 and SWE-bench, where headroom remains.

Are GPT-5.5 benchmark "leaks" I see on Twitter real?

Almost never. Genuine OpenAI leaks surface through The Information or arXiv preprints, not anonymous Twitter screenshots. If the "leak" lacks a methodology, a source chain, and an OpenAI response, treat it as synthetic.

Should I wait for Spud or use GPT-5.4 for my project today?

Use GPT-5.4 today if it meets your needs. Waiting on an unannounced release is bad engineering practice. Build with model abstraction — through TokenMix.ai's OpenAI-compatible gateway or any similar service — so you can switch to Spud the day it launches with a config change rather than a code rewrite. Our GPT-5.5 migration checklist covers this in 7 steps.

How will GPT-5.5 pricing compare to GPT-5.4?

See our full pricing prediction for three scenarios. In short: scenario 1 (status quo at $2.50/ 5), scenario 2 (defensive cut to match Gemini 3.1 Pro at $2/ 2), or scenario 3 (premium at $3-$5/ 8-$25). Scenario 2 is most consistent with competitive dynamics.

Sources

By TokenMix Research Lab · Updated 2026-04-22

GPT-5.5 Spud Benchmarks: Projected Scores vs GPT-5.4 & Claude 4.7 (2026)

Table of Contents

Confirmed vs Speculation: What We Actually Know About Spud

The Competitor Floor: What GPT-5.5 Must Beat

Benchmark Scenario 1: Incremental (Match Gemini 3.1 Pro)

Benchmark Scenario 2: Expected (Beat Claude Opus 4.7 on Coding)

Benchmark Scenario 3: Breakthrough (New SOTA Across the Board)

Full Projected Comparison Table

Why the "Benchmarks Leak" Stories Are Mostly Noise

What to Watch on Release Day

FAQ

When will GPT-5.5 benchmarks be officially released?

Will GPT-5.5 beat Claude Opus 4.7 on coding?

Will GPT-5.5 beat Gemini 3.1 Pro on GPQA Diamond?

Are GPT-5.5 benchmark "leaks" I see on Twitter real?

Should I wait for Spud or use GPT-5.4 for my project today?

How will GPT-5.5 pricing compare to GPT-5.4?

Sources