TokenMix Research Lab · 2026-04-24

GPT-OSS-120B Review: Open Source OpenAI? 2026 Benchmark

Last Updated: 2026-04-24
Author: TokenMix Research Lab

GPT-OSS-120B is OpenAI's 120-billion-parameter open-weight model released August 5, 2025 under Apache 2.0 license — the first meaningful open release from OpenAI since the original GPT-2 in 2019. It's a Mixture-of-Experts architecture with 5.1B active parameters per token, runs on a single 80GB H100 (or 96GB MacBook M-series quantized), and targets frontier reasoning at compact inference cost. Key specs: 128K context, fits on 80GB VRAM in MXFP4, direct chain-of-thought outputs, no safety RLHF layer (raw base alignment). Priced at roughly $0.09 input / $0.40 output per MTok via aggregators — ~5× cheaper than GPT-5.4-mini while passing many of its benchmarks. TokenMix.ai hosts GPT-OSS-120B alongside 300+ other models through OpenAI-compatible endpoint.

Confirmed vs Speculation
Memory Requirements: 80GB VRAM Floor
Benchmarks: Where It Actually Wins
GPT-OSS-120B vs Gemma 4, DeepSeek R1, Llama 4
Playground Access & Hosting Options
Who Should Use It
FAQ

Confirmed vs Speculation

Claim	Status	Source
120B total params, 5.1B active	Confirmed	OpenAI announcement
Apache 2.0 license	Confirmed	Model card
Single H100 80GB deployment	Confirmed (MXFP4 quantization)	Technical blog
128K context window	Confirmed	Docs
Beats GPT-5.4-mini on reasoning	Partial — on AIME yes, on general benchmarks marginal	Independent tests
No safety RLHF (raw alignment)	Confirmed	OpenAI release notes
Matches proprietary frontier models	No — trails GPT-5.4 by 15-25pp on most benchmarks	Vellum benchmarks
Runs on MacBook M3 Pro	Yes (96GB unified memory, int4 quantized)	Community reports

Snapshot note (2026-04-24): Benchmark numbers against Claude Opus 4.7, Gemini 3.1 Pro, GLM-5.1, and other frontier models are vendor-reported or community-measured unless linked to a dated third-party source. GPT-5.5 and DeepSeek V4 launched April 23 — some competitive comparisons below reflect pre-launch baselines. Verify current figures against provider docs before production commitments.

Memory Requirements: 80GB VRAM Floor

Practical deployment reality:

Quantization	VRAM needed	Minimum hardware	Speed
fp16 (native)	~240GB	3× H100 80GB	80 tok/s
MXFP4 (OpenAI recommended)	~75GB	1× H100 80GB	120 tok/s
int4 (community)	~65GB	1× H100 80GB or 96GB Mac	60-90 tok/s
int4 on consumer	~60GB	2× RTX 4090 tensor parallel	40 tok/s
int4 M3 Max 128GB	~55GB	Apple silicon	25-35 tok/s

Key insight: OpenAI's MXFP4 quantization was designed specifically so GPT-OSS-120B fits on a single H100. This is the cheapest deployment path — one GPU at ~$2/hour rental, or $25-30K capex.

Benchmarks: Where It Actually Wins

Benchmark	GPT-OSS-120B	GPT-5.4-mini	Gemma 4 31B	DeepSeek R1
MMLU	86%	88%	87%	88%
GPQA Diamond	78%	79%	78%	71% (reasoning variant differs)
HumanEval	85%	88%	88%	90%
MATH	91%	92%	85%	96%
AIME 2024	82%	78%	70%	82%
Reasoning depth (chain-of-thought)	Strong	Strong	Medium	Strongest
Latency p50 (H100)	120 tok/s	API-only	Similar	Variable

Where GPT-OSS-120B wins: math-heavy reasoning (AIME, MATH), single-GPU deployment economics, permissive Apache 2.0.

Where it loses: multilingual tasks (Gemma 4 stronger), pure coding at scale (Claude Opus 4.7 or GLM-5.1 leads), creative writing polish (proprietary frontier models still ahead).

GPT-OSS-120B vs Gemma 4, DeepSeek R1, Llama 4

Dimension	GPT-OSS-120B	Gemma 4 31B	DeepSeek R1	Llama 4 Maverick
Total params	120B MoE	31B dense	671B MoE	400B MoE
Active params	5.1B	31B	37B	17B
License	Apache 2.0	Apache 2.0	DeepSeek License	Llama Community
Min hardware (fp8)	1× H100	1× H100	8× H100	4× H100
Reasoning chain	Native CoT	Weak	Best-in-class	Medium
SWE-Bench Verified	~62%	64%	72%	71%
Hosted $/MTok	~$0.09/$0.40	~$0.10/$0.30	$0.14/$0.28	self-host only
Procurement safety	US + Apache 2.0	US + Apache 2.0	Chinese + distillation allegations	US + Llama License

Key judgment: GPT-OSS-120B and Gemma 4 31B are the two US-origin open-weight options with zero license friction. GPT-OSS wins on math reasoning; Gemma 4 wins on general capability. For reasoning-heavy workloads at budget scale, GPT-OSS. For general agent use at compact hardware, Gemma 4 26B MoE.

Playground Access & Hosting Options

Three paths to try GPT-OSS-120B:

Option 1 — Free playground: gpt-oss.com hosts a free inference endpoint, zero signup. Rate-limited but works for testing.

Option 2 — Hosted via aggregator: TokenMix.ai, OpenRouter, Together.ai, Fireworks all serve GPT-OSS-120B via OpenAI-compatible endpoints at ~$0.09-$0.12/MTok input, $0.35-$0.45 output.

Option 3 — Self-host: Download weights from HuggingFace (openai/gpt-oss-120b), serve with vLLM or SGLang on 1× H100. Example:

vllm serve openai/gpt-oss-120b \
  --quantization mxfp4 \
  --max-model-len 131072 \
  --tensor-parallel-size 1

Memory footprint under MXFP4 fits comfortably in H100 80GB. Throughput 120+ tok/s at batch size 1.

Who Should Use It

Your situation	Use GPT-OSS-120B?
Math/reasoning heavy workload, moderate budget	Yes
On-prem reasoning with Apache 2.0 requirement	Yes
US federal/defense procurement	Yes
Coding agent (SWE-Bench critical)	No — use GLM-5.1 or Claude Opus 4.7
Long-context RAG (>500K tokens)	No — Gemini 3.1 Pro better
Multimodal (vision/audio)	No — text only
Hobbyist experimentation on MacBook	Yes (int4 on 96GB M-series)
Production chat at scale with latency SLA	Test vs GPT-5.4-mini before committing

FAQ

Is GPT-OSS-120B really open source?

Yes, under Apache 2.0 license. You can download the weights, modify them, fine-tune on your data, redistribute derivatives, and deploy commercially without paying OpenAI. This is the strongest open-source commitment from OpenAI since GPT-2 in 2019.

Can GPT-OSS-120B run on a MacBook?

Yes, on M-series Macs with 96GB+ unified memory using int4 quantization. Expect 25-35 tok/s on M3 Max 128GB, slower on M3 Pro 96GB. For most developers, this is the most accessible frontier-class reasoning model to run locally.

Is there a smaller GPT-OSS variant?

Yes, GPT-OSS-20B was released alongside 120B. Runs on 16GB VRAM (fits RTX 4080/4090 or MacBook 24GB). Benchmarks closer to GPT-5.4-nano but still strong on math. Use 20B for latency-critical chat, 120B for reasoning.

How does GPT-OSS-120B compare to DeepSeek R1 for reasoning?

DeepSeek R1 is still stronger on raw reasoning depth (longer chain-of-thought traces, better AIME scores in its reasoning variant). GPT-OSS-120B is 3-5× cheaper to deploy (single GPU vs 8-GPU cluster) and has cleaner procurement (no distillation allegations). Choose DeepSeek R1 for benchmark-ceiling reasoning; choose GPT-OSS-120B for cost-effective reasoning at ~90% of DeepSeek's quality.

Why does GPT-OSS have "no safety RLHF"?

OpenAI released the base reasoning model without additional safety fine-tuning, letting downstream users apply their own alignment. This means raw outputs can be more direct/blunt than ChatGPT. For enterprise, apply your own guardrails. For research, this removes a common confound in benchmarks.

Is GPT-OSS-120B better than GPT-5.4-mini?

On benchmarks, roughly tied with GPT-5.4-mini across most metrics — GPT-OSS slightly ahead on AIME math, slightly behind on HumanEval coding. On deployment cost, GPT-OSS via TokenMix.ai at $0.09/MTok is ~5× cheaper than GPT-5.4-mini. For cost-sensitive production, GPT-OSS wins.

Does OpenAI plan to release GPT-OSS-200B or bigger?

Not announced. GPT-OSS-120B + 20B are positioned as a complete pair — one for reasoning, one for speed. Larger open releases from OpenAI are speculative.

Sources

By TokenMix Research Lab · Updated 2026-04-24