TokenMix Research Lab · 2026-04-24
GPT-OSS-120B Review: Open Source OpenAI? 2026 Benchmark
GPT-OSS-120B is OpenAI's 120-billion-parameter open-weight model released August 5, 2025 under Apache 2.0 license — the first meaningful open release from OpenAI since the original GPT-2 in 2019. It's a Mixture-of-Experts architecture with 5.1B active parameters per token, runs on a single 80GB H100 (or 96GB MacBook M-series quantized), and targets frontier reasoning at compact inference cost. Key specs: 128K context, fits on 80GB VRAM in MXFP4, direct chain-of-thought outputs, no safety RLHF layer (raw base alignment). Priced at roughly $0.09 input / $0.40 output per MTok via aggregators — ~5× cheaper than GPT-5.4-mini while passing many of its benchmarks. TokenMix.ai hosts GPT-OSS-120B alongside 300+ other models through OpenAI-compatible endpoint.
Table of Contents
- Confirmed vs Speculation
- Memory Requirements: 80GB VRAM Floor
- Benchmarks: Where It Actually Wins
- GPT-OSS-120B vs Gemma 4, DeepSeek R1, Llama 4
- Playground Access & Hosting Options
- Who Should Use It
- FAQ
Confirmed vs Speculation
| Claim | Status | Source |
|---|---|---|
| 120B total params, 5.1B active | Confirmed | OpenAI announcement |
| Apache 2.0 license | Confirmed | Model card |
| Single H100 80GB deployment | Confirmed (MXFP4 quantization) | Technical blog |
| 128K context window | Confirmed | Docs |
| Beats GPT-5.4-mini on reasoning | Partial — on AIME yes, on general benchmarks marginal | Independent tests |
| No safety RLHF (raw alignment) | Confirmed | OpenAI release notes |
| Matches proprietary frontier models | No — trails GPT-5.4 by 15-25pp on most benchmarks | Vellum benchmarks |
| Runs on MacBook M3 Pro | Yes (96GB unified memory, int4 quantized) | Community reports |
Memory Requirements: 80GB VRAM Floor
Practical deployment reality:
| Quantization | VRAM needed | Minimum hardware | Speed |
|---|---|---|---|
| fp16 (native) | ~240GB | 3× H100 80GB | 80 tok/s |
| MXFP4 (OpenAI recommended) | ~75GB | 1× H100 80GB | 120 tok/s |
| int4 (community) | ~65GB | 1× H100 80GB or 96GB Mac | 60-90 tok/s |
| int4 on consumer | ~60GB | 2× RTX 4090 tensor parallel | 40 tok/s |
| int4 M3 Max 128GB | ~55GB | Apple silicon | 25-35 tok/s |
Key insight: OpenAI's MXFP4 quantization was designed specifically so GPT-OSS-120B fits on a single H100. This is the cheapest deployment path — one GPU at ~$2/hour rental, or $25-30K capex.
Benchmarks: Where It Actually Wins
| Benchmark | GPT-OSS-120B | GPT-5.4-mini | Gemma 4 31B | DeepSeek R1 |
|---|---|---|---|---|
| MMLU | 86% | 88% | 87% | 88% |
| GPQA Diamond | 78% | 79% | 78% | 71% (reasoning variant differs) |
| HumanEval | 85% | 88% | 88% | 90% |
| MATH | 91% | 92% | 85% | 96% |
| AIME 2024 | 82% | 78% | 70% | 82% |
| Reasoning depth (chain-of-thought) | Strong | Strong | Medium | Strongest |
| Latency p50 (H100) | 120 tok/s | API-only | Similar | Variable |
Where GPT-OSS-120B wins: math-heavy reasoning (AIME, MATH), single-GPU deployment economics, permissive Apache 2.0.
Where it loses: multilingual tasks (Gemma 4 stronger), pure coding at scale (Claude Opus 4.7 or GLM-5.1 leads), creative writing polish (proprietary frontier models still ahead).
GPT-OSS-120B vs Gemma 4, DeepSeek R1, Llama 4
| Dimension | GPT-OSS-120B | Gemma 4 31B | DeepSeek R1 | Llama 4 Maverick |
|---|---|---|---|---|
| Total params | 120B MoE | 31B dense | 671B MoE | 400B MoE |
| Active params | 5.1B | 31B | 37B | 17B |
| License | Apache 2.0 | Apache 2.0 | DeepSeek License | Llama Community |
| Min hardware (fp8) | 1× H100 | 1× H100 | 8× H100 | 4× H100 |
| Reasoning chain | Native CoT | Weak | Best-in-class | Medium |
| SWE-Bench Verified | ~62% | 64% | 72% | 71% |
| Hosted $/MTok | ~$0.09/$0.40 | ~$0.10/$0.30 | $0.14/$0.28 | self-host only |
| Procurement safety | US + Apache 2.0 | US + Apache 2.0 | Chinese + distillation allegations | US + Llama License |
Key judgment: GPT-OSS-120B and Gemma 4 31B are the two US-origin open-weight options with zero license friction. GPT-OSS wins on math reasoning; Gemma 4 wins on general capability. For reasoning-heavy workloads at budget scale, GPT-OSS. For general agent use at compact hardware, Gemma 4 26B MoE.
Playground Access & Hosting Options
Three paths to try GPT-OSS-120B:
Option 1 — Free playground: gpt-oss.com hosts a free inference endpoint, zero signup. Rate-limited but works for testing.
Option 2 — Hosted via aggregator: TokenMix.ai, OpenRouter, Together.ai, Fireworks all serve GPT-OSS-120B via OpenAI-compatible endpoints at ~$0.09-$0.12/MTok input, $0.35-$0.45 output.
Option 3 — Self-host: Download weights from HuggingFace (openai/gpt-oss-120b), serve with vLLM or SGLang on 1× H100. Example:
vllm serve openai/gpt-oss-120b \
--quantization mxfp4 \
--max-model-len 131072 \
--tensor-parallel-size 1
Memory footprint under MXFP4 fits comfortably in H100 80GB. Throughput 120+ tok/s at batch size 1.
Who Should Use It
| Your situation | Use GPT-OSS-120B? |
|---|---|
| Math/reasoning heavy workload, moderate budget | Yes |
| On-prem reasoning with Apache 2.0 requirement | Yes |
| US federal/defense procurement | Yes |
| Coding agent (SWE-Bench critical) | No — use GLM-5.1 or Claude Opus 4.7 |
| Long-context RAG (>500K tokens) | No — Gemini 3.1 Pro better |
| Multimodal (vision/audio) | No — text only |
| Hobbyist experimentation on MacBook | Yes (int4 on 96GB M-series) |
| Production chat at scale with latency SLA | Test vs GPT-5.4-mini before committing |
FAQ
Is GPT-OSS-120B really open source?
Yes, under Apache 2.0 license. You can download the weights, modify them, fine-tune on your data, redistribute derivatives, and deploy commercially without paying OpenAI. This is the strongest open-source commitment from OpenAI since GPT-2 in 2019.
Can GPT-OSS-120B run on a MacBook?
Yes, on M-series Macs with 96GB+ unified memory using int4 quantization. Expect 25-35 tok/s on M3 Max 128GB, slower on M3 Pro 96GB. For most developers, this is the most accessible frontier-class reasoning model to run locally.
Is there a smaller GPT-OSS variant?
Yes, GPT-OSS-20B was released alongside 120B. Runs on 16GB VRAM (fits RTX 4080/4090 or MacBook 24GB). Benchmarks closer to GPT-5.4-nano but still strong on math. Use 20B for latency-critical chat, 120B for reasoning.
How does GPT-OSS-120B compare to DeepSeek R1 for reasoning?
DeepSeek R1 is still stronger on raw reasoning depth (longer chain-of-thought traces, better AIME scores in its reasoning variant). GPT-OSS-120B is 3-5× cheaper to deploy (single GPU vs 8-GPU cluster) and has cleaner procurement (no distillation allegations). Choose DeepSeek R1 for benchmark-ceiling reasoning; choose GPT-OSS-120B for cost-effective reasoning at ~90% of DeepSeek's quality.
Why does GPT-OSS have "no safety RLHF"?
OpenAI released the base reasoning model without additional safety fine-tuning, letting downstream users apply their own alignment. This means raw outputs can be more direct/blunt than ChatGPT. For enterprise, apply your own guardrails. For research, this removes a common confound in benchmarks.
Is GPT-OSS-120B better than GPT-5.4-mini?
On benchmarks, roughly tied with GPT-5.4-mini across most metrics — GPT-OSS slightly ahead on AIME math, slightly behind on HumanEval coding. On deployment cost, GPT-OSS via TokenMix.ai at $0.09/MTok is ~5× cheaper than GPT-5.4-mini. For cost-sensitive production, GPT-OSS wins.
Does OpenAI plan to release GPT-OSS-200B or bigger?
Not announced. GPT-OSS-120B + 20B are positioned as a complete pair — one for reasoning, one for speed. Larger open releases from OpenAI are speculative.
Sources
- OpenAI GPT-OSS Announcement
- GPT-OSS HuggingFace Model Card
- GPT-OSS Playground
- Vellum LLM Leaderboard
- Gemma 4 Review — TokenMix
- GLM-5.1 SWE-Bench Pro — TokenMix
- DeepSeek R1 Pricing — TokenMix
- Llama 4 Maverick Review — TokenMix
By TokenMix Research Lab · Updated 2026-04-24