TokenMix Research Lab · 2026-04-22

MiniMax M2.5 Review: 80.2% SWE-Bench Verified at $0.28/M — Speed-Per-Dollar King 2026

MiniMax M2.5 Review: 80.2% SWE-Bench Verified at $0.28/M — The Speed-Per-Dollar King of 2026

Last Updated: 2026-04-29
Author: TokenMix Research Lab

MiniMax shipped M2.5 on February 12, 2026, and the numbers are hard to ignore. 80.2% on SWE-Bench Verified, 76.3% on BrowseComp, 192K context, and pricing at $0.28 input / $1.10 output per million tokens — for a model that completes agentic tasks 37% faster than its predecessor and matches Claude Opus 4.6 on raw generation speed.

This review covers what M2.5 actually does well in practice, where its long-context ceiling hurts, pricing comparisons against frontier peers, and the practical setup for teams looking to deploy it in production.

TL;DR — The Numbers That Matter
What MiniMax M2.5 Is
Coding Performance: Breaking Down the 80.2%
Pricing: The Cost Chart That Matters
The Context Window Problem
Real Use Cases Where M2.5 Wins
Code Example: Calling MiniMax M2.5 Through OpenAI-Compatible Endpoint
Access: Why a Gateway Matters for International Teams
Who Should Use MiniMax M2.5
What's Next for MiniMax
Bottom Line

TL;DR — The Numbers That Matter

Metric	MiniMax M2.5	Claude Opus 4.6	GPT-5.4	Qwen 3.6 Plus
SWE-Bench Verified	80.2%	79.4%	85.0%	78.8%
Multi-SWE-Bench	51.3%	~48%	~54%	—
BrowseComp	76.3%	71.0%	78.2%	—
Context window	192K	1M	1.05M	1M
Active parameters	228.7B	proprietary	proprietary	proprietary
Input price ($/M)	$0.28	$5.00	$2.38	$0.28
Output price ($/M)	$1.10	$25.00	$14.25	$1.66
Generation speed (tok/s)	~100	~100	~90	~85

Sources: MiniMax release notes, independent analysis via Artificial Analysis, TokenMix model catalog.

What MiniMax M2.5 Is

MiniMax M2.5 is the February 2026 successor to M2.1, positioned explicitly as a frontier-quality model engineered for economically valuable real-world tasks — coding, agentic tool use, search, and office workflows. The architecture is a 228.7 billion parameter mixture-of-experts model with a 192K token context window, trained with large-scale reinforcement learning across hundreds of thousands of simulated real-world environments.

Three engineering decisions define the model:

Process reward mechanism. To handle credit assignment in long agent rollouts, MiniMax introduced end-to-end monitoring of generation quality step-by-step, not just at the final answer. This is why M2.5 performs meaningfully better on multi-step agent benchmarks than on single-turn completions.
Speed optimization as a first-class goal. M2.5 completes SWE-Bench Verified evaluations 37% faster than M2.1 and matches Claude Opus 4.6's generation speed at roughly 100 tokens per second — while costing ~18× less.
Weights released openly. Unlike Claude or GPT, MiniMax M2.5 weights are available on Hugging Face under a permissive license, enabling self-hosting for teams that need air-gapped deployment.

Coding Performance: Breaking Down the 80.2%

SWE-Bench Verified (80.2%) puts MiniMax M2.5 above Claude Opus 4.6 (79.4%) and Qwen 3.6 Plus (78.8%), and within 5 percentage points of GPT-5.4 (85.0%). For a model at 6% of Claude's output price, that's an unusually flat cost-quality curve.

Multi-SWE-Bench (51.3%) measures multi-language coding across Python, JavaScript, Go, Rust, and Java. M2.5 holds its own on non-Python tasks, where many open-weight competitors (including DeepSeek V3) tend to over-fit on Python-heavy training data.

BrowseComp (76.3%) is where M2.5 genuinely shines. This benchmark tests end-to-end browsing agent behavior — parsing search results, maintaining state across page transitions, extracting structured data. The 76.3% score (with proper context management) beats Claude Opus 4.6 by 5 points and puts M2.5 in the same league as GPT-5.4 on actual web agent workflows.

Where it still trails:

Pure SWE-Bench frontier — Claude Opus 4.7 (87.6%) and GPT-5.4 (85.0%) remain ahead on repository-scale refactors
Long-context reasoning — the 192K ceiling becomes a real constraint for codebase-wide operations where 1M-context models (Claude, Gemini, Qwen) have room to breathe
Tool-calling reliability on complex nested function schemas — Claude Opus 4.7 is still the safest choice here

Pricing: The Cost Chart That Matters

MiniMax M2.5 is priced aggressively across all major providers:

Provider	Input ($/M)	Output ($/M)	Notes
MiniMax official	$0.30	$1.20	With $0.06/M cached input
Together AI	$0.40	$1.60	Hosted inference, no account required
TokenMix	$0.28	$1.10	OpenAI-compatible, unified gateway
DeepInfra	$0.30	$1.20	Competitive with MiniMax official

Compared against frontier peers — the numbers that justify the switch:

vs Claude Opus 4.7 ($5/$25): M2.5 is 17.9× cheaper on input, 22.7× cheaper on output
vs GPT-5.4 ($2.38/$14.25): M2.5 is 8.5× cheaper on input, 13.0× cheaper on output
vs Qwen 3.6 Plus ($0.28/$1.66): M2.5 matches input price, 33% cheaper on output

For a typical agent workload running 100 tokens/second continuously:

MiniMax M2.5: roughly $1 per hour — this is the figure MiniMax leans on in their launch materials, and it holds up in independent testing
Claude Opus 4.7: roughly $18 per hour at the same throughput
GPT-5.4: roughly $10 per hour

For always-on agents, coding assistants, and high-volume inference pipelines, that 18× cost delta translates to the difference between a product that can scale and one that can't.

The Context Window Problem

192K is large, but not best-in-class. Where this matters:

Where 192K is enough:

Single-file or multi-file code tasks up to ~100K tokens of context
Document analysis up to ~500 pages of plain text
Multi-turn conversations up to roughly 150 exchanges
Most agentic tool-use loops (where context is managed dynamically)

Where 192K isn't enough:

Repository-wide refactors on large monorepos (>300K tokens of code)
Long legal or technical document review (>500 pages)
Unbounded agent loops without active context pruning
Book-length content generation in a single pass

For teams whose workloads cleanly fit under 192K, M2.5 is the obvious pick. For workloads that routinely exceed it, Qwen 3.6 Plus (1M, $0.28/$1.66) or Claude Sonnet 4.6 (1M, $3/$15) remain the practical alternatives.

Real Use Cases Where M2.5 Wins

Coding assistants in IDEs. The combination of 80.2% SWE-Bench, 100 tok/s generation speed, and $1.10/M output pricing makes M2.5 the single best economic choice for a Cursor- or Windsurf-style in-editor assistant. The speed means completions arrive faster than the user can read; the price means the vendor can actually afford to serve them.

Autonomous coding agents. For agents that write, run, and debug code across multiple cycles, M2.5's combination of agentic training (via the process reward mechanism) and price makes it plausible to run long-horizon autonomous workflows without the bill ballooning.

Browse-and-extract pipelines. The 76.3% BrowseComp score with context management makes M2.5 particularly strong for building agents that navigate websites, extract structured data, and maintain state across page transitions — research agents, price monitors, competitive intelligence pipelines.

Background inference at scale. For teams processing millions of requests (summarization, classification, extraction), M2.5 at $0.28/$1.10 is 17× cheaper than Claude Opus — a difference that shows up in every monthly infra bill.

Code Example: Calling MiniMax M2.5 Through OpenAI-Compatible Endpoint

from openai import OpenAI

client = OpenAI(
    api_key="sk-tm-xxxx",
    base_url="https://api.tokenmix.ai/v1",
)

response = client.chat.completions.create(
    model="minimax-m2.5",
    messages=[
        {"role": "system", "content": "You are a senior engineer reviewing a pull request."},
        {"role": "user", "content": "Analyze this diff for correctness, performance issues, and security concerns..."},
    ],
    max_tokens=8000,
    temperature=0.2,
)

print(response.choices[0].message.content)

M2.5 supports function calling, JSON mode, and streaming through the standard OpenAI SDK interface. No custom client library is required.

Access: Why a Gateway Matters for International Teams

MiniMax is a Chinese AI company, and its official API requires an account on platform.minimax.io with payment options that often don't accept non-Chinese cards. For international teams, the practical access paths are:

TokenMix — USD billing, OpenAI-compatible, crypto/Stripe/Alipay accepted, minimax-m2.5 slug
Together AI — hosted inference with US-based infrastructure
DeepInfra — similar positioning, competitive pricing
Self-hosting from Hugging Face — practical only for teams with 4–8× H100-class GPU budgets

For production workloads where reliability and compliance matter, a gateway with multi-provider routing removes the single-provider risk and enables model-level failover without code changes.

Who Should Use MiniMax M2.5

Good fit:

Coding assistants where latency and per-token cost both matter
Autonomous agents running long-horizon workflows within 192K context
High-throughput background inference at scale
Teams priced out of Claude Opus but needing more than DeepSeek V3 delivers
Regulated industries that need to self-host (weights are on Hugging Face)

Poor fit:

Repository-scale refactors requiring >200K context
Highest-end SWE-Bench leaderboard chasing (GPT-5.4 and Claude Opus 4.7 still lead)
Multi-turn tool-use with deeply nested function schemas (Claude Opus 4.7 remains most reliable)

What's Next for MiniMax

MiniMax has telegraphed M2.7 (a self-evolving agent variant) and continued improvements to the core M-series. The March–April 2026 roadmap suggests the long-context ceiling is the next target — a 512K or 1M-context M3 would close the most visible gap to Claude and Qwen.

Bottom Line

MiniMax M2.5 is the most cost-effective model on the market that cracks 80% SWE-Bench Verified. It matches Claude Opus 4.6 on generation speed while costing 18× less, and it leads Claude on BrowseComp agentic browsing. The 192K context ceiling is the only structural limitation keeping it out of the "default choice" conversation — for any workload that fits under that ceiling, M2.5 is genuinely the strongest economic pick in 2026.

If your current coding workload runs on Claude or GPT and your monthly bill has four or more digits, a week of A/B testing the same workload on MiniMax M2.5 is likely the highest-ROI experiment you can run this quarter.

FAQ

Is MiniMax M2.5 actually the cheapest model that cracks 80% SWE-Bench?

Yes as of April 2026. At 80.2% SWE-Bench Verified and $0.28 / $1.10 per million tokens, MiniMax M2.5 is the cheapest model in the 80%+ tier. Claude Opus 4.7 scores marginally higher at 80.4% but costs roughly 18x more per output token.

What does the ~$1/hour effective cost actually mean?

It's the average inference spend when M2.5 drives a coding agent through a typical 8-hour engineering task — roughly 2-3M input tokens and 200-400K output tokens billed over the session. The same workload on Claude Sonnet 4 runs $6-12/hour; on Claude Opus 4.7 it runs $15-25/hour.

Why does the 76.3% BrowseComp score matter for agents?

BrowseComp measures the model's ability to navigate, read, and synthesize content across multiple web pages in an agent loop. 76.3% puts M2.5 in the top tier for browse-style agents, which matters if your use case is research, data extraction, or web automation rather than pure code generation.

Can I self-host the 228.7B MoE weights?

Yes, but the hardware bar is high — ~4xH200 or equivalent for low-latency serving (228.7B parameters total, ~38B active per token under MoE routing). For most teams the API is cheaper than the GPU and ops overhead until you sustain ~5M tokens/day or have hard data-residency requirements.

How does the 192K context window compare to alternatives?

192K is comfortable for typical chat and agent workflows but smaller than Claude Sonnet 4 (200K), Qwen 3.6 Plus (1M), and Gemini 2.5 Pro (2M). For repository-scale refactors or very long document analysis, prefer the larger-context models. For most coding agent loops, 192K is enough.

When should I pick MiniMax M2.5 over Qwen 3.6 Plus?

Pick M2.5 if SWE-Bench performance is primary — it leads Qwen 3.6 Plus by 1.4 points and is more reliable in 5+ step agent tool chains. Pick Qwen 3.6 Plus if you need a 1M context window, lower output pricing ($1.10 vs $1.66 reverses), or stronger Chinese-language support.

What's M2.5's main failure mode in production agents?

Over-thinking on trivial tasks. M2.5 occasionally spends 2-3K reasoning tokens on a five-line code change. Mitigate by setting temperature=0 and adding an explicit "answer directly without extended reasoning" instruction for short tasks. On complex multi-step jobs the extra reasoning is usually justified.

Sources:

By TokenMix Research Lab · Updated 2026-04-22