TokenMix Research Lab · 2026-04-22

MiniMax M2.5 Review: 80.2% SWE-Bench Verified at $0.28/M — Speed-Per-Dollar King 2026

MiniMax M2.5 Review: 80.2% SWE-Bench Verified at $0.28/M — The Speed-Per-Dollar King of 2026

MiniMax shipped M2.5 on February 12, 2026, and the numbers are hard to ignore. 80.2% on SWE-Bench Verified, 76.3% on BrowseComp, 192K context, and pricing at $0.28 input / .10 output per million tokens — for a model that completes agentic tasks 37% faster than its predecessor and matches Claude Opus 4.6 on raw generation speed.

This review covers what M2.5 actually does well in practice, where its long-context ceiling hurts, pricing comparisons against frontier peers, and the practical setup for teams looking to deploy it in production.

TL;DR — The Numbers That Matter

Metric MiniMax M2.5 Claude Opus 4.6 GPT-5.4 Qwen 3.6 Plus
SWE-Bench Verified 80.2% 79.4% 85.0% 78.8%
Multi-SWE-Bench 51.3% ~48% ~54%
BrowseComp 76.3% 71.0% 78.2%
Context window 192K 1M 1.05M 1M
Active parameters 228.7B proprietary proprietary proprietary
Input price ($/M) $0.28 $5.00 $2.38 $0.28
Output price ($/M) .10 $25.00 4.25 .66
Generation speed (tok/s) ~100 ~100 ~90 ~85

Sources: MiniMax release notes, independent analysis via Artificial Analysis, TokenMix model catalog.

What MiniMax M2.5 Is

MiniMax M2.5 is the February 2026 successor to M2.1, positioned explicitly as a frontier-quality model engineered for economically valuable real-world tasks — coding, agentic tool use, search, and office workflows. The architecture is a 228.7 billion parameter mixture-of-experts model with a 192K token context window, trained with large-scale reinforcement learning across hundreds of thousands of simulated real-world environments.

Three engineering decisions define the model:

  1. Process reward mechanism. To handle credit assignment in long agent rollouts, MiniMax introduced end-to-end monitoring of generation quality step-by-step, not just at the final answer. This is why M2.5 performs meaningfully better on multi-step agent benchmarks than on single-turn completions.
  2. Speed optimization as a first-class goal. M2.5 completes SWE-Bench Verified evaluations 37% faster than M2.1 and matches Claude Opus 4.6's generation speed at roughly 100 tokens per second — while costing ~18× less.
  3. Weights released openly. Unlike Claude or GPT, MiniMax M2.5 weights are available on Hugging Face under a permissive license, enabling self-hosting for teams that need air-gapped deployment.

Coding Performance: Breaking Down the 80.2%

SWE-Bench Verified (80.2%) puts MiniMax M2.5 above Claude Opus 4.6 (79.4%) and Qwen 3.6 Plus (78.8%), and within 5 percentage points of GPT-5.4 (85.0%). For a model at 6% of Claude's output price, that's an unusually flat cost-quality curve.

Multi-SWE-Bench (51.3%) measures multi-language coding across Python, JavaScript, Go, Rust, and Java. M2.5 holds its own on non-Python tasks, where many open-weight competitors (including DeepSeek V3) tend to over-fit on Python-heavy training data.

BrowseComp (76.3%) is where M2.5 genuinely shines. This benchmark tests end-to-end browsing agent behavior — parsing search results, maintaining state across page transitions, extracting structured data. The 76.3% score (with proper context management) beats Claude Opus 4.6 by 5 points and puts M2.5 in the same league as GPT-5.4 on actual web agent workflows.

Where it still trails:

Pricing: The Cost Chart That Matters

MiniMax M2.5 is priced aggressively across all major providers:

Provider Input ($/M) Output ($/M) Notes
MiniMax official $0.30 .20 With $0.06/M cached input
Together AI $0.40 .60 Hosted inference, no account required
TokenMix $0.28 .10 OpenAI-compatible, unified gateway
DeepInfra $0.30 .20 Competitive with MiniMax official

Compared against frontier peers — the numbers that justify the switch:

For a typical agent workload running 100 tokens/second continuously:

For always-on agents, coding assistants, and high-volume inference pipelines, that 18× cost delta translates to the difference between a product that can scale and one that can't.

The Context Window Problem

192K is large, but not best-in-class. Where this matters:

Where 192K is enough:

Where 192K isn't enough:

For teams whose workloads cleanly fit under 192K, M2.5 is the obvious pick. For workloads that routinely exceed it, Qwen 3.6 Plus (1M, $0.28/ .66) or Claude Sonnet 4.6 (1M, $3/ 5) remain the practical alternatives.

Real Use Cases Where M2.5 Wins

Coding assistants in IDEs. The combination of 80.2% SWE-Bench, 100 tok/s generation speed, and .10/M output pricing makes M2.5 the single best economic choice for a Cursor- or Windsurf-style in-editor assistant. The speed means completions arrive faster than the user can read; the price means the vendor can actually afford to serve them.

Autonomous coding agents. For agents that write, run, and debug code across multiple cycles, M2.5's combination of agentic training (via the process reward mechanism) and price makes it plausible to run long-horizon autonomous workflows without the bill ballooning.

Browse-and-extract pipelines. The 76.3% BrowseComp score with context management makes M2.5 particularly strong for building agents that navigate websites, extract structured data, and maintain state across page transitions — research agents, price monitors, competitive intelligence pipelines.

Background inference at scale. For teams processing millions of requests (summarization, classification, extraction), M2.5 at $0.28/ .10 is 17× cheaper than Claude Opus — a difference that shows up in every monthly infra bill.

Code Example: Calling MiniMax M2.5 Through OpenAI-Compatible Endpoint

from openai import OpenAI

client = OpenAI(
    api_key="sk-tm-xxxx",
    base_url="https://api.tokenmix.ai/v1",
)

response = client.chat.completions.create(
    model="minimax-m2.5",
    messages=[
        {"role": "system", "content": "You are a senior engineer reviewing a pull request."},
        {"role": "user", "content": "Analyze this diff for correctness, performance issues, and security concerns..."},
    ],
    max_tokens=8000,
    temperature=0.2,
)

print(response.choices[0].message.content)

M2.5 supports function calling, JSON mode, and streaming through the standard OpenAI SDK interface. No custom client library is required.

Access: Why a Gateway Matters for International Teams

MiniMax is a Chinese AI company, and its official API requires an account on platform.minimax.io with payment options that often don't accept non-Chinese cards. For international teams, the practical access paths are:

For production workloads where reliability and compliance matter, a gateway with multi-provider routing removes the single-provider risk and enables model-level failover without code changes.

Who Should Use MiniMax M2.5

Good fit:

Poor fit:

What's Next for MiniMax

MiniMax has telegraphed M2.7 (a self-evolving agent variant) and continued improvements to the core M-series. The March–April 2026 roadmap suggests the long-context ceiling is the next target — a 512K or 1M-context M3 would close the most visible gap to Claude and Qwen.

Bottom Line

MiniMax M2.5 is the most cost-effective model on the market that cracks 80% SWE-Bench Verified. It matches Claude Opus 4.6 on generation speed while costing 18× less, and it leads Claude on BrowseComp agentic browsing. The 192K context ceiling is the only structural limitation keeping it out of the "default choice" conversation — for any workload that fits under that ceiling, M2.5 is genuinely the strongest economic pick in 2026.

If your current coding workload runs on Claude or GPT and your monthly bill has four or more digits, a week of A/B testing the same workload on MiniMax M2.5 is likely the highest-ROI experiment you can run this quarter.


Sources:

By TokenMix Research Lab · Updated 2026-04-22