MiniMax M2.5 Review: 80.2% SWE-Bench Verified at $0.28/M — The Speed-Per-Dollar King of 2026
MiniMax shipped M2.5 on February 12, 2026, and the numbers are hard to ignore. 80.2% on SWE-Bench Verified, 76.3% on BrowseComp, 192K context, and pricing at $0.28 input /
.10 output per million tokens — for a model that completes agentic tasks 37% faster than its predecessor and matches Claude Opus 4.6 on raw generation speed.
This review covers what M2.5 actually does well in practice, where its long-context ceiling hurts, pricing comparisons against frontier peers, and the practical setup for teams looking to deploy it in production.
MiniMax M2.5 is the February 2026 successor to M2.1, positioned explicitly as a frontier-quality model engineered for economically valuable real-world tasks — coding, agentic tool use, search, and office workflows. The architecture is a 228.7 billion parameter mixture-of-experts model with a 192K token context window, trained with large-scale reinforcement learning across hundreds of thousands of simulated real-world environments.
Three engineering decisions define the model:
Process reward mechanism. To handle credit assignment in long agent rollouts, MiniMax introduced end-to-end monitoring of generation quality step-by-step, not just at the final answer. This is why M2.5 performs meaningfully better on multi-step agent benchmarks than on single-turn completions.
Speed optimization as a first-class goal. M2.5 completes SWE-Bench Verified evaluations 37% faster than M2.1 and matches Claude Opus 4.6's generation speed at roughly 100 tokens per second — while costing ~18× less.
Weights released openly. Unlike Claude or GPT, MiniMax M2.5 weights are available on Hugging Face under a permissive license, enabling self-hosting for teams that need air-gapped deployment.
Coding Performance: Breaking Down the 80.2%
SWE-Bench Verified (80.2%) puts MiniMax M2.5 above Claude Opus 4.6 (79.4%) and Qwen 3.6 Plus (78.8%), and within 5 percentage points of GPT-5.4 (85.0%). For a model at 6% of Claude's output price, that's an unusually flat cost-quality curve.
Multi-SWE-Bench (51.3%) measures multi-language coding across Python, JavaScript, Go, Rust, and Java. M2.5 holds its own on non-Python tasks, where many open-weight competitors (including DeepSeek V3) tend to over-fit on Python-heavy training data.
BrowseComp (76.3%) is where M2.5 genuinely shines. This benchmark tests end-to-end browsing agent behavior — parsing search results, maintaining state across page transitions, extracting structured data. The 76.3% score (with proper context management) beats Claude Opus 4.6 by 5 points and puts M2.5 in the same league as GPT-5.4 on actual web agent workflows.
Where it still trails:
Pure SWE-Bench frontier — Claude Opus 4.7 (87.6%) and GPT-5.4 (85.0%) remain ahead on repository-scale refactors
Long-context reasoning — the 192K ceiling becomes a real constraint for codebase-wide operations where 1M-context models (Claude, Gemini, Qwen) have room to breathe
Tool-calling reliability on complex nested function schemas — Claude Opus 4.7 is still the safest choice here
Pricing: The Cost Chart That Matters
MiniMax M2.5 is priced aggressively across all major providers:
Provider
Input ($/M)
Output ($/M)
Notes
MiniMax official
$0.30
.20
With $0.06/M cached input
Together AI
$0.40
.60
Hosted inference, no account required
TokenMix
$0.28
.10
OpenAI-compatible, unified gateway
DeepInfra
$0.30
.20
Competitive with MiniMax official
Compared against frontier peers — the numbers that justify the switch:
vs Claude Opus 4.7 ($5/$25): M2.5 is 17.9× cheaper on input, 22.7× cheaper on output
vs GPT-5.4 ($2.38/
4.25): M2.5 is 8.5× cheaper on input, 13.0× cheaper on output
vs Qwen 3.6 Plus ($0.28/
.66): M2.5 matches input price, 33% cheaper on output
For a typical agent workload running 100 tokens/second continuously:
MiniMax M2.5: roughly
per hour — this is the figure MiniMax leans on in their launch materials, and it holds up in independent testing
Claude Opus 4.7: roughly
8 per hour at the same throughput
GPT-5.4: roughly
0 per hour
For always-on agents, coding assistants, and high-volume inference pipelines, that 18× cost delta translates to the difference between a product that can scale and one that can't.
The Context Window Problem
192K is large, but not best-in-class. Where this matters:
Where 192K is enough:
Single-file or multi-file code tasks up to ~100K tokens of context
Document analysis up to ~500 pages of plain text
Multi-turn conversations up to roughly 150 exchanges
Most agentic tool-use loops (where context is managed dynamically)
Where 192K isn't enough:
Repository-wide refactors on large monorepos (>300K tokens of code)
Long legal or technical document review (>500 pages)
Unbounded agent loops without active context pruning
Book-length content generation in a single pass
For teams whose workloads cleanly fit under 192K, M2.5 is the obvious pick. For workloads that routinely exceed it, Qwen 3.6 Plus (1M, $0.28/
.66) or Claude Sonnet 4.6 (1M, $3/
5) remain the practical alternatives.
Real Use Cases Where M2.5 Wins
Coding assistants in IDEs. The combination of 80.2% SWE-Bench, 100 tok/s generation speed, and
.10/M output pricing makes M2.5 the single best economic choice for a Cursor- or Windsurf-style in-editor assistant. The speed means completions arrive faster than the user can read; the price means the vendor can actually afford to serve them.
Autonomous coding agents. For agents that write, run, and debug code across multiple cycles, M2.5's combination of agentic training (via the process reward mechanism) and price makes it plausible to run long-horizon autonomous workflows without the bill ballooning.
Browse-and-extract pipelines. The 76.3% BrowseComp score with context management makes M2.5 particularly strong for building agents that navigate websites, extract structured data, and maintain state across page transitions — research agents, price monitors, competitive intelligence pipelines.
Background inference at scale. For teams processing millions of requests (summarization, classification, extraction), M2.5 at $0.28/
.10 is 17× cheaper than Claude Opus — a difference that shows up in every monthly infra bill.
Code Example: Calling MiniMax M2.5 Through OpenAI-Compatible Endpoint
from openai import OpenAI
client = OpenAI(
api_key="sk-tm-xxxx",
base_url="https://api.tokenmix.ai/v1",
)
response = client.chat.completions.create(
model="minimax-m2.5",
messages=[
{"role": "system", "content": "You are a senior engineer reviewing a pull request."},
{"role": "user", "content": "Analyze this diff for correctness, performance issues, and security concerns..."},
],
max_tokens=8000,
temperature=0.2,
)
print(response.choices[0].message.content)
M2.5 supports function calling, JSON mode, and streaming through the standard OpenAI SDK interface. No custom client library is required.
Access: Why a Gateway Matters for International Teams
MiniMax is a Chinese AI company, and its official API requires an account on platform.minimax.io with payment options that often don't accept non-Chinese cards. For international teams, the practical access paths are:
Together AI — hosted inference with US-based infrastructure
DeepInfra — similar positioning, competitive pricing
Self-hosting from Hugging Face — practical only for teams with 4–8× H100-class GPU budgets
For production workloads where reliability and compliance matter, a gateway with multi-provider routing removes the single-provider risk and enables model-level failover without code changes.
Who Should Use MiniMax M2.5
Good fit:
Coding assistants where latency and per-token cost both matter
Autonomous agents running long-horizon workflows within 192K context
High-throughput background inference at scale
Teams priced out of Claude Opus but needing more than DeepSeek V3 delivers
Regulated industries that need to self-host (weights are on Hugging Face)
Highest-end SWE-Bench leaderboard chasing (GPT-5.4 and Claude Opus 4.7 still lead)
Multi-turn tool-use with deeply nested function schemas (Claude Opus 4.7 remains most reliable)
What's Next for MiniMax
MiniMax has telegraphed M2.7 (a self-evolving agent variant) and continued improvements to the core M-series. The March–April 2026 roadmap suggests the long-context ceiling is the next target — a 512K or 1M-context M3 would close the most visible gap to Claude and Qwen.
Bottom Line
MiniMax M2.5 is the most cost-effective model on the market that cracks 80% SWE-Bench Verified. It matches Claude Opus 4.6 on generation speed while costing 18× less, and it leads Claude on BrowseComp agentic browsing. The 192K context ceiling is the only structural limitation keeping it out of the "default choice" conversation — for any workload that fits under that ceiling, M2.5 is genuinely the strongest economic pick in 2026.
If your current coding workload runs on Claude or GPT and your monthly bill has four or more digits, a week of A/B testing the same workload on MiniMax M2.5 is likely the highest-ROI experiment you can run this quarter.