TokenMix Research Lab · 2026-04-22
MiniMax M2.5 Review: 80.2% SWE-Bench Verified at $0.28/M — The Speed-Per-Dollar King of 2026
Last Updated: 2026-04-29
Author: TokenMix Research Lab
MiniMax shipped M2.5 on February 12, 2026, and the numbers are hard to ignore. 80.2% on SWE-Bench Verified, 76.3% on BrowseComp, 192K context, and pricing at $0.28 input / $1.10 output per million tokens — for a model that completes agentic tasks 37% faster than its predecessor and matches Claude Opus 4.6 on raw generation speed.
This review covers what M2.5 actually does well in practice, where its long-context ceiling hurts, pricing comparisons against frontier peers, and the practical setup for teams looking to deploy it in production.
Table of Contents
- TL;DR — The Numbers That Matter
- What MiniMax M2.5 Is
- Coding Performance: Breaking Down the 80.2%
- Pricing: The Cost Chart That Matters
- The Context Window Problem
- Real Use Cases Where M2.5 Wins
- Code Example: Calling MiniMax M2.5 Through OpenAI-Compatible Endpoint
- Access: Why a Gateway Matters for International Teams
- Who Should Use MiniMax M2.5
- What's Next for MiniMax
- Bottom Line
TL;DR — The Numbers That Matter
| Metric | MiniMax M2.5 | Claude Opus 4.6 | GPT-5.4 | Qwen 3.6 Plus |
|---|---|---|---|---|
| SWE-Bench Verified | 80.2% | 79.4% | 85.0% | 78.8% |
| Multi-SWE-Bench | 51.3% | ~48% | ~54% | — |
| BrowseComp | 76.3% | 71.0% | 78.2% | — |
| Context window | 192K | 1M | 1.05M | 1M |
| Active parameters | 228.7B | proprietary | proprietary | proprietary |
| Input price ($/M) | $0.28 | $5.00 | $2.38 | $0.28 |
| Output price ($/M) | $1.10 | $25.00 | $14.25 | $1.66 |
| Generation speed (tok/s) | ~100 | ~100 | ~90 | ~85 |
Sources: MiniMax release notes, independent analysis via Artificial Analysis, TokenMix model catalog.
What MiniMax M2.5 Is
MiniMax M2.5 is the February 2026 successor to M2.1, positioned explicitly as a frontier-quality model engineered for economically valuable real-world tasks — coding, agentic tool use, search, and office workflows. The architecture is a 228.7 billion parameter mixture-of-experts model with a 192K token context window, trained with large-scale reinforcement learning across hundreds of thousands of simulated real-world environments.
Three engineering decisions define the model:
- Process reward mechanism. To handle credit assignment in long agent rollouts, MiniMax introduced end-to-end monitoring of generation quality step-by-step, not just at the final answer. This is why M2.5 performs meaningfully better on multi-step agent benchmarks than on single-turn completions.
- Speed optimization as a first-class goal. M2.5 completes SWE-Bench Verified evaluations 37% faster than M2.1 and matches Claude Opus 4.6's generation speed at roughly 100 tokens per second — while costing ~18× less.
- Weights released openly. Unlike Claude or GPT, MiniMax M2.5 weights are available on Hugging Face under a permissive license, enabling self-hosting for teams that need air-gapped deployment.
Coding Performance: Breaking Down the 80.2%
SWE-Bench Verified (80.2%) puts MiniMax M2.5 above Claude Opus 4.6 (79.4%) and Qwen 3.6 Plus (78.8%), and within 5 percentage points of GPT-5.4 (85.0%). For a model at 6% of Claude's output price, that's an unusually flat cost-quality curve.
Multi-SWE-Bench (51.3%) measures multi-language coding across Python, JavaScript, Go, Rust, and Java. M2.5 holds its own on non-Python tasks, where many open-weight competitors (including DeepSeek V3) tend to over-fit on Python-heavy training data.
BrowseComp (76.3%) is where M2.5 genuinely shines. This benchmark tests end-to-end browsing agent behavior — parsing search results, maintaining state across page transitions, extracting structured data. The 76.3% score (with proper context management) beats Claude Opus 4.6 by 5 points and puts M2.5 in the same league as GPT-5.4 on actual web agent workflows.
Where it still trails:
- Pure SWE-Bench frontier — Claude Opus 4.7 (87.6%) and GPT-5.4 (85.0%) remain ahead on repository-scale refactors
- Long-context reasoning — the 192K ceiling becomes a real constraint for codebase-wide operations where 1M-context models (Claude, Gemini, Qwen) have room to breathe
- Tool-calling reliability on complex nested function schemas — Claude Opus 4.7 is still the safest choice here
Pricing: The Cost Chart That Matters
MiniMax M2.5 is priced aggressively across all major providers:
| Provider | Input ($/M) | Output ($/M) | Notes |
|---|---|---|---|
| MiniMax official | $0.30 | $1.20 | With $0.06/M cached input |
| Together AI | $0.40 | $1.60 | Hosted inference, no account required |
| TokenMix | $0.28 | $1.10 | OpenAI-compatible, unified gateway |
| DeepInfra | $0.30 | $1.20 | Competitive with MiniMax official |
Compared against frontier peers — the numbers that justify the switch:
- vs Claude Opus 4.7 ($5/$25): M2.5 is 17.9× cheaper on input, 22.7× cheaper on output
- vs GPT-5.4 ($2.38/$14.25): M2.5 is 8.5× cheaper on input, 13.0× cheaper on output
- vs Qwen 3.6 Plus ($0.28/$1.66): M2.5 matches input price, 33% cheaper on output
For a typical agent workload running 100 tokens/second continuously:
- MiniMax M2.5: roughly $1 per hour — this is the figure MiniMax leans on in their launch materials, and it holds up in independent testing
- Claude Opus 4.7: roughly $18 per hour at the same throughput
- GPT-5.4: roughly $10 per hour
For always-on agents, coding assistants, and high-volume inference pipelines, that 18× cost delta translates to the difference between a product that can scale and one that can't.
The Context Window Problem
192K is large, but not best-in-class. Where this matters:
Where 192K is enough:
- Single-file or multi-file code tasks up to ~100K tokens of context
- Document analysis up to ~500 pages of plain text
- Multi-turn conversations up to roughly 150 exchanges
- Most agentic tool-use loops (where context is managed dynamically)
Where 192K isn't enough:
- Repository-wide refactors on large monorepos (>300K tokens of code)
- Long legal or technical document review (>500 pages)
- Unbounded agent loops without active context pruning
- Book-length content generation in a single pass
For teams whose workloads cleanly fit under 192K, M2.5 is the obvious pick. For workloads that routinely exceed it, Qwen 3.6 Plus (1M, $0.28/$1.66) or Claude Sonnet 4.6 (1M, $3/$15) remain the practical alternatives.
Real Use Cases Where M2.5 Wins
Coding assistants in IDEs. The combination of 80.2% SWE-Bench, 100 tok/s generation speed, and $1.10/M output pricing makes M2.5 the single best economic choice for a Cursor- or Windsurf-style in-editor assistant. The speed means completions arrive faster than the user can read; the price means the vendor can actually afford to serve them.
Autonomous coding agents. For agents that write, run, and debug code across multiple cycles, M2.5's combination of agentic training (via the process reward mechanism) and price makes it plausible to run long-horizon autonomous workflows without the bill ballooning.
Browse-and-extract pipelines. The 76.3% BrowseComp score with context management makes M2.5 particularly strong for building agents that navigate websites, extract structured data, and maintain state across page transitions — research agents, price monitors, competitive intelligence pipelines.
Background inference at scale. For teams processing millions of requests (summarization, classification, extraction), M2.5 at $0.28/$1.10 is 17× cheaper than Claude Opus — a difference that shows up in every monthly infra bill.
Code Example: Calling MiniMax M2.5 Through OpenAI-Compatible Endpoint
from openai import OpenAI
client = OpenAI(
api_key="sk-tm-xxxx",
base_url="https://api.tokenmix.ai/v1",
)
response = client.chat.completions.create(
model="minimax-m2.5",
messages=[
{"role": "system", "content": "You are a senior engineer reviewing a pull request."},
{"role": "user", "content": "Analyze this diff for correctness, performance issues, and security concerns..."},
],
max_tokens=8000,
temperature=0.2,
)
print(response.choices[0].message.content)
M2.5 supports function calling, JSON mode, and streaming through the standard OpenAI SDK interface. No custom client library is required.
Access: Why a Gateway Matters for International Teams
MiniMax is a Chinese AI company, and its official API requires an account on platform.minimax.io with payment options that often don't accept non-Chinese cards. For international teams, the practical access paths are:
- TokenMix — USD billing, OpenAI-compatible, crypto/Stripe/Alipay accepted,
minimax-m2.5slug - Together AI — hosted inference with US-based infrastructure
- DeepInfra — similar positioning, competitive pricing
- Self-hosting from Hugging Face — practical only for teams with 4–8× H100-class GPU budgets
For production workloads where reliability and compliance matter, a gateway with multi-provider routing removes the single-provider risk and enables model-level failover without code changes.
Who Should Use MiniMax M2.5
Good fit:
- Coding assistants where latency and per-token cost both matter
- Autonomous agents running long-horizon workflows within 192K context
- High-throughput background inference at scale
- Teams priced out of Claude Opus but needing more than DeepSeek V3 delivers
- Regulated industries that need to self-host (weights are on Hugging Face)
Poor fit:
- Repository-scale refactors requiring >200K context
- Highest-end SWE-Bench leaderboard chasing (GPT-5.4 and Claude Opus 4.7 still lead)
- Multi-turn tool-use with deeply nested function schemas (Claude Opus 4.7 remains most reliable)
What's Next for MiniMax
MiniMax has telegraphed M2.7 (a self-evolving agent variant) and continued improvements to the core M-series. The March–April 2026 roadmap suggests the long-context ceiling is the next target — a 512K or 1M-context M3 would close the most visible gap to Claude and Qwen.
Bottom Line
MiniMax M2.5 is the most cost-effective model on the market that cracks 80% SWE-Bench Verified. It matches Claude Opus 4.6 on generation speed while costing 18× less, and it leads Claude on BrowseComp agentic browsing. The 192K context ceiling is the only structural limitation keeping it out of the "default choice" conversation — for any workload that fits under that ceiling, M2.5 is genuinely the strongest economic pick in 2026.
If your current coding workload runs on Claude or GPT and your monthly bill has four or more digits, a week of A/B testing the same workload on MiniMax M2.5 is likely the highest-ROI experiment you can run this quarter.
Sources:
- MiniMax M2.5 official release
- MiniMax M2.5 on Hugging Face
- MiniMax M2.5 analysis — Artificial Analysis
- TokenMix MiniMax M2.5 listing
- Claude Opus 4.7 Review — TokenMix
- GLM-5.1 SWE-Bench Pro Review — TokenMix
By TokenMix Research Lab · Updated 2026-04-22