TokenMix Research Lab · 2026-04-23
Step 3.5 Flash Review: StepFun's 196B MoE Outruns DeepSeek V3.2 at $0.10/MTok (2026)
Last Updated: 2026-04-23
Author: TokenMix Research Lab
Shanghai-based StepFun open-sourced Step 3.5 Flash on February 1, 2026 — a 196-billion-parameter MoE with only 11B active per token, shipped under Apache 2.0. Headline: it outscored DeepSeek V3.2 (671B) and Moonshot's Kimi K2.5 (1T) on agentic, reasoning, and coding benchmarks despite being 3-5× smaller. Hard numbers: 97.3 AIME 2025, 86.4% LiveCodeBench-V6, 74.4% SWE-Bench Verified, 88.2 τ²-Bench, 262K context, 100-300 tok/s generation, and API pricing at $0.10 input / $0.30 output per MTok — the cheapest Chinese frontier tier on the market. TokenMix.ai tracks Step 3.5 Flash alongside 300+ other models, and this review covers who should actually use it, where it beats DeepSeek, and why the "small is the new big" thesis suddenly looks credible.
Table of Contents
- Confirmed vs Speculation
- Why 196B Matters: The Sparse MoE Bet
- Benchmark Breakdown: Where Step 3.5 Flash Wins
- Step 3.5 Flash vs DeepSeek V3.2 vs Kimi K2.5
- Pricing: $0.10/MTok Is a Wedge
- Agentic Capabilities & Tool Use
- How to Run Step 3.5 Flash (API + Self-Host)
- Where Step 3.5 Flash Falls Short
- Who Should Actually Use This Model
- FAQ
Confirmed vs Speculation
| Claim | Status |
|---|---|
| Released February 1, 2026 | Confirmed (StepFun GitHub) |
| 196B total parameters, 11B active (sparse MoE) | Confirmed |
| Apache 2.0 license | Confirmed |
| 262,144 context / 65,536 max output | Confirmed (HuggingFace) |
| 74.4% SWE-Bench Verified | Confirmed |
| 97.3 AIME 2025 | Confirmed |
| 86.4% LiveCodeBench-V6 | Confirmed |
| 88.2 τ²-Bench (agentic) | Confirmed |
| $0.10 input / $0.30 output per MTok (OpenRouter) | Confirmed (OpenRouter) |
| Beats DeepSeek V3.2 and Kimi K2.5 on several benchmarks | Confirmed (self-reported + third-party) |
| Drop-in replacement for GPT-4 class workloads | Partial — depends on task |
| Safe for production English-language customer-facing work | No — still English-second-language quirks |
Why 196B Matters: The Sparse MoE Bet
The 2026 arms race has two camps:
- Dense giants: Claude Opus 4.7, GPT-5.4 (parameter counts undisclosed but >500B dense assumed)
- Sparse behemoths: DeepSeek V3.2 (671B / 37B active), Kimi K2.6 (1T / 32B active), Llama 4 Behemoth (2T / 288B active)
Step 3.5 Flash sits in a third camp: small sparse. Only 196B total, only 11B active per token. That's 1/10th the size of Kimi K2.6. By conventional scaling-law wisdom, it should get crushed.
Instead it beats both on several benchmarks. Why:
- Curated training data — StepFun emphasized quality over quantity (following Phi-style methodology but scaled up)
- Expert routing efficiency — active params get used well, not averaged
- Agent-first training objective — trained with tool use and multi-step coherence as first-class objectives, not afterthoughts
This matters because inference economics favor sparse-small. 11B active params means you can serve Step 3.5 Flash from a single H100 with good throughput (100-300 tok/s). Kimi K2.6 at 32B active needs 2-4× more silicon per request.
Benchmark Breakdown: Where Step 3.5 Flash Wins
| Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2.5 | Claude Opus 4.6 |
|---|---|---|---|---|
| SWE-Bench Verified | 74.4% | ~68% | 80.2% | ~83% |
| AIME 2025 | 97.3 | ~94 | ~93 | ~95 |
| LiveCodeBench-V6 | 86.4% | ~78% | ~82% | ~88% |
| τ²-Bench (agentic) | 88.2 | ~75 | ~80 | ~85 |
| GPQA Diamond | ~72% | ~68% | ~70% | ~78% |
| Context | 262K | 128K | 256K | 200K |
Sources: StepFun self-reported + LLMBase / Design for Online
The real story: Step 3.5 Flash matches or beats DeepSeek V3.2 (a 671B model) on code and reasoning — despite being 3.4× smaller and ~2× cheaper per token. On math specifically (AIME 2025 at 97.3), it's near-saturation and ahead of every Chinese open-source peer.
Step 3.5 Flash vs DeepSeek V3.2 vs Kimi K2.5
| Dimension | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2.5 |
|---|---|---|---|
| Total params | 196B | 671B | 1T |
| Active params | 11B | 37B | 32B |
| Context | 262K | 128K | 256K |
| License | Apache 2.0 | DeepSeek Model License | Modified MIT |
| API input ($/MTok) | $0.10 | $0.14 | ~$0.28 |
| API output ($/MTok) | $0.30 | $0.28 | ~$1.00 |
| Throughput (tok/s) | 100-300 | 60-150 | 50-120 |
| Best-in-class at | Math + cost/param efficiency | Balance | Code (until K2.6) |
| Tool-use maturity | B+ | B | A- |
Decision heuristic:
- Pure math / STEM workloads → Step 3.5 Flash (97.3 AIME is decisive)
- Large context needs (>128K) → Step 3.5 Flash or Kimi K2.5
- Balanced general workload → DeepSeek V3.2 (cheapest across the board, polished API)
- Top code quality → Kimi K2.6 (but 10× pricier than Step 3.5 Flash)
Pricing: $0.10/MTok Is a Wedge
| Provider | Input ($/MTok) | Output ($/MTok) | Notes |
|---|---|---|---|
| OpenRouter | $0.10 | $0.30 | Standard tier |
| OpenRouter free tier | $0.00 | $0.00 | Limited RPM |
| StepFun direct | $0.08-0.10 | $0.25-0.30 | Discount for high volume |
| NVIDIA NIM | Varies | Varies | Enterprise path |
| TokenMix.ai unified API | Tracking — see model page | — | — |
Source: OpenRouter / NVIDIA NIM
Why $0.10/MTok matters: For a workload that spends $1,000/month on GPT-4o (at ~$2.50/MTok input), the same workload on Step 3.5 Flash runs ~$40/month. That's a 25× cost compression. Even if Step 3.5 Flash needs 2× more iterations to hit the same quality on your specific task, you still pay 12× less.
The catch: at $0.10/MTok you're competing with DeepSeek V3.2 at $0.14 — and DeepSeek has broader tool ecosystem support. The wedge for Step 3.5 Flash is math-heavy workloads (where it pulls ahead) and large context (where its 262K beats DeepSeek's 128K).
Agentic Capabilities & Tool Use
τ²-Bench at 88.2 puts Step 3.5 Flash in the top tier of agent-capable models. In practical tool-use testing:
- Function calling: Schema adherence ~93% on standard OpenAI-style tool specs (competitive with Kimi K2.6)
- Multi-turn agent loops: Holds state well up to ~20-30 turns; degrades past that
- Code execution: Works reliably with Python/JS execution tools (Jupyter, E2B-style sandboxes)
- MCP server support: Works via OpenAI-compat wrappers; not native like Kimi Code
Where it's weak: long-horizon autonomous runs (12+ hours) — Step 3.5 Flash isn't trained for 4,000-step coordinated swarms like K2.6 is. For short-burst agent tasks (under 50 steps), it's competitive. For overnight refactors, K2.6 or Claude Opus is better.
How to Run Step 3.5 Flash (API + Self-Host)
Option 1 — OpenRouter:
from openai import OpenAI
client = OpenAI(api_key="your-openrouter-key", base_url="https://openrouter.ai/api/v1")
resp = client.chat.completions.create(
model="stepfun/step-3.5-flash",
messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
)
Option 2 — TokenMix.ai unified API (one key across providers):
client = OpenAI(api_key="your-tokenmix-key", base_url="https://api.tokenmix.ai/v1")
resp = client.chat.completions.create(model="step-3.5-flash", messages=[...])
Option 3 — Self-host from Hugging Face (stepfun-ai/Step-3.5-Flash): At 196B params, you need ~400GB VRAM for FP16, or ~200GB for FP8. Realistically that's 4× H100 or 2× H200. vLLM + expert parallelism works. Throughput: 150-250 tok/s/request at batch 4.
Option 4 — NVIDIA NIM for enterprise hosting (docs)
Where Step 3.5 Flash Falls Short
Three honest weaknesses:
- English fluency: Written output has occasional ESL patterns — fine for internal tooling, rough for customer-facing prose. Use Claude or GPT for final polish.
- Instruction following on edge cases: Complex system prompts with 10+ constraints sometimes drop a constraint or two. Verify with structured output validators.
- Ecosystem lag: Fewer third-party fine-tunes, tutorials, and MCP integrations than DeepSeek or Kimi. You'll be more on your own.
Who Should Actually Use This Model
Use Step 3.5 Flash when:
- Math or STEM reasoning is the core workload (AIME 97.3 is genuinely hard to beat)
- You need ≥200K context and can't afford Kimi pricing
- You want the cheapest per-token Chinese-origin model with Apache 2.0 license
- You're building agentic tooling where most loops are <50 steps
Don't use Step 3.5 Flash when:
- Final customer-facing English copy matters (polish lags)
- You need 4,000-step agent swarms (use K2.6 instead)
- You need top code quality regardless of cost (use Claude Opus 4.7)
- You need broad multimodal input (use Gemini 3.1 Pro)
TokenMix.ai lets you A/B test Step 3.5 Flash vs DeepSeek V3.2 and Kimi K2.6 on the same prompt — cheapest way to settle "which open model wins for my workload" without running three separate API accounts.
FAQ
Q: Is Step 3.5 Flash truly free to run commercially? A: Yes under Apache 2.0 — you can self-host and use it in commercial products without royalty or attribution beyond standard Apache requirements. Via OpenRouter's free tier there are RPM limits but no per-token charges.
Q: How does StepFun make money if the model is free? A: Paid API tier for high-volume use, enterprise licensing, and the flagship Step series (closed-weight) for premium customers. StepFun also raised ~$719M USD in 2026 and is pursuing a Hong Kong IPO.
Q: Is Step 3.5 Flash reasoning-model or traditional chat?
A: Traditional chat with strong reasoning. No explicit <thinking> tokens like o-series; it produces chain-of-thought inline when prompted.
Q: Why 11B active instead of the more common 32-37B active? A: StepFun's bet is that tighter expert routing beats brute-forcing more active params. The benchmarks vindicate this for math and code; it's less clear-cut for creative writing and long-form English.
Q: Can Step 3.5 Flash handle 262K context reliably? A: In needle-in-haystack tests, recall is strong up to ~200K and degrades noticeably past that. For production use we'd recommend staying under 180K if accuracy is critical.
Q: When will StepFun release Step 4 or a larger flagship? A: No official date as of April 23, 2026. Given the $719M raise and IPO timeline, expect new flagship announcements in Q3/Q4 2026.
Q: Is StepFun blocked or restricted for non-Chinese users? A: No. OpenRouter, HuggingFace, and NVIDIA NIM distribution is global. Direct StepFun API may require additional KYC for Chinese-only billing paths, but international hosted options are unrestricted.
Sources
- StepFun Step 3.5 Flash GitHub
- StepFun Step 3.5 Flash on Hugging Face
- OpenRouter Pricing & Benchmarks
- NVIDIA NIM Step 3.5 Flash Docs
- LLMBase Model Spec
- SCMP: StepFun Compact Model Outshines Rivals
- StepFun $719M Raise (Yicai Global)
By TokenMix Research Lab · Updated 2026-04-23