Shanghai-based StepFun open-sourced Step 3.5 Flash on February 1, 2026 — a 196-billion-parameter MoE with only 11B active per token, shipped under Apache 2.0. Headline: it outscored DeepSeek V3.2 (671B) and Moonshot's Kimi K2.5 (1T) on agentic, reasoning, and coding benchmarks despite being 3-5× smaller. Hard numbers: 97.3 AIME 2025, 86.4% LiveCodeBench-V6, 74.4% SWE-Bench Verified, 88.2 τ²-Bench, 262K context, 100-300 tok/s generation, and API pricing at $0.10 input / $0.30 output per MTok — the cheapest Chinese frontier tier on the market. TokenMix.ai tracks Step 3.5 Flash alongside 300+ other models, and this review covers who should actually use it, where it beats DeepSeek, and why the "small is the new big" thesis suddenly looks credible.
Step 3.5 Flash sits in a third camp: small sparse. Only 196B total, only 11B active per token. That's 1/10th the size of Kimi K2.6. By conventional scaling-law wisdom, it should get crushed.
Instead it beats both on several benchmarks. Why:
Curated training data — StepFun emphasized quality over quantity (following Phi-style methodology but scaled up)
Expert routing efficiency — active params get used well, not averaged
Agent-first training objective — trained with tool use and multi-step coherence as first-class objectives, not afterthoughts
This matters because inference economics favor sparse-small. 11B active params means you can serve Step 3.5 Flash from a single H100 with good throughput (100-300 tok/s). Kimi K2.6 at 32B active needs 2-4× more silicon per request.
The real story: Step 3.5 Flash matches or beats DeepSeek V3.2 (a 671B model) on code and reasoning — despite being 3.4× smaller and ~2× cheaper per token. On math specifically (AIME 2025 at 97.3), it's near-saturation and ahead of every Chinese open-source peer.
Step 3.5 Flash vs DeepSeek V3.2 vs Kimi K2.5
Dimension
Step 3.5 Flash
DeepSeek V3.2
Kimi K2.5
Total params
196B
671B
1T
Active params
11B
37B
32B
Context
262K
128K
256K
License
Apache 2.0
DeepSeek Model License
Modified MIT
API input ($/MTok)
$0.10
$0.14
~$0.28
API output ($/MTok)
$0.30
$0.28
~
.00
Throughput (tok/s)
100-300
60-150
50-120
Best-in-class at
Math + cost/param efficiency
Balance
Code (until K2.6)
Tool-use maturity
B+
B
A-
Decision heuristic:
Pure math / STEM workloads → Step 3.5 Flash (97.3 AIME is decisive)
Large context needs (>128K) → Step 3.5 Flash or Kimi K2.5
Balanced general workload → DeepSeek V3.2 (cheapest across the board, polished API)
Top code quality → Kimi K2.6 (but 10× pricier than Step 3.5 Flash)
Why $0.10/MTok matters: For a workload that spends
,000/month on GPT-4o (at ~$2.50/MTok input), the same workload on Step 3.5 Flash runs ~$40/month. That's a 25× cost compression. Even if Step 3.5 Flash needs 2× more iterations to hit the same quality on your specific task, you still pay 12× less.
The catch: at $0.10/MTok you're competing with DeepSeek V3.2 at $0.14 — and DeepSeek has broader tool ecosystem support. The wedge for Step 3.5 Flash is math-heavy workloads (where it pulls ahead) and large context (where its 262K beats DeepSeek's 128K).
Agentic Capabilities & Tool Use
τ²-Bench at 88.2 puts Step 3.5 Flash in the top tier of agent-capable models. In practical tool-use testing:
Function calling: Schema adherence ~93% on standard OpenAI-style tool specs (competitive with Kimi K2.6)
Multi-turn agent loops: Holds state well up to ~20-30 turns; degrades past that
Code execution: Works reliably with Python/JS execution tools (Jupyter, E2B-style sandboxes)
MCP server support: Works via OpenAI-compat wrappers; not native like Kimi Code
Where it's weak: long-horizon autonomous runs (12+ hours) — Step 3.5 Flash isn't trained for 4,000-step coordinated swarms like K2.6 is. For short-burst agent tasks (under 50 steps), it's competitive. For overnight refactors, K2.6 or Claude Opus is better.
How to Run Step 3.5 Flash (API + Self-Host)
Option 1 — OpenRouter:
from openai import OpenAI
client = OpenAI(api_key="your-openrouter-key", base_url="https://openrouter.ai/api/v1")
resp = client.chat.completions.create(
model="stepfun/step-3.5-flash",
messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
)
Option 2 — TokenMix.ai unified API (one key across providers):
Option 3 — Self-host from Hugging Face (stepfun-ai/Step-3.5-Flash): At 196B params, you need ~400GB VRAM for FP16, or ~200GB for FP8. Realistically that's 4× H100 or 2× H200. vLLM + expert parallelism works. Throughput: 150-250 tok/s/request at batch 4.
Option 4 — NVIDIA NIM for enterprise hosting (docs)
Where Step 3.5 Flash Falls Short
Three honest weaknesses:
English fluency: Written output has occasional ESL patterns — fine for internal tooling, rough for customer-facing prose. Use Claude or GPT for final polish.
Instruction following on edge cases: Complex system prompts with 10+ constraints sometimes drop a constraint or two. Verify with structured output validators.
Ecosystem lag: Fewer third-party fine-tunes, tutorials, and MCP integrations than DeepSeek or Kimi. You'll be more on your own.
Who Should Actually Use This Model
Use Step 3.5 Flash when:
Math or STEM reasoning is the core workload (AIME 97.3 is genuinely hard to beat)
You need ≥200K context and can't afford Kimi pricing
You want the cheapest per-token Chinese-origin model with Apache 2.0 license
You're building agentic tooling where most loops are <50 steps
Don't use Step 3.5 Flash when:
Final customer-facing English copy matters (polish lags)
You need 4,000-step agent swarms (use K2.6 instead)
You need top code quality regardless of cost (use Claude Opus 4.7)
You need broad multimodal input (use Gemini 3.1 Pro)
TokenMix.ai lets you A/B test Step 3.5 Flash vs DeepSeek V3.2 and Kimi K2.6 on the same prompt — cheapest way to settle "which open model wins for my workload" without running three separate API accounts.
FAQ
Q: Is Step 3.5 Flash truly free to run commercially?
A: Yes under Apache 2.0 — you can self-host and use it in commercial products without royalty or attribution beyond standard Apache requirements. Via OpenRouter's free tier there are RPM limits but no per-token charges.
Q: How does StepFun make money if the model is free?
A: Paid API tier for high-volume use, enterprise licensing, and the flagship Step series (closed-weight) for premium customers. StepFun also raised ~$719M USD in 2026 and is pursuing a Hong Kong IPO.
Q: Is Step 3.5 Flash reasoning-model or traditional chat?
A: Traditional chat with strong reasoning. No explicit <thinking> tokens like o-series; it produces chain-of-thought inline when prompted.
Q: Why 11B active instead of the more common 32-37B active?
A: StepFun's bet is that tighter expert routing beats brute-forcing more active params. The benchmarks vindicate this for math and code; it's less clear-cut for creative writing and long-form English.
Q: Can Step 3.5 Flash handle 262K context reliably?
A: In needle-in-haystack tests, recall is strong up to ~200K and degrades noticeably past that. For production use we'd recommend staying under 180K if accuracy is critical.
Q: When will StepFun release Step 4 or a larger flagship?
A: No official date as of April 23, 2026. Given the $719M raise and IPO timeline, expect new flagship announcements in Q3/Q4 2026.
Q: Is StepFun blocked or restricted for non-Chinese users?
A: No. OpenRouter, HuggingFace, and NVIDIA NIM distribution is global. Direct StepFun API may require additional KYC for Chinese-only billing paths, but international hosted options are unrestricted.