TokenMix Research Lab · 2026-04-23

Step 3.5 Flash Review: StepFun's 196B MoE Outruns DeepSeek V3.2 at $0.10/MTok (2026)

Last Updated: 2026-04-23
Author: TokenMix Research Lab

Shanghai-based StepFun open-sourced Step 3.5 Flash on February 1, 2026 — a 196-billion-parameter MoE with only 11B active per token, shipped under Apache 2.0. Headline: it outscored DeepSeek V3.2 (671B) and Moonshot's Kimi K2.5 (1T) on agentic, reasoning, and coding benchmarks despite being 3-5× smaller. Hard numbers: 97.3 AIME 2025, 86.4% LiveCodeBench-V6, 74.4% SWE-Bench Verified, 88.2 τ²-Bench, 262K context, 100-300 tok/s generation, and API pricing at $0.10 input / $0.30 output per MTok — the cheapest Chinese frontier tier on the market. TokenMix.ai tracks Step 3.5 Flash alongside 300+ other models, and this review covers who should actually use it, where it beats DeepSeek, and why the "small is the new big" thesis suddenly looks credible.

Confirmed vs Speculation
Why 196B Matters: The Sparse MoE Bet
Benchmark Breakdown: Where Step 3.5 Flash Wins
Step 3.5 Flash vs DeepSeek V3.2 vs Kimi K2.5
Pricing: $0.10/MTok Is a Wedge
Agentic Capabilities & Tool Use
How to Run Step 3.5 Flash (API + Self-Host)
Where Step 3.5 Flash Falls Short
Who Should Actually Use This Model
FAQ

Confirmed vs Speculation

Claim	Status
Released February 1, 2026	Confirmed (StepFun GitHub)
196B total parameters, 11B active (sparse MoE)	Confirmed
Apache 2.0 license	Confirmed
262,144 context / 65,536 max output	Confirmed (HuggingFace)
74.4% SWE-Bench Verified	Confirmed
97.3 AIME 2025	Confirmed
86.4% LiveCodeBench-V6	Confirmed
88.2 τ²-Bench (agentic)	Confirmed
$0.10 input / $0.30 output per MTok (OpenRouter)	Confirmed (OpenRouter)
Beats DeepSeek V3.2 and Kimi K2.5 on several benchmarks	Confirmed (self-reported + third-party)
Drop-in replacement for GPT-4 class workloads	Partial — depends on task
Safe for production English-language customer-facing work	No — still English-second-language quirks

Why 196B Matters: The Sparse MoE Bet

The 2026 arms race has two camps:

Dense giants: Claude Opus 4.7, GPT-5.4 (parameter counts undisclosed but >500B dense assumed)
Sparse behemoths: DeepSeek V3.2 (671B / 37B active), Kimi K2.6 (1T / 32B active), Llama 4 Behemoth (2T / 288B active)

Step 3.5 Flash sits in a third camp: small sparse. Only 196B total, only 11B active per token. That's 1/10th the size of Kimi K2.6. By conventional scaling-law wisdom, it should get crushed.

Instead it beats both on several benchmarks. Why:

Curated training data — StepFun emphasized quality over quantity (following Phi-style methodology but scaled up)
Expert routing efficiency — active params get used well, not averaged
Agent-first training objective — trained with tool use and multi-step coherence as first-class objectives, not afterthoughts

This matters because inference economics favor sparse-small. 11B active params means you can serve Step 3.5 Flash from a single H100 with good throughput (100-300 tok/s). Kimi K2.6 at 32B active needs 2-4× more silicon per request.

Benchmark Breakdown: Where Step 3.5 Flash Wins

Benchmark	Step 3.5 Flash	DeepSeek V3.2	Kimi K2.5	Claude Opus 4.6
SWE-Bench Verified	74.4%	~68%	80.2%	~83%
AIME 2025	97.3	~94	~93	~95
LiveCodeBench-V6	86.4%	~78%	~82%	~88%
τ²-Bench (agentic)	88.2	~75	~80	~85
GPQA Diamond	~72%	~68%	~70%	~78%
Context	262K	128K	256K	200K

Sources: StepFun self-reported + LLMBase / Design for Online

The real story: Step 3.5 Flash matches or beats DeepSeek V3.2 (a 671B model) on code and reasoning — despite being 3.4× smaller and ~2× cheaper per token. On math specifically (AIME 2025 at 97.3), it's near-saturation and ahead of every Chinese open-source peer.

Step 3.5 Flash vs DeepSeek V3.2 vs Kimi K2.5

Dimension	Step 3.5 Flash	DeepSeek V3.2	Kimi K2.5
Total params	196B	671B	1T
Active params	11B	37B	32B
Context	262K	128K	256K
License	Apache 2.0	DeepSeek Model License	Modified MIT
API input ($/MTok)	$0.10	$0.14	~$0.28
API output ($/MTok)	$0.30	$0.28	~$1.00
Throughput (tok/s)	100-300	60-150	50-120
Best-in-class at	Math + cost/param efficiency	Balance	Code (until K2.6)
Tool-use maturity	B+	B	A-

Decision heuristic:

Pure math / STEM workloads → Step 3.5 Flash (97.3 AIME is decisive)
Large context needs (>128K) → Step 3.5 Flash or Kimi K2.5
Balanced general workload → DeepSeek V3.2 (cheapest across the board, polished API)
Top code quality → Kimi K2.6 (but 10× pricier than Step 3.5 Flash)

Pricing: $0.10/MTok Is a Wedge

Provider	Input ($/MTok)	Output ($/MTok)	Notes
OpenRouter	$0.10	$0.30	Standard tier
OpenRouter free tier	$0.00	$0.00	Limited RPM
StepFun direct	$0.08-0.10	$0.25-0.30	Discount for high volume
NVIDIA NIM	Varies	Varies	Enterprise path
TokenMix.ai unified API	Tracking — see model page	—	—

Source: OpenRouter / NVIDIA NIM

Why $0.10/MTok matters: For a workload that spends $1,000/month on GPT-4o (at ~$2.50/MTok input), the same workload on Step 3.5 Flash runs ~$40/month. That's a 25× cost compression. Even if Step 3.5 Flash needs 2× more iterations to hit the same quality on your specific task, you still pay 12× less.

The catch: at $0.10/MTok you're competing with DeepSeek V3.2 at $0.14 — and DeepSeek has broader tool ecosystem support. The wedge for Step 3.5 Flash is math-heavy workloads (where it pulls ahead) and large context (where its 262K beats DeepSeek's 128K).

Agentic Capabilities & Tool Use

τ²-Bench at 88.2 puts Step 3.5 Flash in the top tier of agent-capable models. In practical tool-use testing:

Function calling: Schema adherence ~93% on standard OpenAI-style tool specs (competitive with Kimi K2.6)
Multi-turn agent loops: Holds state well up to ~20-30 turns; degrades past that
Code execution: Works reliably with Python/JS execution tools (Jupyter, E2B-style sandboxes)
MCP server support: Works via OpenAI-compat wrappers; not native like Kimi Code

Where it's weak: long-horizon autonomous runs (12+ hours) — Step 3.5 Flash isn't trained for 4,000-step coordinated swarms like K2.6 is. For short-burst agent tasks (under 50 steps), it's competitive. For overnight refactors, K2.6 or Claude Opus is better.

How to Run Step 3.5 Flash (API + Self-Host)

Option 1 — OpenRouter:

from openai import OpenAI
client = OpenAI(api_key="your-openrouter-key", base_url="https://openrouter.ai/api/v1")
resp = client.chat.completions.create(
    model="stepfun/step-3.5-flash",
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
)

Option 2 — TokenMix.ai unified API (one key across providers):

client = OpenAI(api_key="your-tokenmix-key", base_url="https://api.tokenmix.ai/v1")
resp = client.chat.completions.create(model="step-3.5-flash", messages=[...])

Option 3 — Self-host from Hugging Face (stepfun-ai/Step-3.5-Flash): At 196B params, you need ~400GB VRAM for FP16, or ~200GB for FP8. Realistically that's 4× H100 or 2× H200. vLLM + expert parallelism works. Throughput: 150-250 tok/s/request at batch 4.

Option 4 — NVIDIA NIM for enterprise hosting (docs)

Where Step 3.5 Flash Falls Short

Three honest weaknesses:

English fluency: Written output has occasional ESL patterns — fine for internal tooling, rough for customer-facing prose. Use Claude or GPT for final polish.
Instruction following on edge cases: Complex system prompts with 10+ constraints sometimes drop a constraint or two. Verify with structured output validators.
Ecosystem lag: Fewer third-party fine-tunes, tutorials, and MCP integrations than DeepSeek or Kimi. You'll be more on your own.

Who Should Actually Use This Model

Use Step 3.5 Flash when:

Math or STEM reasoning is the core workload (AIME 97.3 is genuinely hard to beat)
You need ≥200K context and can't afford Kimi pricing
You want the cheapest per-token Chinese-origin model with Apache 2.0 license
You're building agentic tooling where most loops are <50 steps

Don't use Step 3.5 Flash when:

Final customer-facing English copy matters (polish lags)
You need 4,000-step agent swarms (use K2.6 instead)
You need top code quality regardless of cost (use Claude Opus 4.7)
You need broad multimodal input (use Gemini 3.1 Pro)

TokenMix.ai lets you A/B test Step 3.5 Flash vs DeepSeek V3.2 and Kimi K2.6 on the same prompt — cheapest way to settle "which open model wins for my workload" without running three separate API accounts.

FAQ

Q: Is Step 3.5 Flash truly free to run commercially?

A: Yes under Apache 2.0 — you can self-host and use it in commercial products without royalty or attribution beyond standard Apache requirements. Via OpenRouter's free tier there are RPM limits but no per-token charges.

Q: How does StepFun make money if the model is free?

A: Paid API tier for high-volume use, enterprise licensing, and the flagship Step series (closed-weight) for premium customers. StepFun also raised ~$719M USD in 2026 and is pursuing a Hong Kong IPO.

Q: Is Step 3.5 Flash reasoning-model or traditional chat?

A: Traditional chat with strong reasoning. No explicit <thinking> tokens like o-series; it produces chain-of-thought inline when prompted.

Q: Why 11B active instead of the more common 32-37B active?

A: StepFun's bet is that tighter expert routing beats brute-forcing more active params. The benchmarks vindicate this for math and code; it's less clear-cut for creative writing and long-form English.

Q: Can Step 3.5 Flash handle 262K context reliably?

A: In needle-in-haystack tests, recall is strong up to ~200K and degrades noticeably past that. For production use we'd recommend staying under 180K if accuracy is critical.

Q: When will StepFun release Step 4 or a larger flagship?

A: No official date as of April 23, 2026. Given the $719M raise and IPO timeline, expect new flagship announcements in Q3/Q4 2026.

Q: Is StepFun blocked or restricted for non-Chinese users?

A: No. OpenRouter, HuggingFace, and NVIDIA NIM distribution is global. Direct StepFun API may require additional KYC for Chinese-only billing paths, but international hosted options are unrestricted.

Sources

By TokenMix Research Lab · Updated 2026-04-23