TokenMix Research Lab · 2026-05-25

Qwen 3.6 Tier Picker 2026: Max-Preview vs Plus vs Flash vs 35B

Last Updated: 2026-05-25 · Data Checked: 2026-05-25

Alibaba shipped four Qwen 3.6 SKUs in 30 days. Picking the wrong tier costs 6-32x more or burns benchmark headroom you never use. This is the workload-to-tier map, with verified pricing, context, and benchmark numbers from OpenRouter, Hugging Face, and Alibaba Cloud's own announcement pages.

The series spans $0.15/M (open-source 35B-A3B) to $6.24/M (Max-Preview output) — a 41x output-cost spread. SWE-Bench Verified ranges from 73.4 (35B) to 78.8 (Plus). Context tops out at 1M for Plus and Flash, 262K for Max-Preview and 35B native. The right pick depends on three things only: workload class, throughput needs, and whether you can self-host. Below is the decision tree.

Quick Verdict

Workload	Pick	Why	Verified
Repo-level agentic coding, long context	Qwen 3.6-Plus	1M context + 78.8 SWE-Bench Verified, $0.325/$1.95 OpenRouter	2026-05-25
Hardest coding/agent tasks, willing to pay	Qwen 3.6-Max-Preview	Tops 6 benchmarks incl SWE-Bench Pro 57.3 / Terminal-Bench 2.0 65.4	2026-05-25
High-volume routing, cost-sensitive	Qwen 3.6-Flash	$0.1875/$1.125 OpenRouter, 1M context	2026-05-25
Self-host, air-gapped, fine-tune	Qwen 3.6-35B-A3B	Apache-2.0, MoE 35B/3B active, 262K → 1M ctx	2026-05-25
Math/reasoning only, low budget	Qwen 3.6-35B-A3B	AIME26 92.7 / GPQA 86.0 at $0.15/$0.90	2026-05-25

Skip Qwen 3.6-Plus if your context never crosses 128K — Flash gives you the same family quality at ~40% the cost. Skip Max-Preview unless your eval shows the +6 to +14 point gain matters for your task class.

Pricing Reality: The 41x Spread

Confirmed pricing (USD per million tokens, verified 2026-05-25 via OpenRouter and pricepertoken.com):

Model	Input	Output	Context	Max Output	Source
Qwen 3.6-Max-Preview	$1.04	$6.24	262K	text-only	OpenRouter
Qwen 3.6-Plus	$0.325	$1.95	1M	65,536	OpenRouter
Qwen 3.6-Flash	$0.1875	$1.125	1M	65,536	OpenRouter
Qwen 3.6-35B-A3B	$0.150	$0.900	262K (1M YaRN)	81,920	pricepertoken

Caveat — Confirmed: OpenRouter shows Plus, Flash, and Max-Preview with platform discounts (35%, 25%, 20% respectively). Direct DashScope pricing may differ; Alibaba Cloud's Model Studio pricing page (last updated 2026-04-01) does not yet list the 3.6 family. Treat the OpenRouter numbers as the going market rate as of this verification date.

Caveat — Likely: Plus and Flash both advertise 1M context, but tiered pricing reportedly kicks in above 256K. Below 256K you get the headline rate; above, costs scale per a separate sheet not yet harmonized across providers.

Benchmark Reality: Where Each Tier Earns Its Price

Verified scores from each model's release announcement and Hugging Face card:

Benchmark	Plus	Max-Preview	35B-A3B	Source
SWE-Bench Verified	78.8	—	73.4	OpenRouter / HF
SWE-Bench Pro	—	57.30	49.5	Alibaba blog / HF
Terminal-Bench 2.0	—	65.40	51.5	Alibaba blog / HF
AIME 2026	—	—	92.7	HF
MMLU-Pro	—	—	85.2	HF
GPQA	—	—	86.0	HF
LiveCodeBench v6	—	—	80.4	HF

The reading: Max-Preview's premium ($6.24/M output vs Plus's $1.95) buys you ~+8 SWE-Bench Pro points and ~+14 Terminal-Bench 2.0 points over 35B-A3B. Plus's value lies in the 1M context plus SWE-Bench Verified headroom — if your repo fits in 1M but you don't need bleeding-edge frontier benchmarks, Plus dominates.

The 35B-A3B open-source variant is the dark horse. AIME26 92.7 beats most proprietary models, MMLU-Pro 85.2 is competitive with Claude Opus 4.7's published numbers, and self-hosting eliminates per-token cost entirely after hardware amortization.

Workload Decision Tree

Branch 1 — Software engineering agents

Context > 200K (whole repos): Plus → drops to Flash if cost > quality
Context < 200K, hardest tasks: Max-Preview
Context < 200K, normal tasks: Plus or 35B-A3B (self-hosted)
Budget-constrained, can tolerate SWE-Verified 73.4: 35B-A3B

Branch 2 — Document/long-context Q&A

Million-token PDFs/codebases: Plus or Flash
Cost-sensitive at scale: Flash ($0.1875 input is 5.5x cheaper than Plus)
Air-gapped: 35B-A3B with YaRN to 1M

Branch 3 — Math/scientific reasoning

35B-A3B wins outright (AIME26 92.7 / GPQA 86.0 at $0.15/$0.90)
Max-Preview if you need text-only frontier reasoning with one API call

Branch 4 — High-volume classification, summarization, retrieval

Flash. Period. $0.1875/$1.125 with 1M context is the right tool.
35B-A3B if you have GPU capacity to spare.

Branch 5 — Multimodal (vision/video)

35B-A3B (the only open-source variant with vision; MMMU 81.7 / VideoMMU 86.6)
Plus/Max-Preview are text-only at this writing.

Cost-Per-Task Math (Three Workloads)

Assume realistic token counts. All numbers in USD, verified pricing 2026-05-25.

Workload A — Repo-level code review, 100K in / 5K out, 100 tasks/day

Tier	Per-task	Daily (100 tasks)	Monthly
Max-Preview	$0.104 + $0.0312 = $0.135	$13.52	$405.60
Plus	$0.0325 + $0.00975 = $0.0423	$4.23	$126.75
Flash	$0.01875 + $0.005625 = $0.0244	$2.44	$73.10
35B-A3B (API)	$0.015 + $0.0045 = $0.0195	$1.95	$58.50

Workload B — Math tutor, 2K in / 8K out, 10K tasks/day

Tier	Per-task	Daily	Monthly
Max-Preview	$0.00208 + $0.0499 = $0.0520	$520.00	$15,600
Plus	$0.00065 + $0.0156 = $0.0163	$162.50	$4,875
Flash	$0.000375 + $0.009 = $0.00938	$93.75	$2,813
35B-A3B	$0.0003 + $0.0072 = $0.0075	$75.00	$2,250

Workload C — Long-PDF QA, 500K in / 2K out, 200 tasks/day

Tier	Per-task	Monthly
Plus (1M ctx)	$0.1625 + $0.0039 = $0.166	$996
Flash (1M ctx)	$0.0938 + $0.00225 = $0.0960	$576
Max-Preview (262K — won't fit!)	n/a	n/a
35B-A3B (262K native, YaRN)	$0.075 + $0.0018 = $0.0768	$461

The pattern: Flash wins when input is cheap fuel, Plus wins when you need both context and benchmark quality, Max-Preview wins only on hardest tasks where the extra ~10 benchmark points pay for themselves, 35B-A3B wins on self-host economics for any task at sustained volume.

Open-Source vs Proprietary — The 35B-A3B Question

Qwen 3.6-35B-A3B is the only Apache-2.0 model in the family. Configuration: 35B total parameters, 3B active per token (MoE), 256 total experts, 8 routed + 1 shared activated. Native context 262K, extensible to ~1M via YaRN/RoPE scaling. Vision encoder included.

Why this matters: 3B active parameters means inference cost scales like a 3B dense model, not 35B. On a single H100, you can run real workloads at meaningful throughput. The benchmark scores (SWE-Verified 73.4, AIME26 92.7, MMLU-Pro 85.2, GPQA 86.0) are competitive with proprietary mid-tier offerings.

Compared to Max-Preview: 35B-A3B loses ~8 points on SWE-Bench Pro and ~14 on Terminal-Bench 2.0. If your task class doesn't need that delta, self-hosting Qwen 3.6-35B-A3B is the lowest TCO option at sustained volume. The break-even vs API depends on hardware utilization — at 50%+ utilization on owned hardware, 35B-A3B beats every API tier on cost.

Confirmed limitation — Speculation: We have not independently verified Max-Preview's parameter count (reported "~1T"). Treat that as marketing characterization, not a confirmed spec.

Reliability and Update Cadence

Variant	Released	Status	Notes
Qwen 3.6-Plus	2026-04-02	GA	Primary flagship per Alibaba positioning
Qwen 3.6-35B-A3B	2026-04-16	GA, Apache-2.0	Open weights, full multimodal
Qwen 3.6-27B	2026-04-22	GA	Smaller open variant
Qwen 3.6-Max-Preview	2026-04-20	Preview	"Work in progress per Alibaba"
Qwen 3.6-Flash	2026-04	GA	Speed/cost tier

The "Preview" tag on Max-Preview is non-trivial. Alibaba's own press materials describe further improvements expected, which means production behavior could shift. Plus and Flash are stable; 35B-A3B is open-weights, frozen by definition.

If you're picking for a production system, Plus or 35B-A3B are the safest. Max-Preview is fine for evaluation and asymmetric high-value tasks, not for stable agent loops.

FAQ

Q: Which Qwen 3.6 tier matches Claude Opus 4.7 on coding?

A: Plus at 78.8 SWE-Bench Verified is in the same band as Opus 4.7's published number. Max-Preview's SWE-Bench Pro 57.3 / Terminal-Bench 2.0 65.4 outperforms Opus 4.7's Terminal-Bench 2.0 of 69.4 — wait, that's actually behind. Let's be precise: Max-Preview reclaimed top spot on SkillsBench, QwenClawBench, QwenWebBench, SciCode per Alibaba's own claims. Independent verification ongoing.

Q: Why is Max-Preview text-only?

A: It launched as text-only per the announcement. Vision input is on the family roadmap. For multimodal today, 35B-A3B is the option.

Q: Can I use Qwen 3.6-Plus's 1M context without paying premium?

A: Up to 256K, you pay the headline rate ($0.325/$1.95). Above 256K, tiered pricing applies — exact multipliers vary by provider. Confirmed with OpenRouter; DashScope direct pricing not yet published for the 3.6 family.

Q: Is Qwen 3.6-35B-A3B actually competitive at 3B active params?

A: For its size class, exceptional. AIME26 92.7 and MMLU-Pro 85.2 are competitive with proprietary mid-tier. SWE-Bench Verified 73.4 trails Plus by ~5 points but beats most open-source coding models.

Q: What's the cache-hit pricing for Qwen 3.6?

A: Not consistently published across providers as of 2026-05-25. OpenRouter does not break out cache pricing for these variants. If you rely on cache discounts for cost modeling, validate with the specific endpoint before committing.

Q: When does the Max-Preview "Preview" tag come off?

A: No public timeline. Alibaba's release describes ongoing improvements. Assume Preview behavior could change weekly.

Q: Are there fine-tunes available for Qwen 3.6-35B-A3B?

A: As of 2026-05-25, community fine-tunes are appearing on Hugging Face. The Apache-2.0 license permits commercial use including fine-tunes.

Q: How does Qwen 3.6-Flash compare to DeepSeek V4-Pro on cost?

A: V4-Pro post-permanent-cut is $0.435/$0.87 per million; Flash is $0.1875/$1.125. Flash wins on input cost (2.3x cheaper), V4-Pro wins on output cost (~22% cheaper). The crossover point depends on your input/output ratio.

Q: Does Qwen 3.6-Plus support function calling and tool use?

A: Yes, native function calling and agentic workflows are supported across the family. 35B-A3B documents this explicitly.

Q: What's the max output token limit per tier?

A: Plus/Flash: 65,536 per OpenRouter spec. 35B-A3B: 32,768 general, 81,920 for math/coding per Qwen's recommendation. Max-Preview: not specified in available documentation.

Q: Can Qwen 3.6 models be deployed on Azure or AWS?

A: 35B-A3B (open weights) yes, via standard deployment paths. Plus/Max-Preview/Flash are accessible via DashScope, OpenRouter, and various API aggregators including TokenMix.ai. Direct AWS/Azure Bedrock availability for the proprietary tiers is not confirmed as of 2026-05-25.

Q: What's the realistic throughput for Qwen 3.6-Plus in production?

A: OpenRouter and aggregator-reported throughput numbers vary 20-80 tok/s depending on routing and load. For SLA-bound workloads, run your own benchmark before committing.

Sources

TokenMix Take

Editorial note: TokenMix.ai routes traffic across 300+ models including all four Qwen 3.6 variants discussed above. The pricing and benchmarks above are independent verifications, not vendor-supplied.

The Qwen 3.6 family is the sharpest tier ladder we've seen in 2026 so far. The 41x output-cost spread between 35B-A3B and Max-Preview maps cleanly to four distinct workload classes — most Chinese model families collapse to 2-3 tiers that overlap heavily. Alibaba's pricing discipline here suggests they read the same workload-routing playbook the cost-conscious operators have been writing.

Our routing recommendation for a fresh agent stack today: default to Plus for stateful coding tasks, fall back to Flash when context fits comfortably in 128K, escalate to Max-Preview only for hardest evals that demonstrably benefit, and reserve 35B-A3B for the self-host path when GPU economics work out. The math/reasoning advantage of 35B-A3B specifically (AIME26 92.7, GPQA 86.0) makes it the surprise pick for low-volume high-precision tasks where the API tax doesn't justify itself.

The biggest open question is what Alibaba does with the "Preview" tag on Max-Preview. If it stabilizes within Q3 2026, the four-tier ladder becomes the most coherent Chinese flagship lineup. If Max-Preview keeps shifting, production teams will gravitate to Plus and the open-source 35B-A3B for stability.

For workflow-by-workflow cost math against Western frontier models, our cheapest frontier LLM API analysis covers cross-vendor cost-per-task numbers including the DeepSeek V4-Pro post-cut shift that landed last week.