TokenMix Research Lab · 2026-06-22

LongCat-Flash Review 2026: Meituan's 560B Open MoE Tested

Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - LongCat-Flash technical report (arXiv 2509.01322), LongCat-Flash-Thinking report (arXiv 2509.18883), Hugging Face meituan-longcat model cards, OpenRouter listing, LongCat AI hub, secondary coverage

Meituan open-sourced LongCat-Flash-Chat on September 1, 2025, a 560-billion-parameter mixture-of-experts model that activates only about 27B parameters per token, released under a permissive MIT license (LongCat-Flash technical report). It scores 60.4 on SWE-bench Verified and 96.4 on MATH500, competitive with leading open models — but its throughput and cost-per-token figures are vendor-reported, the chat model is text-only with the Thinking and Omni variants split off separately, and TokenMix does not relay it (TokenMix models).

This review separates the technical report's confirmed architecture and the model cards' benchmark tables from vendor-reported deployment claims, tagging each item Confirmed, Likely, or vendor-reported. Benchmarks come from the official Hugging Face cards and arXiv reports; pricing comes from OpenRouter and the technical report, and relay prices drift, so re-check before committing volume.

Quick Verdict
The LongCat-Flash Family
Architecture
Benchmarks: LongCat-Flash-Chat
Benchmarks: LongCat-Flash-Thinking
Pricing and Access
Cost per Task
LongCat-Flash vs DeepSeek
Where LongCat-Flash Loses
Use Case Matrix
Final Recommendation
FAQ
About TokenMix
Sources
Related Articles

Quick Verdict

LongCat-Flash is one of the more interesting Chinese open MoE releases of late 2025: 560B total parameters, ~27B active, MIT-licensed, and engineered for high throughput. The catch is that the headline speed and cost numbers are vendor-reported, and serving 560B weights is serious infrastructure.

Claim	Status	Source
LongCat-Flash-Chat released Sept 1, 2025	Confirmed	arXiv 2509.01322
560B total parameters, MoE	Confirmed	HF model card
~27B average active parameters per token	Confirmed	arXiv 2509.01322
MIT license (weights and code)	Confirmed	HF model card
128K context window	Confirmed	HF model card
SWE-bench Verified 60.4	Confirmed (vendor card)	HF model card
Made by Meituan	Confirmed	HF model card
Over 100 tokens/sec at scale	Likely (vendor, H800-class)	arXiv 2509.01322
OpenRouter price ~$0.15 in / $0.75 out per 1M	Likely (re-verify)	OpenRouter
TokenMix serves LongCat-Flash	False	Not in TokenMix catalog (TokenMix models)

The short answer: LongCat-Flash is worth evaluating if you want a permissively licensed, high-throughput Chinese MoE for agentic and math-heavy work. Treat its speed and cost claims as vendor-reported until you benchmark them yourself.

The LongCat-Flash Family

LongCat-Flash is not one model but a family, and confusing the variants is the most common mistake. All share the same 560B MoE backbone but are tuned for different jobs (LongCat AI hub).

Model	Released	Role	License
LongCat-Flash-Chat	Sept 1, 2025	General instruct / chat	MIT
LongCat-Flash-Thinking	Sept 23, 2025	Reasoning, math, theorem proving, agents	MIT
LongCat-Flash-Omni	~Nov 2025	Multimodal variant	MIT
LongCat-Flash-Lite	later	Smaller (~4.5B active)	Likely MIT

The split that matters for most readers is Chat versus Thinking. Both use the same backbone, but Thinking adds extended chain-of-thought via a /think_on chat-template token and is tuned for math, formal theorem proving, and agentic tool use. Chat is the general-purpose instruct model. There is also an Omni multimodal variant and a smaller Lite, plus a teased LongCat-Next. For the broader Chinese-model landscape, see the best Chinese AI models comparison.

Architecture

LongCat-Flash's signature is dynamic computation: it spends more parameters on hard tokens and fewer on easy ones, averaging about 27B active out of 560B total. This is driven by "Zero-computation Experts" governed by a PID controller that keeps the active average near 27B (LongCat-Flash technical report).

Field	Value	Status
Total parameters	560B	Confirmed
Active parameters	18.6B-31.3B, ~27B average	Confirmed
Activation mechanism	Zero-computation Experts + PID controller	Confirmed
MoE design	Shortcut-connected MoE (ScMoE)	Confirmed
Context window	128K tokens	Confirmed
Training tokens	20T+	Confirmed
Training time	~30 days	Likely (vendor)
Training availability	98.48%	Likely (vendor)

The ScMoE design widens the overlap window between computation and communication, which is the engineering trick behind the throughput claims. The dynamic activation is genuinely novel: rather than activating a fixed number of experts, the model scales compute to token difficulty, which is what lets a 560B model run at the cost profile of a much smaller one. These architecture facts are from the technical report; the training-efficiency numbers are vendor self-reported.

Benchmarks: LongCat-Flash-Chat

On its model card numbers, LongCat-Flash-Chat is strong on math and agentic tool use, and mid-pack on raw coding. The standout rows are MATH500 and the agentic tau-2 benchmarks.

Benchmark	Score	Note
MMLU	89.71	general knowledge
MMLU-Pro	82.68	harder MMLU
MATH500	96.40	math
AIME 25	61.25	competition math (avg@10)
LiveCodeBench	48.02	coding (pass@1)
SWE-Bench-Verified	60.40	agentic coding
tau-2 Bench (telecom)	73.68	tool use
tau-2 Bench (airline)	58.00	tool use
VitaBench	24.30	Meituan's real-world agentic bench

Source: the official LongCat-Flash-Chat model card. The profile is a capable generalist that is especially good at math (96.4 MATH500) and tool-use agents (73.68 telecom tau-2), with a 60.4 SWE-bench Verified that is competitive but not class-leading. LiveCodeBench at 48.02 is the softer spot, which is why the Thinking variant exists for reasoning-heavy work.

Benchmarks: LongCat-Flash-Thinking

The Thinking variant pushes math and reasoning into frontier territory, and is the version to use for hard problem-solving. These numbers are distinct from the Chat model and come from its own card.

Benchmark	Score	Note
AIME 25	90.6	competition math (Mean@32)
LiveCodeBench	79.4	coding
GPQA-Diamond	81.5	graduate reasoning
MiniF2F	67.6	formal theorem proving (pass@1)

Source: the official LongCat-Flash-Thinking model card. The jump is dramatic: AIME 25 goes from 61.25 on Chat to 90.6 on Thinking, and LiveCodeBench from 48.02 to 79.4. If your workload is math, competitive coding, or formal reasoning, Thinking is the variant that matters; for general chat and tool use, Chat is the lighter default.

Pricing and Access

LongCat-Flash is free to download and cheap to rent, with open MIT weights and a low OpenRouter rate. The exact relay price should be re-checked, as the live page renders dynamically.

Access	Input / 1M	Output / 1M	Note
Hugging Face weights	$0 API	$0 API	MIT, plus FP8 quant
OpenRouter	~$0.15	~$0.75	Re-verify at write time
Official deployment (report)	n/a	~$0.70	Vendor economics claim
longcat.ai web demo	free	free	Vendor chat UI

Two caveats. First, the OpenRouter price of roughly $0.15 input and $0.75 output per 1M tokens came from search aggregation rather than a clean page fetch, so confirm it live before budgeting. Second, the $0.70 per 1M output figure is from the technical report describing Meituan's own large-scale deployment economics, not necessarily a public retail SKU. The weights are on Hugging Face with an FP8 quant for cheaper serving, and the model is also routed via Vercel AI Gateway and LangDB. For the routing pattern across providers, see the AI API gateway guide.

Cost per Task

At OpenRouter rates, LongCat-Flash-Chat is inexpensive for moderate workloads. Assume an agent consuming 10M input and 2M output tokens per month.

Path	Input cost	Output cost	Total
OpenRouter LongCat-Flash-Chat	$1.50	$1.50	$3.00
Self-host (MIT weights)	$0 API	$0 API	GPU cluster cost
Heavy agent (50M in / 10M out)	$7.50	$7.50	$15.00

A moderate agent costs about $3 a month on OpenRouter, scaling to roughly $15 for a heavier 50M/10M workload — cheap for a 560B-class model, which is the whole point of the dynamic-activation design. Self-hosting is free of API fees but requires a serious GPU cluster for 560B weights, so the hosted route is the realistic path for most teams. Model your own mix with the LLM API cost calculator.

LongCat-Flash vs DeepSeek

Against DeepSeek, the obvious Chinese-MoE rival, LongCat-Flash competes on throughput and permissive licensing rather than on having the deepest ecosystem. Both are large open MoE models; the choice is about availability and the specific benchmark profile.

Dimension	LongCat-Flash	DeepSeek (V-series)
Architecture	560B MoE, ~27B active, dynamic	Large MoE, fixed active set
License	MIT	Open (model-specific)
Ecosystem / hosting	Newer, fewer providers	Broad, widely hosted
Throughput claim	>100 TPS (vendor)	Provider-dependent
Best variant for reasoning	LongCat-Flash-Thinking	DeepSeek reasoning models

DeepSeek has the deeper ecosystem and far wider hosting, including on managed relays. LongCat-Flash's pitch is the MIT license, the novel dynamic-activation efficiency, and strong math/agentic scores. For most production teams DeepSeek is the safer default today; LongCat-Flash is the one to test when you want a permissively licensed alternative or its specific throughput profile. See the DeepSeek V4 review and DeepSeek API pricing for that side.

Where LongCat-Flash Loses

LongCat-Flash loses on ecosystem maturity, benchmark independence, and self-host practicality. These are typical for a recent large open model, not flaws in the model itself.

Weak spot	Evidence	Pick instead
Vendor-reported speed/cost	Report claims, no third-party replication	Benchmark on your own infra
Narrow hosting	Few relays vs DeepSeek/Qwen	DeepSeek for broad availability
560B self-host cost	Large MoE weights	Hosted API or smaller model
Chat is text-only	Variants split by modality	LongCat-Flash-Omni for multimodal
Coding mid-pack (Chat)	LiveCodeBench 48.02	LongCat-Flash-Thinking or a coding model
Limited track record	Released Sept 2025	Established model for risk-averse prod

The pattern: LongCat-Flash is a credible, well-engineered open model that is still early in its ecosystem. Where you need broad hosting, independent benchmarks, or a long production track record, a more established model wins. Where you want a permissively licensed, efficient Chinese MoE to evaluate, it earns the test.

Use Case Matrix

Point LongCat-Flash at math, agentic, and permissive-license workloads, and route broad-availability production needs to more established models.

Use case	LongCat-Flash fit	Better alternative	Why
Math / competition problems	Strong (Thinking)	DeepSeek reasoning	AIME 25 90.6 on Thinking
Agentic tool use	Strong	full frontier model	tau-2 73.68, SWE-bench 60.4
Permissive-license self-host	Strong	smaller MIT model if infra-limited	MIT on 560B weights
Formal theorem proving	Strong (Thinking)	specialized prover	MiniF2F 67.6
Broad production hosting	Medium	DeepSeek / Qwen	fewer relays today
Multimodal tasks	Medium	LongCat-Flash-Omni / a VLM	Chat is text-only
Cheapest small-scale jobs	Medium	a small dense model	560B is overkill for trivial tasks
Risk-averse enterprise	Medium	established model	short track record

If your real problem is routing across many Chinese and frontier models rather than picking one, pair this with the best Chinese AI models guide and cheapest LLM API.

Final Recommendation

Evaluate LongCat-Flash when you want a permissively licensed, high-throughput Chinese MoE: use LongCat-Flash-Thinking for math, competitive coding, and formal reasoning, LongCat-Flash-Chat for general instruct and tool-use agents, and the MIT weights or OpenRouter for deployment. Keep a more widely hosted model like DeepSeek as the production default until LongCat's ecosystem matures, and benchmark its vendor-reported speed and cost claims on your own infrastructure before relying on them.

FAQ

What is LongCat-Flash?

LongCat-Flash is a family of large open mixture-of-experts models from Meituan, released starting September 2025. The backbone has 560B total parameters but activates only about 27B per token, and it is MIT-licensed for weights and code.

Who made LongCat-Flash?

Meituan, the Chinese super-app company, through its LongCat AI lab. The models and technical reports are published on Hugging Face and arXiv.

Is LongCat-Flash open source?

Yes. Both LongCat-Flash-Chat and LongCat-Flash-Thinking are released under an MIT license, which is unusually permissive, covering both weights and code. An FP8 quantization is also available for cheaper serving.

LongCat-Flash-Chat vs Thinking: what's the difference?

They share the same 560B MoE backbone. Chat is the general instruct model; Thinking adds extended chain-of-thought reasoning and is tuned for math, formal theorem proving, and agentic tasks. AIME 25 rises from 61.25 on Chat to 90.6 on Thinking.

How much does LongCat-Flash cost?

The weights are free to download under MIT. On OpenRouter it lists around $0.15 per 1M input and $0.75 per 1M output, though that should be re-verified live. Meituan's technical report cites about $0.70 per 1M output tokens for its own large-scale deployment.

How fast is LongCat-Flash?

Meituan reports over 100 tokens per second at large-scale deployment, attributed in secondary coverage to H800-class hardware. This is a vendor-reported figure tied to the model's dynamic-activation and Shortcut-connected MoE design.

How does LongCat-Flash compare to DeepSeek?

DeepSeek has a broader ecosystem and far wider hosting. LongCat-Flash competes on its MIT license, dynamic-activation efficiency, and strong math and agentic scores. For broad production use DeepSeek is the safer default; LongCat-Flash is the permissive-license alternative to test.

Does TokenMix offer LongCat-Flash?

No. LongCat-Flash is not in the TokenMix catalog. To use it today, download the MIT weights from Hugging Face, call it via OpenRouter, or try the longcat.ai web demo.

About TokenMix

TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. LongCat-Flash is not currently in the TokenMix catalog, so this review is published as independent model intelligence.

Sources

LongCat-Flash Technical Report (arXiv 2509.01322) - architecture, dynamic activation, training, deployment economics
Hugging Face - meituan-longcat/LongCat-Flash-Chat - model card, benchmarks, MIT license, 128K context
Hugging Face - meituan-longcat/LongCat-Flash-Thinking - reasoning variant benchmarks
LongCat-Flash-Thinking Technical Report (arXiv 2509.18883) - reasoning training methodology
OpenRouter - meituan/longcat-flash-chat - third-party hosting and pricing
aibase coverage - LongCat-Flash - 560B, throughput, H800 context
DigitalOcean - LongCat-Flash-Chat tutorial - independent explainer
LongCat AI hub - family overview (Lite, Omni, Next)