TokenMix Research Lab · 2026-06-22

LongCat-Flash Review 2026: Meituan's 560B Open MoE Tested
Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - LongCat-Flash technical report (arXiv 2509.01322), LongCat-Flash-Thinking report (arXiv 2509.18883), Hugging Face meituan-longcat model cards, OpenRouter listing, LongCat AI hub, secondary coverage
Meituan open-sourced LongCat-Flash-Chat on September 1, 2025, a 560-billion-parameter mixture-of-experts model that activates only about 27B parameters per token, released under a permissive MIT license (LongCat-Flash technical report). It scores 60.4 on SWE-bench Verified and 96.4 on MATH500, competitive with leading open models — but its throughput and cost-per-token figures are vendor-reported, the chat model is text-only with the Thinking and Omni variants split off separately, and TokenMix does not relay it (TokenMix models).
This review separates the technical report's confirmed architecture and the model cards' benchmark tables from vendor-reported deployment claims, tagging each item Confirmed, Likely, or vendor-reported. Benchmarks come from the official Hugging Face cards and arXiv reports; pricing comes from OpenRouter and the technical report, and relay prices drift, so re-check before committing volume.
Table of Contents
- Quick Verdict
- The LongCat-Flash Family
- Architecture
- Benchmarks: LongCat-Flash-Chat
- Benchmarks: LongCat-Flash-Thinking
- Pricing and Access
- Cost per Task
- LongCat-Flash vs DeepSeek
- Where LongCat-Flash Loses
- Use Case Matrix
- Final Recommendation
- FAQ
- About TokenMix
- Sources
- Related Articles
Quick Verdict
LongCat-Flash is one of the more interesting Chinese open MoE releases of late 2025: 560B total parameters, ~27B active, MIT-licensed, and engineered for high throughput. The catch is that the headline speed and cost numbers are vendor-reported, and serving 560B weights is serious infrastructure.
| Claim | Status | Source |
|---|---|---|
| LongCat-Flash-Chat released Sept 1, 2025 | Confirmed | arXiv 2509.01322 |
| 560B total parameters, MoE | Confirmed | HF model card |
| ~27B average active parameters per token | Confirmed | arXiv 2509.01322 |
| MIT license (weights and code) | Confirmed | HF model card |
| 128K context window | Confirmed | HF model card |
| SWE-bench Verified 60.4 | Confirmed (vendor card) | HF model card |
| Made by Meituan | Confirmed | HF model card |
| Over 100 tokens/sec at scale | Likely (vendor, H800-class) | arXiv 2509.01322 |
| OpenRouter price ~$0.15 in / $0.75 out per 1M | Likely (re-verify) | OpenRouter |
| TokenMix serves LongCat-Flash | False | Not in TokenMix catalog (TokenMix models) |
The short answer: LongCat-Flash is worth evaluating if you want a permissively licensed, high-throughput Chinese MoE for agentic and math-heavy work. Treat its speed and cost claims as vendor-reported until you benchmark them yourself.
The LongCat-Flash Family
LongCat-Flash is not one model but a family, and confusing the variants is the most common mistake. All share the same 560B MoE backbone but are tuned for different jobs (LongCat AI hub).
| Model | Released | Role | License |
|---|---|---|---|
| LongCat-Flash-Chat | Sept 1, 2025 | General instruct / chat | MIT |
| LongCat-Flash-Thinking | Sept 23, 2025 | Reasoning, math, theorem proving, agents | MIT |
| LongCat-Flash-Omni | ~Nov 2025 | Multimodal variant | MIT |
| LongCat-Flash-Lite | later | Smaller (~4.5B active) | Likely MIT |
The split that matters for most readers is Chat versus Thinking. Both use the same backbone, but Thinking adds extended chain-of-thought via a /think_on chat-template token and is tuned for math, formal theorem proving, and agentic tool use. Chat is the general-purpose instruct model. There is also an Omni multimodal variant and a smaller Lite, plus a teased LongCat-Next. For the broader Chinese-model landscape, see the best Chinese AI models comparison.
Architecture
LongCat-Flash's signature is dynamic computation: it spends more parameters on hard tokens and fewer on easy ones, averaging about 27B active out of 560B total. This is driven by "Zero-computation Experts" governed by a PID controller that keeps the active average near 27B (LongCat-Flash technical report).
| Field | Value | Status |
|---|---|---|
| Total parameters | 560B | Confirmed |
| Active parameters | 18.6B-31.3B, ~27B average | Confirmed |
| Activation mechanism | Zero-computation Experts + PID controller | Confirmed |
| MoE design | Shortcut-connected MoE (ScMoE) | Confirmed |
| Context window | 128K tokens | Confirmed |
| Training tokens | 20T+ | Confirmed |
| Training time | ~30 days | Likely (vendor) |
| Training availability | 98.48% | Likely (vendor) |
The ScMoE design widens the overlap window between computation and communication, which is the engineering trick behind the throughput claims. The dynamic activation is genuinely novel: rather than activating a fixed number of experts, the model scales compute to token difficulty, which is what lets a 560B model run at the cost profile of a much smaller one. These architecture facts are from the technical report; the training-efficiency numbers are vendor self-reported.
Benchmarks: LongCat-Flash-Chat
On its model card numbers, LongCat-Flash-Chat is strong on math and agentic tool use, and mid-pack on raw coding. The standout rows are MATH500 and the agentic tau-2 benchmarks.
| Benchmark | Score | Note |
|---|---|---|
| MMLU | 89.71 | general knowledge |
| MMLU-Pro | 82.68 | harder MMLU |
| MATH500 | 96.40 | math |
| AIME 25 | 61.25 | competition math (avg@10) |
| LiveCodeBench | 48.02 | coding (pass@1) |
| SWE-Bench-Verified | 60.40 | agentic coding |
| tau-2 Bench (telecom) | 73.68 | tool use |
| tau-2 Bench (airline) | 58.00 | tool use |
| VitaBench | 24.30 | Meituan's real-world agentic bench |
Source: the official LongCat-Flash-Chat model card. The profile is a capable generalist that is especially good at math (96.4 MATH500) and tool-use agents (73.68 telecom tau-2), with a 60.4 SWE-bench Verified that is competitive but not class-leading. LiveCodeBench at 48.02 is the softer spot, which is why the Thinking variant exists for reasoning-heavy work.
Benchmarks: LongCat-Flash-Thinking
The Thinking variant pushes math and reasoning into frontier territory, and is the version to use for hard problem-solving. These numbers are distinct from the Chat model and come from its own card.
| Benchmark | Score | Note |
|---|---|---|
| AIME 25 | 90.6 | competition math (Mean@32) |
| LiveCodeBench | 79.4 | coding |
| GPQA-Diamond | 81.5 | graduate reasoning |
| MiniF2F | 67.6 | formal theorem proving (pass@1) |
Source: the official LongCat-Flash-Thinking model card. The jump is dramatic: AIME 25 goes from 61.25 on Chat to 90.6 on Thinking, and LiveCodeBench from 48.02 to 79.4. If your workload is math, competitive coding, or formal reasoning, Thinking is the variant that matters; for general chat and tool use, Chat is the lighter default.
Pricing and Access
LongCat-Flash is free to download and cheap to rent, with open MIT weights and a low OpenRouter rate. The exact relay price should be re-checked, as the live page renders dynamically.
| Access | Input / 1M | Output / 1M | Note |
|---|---|---|---|
| Hugging Face weights | $0 API | $0 API | MIT, plus FP8 quant |
| OpenRouter | ~$0.15 | ~$0.75 | Re-verify at write time |
| Official deployment (report) | n/a | ~$0.70 | Vendor economics claim |
| longcat.ai web demo | free | free | Vendor chat UI |
Two caveats. First, the OpenRouter price of roughly $0.15 input and $0.75 output per 1M tokens came from search aggregation rather than a clean page fetch, so confirm it live before budgeting. Second, the $0.70 per 1M output figure is from the technical report describing Meituan's own large-scale deployment economics, not necessarily a public retail SKU. The weights are on Hugging Face with an FP8 quant for cheaper serving, and the model is also routed via Vercel AI Gateway and LangDB. For the routing pattern across providers, see the AI API gateway guide.
Cost per Task
At OpenRouter rates, LongCat-Flash-Chat is inexpensive for moderate workloads. Assume an agent consuming 10M input and 2M output tokens per month.
| Path | Input cost | Output cost | Total |
|---|---|---|---|
| OpenRouter LongCat-Flash-Chat | $1.50 | $1.50 | $3.00 |
| Self-host (MIT weights) | $0 API | $0 API | GPU cluster cost |
| Heavy agent (50M in / 10M out) | $7.50 | $7.50 | $15.00 |
A moderate agent costs about $3 a month on OpenRouter, scaling to roughly $15 for a heavier 50M/10M workload — cheap for a 560B-class model, which is the whole point of the dynamic-activation design. Self-hosting is free of API fees but requires a serious GPU cluster for 560B weights, so the hosted route is the realistic path for most teams. Model your own mix with the LLM API cost calculator.
LongCat-Flash vs DeepSeek
Against DeepSeek, the obvious Chinese-MoE rival, LongCat-Flash competes on throughput and permissive licensing rather than on having the deepest ecosystem. Both are large open MoE models; the choice is about availability and the specific benchmark profile.
| Dimension | LongCat-Flash | DeepSeek (V-series) |
|---|---|---|
| Architecture | 560B MoE, ~27B active, dynamic | Large MoE, fixed active set |
| License | MIT | Open (model-specific) |
| Ecosystem / hosting | Newer, fewer providers | Broad, widely hosted |
| Throughput claim | >100 TPS (vendor) | Provider-dependent |
| Best variant for reasoning | LongCat-Flash-Thinking | DeepSeek reasoning models |
DeepSeek has the deeper ecosystem and far wider hosting, including on managed relays. LongCat-Flash's pitch is the MIT license, the novel dynamic-activation efficiency, and strong math/agentic scores. For most production teams DeepSeek is the safer default today; LongCat-Flash is the one to test when you want a permissively licensed alternative or its specific throughput profile. See the DeepSeek V4 review and DeepSeek API pricing for that side.
Where LongCat-Flash Loses
LongCat-Flash loses on ecosystem maturity, benchmark independence, and self-host practicality. These are typical for a recent large open model, not flaws in the model itself.
| Weak spot | Evidence | Pick instead |
|---|---|---|
| Vendor-reported speed/cost | Report claims, no third-party replication | Benchmark on your own infra |
| Narrow hosting | Few relays vs DeepSeek/Qwen | DeepSeek for broad availability |
| 560B self-host cost | Large MoE weights | Hosted API or smaller model |
| Chat is text-only | Variants split by modality | LongCat-Flash-Omni for multimodal |
| Coding mid-pack (Chat) | LiveCodeBench 48.02 | LongCat-Flash-Thinking or a coding model |
| Limited track record | Released Sept 2025 | Established model for risk-averse prod |
The pattern: LongCat-Flash is a credible, well-engineered open model that is still early in its ecosystem. Where you need broad hosting, independent benchmarks, or a long production track record, a more established model wins. Where you want a permissively licensed, efficient Chinese MoE to evaluate, it earns the test.
Use Case Matrix
Point LongCat-Flash at math, agentic, and permissive-license workloads, and route broad-availability production needs to more established models.
| Use case | LongCat-Flash fit | Better alternative | Why |
|---|---|---|---|
| Math / competition problems | Strong (Thinking) | DeepSeek reasoning | AIME 25 90.6 on Thinking |
| Agentic tool use | Strong | full frontier model | tau-2 73.68, SWE-bench 60.4 |
| Permissive-license self-host | Strong | smaller MIT model if infra-limited | MIT on 560B weights |
| Formal theorem proving | Strong (Thinking) | specialized prover | MiniF2F 67.6 |
| Broad production hosting | Medium | DeepSeek / Qwen | fewer relays today |
| Multimodal tasks | Medium | LongCat-Flash-Omni / a VLM | Chat is text-only |
| Cheapest small-scale jobs | Medium | a small dense model | 560B is overkill for trivial tasks |
| Risk-averse enterprise | Medium | established model | short track record |
If your real problem is routing across many Chinese and frontier models rather than picking one, pair this with the best Chinese AI models guide and cheapest LLM API.
Final Recommendation
Evaluate LongCat-Flash when you want a permissively licensed, high-throughput Chinese MoE: use LongCat-Flash-Thinking for math, competitive coding, and formal reasoning, LongCat-Flash-Chat for general instruct and tool-use agents, and the MIT weights or OpenRouter for deployment. Keep a more widely hosted model like DeepSeek as the production default until LongCat's ecosystem matures, and benchmark its vendor-reported speed and cost claims on your own infrastructure before relying on them.
FAQ
What is LongCat-Flash?
LongCat-Flash is a family of large open mixture-of-experts models from Meituan, released starting September 2025. The backbone has 560B total parameters but activates only about 27B per token, and it is MIT-licensed for weights and code.
Who made LongCat-Flash?
Meituan, the Chinese super-app company, through its LongCat AI lab. The models and technical reports are published on Hugging Face and arXiv.
Is LongCat-Flash open source?
Yes. Both LongCat-Flash-Chat and LongCat-Flash-Thinking are released under an MIT license, which is unusually permissive, covering both weights and code. An FP8 quantization is also available for cheaper serving.
LongCat-Flash-Chat vs Thinking: what's the difference?
They share the same 560B MoE backbone. Chat is the general instruct model; Thinking adds extended chain-of-thought reasoning and is tuned for math, formal theorem proving, and agentic tasks. AIME 25 rises from 61.25 on Chat to 90.6 on Thinking.
How much does LongCat-Flash cost?
The weights are free to download under MIT. On OpenRouter it lists around $0.15 per 1M input and $0.75 per 1M output, though that should be re-verified live. Meituan's technical report cites about $0.70 per 1M output tokens for its own large-scale deployment.
How fast is LongCat-Flash?
Meituan reports over 100 tokens per second at large-scale deployment, attributed in secondary coverage to H800-class hardware. This is a vendor-reported figure tied to the model's dynamic-activation and Shortcut-connected MoE design.
How does LongCat-Flash compare to DeepSeek?
DeepSeek has a broader ecosystem and far wider hosting. LongCat-Flash competes on its MIT license, dynamic-activation efficiency, and strong math and agentic scores. For broad production use DeepSeek is the safer default; LongCat-Flash is the permissive-license alternative to test.
Does TokenMix offer LongCat-Flash?
No. LongCat-Flash is not in the TokenMix catalog. To use it today, download the MIT weights from Hugging Face, call it via OpenRouter, or try the longcat.ai web demo.
About TokenMix
TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. LongCat-Flash is not currently in the TokenMix catalog, so this review is published as independent model intelligence.
Sources
- LongCat-Flash Technical Report (arXiv 2509.01322) - architecture, dynamic activation, training, deployment economics
- Hugging Face - meituan-longcat/LongCat-Flash-Chat - model card, benchmarks, MIT license, 128K context
- Hugging Face - meituan-longcat/LongCat-Flash-Thinking - reasoning variant benchmarks
- LongCat-Flash-Thinking Technical Report (arXiv 2509.18883) - reasoning training methodology
- OpenRouter - meituan/longcat-flash-chat - third-party hosting and pricing
- aibase coverage - LongCat-Flash - 560B, throughput, H800 context
- DigitalOcean - LongCat-Flash-Chat tutorial - independent explainer
- LongCat AI hub - family overview (Lite, Omni, Next)