TokenMix Research Lab · 2026-06-22

GLM-4.7-Flash Review 2026: Free 30B Coding Model, Benchmarks
Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Z.ai release notes, Hugging Face zai-org/GLM-4.7-Flash model card, BigModel free-model docs, OpenRouter pricing page, GGUF quant repo, secondary coverage (MarkTechPost, Techloy)
Zhipu released GLM-4.7-Flash on January 19, 2026, a 31B-parameter mixture-of-experts model that activates only 3B parameters per token, ships under an MIT license, and is free to call on Zhipu's own platform (Z.ai release notes). It posts 59.2 on SWE-bench Verified, strong for its size class — but it is text-only, the free tier is capped near 1 request per second, and TokenMix does not relay it, serving the newer GLM-5 family instead (TokenMix models).
This review separates Zhipu's documented specs and the model card's own benchmark table from third-party performance claims, and tags each item Confirmed, Likely, or vendor-reported. Pricing is pulled from BigModel's free-model docs and OpenRouter; benchmark numbers come from the official Hugging Face model card, which compares GLM-4.7-Flash against Qwen3-30B-A3B and GPT-OSS-20B.
Table of Contents
- Quick Verdict
- What GLM-4.7-Flash Is
- Specifications
- Benchmarks
- Pricing
- Cost per Task
- GLM-4.7-Flash vs GLM-4.7 vs GLM-4.5-Flash
- Access and Self-Hosting
- Where GLM-4.7-Flash Loses
- Use Case Matrix
- Final Recommendation
- FAQ
- About TokenMix
- Sources
- Related Articles
Quick Verdict
GLM-4.7-Flash is the cheapest credible 30B-class coding and agent model of early 2026: free on Zhipu, MIT-licensed for self-hosting, and benchmark-strong for its size. The catch is the 1-QPS free cap, text-only scope, and the need to verify its vendor-card numbers yourself.
| Claim | Status | Source |
|---|---|---|
| Released January 19, 2026 | Confirmed | Z.ai release notes |
| 31B total / 3B active MoE ("30B-A3B") | Confirmed | HF model card |
| MIT license, open weights | Confirmed | HF model card |
| 200K context window | Confirmed | BigModel docs |
| Text-only (not multimodal) | Confirmed | BigModel docs |
| SWE-bench Verified 59.2 | Confirmed (vendor card) | HF model card |
| Free on Zhipu / BigModel | Confirmed | BigModel free model |
| OpenRouter price $0.06 in / $0.40 out per 1M | Confirmed | OpenRouter |
| Roughly 10x cheaper input than full GLM-4.7 | Likely | llm-stats GLM-4.7 |
| Runs locally on a single RTX 3090 | Likely (secondary) | MarkTechPost |
| TokenMix serves GLM-4.7-Flash | False | TokenMix lists GLM-5.2, not GLM-4.7-Flash (TokenMix models) |
The short answer: GLM-4.7-Flash is the model to reach for when you want a near-free, self-hostable coding agent and can live with 1-QPS on the free tier. For production throughput, OpenRouter or a managed GLM-5 route is the realistic path.
What GLM-4.7-Flash Is
GLM-4.7-Flash is the cheap, fast sibling of Zhipu's GLM-4.7 flagship, built for local coding and agent workloads rather than frontier benchmarks. Where the full GLM-4.7 is a 355B-class MoE released December 22, 2025, the Flash variant is a 31B/3B-active "Lite" MoE released January 19, 2026 and positioned as the strongest model in the 30B class (Z.ai release notes).
The model family timeline is the part buyers get wrong, so to be precise: GLM-4.6 (Sept 2025), then GLM-4.7 (Dec 2025), then GLM-4.7-Flash (Jan 2026), then GLM-5 (Feb 2026) and GLM-5.1/5.2 after. GLM-4.7-Flash is not a downgrade of GLM-5; it is the budget tier of the 4.7 generation, kept around because it is small enough to self-host. For the full flagship, see the GLM-4.7 review; for the broader lineup, the GLM models roundup.
Specifications
The spec that defines GLM-4.7-Flash is the 30B-A3B design: 31B total parameters, but only 3B active per token, which is what makes it cheap to run and self-hostable. Every figure is from the official model card or Zhipu docs.
| Field | Value | Status |
|---|---|---|
| Release date | 2026-01-19 | Confirmed |
| Total parameters | 31B | Confirmed |
| Active parameters | 3B per token | Confirmed |
| Architecture | MoE Lite, MLA attention | Confirmed |
| Context window | 200K (203K on OpenRouter) | Confirmed |
| Max output | 128K (16,384 on OpenRouter) | Confirmed |
| Modalities | Text-only | Confirmed |
| License | MIT | Confirmed |
| Local speed | 80+ tok/s on RTX 3090 / Apple Silicon | Likely (secondary) |
A note on the "FlashX" variant: third parties reference a faster paid GLM-4.7-FlashX tier, but it is not on Zhipu's official release-notes page, so treat FlashX as Likely-but-unconfirmed. The 200K context is the official figure; OpenRouter exposes a slightly different 203K and caps completion at 16,384 tokens, which matters if you plan long generations.
Benchmarks
On the model card's own numbers, GLM-4.7-Flash is a standout in the 30B class, led by a 59.2 on SWE-bench Verified that roughly triples Qwen3-30B-A3B. These are vendor-published comparisons, so weight them as such until independent evals land.
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|---|---|---|---|
| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
| tau-2 Bench (tool use) | 79.5 | 49.0 | 47.7 |
| BrowseComp (web agent) | 42.8 | 2.29 | 28.3 |
| AIME 25 (math) | 91.6 | 85.0 | 91.7 |
| GPQA (reasoning) | 75.2 | 73.4 | 71.5 |
| HLE (Humanity's Last Exam) | 14.4 | 9.8 | 10.9 |
| LiveCodeBench v6 | 64.0 | 66.0 | 61.0 |
Source for the table: the official GLM-4.7-Flash model card. The headline is agentic coding: a 59.2 SWE-bench Verified at 3B active parameters is the kind of result that makes a self-hostable model genuinely useful for coding agents. The one row where it slips is LiveCodeBench v6, where Qwen3-30B-A3B edges ahead at 66.0 to 64.0 — a useful reminder that "best in class" is not "best on every test."
Pricing
GLM-4.7-Flash is free on Zhipu's own platform and cheap on OpenRouter, and that price gap versus the full GLM-4.7 is the whole reason to care about it. The free tier requires real-name verification and is limited to roughly 1 request per second (BigModel free model docs).
| Access | Input / 1M | Output / 1M | Note |
|---|---|---|---|
| Zhipu / BigModel (free tier) | $0.00 | $0.00 | Real-name verify, ~1 QPS |
| OpenRouter | $0.06 | $0.40 | Production throughput |
| Self-host (MIT weights) | $0 API | $0 API | Pay your own GPU |
| Full GLM-4.7 (for contrast) | ~$0.60 | ~$2.20 | ~10x in, ~5.5x out |
The comparison that matters: GLM-4.7-Flash on OpenRouter is roughly 10x cheaper on input and 5.5x cheaper on output than the full GLM-4.7 (llm-stats), and free entirely if you stay inside Zhipu's 1-QPS tier. For teams already managing free-tier limits across vendors, see the GLM free API access tiers guide.
Cost per Task
Modeling a real coding-agent workload shows how cheap this model is in production. Assume a moderate agent that consumes 50M input and 10M output tokens per month.
| Path | Monthly input cost | Monthly output cost | Total |
|---|---|---|---|
| Zhipu free tier | $0.00 | $0.00 | $0.00 (1 QPS limit) |
| OpenRouter GLM-4.7-Flash | $3.00 | $4.00 | $7.00 |
| Full GLM-4.7 (OpenRouter-class) | $30.00 | $22.00 | $52.00 |
A coding agent that would cost $52 a month on full GLM-4.7 runs about $7 on GLM-4.7-Flash, or $0 on Zhipu's free tier if you can work within 1 request per second. For batch or off-peak jobs the free tier is effectively a free coding model; for interactive multi-user agents the $7 OpenRouter path buys real concurrency. Either way, the per-task economics are far below frontier models — use the LLM API cost calculator to model your own mix.
GLM-4.7-Flash vs GLM-4.7 vs GLM-4.5-Flash
Against its own family, GLM-4.7-Flash trades raw capability for cost and deployability, sitting below the full GLM-4.7 and above the older GLM-4.5-Flash. The choice is mostly task complexity versus budget.
| Model | Class | Strength | Pick when |
|---|---|---|---|
| GLM-4.7-Flash | 31B/3B MoE | Cheap, self-hostable, strong SWE-bench | Coding agents on a budget |
| GLM-4.7 (full) | 355B-class MoE | Higher ceiling on hard tasks | Complex work, quality first |
| GLM-4.5-Flash | prior-gen Flash | Older, still free-tier | Legacy compatibility |
| GLM-5.2 (newer) | 1M-context flagship | Long-context agents, managed on TokenMix | Production, long horizons |
The practical guidance: use GLM-4.7-Flash for routine coding and agent tasks where cost dominates, escalate to full GLM-4.7 for genuinely hard problems, and if you want a managed long-context option without self-hosting, GLM-5.2 is the newer family member available through a hosted relay. See the GLM-5.2 review for that comparison.
Access and Self-Hosting
Because the weights are MIT-licensed and the active footprint is only 3B, GLM-4.7-Flash is genuinely runnable on consumer hardware, which sets it apart from most strong coding models. There are four practical paths.
| Path | What you get | Best for | Caveat |
|---|---|---|---|
| Zhipu / Z.ai API | OpenAI-style endpoint, free tier | Fastest test | ~1 QPS free, real-name verify |
| OpenRouter | Hosted, pay-as-you-go | Production throughput | $0.06 / $0.40 per 1M |
| Hugging Face weights | MIT download, vLLM/SGLang | Self-host, research | Needs a capable GPU |
| GGUF quant | LM Studio / llama.cpp local | Laptop / single GPU | Quantization quality trade |
from openai import OpenAI
client = OpenAI(
api_key="your-ZAI-api-key",
base_url="https://api.z.ai/api/paas/v4/"
)
resp = client.chat.completions.create(
model="glm-4.7-flash",
messages=[{"role": "user", "content": "Refactor this function and add tests."}],
)
print(resp.choices[0].message.content)
For local deployment, the bartowski GGUF build runs in LM Studio or llama.cpp, and the full weights serve under vLLM or SGLang for higher throughput. The OpenAI-compatible endpoint means existing tooling needs only a base-URL and model-name change — the standard AI API gateway pattern.
Where GLM-4.7-Flash Loses
GLM-4.7-Flash loses on free-tier throughput, modality, and benchmark independence. None of these are surprising for a free 30B model, but they shape where it fits.
| Weak spot | Evidence | Pick instead |
|---|---|---|
| 1-QPS free-tier cap | BigModel free-model limit | OpenRouter for concurrency |
| Text-only | No vision/audio | GLM-4.1V or a VLM for multimodal |
| Vendor-only benchmarks | Numbers from model card | Run your own SWE/agent eval |
| Below full GLM-4.7 ceiling | Smaller active params | Full GLM-4.7 for hard tasks |
| Self-host needs a real GPU | 31B weights | Hosted API if no GPU |
| LiveCodeBench slightly trails Qwen3 | 64.0 vs 66.0 | Qwen3-30B for that specific profile |
The pattern: this is a budget and deployability play, not a frontier play. Where you need maximum quality, multimodality, or guaranteed throughput, a larger or managed model is the right call. Where you need a cheap, capable, self-hostable coding model, very little competes.
Use Case Matrix
Point GLM-4.7-Flash at cost-sensitive coding and agent work, and route harder or multimodal jobs elsewhere.
| Use case | GLM-4.7-Flash fit | Better alternative | Why |
|---|---|---|---|
| Budget coding agent | Strong | none on cost | SWE-bench 59.2 at ~$7/mo |
| Self-hosted local coding | Strong | none on cost | MIT, runs on one GPU |
| Off-peak / batch automation | Strong | none on cost | Free tier covers it |
| Tool-use / web agents | Strong | full GLM-4.7 if quality first | tau-2 79.5, BrowseComp 42.8 |
| High-concurrency production | Medium | OpenRouter or GLM-5.2 | 1-QPS free cap |
| Long-horizon, 1M context | Weak | GLM-5.2 | 200K vs 1M context |
| Multimodal (image/doc) | Weak | GLM-4.1V-Thinking | text-only |
| Maximum-quality hard tasks | Medium | full GLM-4.7 / frontier | smaller active params |
If your real problem is routing across cheap and frontier models rather than picking one, pair this with cheapest LLM API and the GLM-5.2 review.
Final Recommendation
Use GLM-4.7-Flash as a near-free coding and agent model: the Zhipu free tier for batch and off-peak work within 1 QPS, OpenRouter at $0.06/$0.40 for production concurrency, and the MIT weights for self-hosting on a single capable GPU. Escalate to full GLM-4.7 for genuinely hard problems, move to GLM-5.2 when you need 1M context or a managed endpoint, and validate the model card's benchmark numbers on your own tasks before standardizing on it.
FAQ
Is GLM-4.7-Flash free?
Yes, on Zhipu's own platform. The BigModel free tier requires real-name verification and is limited to roughly 1 request per second. For higher throughput, OpenRouter charges $0.06 per 1M input and $0.40 per 1M output tokens.
Is GLM-4.7-Flash open source?
Yes. The weights are released under an MIT license on Hugging Face, so you can self-host and use it commercially. GGUF quantizations exist for local runners like LM Studio and llama.cpp.
How good is GLM-4.7-Flash at coding?
Strong for its size. Its model card reports 59.2 on SWE-bench Verified, roughly triple Qwen3-30B-A3B, plus 79.5 on tau-2 Bench for tool use. These are vendor-published numbers, so confirm them on your own workload.
What is GLM-4.7-Flash's context window?
200K tokens per Zhipu's documentation, shown as about 203K on OpenRouter. Maximum output is 128K on Zhipu, though OpenRouter caps completion at 16,384 tokens.
GLM-4.7-Flash vs GLM-4.7: what's the difference?
GLM-4.7-Flash is the 31B/3B-active budget tier; full GLM-4.7 is a 355B-class flagship with a higher quality ceiling. Flash is roughly 10x cheaper on input and self-hostable; the full model is the pick for genuinely hard tasks.
Can I run GLM-4.7-Flash locally?
Yes. With only 3B active parameters and MIT weights, secondary coverage reports it running on a single RTX 3090 or Apple Silicon at 80+ tokens per second, especially via GGUF quantization. A capable GPU is still required for the full weights.
Is GLM-4.7-Flash multimodal?
No. It is text-only. For image, document, or video understanding in the GLM family, use GLM-4.1V-Thinking or a newer GLM vision model.
Does TokenMix offer GLM-4.7-Flash?
No. TokenMix lists GLM-5.2 from the GLM family, not GLM-4.7-Flash. If you want a managed, long-context GLM through one endpoint, GLM-5.2 is the available option; for GLM-4.7-Flash specifically, use Zhipu, OpenRouter, or self-host.
About TokenMix
TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. TokenMix serves the GLM-5 family rather than GLM-4.7-Flash, so this review is published as independent model intelligence.
Sources
- Z.ai release notes - release date and positioning
- Hugging Face - zai-org/GLM-4.7-Flash model card - specs, benchmarks, MIT license
- BigModel - GLM-4.7-Flash free model docs - free tier, 200K/128K, text-only
- OpenRouter - z-ai/glm-4.7-flash - paid pricing and context
- llm-stats - GLM-4.7 - full GLM-4.7 pricing for comparison
- MarkTechPost - Zhipu releases GLM-4.7-Flash - local-deploy coverage
- Techloy - GLM-4.7-Flash for consumer hardware - secondary coverage
- Hugging Face - bartowski GLM-4.7-Flash GGUF - GGUF quant for local use