TokenMix Research Lab · 2026-06-22

GLM-4.7-Flash Review 2026: Free 30B Coding Model, Benchmarks

Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Z.ai release notes, Hugging Face zai-org/GLM-4.7-Flash model card, BigModel free-model docs, OpenRouter pricing page, GGUF quant repo, secondary coverage (MarkTechPost, Techloy)

Zhipu released GLM-4.7-Flash on January 19, 2026, a 31B-parameter mixture-of-experts model that activates only 3B parameters per token, ships under an MIT license, and is free to call on Zhipu's own platform (Z.ai release notes). It posts 59.2 on SWE-bench Verified, strong for its size class — but it is text-only, the free tier is capped near 1 request per second, and TokenMix does not relay it, serving the newer GLM-5 family instead (TokenMix models).

This review separates Zhipu's documented specs and the model card's own benchmark table from third-party performance claims, and tags each item Confirmed, Likely, or vendor-reported. Pricing is pulled from BigModel's free-model docs and OpenRouter; benchmark numbers come from the official Hugging Face model card, which compares GLM-4.7-Flash against Qwen3-30B-A3B and GPT-OSS-20B.

Quick Verdict
What GLM-4.7-Flash Is
Specifications
Benchmarks
Pricing
Cost per Task
GLM-4.7-Flash vs GLM-4.7 vs GLM-4.5-Flash
Access and Self-Hosting
Where GLM-4.7-Flash Loses
Use Case Matrix
Final Recommendation
FAQ
About TokenMix
Sources
Related Articles

Quick Verdict

GLM-4.7-Flash is the cheapest credible 30B-class coding and agent model of early 2026: free on Zhipu, MIT-licensed for self-hosting, and benchmark-strong for its size. The catch is the 1-QPS free cap, text-only scope, and the need to verify its vendor-card numbers yourself.

Claim	Status	Source
Released January 19, 2026	Confirmed	Z.ai release notes
31B total / 3B active MoE ("30B-A3B")	Confirmed	HF model card
MIT license, open weights	Confirmed	HF model card
200K context window	Confirmed	BigModel docs
Text-only (not multimodal)	Confirmed	BigModel docs
SWE-bench Verified 59.2	Confirmed (vendor card)	HF model card
Free on Zhipu / BigModel	Confirmed	BigModel free model
OpenRouter price $0.06 in / $0.40 out per 1M	Confirmed	OpenRouter
Roughly 10x cheaper input than full GLM-4.7	Likely	llm-stats GLM-4.7
Runs locally on a single RTX 3090	Likely (secondary)	MarkTechPost
TokenMix serves GLM-4.7-Flash	False	TokenMix lists GLM-5.2, not GLM-4.7-Flash (TokenMix models)

The short answer: GLM-4.7-Flash is the model to reach for when you want a near-free, self-hostable coding agent and can live with 1-QPS on the free tier. For production throughput, OpenRouter or a managed GLM-5 route is the realistic path.

What GLM-4.7-Flash Is

GLM-4.7-Flash is the cheap, fast sibling of Zhipu's GLM-4.7 flagship, built for local coding and agent workloads rather than frontier benchmarks. Where the full GLM-4.7 is a 355B-class MoE released December 22, 2025, the Flash variant is a 31B/3B-active "Lite" MoE released January 19, 2026 and positioned as the strongest model in the 30B class (Z.ai release notes).

The model family timeline is the part buyers get wrong, so to be precise: GLM-4.6 (Sept 2025), then GLM-4.7 (Dec 2025), then GLM-4.7-Flash (Jan 2026), then GLM-5 (Feb 2026) and GLM-5.1/5.2 after. GLM-4.7-Flash is not a downgrade of GLM-5; it is the budget tier of the 4.7 generation, kept around because it is small enough to self-host. For the full flagship, see the GLM-4.7 review; for the broader lineup, the GLM models roundup.

Specifications

The spec that defines GLM-4.7-Flash is the 30B-A3B design: 31B total parameters, but only 3B active per token, which is what makes it cheap to run and self-hostable. Every figure is from the official model card or Zhipu docs.

Field	Value	Status
Release date	2026-01-19	Confirmed
Total parameters	31B	Confirmed
Active parameters	3B per token	Confirmed
Architecture	MoE Lite, MLA attention	Confirmed
Context window	200K (203K on OpenRouter)	Confirmed
Max output	128K (16,384 on OpenRouter)	Confirmed
Modalities	Text-only	Confirmed
License	MIT	Confirmed
Local speed	80+ tok/s on RTX 3090 / Apple Silicon	Likely (secondary)

A note on the "FlashX" variant: third parties reference a faster paid GLM-4.7-FlashX tier, but it is not on Zhipu's official release-notes page, so treat FlashX as Likely-but-unconfirmed. The 200K context is the official figure; OpenRouter exposes a slightly different 203K and caps completion at 16,384 tokens, which matters if you plan long generations.

Benchmarks

On the model card's own numbers, GLM-4.7-Flash is a standout in the 30B class, led by a 59.2 on SWE-bench Verified that roughly triples Qwen3-30B-A3B. These are vendor-published comparisons, so weight them as such until independent evals land.

Benchmark	GLM-4.7-Flash	Qwen3-30B-A3B	GPT-OSS-20B
SWE-bench Verified	59.2	22.0	34.0
tau-2 Bench (tool use)	79.5	49.0	47.7
BrowseComp (web agent)	42.8	2.29	28.3
AIME 25 (math)	91.6	85.0	91.7
GPQA (reasoning)	75.2	73.4	71.5
HLE (Humanity's Last Exam)	14.4	9.8	10.9
LiveCodeBench v6	64.0	66.0	61.0

Source for the table: the official GLM-4.7-Flash model card. The headline is agentic coding: a 59.2 SWE-bench Verified at 3B active parameters is the kind of result that makes a self-hostable model genuinely useful for coding agents. The one row where it slips is LiveCodeBench v6, where Qwen3-30B-A3B edges ahead at 66.0 to 64.0 — a useful reminder that "best in class" is not "best on every test."

Pricing

GLM-4.7-Flash is free on Zhipu's own platform and cheap on OpenRouter, and that price gap versus the full GLM-4.7 is the whole reason to care about it. The free tier requires real-name verification and is limited to roughly 1 request per second (BigModel free model docs).

Access	Input / 1M	Output / 1M	Note
Zhipu / BigModel (free tier)	$0.00	$0.00	Real-name verify, ~1 QPS
OpenRouter	$0.06	$0.40	Production throughput
Self-host (MIT weights)	$0 API	$0 API	Pay your own GPU
Full GLM-4.7 (for contrast)	~$0.60	~$2.20	~10x in, ~5.5x out

The comparison that matters: GLM-4.7-Flash on OpenRouter is roughly 10x cheaper on input and 5.5x cheaper on output than the full GLM-4.7 (llm-stats), and free entirely if you stay inside Zhipu's 1-QPS tier. For teams already managing free-tier limits across vendors, see the GLM free API access tiers guide.

Cost per Task

Modeling a real coding-agent workload shows how cheap this model is in production. Assume a moderate agent that consumes 50M input and 10M output tokens per month.

Path	Monthly input cost	Monthly output cost	Total
Zhipu free tier	$0.00	$0.00	$0.00 (1 QPS limit)
OpenRouter GLM-4.7-Flash	$3.00	$4.00	$7.00
Full GLM-4.7 (OpenRouter-class)	$30.00	$22.00	$52.00

A coding agent that would cost $52 a month on full GLM-4.7 runs about $7 on GLM-4.7-Flash, or $0 on Zhipu's free tier if you can work within 1 request per second. For batch or off-peak jobs the free tier is effectively a free coding model; for interactive multi-user agents the $7 OpenRouter path buys real concurrency. Either way, the per-task economics are far below frontier models — use the LLM API cost calculator to model your own mix.

GLM-4.7-Flash vs GLM-4.7 vs GLM-4.5-Flash

Against its own family, GLM-4.7-Flash trades raw capability for cost and deployability, sitting below the full GLM-4.7 and above the older GLM-4.5-Flash. The choice is mostly task complexity versus budget.

Model	Class	Strength	Pick when
GLM-4.7-Flash	31B/3B MoE	Cheap, self-hostable, strong SWE-bench	Coding agents on a budget
GLM-4.7 (full)	355B-class MoE	Higher ceiling on hard tasks	Complex work, quality first
GLM-4.5-Flash	prior-gen Flash	Older, still free-tier	Legacy compatibility
GLM-5.2 (newer)	1M-context flagship	Long-context agents, managed on TokenMix	Production, long horizons

The practical guidance: use GLM-4.7-Flash for routine coding and agent tasks where cost dominates, escalate to full GLM-4.7 for genuinely hard problems, and if you want a managed long-context option without self-hosting, GLM-5.2 is the newer family member available through a hosted relay. See the GLM-5.2 review for that comparison.

Access and Self-Hosting

Because the weights are MIT-licensed and the active footprint is only 3B, GLM-4.7-Flash is genuinely runnable on consumer hardware, which sets it apart from most strong coding models. There are four practical paths.

Path	What you get	Best for	Caveat
Zhipu / Z.ai API	OpenAI-style endpoint, free tier	Fastest test	~1 QPS free, real-name verify
OpenRouter	Hosted, pay-as-you-go	Production throughput	$0.06 / $0.40 per 1M
Hugging Face weights	MIT download, vLLM/SGLang	Self-host, research	Needs a capable GPU
GGUF quant	LM Studio / llama.cpp local	Laptop / single GPU	Quantization quality trade

from openai import OpenAI

client = OpenAI(
    api_key="your-ZAI-api-key",
    base_url="https://api.z.ai/api/paas/v4/"
)

resp = client.chat.completions.create(
    model="glm-4.7-flash",
    messages=[{"role": "user", "content": "Refactor this function and add tests."}],
)
print(resp.choices[0].message.content)

For local deployment, the bartowski GGUF build runs in LM Studio or llama.cpp, and the full weights serve under vLLM or SGLang for higher throughput. The OpenAI-compatible endpoint means existing tooling needs only a base-URL and model-name change — the standard AI API gateway pattern.

Where GLM-4.7-Flash Loses

GLM-4.7-Flash loses on free-tier throughput, modality, and benchmark independence. None of these are surprising for a free 30B model, but they shape where it fits.

Weak spot	Evidence	Pick instead
1-QPS free-tier cap	BigModel free-model limit	OpenRouter for concurrency
Text-only	No vision/audio	GLM-4.1V or a VLM for multimodal
Vendor-only benchmarks	Numbers from model card	Run your own SWE/agent eval
Below full GLM-4.7 ceiling	Smaller active params	Full GLM-4.7 for hard tasks
Self-host needs a real GPU	31B weights	Hosted API if no GPU
LiveCodeBench slightly trails Qwen3	64.0 vs 66.0	Qwen3-30B for that specific profile

The pattern: this is a budget and deployability play, not a frontier play. Where you need maximum quality, multimodality, or guaranteed throughput, a larger or managed model is the right call. Where you need a cheap, capable, self-hostable coding model, very little competes.

Use Case Matrix

Point GLM-4.7-Flash at cost-sensitive coding and agent work, and route harder or multimodal jobs elsewhere.

Use case	GLM-4.7-Flash fit	Better alternative	Why
Budget coding agent	Strong	none on cost	SWE-bench 59.2 at ~$7/mo
Self-hosted local coding	Strong	none on cost	MIT, runs on one GPU
Off-peak / batch automation	Strong	none on cost	Free tier covers it
Tool-use / web agents	Strong	full GLM-4.7 if quality first	tau-2 79.5, BrowseComp 42.8
High-concurrency production	Medium	OpenRouter or GLM-5.2	1-QPS free cap
Long-horizon, 1M context	Weak	GLM-5.2	200K vs 1M context
Multimodal (image/doc)	Weak	GLM-4.1V-Thinking	text-only
Maximum-quality hard tasks	Medium	full GLM-4.7 / frontier	smaller active params

If your real problem is routing across cheap and frontier models rather than picking one, pair this with cheapest LLM API and the GLM-5.2 review.

Final Recommendation

Use GLM-4.7-Flash as a near-free coding and agent model: the Zhipu free tier for batch and off-peak work within 1 QPS, OpenRouter at $0.06/$0.40 for production concurrency, and the MIT weights for self-hosting on a single capable GPU. Escalate to full GLM-4.7 for genuinely hard problems, move to GLM-5.2 when you need 1M context or a managed endpoint, and validate the model card's benchmark numbers on your own tasks before standardizing on it.

FAQ

Is GLM-4.7-Flash free?

Yes, on Zhipu's own platform. The BigModel free tier requires real-name verification and is limited to roughly 1 request per second. For higher throughput, OpenRouter charges $0.06 per 1M input and $0.40 per 1M output tokens.

Is GLM-4.7-Flash open source?

Yes. The weights are released under an MIT license on Hugging Face, so you can self-host and use it commercially. GGUF quantizations exist for local runners like LM Studio and llama.cpp.

How good is GLM-4.7-Flash at coding?

Strong for its size. Its model card reports 59.2 on SWE-bench Verified, roughly triple Qwen3-30B-A3B, plus 79.5 on tau-2 Bench for tool use. These are vendor-published numbers, so confirm them on your own workload.

What is GLM-4.7-Flash's context window?

200K tokens per Zhipu's documentation, shown as about 203K on OpenRouter. Maximum output is 128K on Zhipu, though OpenRouter caps completion at 16,384 tokens.

GLM-4.7-Flash vs GLM-4.7: what's the difference?

GLM-4.7-Flash is the 31B/3B-active budget tier; full GLM-4.7 is a 355B-class flagship with a higher quality ceiling. Flash is roughly 10x cheaper on input and self-hostable; the full model is the pick for genuinely hard tasks.

Can I run GLM-4.7-Flash locally?

Yes. With only 3B active parameters and MIT weights, secondary coverage reports it running on a single RTX 3090 or Apple Silicon at 80+ tokens per second, especially via GGUF quantization. A capable GPU is still required for the full weights.

Is GLM-4.7-Flash multimodal?

No. It is text-only. For image, document, or video understanding in the GLM family, use GLM-4.1V-Thinking or a newer GLM vision model.

Does TokenMix offer GLM-4.7-Flash?

No. TokenMix lists GLM-5.2 from the GLM family, not GLM-4.7-Flash. If you want a managed, long-context GLM through one endpoint, GLM-5.2 is the available option; for GLM-4.7-Flash specifically, use Zhipu, OpenRouter, or self-host.

About TokenMix

TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. TokenMix serves the GLM-5 family rather than GLM-4.7-Flash, so this review is published as independent model intelligence.

Sources

Z.ai release notes - release date and positioning
Hugging Face - zai-org/GLM-4.7-Flash model card - specs, benchmarks, MIT license
BigModel - GLM-4.7-Flash free model docs - free tier, 200K/128K, text-only
OpenRouter - z-ai/glm-4.7-flash - paid pricing and context
llm-stats - GLM-4.7 - full GLM-4.7 pricing for comparison
MarkTechPost - Zhipu releases GLM-4.7-Flash - local-deploy coverage
Techloy - GLM-4.7-Flash for consumer hardware - secondary coverage
Hugging Face - bartowski GLM-4.7-Flash GGUF - GGUF quant for local use