TokenMix Research Lab · 2026-06-22

GLM-4.7-Flash Review 2026: Free 30B Coding Model, Benchmarks

GLM-4.7-Flash Review 2026: Free 30B Coding Model, Benchmarks

Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Z.ai release notes, Hugging Face zai-org/GLM-4.7-Flash model card, BigModel free-model docs, OpenRouter pricing page, GGUF quant repo, secondary coverage (MarkTechPost, Techloy)

Zhipu released GLM-4.7-Flash on January 19, 2026, a 31B-parameter mixture-of-experts model that activates only 3B parameters per token, ships under an MIT license, and is free to call on Zhipu's own platform (Z.ai release notes). It posts 59.2 on SWE-bench Verified, strong for its size class — but it is text-only, the free tier is capped near 1 request per second, and TokenMix does not relay it, serving the newer GLM-5 family instead (TokenMix models).

This review separates Zhipu's documented specs and the model card's own benchmark table from third-party performance claims, and tags each item Confirmed, Likely, or vendor-reported. Pricing is pulled from BigModel's free-model docs and OpenRouter; benchmark numbers come from the official Hugging Face model card, which compares GLM-4.7-Flash against Qwen3-30B-A3B and GPT-OSS-20B.

Table of Contents

Quick Verdict

GLM-4.7-Flash is the cheapest credible 30B-class coding and agent model of early 2026: free on Zhipu, MIT-licensed for self-hosting, and benchmark-strong for its size. The catch is the 1-QPS free cap, text-only scope, and the need to verify its vendor-card numbers yourself.

Claim Status Source
Released January 19, 2026 Confirmed Z.ai release notes
31B total / 3B active MoE ("30B-A3B") Confirmed HF model card
MIT license, open weights Confirmed HF model card
200K context window Confirmed BigModel docs
Text-only (not multimodal) Confirmed BigModel docs
SWE-bench Verified 59.2 Confirmed (vendor card) HF model card
Free on Zhipu / BigModel Confirmed BigModel free model
OpenRouter price $0.06 in / $0.40 out per 1M Confirmed OpenRouter
Roughly 10x cheaper input than full GLM-4.7 Likely llm-stats GLM-4.7
Runs locally on a single RTX 3090 Likely (secondary) MarkTechPost
TokenMix serves GLM-4.7-Flash False TokenMix lists GLM-5.2, not GLM-4.7-Flash (TokenMix models)

The short answer: GLM-4.7-Flash is the model to reach for when you want a near-free, self-hostable coding agent and can live with 1-QPS on the free tier. For production throughput, OpenRouter or a managed GLM-5 route is the realistic path.

What GLM-4.7-Flash Is

GLM-4.7-Flash is the cheap, fast sibling of Zhipu's GLM-4.7 flagship, built for local coding and agent workloads rather than frontier benchmarks. Where the full GLM-4.7 is a 355B-class MoE released December 22, 2025, the Flash variant is a 31B/3B-active "Lite" MoE released January 19, 2026 and positioned as the strongest model in the 30B class (Z.ai release notes).

The model family timeline is the part buyers get wrong, so to be precise: GLM-4.6 (Sept 2025), then GLM-4.7 (Dec 2025), then GLM-4.7-Flash (Jan 2026), then GLM-5 (Feb 2026) and GLM-5.1/5.2 after. GLM-4.7-Flash is not a downgrade of GLM-5; it is the budget tier of the 4.7 generation, kept around because it is small enough to self-host. For the full flagship, see the GLM-4.7 review; for the broader lineup, the GLM models roundup.

Specifications

The spec that defines GLM-4.7-Flash is the 30B-A3B design: 31B total parameters, but only 3B active per token, which is what makes it cheap to run and self-hostable. Every figure is from the official model card or Zhipu docs.

Field Value Status
Release date 2026-01-19 Confirmed
Total parameters 31B Confirmed
Active parameters 3B per token Confirmed
Architecture MoE Lite, MLA attention Confirmed
Context window 200K (203K on OpenRouter) Confirmed
Max output 128K (16,384 on OpenRouter) Confirmed
Modalities Text-only Confirmed
License MIT Confirmed
Local speed 80+ tok/s on RTX 3090 / Apple Silicon Likely (secondary)

A note on the "FlashX" variant: third parties reference a faster paid GLM-4.7-FlashX tier, but it is not on Zhipu's official release-notes page, so treat FlashX as Likely-but-unconfirmed. The 200K context is the official figure; OpenRouter exposes a slightly different 203K and caps completion at 16,384 tokens, which matters if you plan long generations.

Benchmarks

On the model card's own numbers, GLM-4.7-Flash is a standout in the 30B class, led by a 59.2 on SWE-bench Verified that roughly triples Qwen3-30B-A3B. These are vendor-published comparisons, so weight them as such until independent evals land.

Benchmark GLM-4.7-Flash Qwen3-30B-A3B GPT-OSS-20B
SWE-bench Verified 59.2 22.0 34.0
tau-2 Bench (tool use) 79.5 49.0 47.7
BrowseComp (web agent) 42.8 2.29 28.3
AIME 25 (math) 91.6 85.0 91.7
GPQA (reasoning) 75.2 73.4 71.5
HLE (Humanity's Last Exam) 14.4 9.8 10.9
LiveCodeBench v6 64.0 66.0 61.0

Source for the table: the official GLM-4.7-Flash model card. The headline is agentic coding: a 59.2 SWE-bench Verified at 3B active parameters is the kind of result that makes a self-hostable model genuinely useful for coding agents. The one row where it slips is LiveCodeBench v6, where Qwen3-30B-A3B edges ahead at 66.0 to 64.0 — a useful reminder that "best in class" is not "best on every test."

Pricing

GLM-4.7-Flash is free on Zhipu's own platform and cheap on OpenRouter, and that price gap versus the full GLM-4.7 is the whole reason to care about it. The free tier requires real-name verification and is limited to roughly 1 request per second (BigModel free model docs).

Access Input / 1M Output / 1M Note
Zhipu / BigModel (free tier) $0.00 $0.00 Real-name verify, ~1 QPS
OpenRouter $0.06 $0.40 Production throughput
Self-host (MIT weights) $0 API $0 API Pay your own GPU
Full GLM-4.7 (for contrast) ~$0.60 ~$2.20 ~10x in, ~5.5x out

The comparison that matters: GLM-4.7-Flash on OpenRouter is roughly 10x cheaper on input and 5.5x cheaper on output than the full GLM-4.7 (llm-stats), and free entirely if you stay inside Zhipu's 1-QPS tier. For teams already managing free-tier limits across vendors, see the GLM free API access tiers guide.

Cost per Task

Modeling a real coding-agent workload shows how cheap this model is in production. Assume a moderate agent that consumes 50M input and 10M output tokens per month.

Path Monthly input cost Monthly output cost Total
Zhipu free tier $0.00 $0.00 $0.00 (1 QPS limit)
OpenRouter GLM-4.7-Flash $3.00 $4.00 $7.00
Full GLM-4.7 (OpenRouter-class) $30.00 $22.00 $52.00

A coding agent that would cost $52 a month on full GLM-4.7 runs about $7 on GLM-4.7-Flash, or $0 on Zhipu's free tier if you can work within 1 request per second. For batch or off-peak jobs the free tier is effectively a free coding model; for interactive multi-user agents the $7 OpenRouter path buys real concurrency. Either way, the per-task economics are far below frontier models — use the LLM API cost calculator to model your own mix.

GLM-4.7-Flash vs GLM-4.7 vs GLM-4.5-Flash

Against its own family, GLM-4.7-Flash trades raw capability for cost and deployability, sitting below the full GLM-4.7 and above the older GLM-4.5-Flash. The choice is mostly task complexity versus budget.

Model Class Strength Pick when
GLM-4.7-Flash 31B/3B MoE Cheap, self-hostable, strong SWE-bench Coding agents on a budget
GLM-4.7 (full) 355B-class MoE Higher ceiling on hard tasks Complex work, quality first
GLM-4.5-Flash prior-gen Flash Older, still free-tier Legacy compatibility
GLM-5.2 (newer) 1M-context flagship Long-context agents, managed on TokenMix Production, long horizons

The practical guidance: use GLM-4.7-Flash for routine coding and agent tasks where cost dominates, escalate to full GLM-4.7 for genuinely hard problems, and if you want a managed long-context option without self-hosting, GLM-5.2 is the newer family member available through a hosted relay. See the GLM-5.2 review for that comparison.

Access and Self-Hosting

Because the weights are MIT-licensed and the active footprint is only 3B, GLM-4.7-Flash is genuinely runnable on consumer hardware, which sets it apart from most strong coding models. There are four practical paths.

Path What you get Best for Caveat
Zhipu / Z.ai API OpenAI-style endpoint, free tier Fastest test ~1 QPS free, real-name verify
OpenRouter Hosted, pay-as-you-go Production throughput $0.06 / $0.40 per 1M
Hugging Face weights MIT download, vLLM/SGLang Self-host, research Needs a capable GPU
GGUF quant LM Studio / llama.cpp local Laptop / single GPU Quantization quality trade
from openai import OpenAI

client = OpenAI(
    api_key="your-ZAI-api-key",
    base_url="https://api.z.ai/api/paas/v4/"
)

resp = client.chat.completions.create(
    model="glm-4.7-flash",
    messages=[{"role": "user", "content": "Refactor this function and add tests."}],
)
print(resp.choices[0].message.content)

For local deployment, the bartowski GGUF build runs in LM Studio or llama.cpp, and the full weights serve under vLLM or SGLang for higher throughput. The OpenAI-compatible endpoint means existing tooling needs only a base-URL and model-name change — the standard AI API gateway pattern.

Where GLM-4.7-Flash Loses

GLM-4.7-Flash loses on free-tier throughput, modality, and benchmark independence. None of these are surprising for a free 30B model, but they shape where it fits.

Weak spot Evidence Pick instead
1-QPS free-tier cap BigModel free-model limit OpenRouter for concurrency
Text-only No vision/audio GLM-4.1V or a VLM for multimodal
Vendor-only benchmarks Numbers from model card Run your own SWE/agent eval
Below full GLM-4.7 ceiling Smaller active params Full GLM-4.7 for hard tasks
Self-host needs a real GPU 31B weights Hosted API if no GPU
LiveCodeBench slightly trails Qwen3 64.0 vs 66.0 Qwen3-30B for that specific profile

The pattern: this is a budget and deployability play, not a frontier play. Where you need maximum quality, multimodality, or guaranteed throughput, a larger or managed model is the right call. Where you need a cheap, capable, self-hostable coding model, very little competes.

Use Case Matrix

Point GLM-4.7-Flash at cost-sensitive coding and agent work, and route harder or multimodal jobs elsewhere.

Use case GLM-4.7-Flash fit Better alternative Why
Budget coding agent Strong none on cost SWE-bench 59.2 at ~$7/mo
Self-hosted local coding Strong none on cost MIT, runs on one GPU
Off-peak / batch automation Strong none on cost Free tier covers it
Tool-use / web agents Strong full GLM-4.7 if quality first tau-2 79.5, BrowseComp 42.8
High-concurrency production Medium OpenRouter or GLM-5.2 1-QPS free cap
Long-horizon, 1M context Weak GLM-5.2 200K vs 1M context
Multimodal (image/doc) Weak GLM-4.1V-Thinking text-only
Maximum-quality hard tasks Medium full GLM-4.7 / frontier smaller active params

If your real problem is routing across cheap and frontier models rather than picking one, pair this with cheapest LLM API and the GLM-5.2 review.

Final Recommendation

Use GLM-4.7-Flash as a near-free coding and agent model: the Zhipu free tier for batch and off-peak work within 1 QPS, OpenRouter at $0.06/$0.40 for production concurrency, and the MIT weights for self-hosting on a single capable GPU. Escalate to full GLM-4.7 for genuinely hard problems, move to GLM-5.2 when you need 1M context or a managed endpoint, and validate the model card's benchmark numbers on your own tasks before standardizing on it.

FAQ

Is GLM-4.7-Flash free?

Yes, on Zhipu's own platform. The BigModel free tier requires real-name verification and is limited to roughly 1 request per second. For higher throughput, OpenRouter charges $0.06 per 1M input and $0.40 per 1M output tokens.

Is GLM-4.7-Flash open source?

Yes. The weights are released under an MIT license on Hugging Face, so you can self-host and use it commercially. GGUF quantizations exist for local runners like LM Studio and llama.cpp.

How good is GLM-4.7-Flash at coding?

Strong for its size. Its model card reports 59.2 on SWE-bench Verified, roughly triple Qwen3-30B-A3B, plus 79.5 on tau-2 Bench for tool use. These are vendor-published numbers, so confirm them on your own workload.

What is GLM-4.7-Flash's context window?

200K tokens per Zhipu's documentation, shown as about 203K on OpenRouter. Maximum output is 128K on Zhipu, though OpenRouter caps completion at 16,384 tokens.

GLM-4.7-Flash vs GLM-4.7: what's the difference?

GLM-4.7-Flash is the 31B/3B-active budget tier; full GLM-4.7 is a 355B-class flagship with a higher quality ceiling. Flash is roughly 10x cheaper on input and self-hostable; the full model is the pick for genuinely hard tasks.

Can I run GLM-4.7-Flash locally?

Yes. With only 3B active parameters and MIT weights, secondary coverage reports it running on a single RTX 3090 or Apple Silicon at 80+ tokens per second, especially via GGUF quantization. A capable GPU is still required for the full weights.

Is GLM-4.7-Flash multimodal?

No. It is text-only. For image, document, or video understanding in the GLM family, use GLM-4.1V-Thinking or a newer GLM vision model.

Does TokenMix offer GLM-4.7-Flash?

No. TokenMix lists GLM-5.2 from the GLM family, not GLM-4.7-Flash. If you want a managed, long-context GLM through one endpoint, GLM-5.2 is the available option; for GLM-4.7-Flash specifically, use Zhipu, OpenRouter, or self-host.

About TokenMix

TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. TokenMix serves the GLM-5 family rather than GLM-4.7-Flash, so this review is published as independent model intelligence.

Sources

Related Articles