TokenMix Research Lab · 2026-07-03

Same 20 Text-to-SQL Tasks, 12 Models: the $0.19/M Model Went 20/20, the Bill Spread Was 299x

I kept seeing "just use the big model for SQL" advice, so we measured it.

Setup: one SQLite database (4 tables, ~350 orders, deterministic seed), 20 questions from "count the orders" up to median-by-group, window functions, and market-basket pairs. Every model gets the exact same prompt, temperature 0, one attempt, no retries. We execute whatever SQL comes back against the database and compare result sets with the reference answer. Wrong columns, wrong rows, SQL error = fail. All 240 calls went through the same OpenAI-compatible gateway on the same day, so the costs below are actual billed amounts, not price-sheet math.

Results (sorted by what we actually paid for all 20 tasks)

Model	List $/M in/out	Score	Total bill	Median latency
grok-4.1-fast-reasoning	0.19/0.48	20/20	$0.0011	17.9s
deepseek-v4-flash	0.13/0.26	20/20	$0.0033	7.1s
gpt-5.4-mini	0.73/4.37	19/20*	$0.0102	2.4s
deepseek-v4-pro	0.42/0.83	20/20	$0.0114	8.3s
qwen3.5-flash	0.03/0.26	18/20	$0.0157	20.3s
qwen3.7-plus	0.29/1.17	20/20	$0.0170	15.8s
claude-sonnet-5	1.96/9.80	19/20*	$0.0222	5.3s
kimi-k2.7-code	0.91/3.77	19/20	$0.0939	6.2s
claude-opus-4.8	5.00/25.00	19/20*	$0.1119	4.9s
glm-5.2	1.12/3.91	20/20	$0.1640	20.9s
gpt-5.5	5.00/30.00	20/20	$0.1685	6.0s
gemini-3.1-pro	1.94/11.64	19/20	$0.3293	13.4s

* The three asterisked models all found the correct answer on the market-basket question but violated the output spec (wrong column order, or an extra count column). Graded strictly they lose the point; graded leniently they are 20/20. Full per-call data is public so you can re-grade however you like.

What actually surprised us

The two cheapest models in the lineup both went a strict 20/20. On this task there is nothing to buy at the top end: the bill spread across models was 299x, and accuracy did not move with price.
List price is not cost. glm-5.2 looks 43% cheaper than claude-sonnet-5 on the price sheet, but it wrote 28x more output tokens for the same questions, so its actual bill was 7x higher.
Reasoning tokens are where budget models sneak up on you. qwen3.5-flash has the lowest list price of all 12 but burned 64k thinking tokens across 20 questions and ended up mid-table on cost — and it is the only model that made typo-level errors (misspelled julianday as juliandate).
The only two hard SQL failures were the same bug. gemini-3.1-pro and kimi-k2.7-code both misused LAG() in a way SQLite rejects. A coding-specialized model and the most expensive model in the table, same mistake.
Fastest correct answers came from gpt-5.4-mini (2.4s median). For interactive text-to-SQL, latency may matter more than the last point of accuracy.

Reproduce it

Full methodology, all 20 questions with reference SQL, and grading rules: methodology page.

Every script, the exact database, and all 240 raw responses (extracted SQL, token usage, latency, pass/fail): github.com/TokenMixAi/text2sql-bench. Point bench_run.py at any OpenAI-compatible endpoint and re-run it.

All calls were made 2026-07-02 through the TokenMix gateway; bills are from our billing ledger for those exact requests. The benchmark is small and SQLite-specific — read the limitations section of the methodology before quoting.