TokenMix Research Lab · 2026-07-03

Published 2026-07-02. All 240 calls made on this date through api.tokenmix.ai. Companion post with the results table: https://tokenmix.ai/blog/text-to-sql-benchmark-12-models.
Why
Answer one question with receipts: for SQL generation, what does paying 10-100x more per token actually buy?
Database
SQLite, 4 tables (customers 60 / products 20 / orders 350 / order_items ~870), generated with a fixed seed (42) so anyone can rebuild it byte-for-byte. The generator script is in the repo linked below. 8 customers never place an order; order statuses are completed/cancelled/refunded (75/15/10); order dates span 2025.
CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT NOT NULL, country TEXT NOT NULL, created_at TEXT NOT NULL);
CREATE TABLE products (id INTEGER PRIMARY KEY, name TEXT NOT NULL, category TEXT NOT NULL, price REAL NOT NULL);
CREATE TABLE orders (id INTEGER PRIMARY KEY, customer_id INTEGER NOT NULL REFERENCES customers(id), status TEXT NOT NULL, created_at TEXT NOT NULL);
CREATE TABLE order_items (order_id INTEGER NOT NULL REFERENCES orders(id), product_id INTEGER NOT NULL REFERENCES products(id), quantity INTEGER NOT NULL, unit_price REAL NOT NULL);
Protocol
- Identical prompt for every model: schema DDL + question + "Return only the SQL query, with no explanation and no markdown fences."
- temperature 0, streaming, one attempt per model per question. No retries on wrong answers. No system prompt.
- 12 models spanning list prices from $0.03/M to $5/M input, all called through the same OpenAI-compatible endpoint, 6-way concurrency.
- One extra warm-up call (deepseek-v4-flash, q01) is included in that model's bill; it adds about $0.00003.
Grading
Returned text -> strip markdown fences -> take the first SELECT/WITH statement -> execute read-only against the database -> normalize rows (floats rounded to 2 decimals, strings trimmed, row order ignored) -> exact match against the reference result. SQL error or wrong result set = 0 points, no partial credit.
Strict vs lenient. On q15, three models (claude-opus-4.8, claude-sonnet-5, gpt-5.4-mini) identified the correct product pair but violated the stated output spec — wrong column order or an extra count column. Strict scoring (the headline numbers) counts these as fails; lenient scoring treats them as passes and moves all three to 20/20. Every raw response is in the data dump, so you can re-grade with your own rules.
Costs
"Total bill" is the actual amount charged by our gateway's billing ledger for these calls, summed per model — not price-sheet arithmetic. Prices as of 2026-07-02. Provider-side accounting of reasoning/thinking tokens varies by model family and is reflected in the bills as charged on that date.
Limitations (read before quoting)
- 20 questions, one schema, SQLite dialect only. This measures analytics-style text-to-SQL, nothing else.
- Single attempt at temperature 0, no self-correction loop. Agentic setups with retries would look different.
- Questions specify output format precisely; a model that ignores the spec loses the point even when its logic is right (see strict vs lenient).
- 12 models, chosen to span the price tiers available on one gateway.
The 20 questions
q01
Question: How many orders are there in total? Return a single number.
Reference SQL:
SELECT COUNT(*) FROM orders
q02
Question: Which 3 countries have the most customers? Return only the country codes, ordered by customer count descending, ties broken alphabetically by country code.
Reference SQL:
SELECT country FROM customers GROUP BY country ORDER BY COUNT(*) DESC, country ASC LIMIT 3
q03
Question: What is the total revenue from completed orders? Revenue is quantity * unit_price summed over order items. Return a single number rounded to 2 decimals.
Reference SQL:
SELECT ROUND(SUM(oi.quantity*oi.unit_price),2) FROM order_items oi JOIN orders o ON o.id=oi.order_id WHERE o.status='completed'
q04
Question: Which product category generated the highest revenue across completed orders? Return only the category name.
Reference SQL:
SELECT p.category FROM order_items oi JOIN orders o ON o.id=oi.order_id JOIN products p ON p.id=oi.product_id WHERE o.status='completed' GROUP BY p.category ORDER BY SUM(oi.quantity*oi.unit_price) DESC LIMIT 1
q05
Question: Which customer has the highest total spend on completed orders? Return only the customer's name.
Reference SQL:
SELECT c.name FROM customers c JOIN orders o ON o.customer_id=c.id JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY c.id ORDER BY SUM(oi.quantity*oi.unit_price) DESC LIMIT 1
q06
Question: Which calendar month of 2025 had the most orders (any status)? Return the month in YYYY-MM format. If tied, return the earliest such month.
Reference SQL:
SELECT strftime('%Y-%m',created_at) m FROM orders GROUP BY m ORDER BY COUNT(*) DESC, m ASC LIMIT 1
q07
Question: What is the average order value of completed orders? Order value is the sum of quantity * unit_price of the order's items. Return a single number rounded to 2 decimals.
Reference SQL:
SELECT ROUND(AVG(v),2) FROM (SELECT SUM(oi.quantity*oi.unit_price) v FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY o.id)
q08
Question: How many customers have never placed any order? Return a single number.
Reference SQL:
SELECT COUNT(*) FROM customers c WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id=c.id)
q09
Question: For each product category, find the best-selling product by total quantity sold across all orders. Return rows of (category, product name), ties broken alphabetically by product name.
Reference SQL:
SELECT category, name FROM (SELECT p.category category, p.name name, ROW_NUMBER() OVER (PARTITION BY p.category ORDER BY SUM(oi.quantity) DESC, p.name ASC) rn FROM products p JOIN order_items oi ON oi.product_id=p.id GROUP BY p.id) WHERE rn=1
q10
Question: Among customers who placed at least one completed order, what share placed two or more completed orders? Return a single number between 0 and 1 rounded to 2 decimals.
Reference SQL:
SELECT ROUND(AVG(CASE WHEN n>=2 THEN 1.0 ELSE 0.0 END),2) FROM (SELECT customer_id, COUNT(*) n FROM orders WHERE status='completed' GROUP BY customer_id)
q11
Question: What is the median order value of completed orders? Order value is the sum of quantity * unit_price of the order's items. Use the standard median (average of the two middle values if the count is even). Return a single number rounded to 2 decimals.
Reference SQL:
WITH v AS (SELECT SUM(oi.quantity*oi.unit_price) val FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY o.id),
r AS (SELECT val, ROW_NUMBER() OVER (ORDER BY val) rn, COUNT(*) OVER () n FROM v)
SELECT ROUND(AVG(val),2) FROM r WHERE rn IN ((n+1)/2, (n+2)/2)
q12
Question: For each country that has at least one completed order, find the customer with the highest total completed-order spend. Return rows of (country, customer name), ties broken alphabetically by customer name.
Reference SQL:
SELECT country, name FROM (
SELECT c.country country, c.name name, ROW_NUMBER() OVER (PARTITION BY c.country ORDER BY SUM(oi.quantity*oi.unit_price) DESC, c.name ASC) rn
FROM customers c JOIN orders o ON o.customer_id=c.id JOIN order_items oi ON oi.order_id=o.id
WHERE o.status='completed' GROUP BY c.id) WHERE rn=1
q13
Question: Comparing consecutive calendar months of 2025, which month had the largest absolute increase in completed-order revenue versus the immediately preceding calendar month? Only consider February through December. Return the month in YYYY-MM format.
Reference SQL:
WITH mrev AS (SELECT strftime('%Y-%m',o.created_at) m, SUM(oi.quantity*oi.unit_price) rev
FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY m)
SELECT m FROM (SELECT m, rev - LAG(rev) OVER (ORDER BY m) d FROM mrev) WHERE d IS NOT NULL ORDER BY d DESC, m ASC LIMIT 1
q14
Question: How many customers have placed at least one order and have ALL of their orders in 'completed' status? Return a single number.
Reference SQL:
SELECT COUNT(*) FROM (SELECT customer_id FROM orders GROUP BY customer_id HAVING SUM(CASE WHEN status<>'completed' THEN 1 ELSE 0 END)=0)
q15
Question: Which pair of distinct products appears together in the same order most frequently (counting each order at most once per pair, any order status)? Return one row with the two product names, the alphabetically smaller name first. Break ties by the alphabetically smallest pair.
Reference SQL:
WITH pairs AS (
SELECT DISTINCT a.order_id, MIN(pa.name,pb.name) n1, MAX(pa.name,pb.name) n2
FROM order_items a JOIN order_items b ON a.order_id=b.order_id AND a.product_id<b.product_id
JOIN products pa ON pa.id=a.product_id JOIN products pb ON pb.id=b.product_id)
SELECT n1, n2 FROM pairs GROUP BY n1,n2 ORDER BY COUNT(*) DESC, n1 ASC, n2 ASC LIMIT 1
q16
Question: How many days is the longest gap between two consecutive order dates, looking at all distinct order dates in the table (any status)? Return a single integer number of days.
Reference SQL:
WITH d AS (SELECT DISTINCT created_at FROM orders)
SELECT CAST(MAX(julianday(created_at)-julianday(prev)) AS INTEGER) FROM (SELECT created_at, LAG(created_at) OVER (ORDER BY created_at) prev FROM d) WHERE prev IS NOT NULL
q17
Question: Consider each customer's earliest completed order (break date ties by the smaller order id). What fraction of total completed-order revenue comes from these first orders? Return a single number between 0 and 1 rounded to 2 decimals.
Reference SQL:
WITH ov AS (SELECT o.id, o.customer_id, o.created_at, SUM(oi.quantity*oi.unit_price) val,
ROW_NUMBER() OVER (PARTITION BY o.customer_id ORDER BY o.created_at ASC, o.id ASC) rn
FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY o.id)
SELECT ROUND(SUM(CASE WHEN rn=1 THEN val ELSE 0 END)*1.0/SUM(val),2) FROM ov
q18
Question: How many completed orders contain products from at least 3 distinct categories? Return a single number.
Reference SQL:
SELECT COUNT(*) FROM (SELECT o.id FROM orders o JOIN order_items oi ON oi.order_id=o.id JOIN products p ON p.id=oi.product_id WHERE o.status='completed' GROUP BY o.id HAVING COUNT(DISTINCT p.category)>=3)
q19
Question: Walking through the calendar months of 2025 in order and accumulating completed-order revenue, what is the first month where the cumulative revenue strictly exceeds 200000? Return the month in YYYY-MM format.
Reference SQL:
WITH mrev AS (SELECT strftime('%Y-%m',o.created_at) m, SUM(oi.quantity*oi.unit_price) rev
FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY m)
SELECT m FROM (SELECT m, SUM(rev) OVER (ORDER BY m) c FROM mrev) WHERE c > 200000 ORDER BY m ASC LIMIT 1
q20
Question: On average, how many distinct product categories does a completed order contain? Return a single number rounded to 2 decimals.
Reference SQL:
SELECT ROUND(AVG(k),2) FROM (SELECT COUNT(DISTINCT p.category) k FROM orders o JOIN order_items oi ON oi.order_id=o.id JOIN products p ON p.id=oi.product_id WHERE o.status='completed' GROUP BY o.id)
Raw data & reproduction
Everything needed to reproduce or re-grade: data generator, runner, grader, all 240 raw responses with extracted SQL, token usage and latency — https://github.com/TokenMixAi/text2sql-bench.