TokenMix Research Lab · 2026-07-03

Text-to-SQL Across 12 Models: Methodology & Raw Data

Published 2026-07-02. All 240 calls made on this date through api.tokenmix.ai. Companion post with the results table: https://tokenmix.ai/blog/text-to-sql-benchmark-12-models.

Why

Answer one question with receipts: for SQL generation, what does paying 10-100x more per token actually buy?

Database

SQLite, 4 tables (customers 60 / products 20 / orders 350 / order_items ~870), generated with a fixed seed (42) so anyone can rebuild it byte-for-byte. The generator script is in the repo linked below. 8 customers never place an order; order statuses are completed/cancelled/refunded (75/15/10); order dates span 2025.

CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT NOT NULL, country TEXT NOT NULL, created_at TEXT NOT NULL);
CREATE TABLE products (id INTEGER PRIMARY KEY, name TEXT NOT NULL, category TEXT NOT NULL, price REAL NOT NULL);
CREATE TABLE orders (id INTEGER PRIMARY KEY, customer_id INTEGER NOT NULL REFERENCES customers(id), status TEXT NOT NULL, created_at TEXT NOT NULL);
CREATE TABLE order_items (order_id INTEGER NOT NULL REFERENCES orders(id), product_id INTEGER NOT NULL REFERENCES products(id), quantity INTEGER NOT NULL, unit_price REAL NOT NULL);

Protocol

Grading

Returned text -> strip markdown fences -> take the first SELECT/WITH statement -> execute read-only against the database -> normalize rows (floats rounded to 2 decimals, strings trimmed, row order ignored) -> exact match against the reference result. SQL error or wrong result set = 0 points, no partial credit.

Strict vs lenient. On q15, three models (claude-opus-4.8, claude-sonnet-5, gpt-5.4-mini) identified the correct product pair but violated the stated output spec — wrong column order or an extra count column. Strict scoring (the headline numbers) counts these as fails; lenient scoring treats them as passes and moves all three to 20/20. Every raw response is in the data dump, so you can re-grade with your own rules.

Costs

"Total bill" is the actual amount charged by our gateway's billing ledger for these calls, summed per model — not price-sheet arithmetic. Prices as of 2026-07-02. Provider-side accounting of reasoning/thinking tokens varies by model family and is reflected in the bills as charged on that date.

Limitations (read before quoting)

The 20 questions

q01

Question: How many orders are there in total? Return a single number.

Reference SQL:

SELECT COUNT(*) FROM orders

q02

Question: Which 3 countries have the most customers? Return only the country codes, ordered by customer count descending, ties broken alphabetically by country code.

Reference SQL:

SELECT country FROM customers GROUP BY country ORDER BY COUNT(*) DESC, country ASC LIMIT 3

q03

Question: What is the total revenue from completed orders? Revenue is quantity * unit_price summed over order items. Return a single number rounded to 2 decimals.

Reference SQL:

SELECT ROUND(SUM(oi.quantity*oi.unit_price),2) FROM order_items oi JOIN orders o ON o.id=oi.order_id WHERE o.status='completed'

q04

Question: Which product category generated the highest revenue across completed orders? Return only the category name.

Reference SQL:

SELECT p.category FROM order_items oi JOIN orders o ON o.id=oi.order_id JOIN products p ON p.id=oi.product_id WHERE o.status='completed' GROUP BY p.category ORDER BY SUM(oi.quantity*oi.unit_price) DESC LIMIT 1

q05

Question: Which customer has the highest total spend on completed orders? Return only the customer's name.

Reference SQL:

SELECT c.name FROM customers c JOIN orders o ON o.customer_id=c.id JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY c.id ORDER BY SUM(oi.quantity*oi.unit_price) DESC LIMIT 1

q06

Question: Which calendar month of 2025 had the most orders (any status)? Return the month in YYYY-MM format. If tied, return the earliest such month.

Reference SQL:

SELECT strftime('%Y-%m',created_at) m FROM orders GROUP BY m ORDER BY COUNT(*) DESC, m ASC LIMIT 1

q07

Question: What is the average order value of completed orders? Order value is the sum of quantity * unit_price of the order's items. Return a single number rounded to 2 decimals.

Reference SQL:

SELECT ROUND(AVG(v),2) FROM (SELECT SUM(oi.quantity*oi.unit_price) v FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY o.id)

q08

Question: How many customers have never placed any order? Return a single number.

Reference SQL:

SELECT COUNT(*) FROM customers c WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id=c.id)

q09

Question: For each product category, find the best-selling product by total quantity sold across all orders. Return rows of (category, product name), ties broken alphabetically by product name.

Reference SQL:

SELECT category, name FROM (SELECT p.category category, p.name name, ROW_NUMBER() OVER (PARTITION BY p.category ORDER BY SUM(oi.quantity) DESC, p.name ASC) rn FROM products p JOIN order_items oi ON oi.product_id=p.id GROUP BY p.id) WHERE rn=1

q10

Question: Among customers who placed at least one completed order, what share placed two or more completed orders? Return a single number between 0 and 1 rounded to 2 decimals.

Reference SQL:

SELECT ROUND(AVG(CASE WHEN n>=2 THEN 1.0 ELSE 0.0 END),2) FROM (SELECT customer_id, COUNT(*) n FROM orders WHERE status='completed' GROUP BY customer_id)

q11

Question: What is the median order value of completed orders? Order value is the sum of quantity * unit_price of the order's items. Use the standard median (average of the two middle values if the count is even). Return a single number rounded to 2 decimals.

Reference SQL:

WITH v AS (SELECT SUM(oi.quantity*oi.unit_price) val FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY o.id),
     r AS (SELECT val, ROW_NUMBER() OVER (ORDER BY val) rn, COUNT(*) OVER () n FROM v)
     SELECT ROUND(AVG(val),2) FROM r WHERE rn IN ((n+1)/2, (n+2)/2)

q12

Question: For each country that has at least one completed order, find the customer with the highest total completed-order spend. Return rows of (country, customer name), ties broken alphabetically by customer name.

Reference SQL:

SELECT country, name FROM (
      SELECT c.country country, c.name name, ROW_NUMBER() OVER (PARTITION BY c.country ORDER BY SUM(oi.quantity*oi.unit_price) DESC, c.name ASC) rn
      FROM customers c JOIN orders o ON o.customer_id=c.id JOIN order_items oi ON oi.order_id=o.id
      WHERE o.status='completed' GROUP BY c.id) WHERE rn=1

q13

Question: Comparing consecutive calendar months of 2025, which month had the largest absolute increase in completed-order revenue versus the immediately preceding calendar month? Only consider February through December. Return the month in YYYY-MM format.

Reference SQL:

WITH mrev AS (SELECT strftime('%Y-%m',o.created_at) m, SUM(oi.quantity*oi.unit_price) rev
      FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY m)
     SELECT m FROM (SELECT m, rev - LAG(rev) OVER (ORDER BY m) d FROM mrev) WHERE d IS NOT NULL ORDER BY d DESC, m ASC LIMIT 1

q14

Question: How many customers have placed at least one order and have ALL of their orders in 'completed' status? Return a single number.

Reference SQL:

SELECT COUNT(*) FROM (SELECT customer_id FROM orders GROUP BY customer_id HAVING SUM(CASE WHEN status<>'completed' THEN 1 ELSE 0 END)=0)

q15

Question: Which pair of distinct products appears together in the same order most frequently (counting each order at most once per pair, any order status)? Return one row with the two product names, the alphabetically smaller name first. Break ties by the alphabetically smallest pair.

Reference SQL:

WITH pairs AS (
      SELECT DISTINCT a.order_id, MIN(pa.name,pb.name) n1, MAX(pa.name,pb.name) n2
      FROM order_items a JOIN order_items b ON a.order_id=b.order_id AND a.product_id<b.product_id
      JOIN products pa ON pa.id=a.product_id JOIN products pb ON pb.id=b.product_id)
     SELECT n1, n2 FROM pairs GROUP BY n1,n2 ORDER BY COUNT(*) DESC, n1 ASC, n2 ASC LIMIT 1

q16

Question: How many days is the longest gap between two consecutive order dates, looking at all distinct order dates in the table (any status)? Return a single integer number of days.

Reference SQL:

WITH d AS (SELECT DISTINCT created_at FROM orders)
     SELECT CAST(MAX(julianday(created_at)-julianday(prev)) AS INTEGER) FROM (SELECT created_at, LAG(created_at) OVER (ORDER BY created_at) prev FROM d) WHERE prev IS NOT NULL

q17

Question: Consider each customer's earliest completed order (break date ties by the smaller order id). What fraction of total completed-order revenue comes from these first orders? Return a single number between 0 and 1 rounded to 2 decimals.

Reference SQL:

WITH ov AS (SELECT o.id, o.customer_id, o.created_at, SUM(oi.quantity*oi.unit_price) val,
        ROW_NUMBER() OVER (PARTITION BY o.customer_id ORDER BY o.created_at ASC, o.id ASC) rn
      FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY o.id)
     SELECT ROUND(SUM(CASE WHEN rn=1 THEN val ELSE 0 END)*1.0/SUM(val),2) FROM ov

q18

Question: How many completed orders contain products from at least 3 distinct categories? Return a single number.

Reference SQL:

SELECT COUNT(*) FROM (SELECT o.id FROM orders o JOIN order_items oi ON oi.order_id=o.id JOIN products p ON p.id=oi.product_id WHERE o.status='completed' GROUP BY o.id HAVING COUNT(DISTINCT p.category)>=3)

q19

Question: Walking through the calendar months of 2025 in order and accumulating completed-order revenue, what is the first month where the cumulative revenue strictly exceeds 200000? Return the month in YYYY-MM format.

Reference SQL:

WITH mrev AS (SELECT strftime('%Y-%m',o.created_at) m, SUM(oi.quantity*oi.unit_price) rev
      FROM orders o JOIN order_items oi ON oi.order_id=o.id WHERE o.status='completed' GROUP BY m)
     SELECT m FROM (SELECT m, SUM(rev) OVER (ORDER BY m) c FROM mrev) WHERE c > 200000 ORDER BY m ASC LIMIT 1

q20

Question: On average, how many distinct product categories does a completed order contain? Return a single number rounded to 2 decimals.

Reference SQL:

SELECT ROUND(AVG(k),2) FROM (SELECT COUNT(DISTINCT p.category) k FROM orders o JOIN order_items oi ON oi.order_id=o.id JOIN products p ON p.id=oi.product_id WHERE o.status='completed' GROUP BY o.id)

Raw data & reproduction

Everything needed to reproduce or re-grade: data generator, runner, grader, all 240 raw responses with extracted SQL, token usage and latency — https://github.com/TokenMixAi/text2sql-bench.