The benchmark grading every text-to-SQL model has wrong answers in its key

July 2026 · the sqlsure launch story

BIRD and Spider are the exams of the text-to-SQL world. Every model you've seen ranked — GPT, Claude, Gemini, every fine-tuned SQLCoder variant — gets its accuracy number by comparing its output against these benchmarks' gold queries: SQL written by experts, reviewed, and published as ground truth.

We pointed a deterministic semantic checker at the ground truth itself. It took about two seconds, and one of the answers in the key is provably wrong by a factor of eight.

The method (no AI, no labeling)

sqlsure validates SQL against declared facts — what one row means, which joins multiply rows, what's safe to sum. Benchmarks conveniently ship those facts: every BIRD and Spider database publishes its primary and foreign keys. So the audit is mechanical:

  1. Build the rulebook from each database's own PK/FK declarations.
  2. Parse all 2,568 gold queries (1,034 Spider dev + 1,534 BIRD dev) and extract every join.
  3. Flag joins that contradict the declared facts. Review every flag by hand — and, for BIRD, by executing the shipped databases.

Results

Spider devBIRD dev
gold queries parsed1,034 / 1,0341,534 / 1,534
joins observed5181,419
backed by declared keys93%96%
anomaly flags3015
flags confirmed real30/3014/15 (1 benign-but-fragile)
spurious flags00

Finding 1: a schema with 29 declared links that forgot the one that matters

All 30 Spider flags traced to one database (flight_2) whose declared "primary key" for flights is the airline — impossible, since an airline has many flights — and whose flights→airlines relationship was never declared at all.

BIRD has the mirror image: european_football_2 declares 29 foreign keys on its Match table — every one of the 22 player-slot columns — yet omits Match.league_id → League.id, the single most analytically important link in the database, which BIRD's own gold answers use 13 times. Root cause: the FK uses SQLite's implicit-PK form (REFERENCES League with no column), which the benchmark's schema-extraction script fails to resolve. We verified it against the shipped SQLite file and filed it upstream — the only two missing FKs in the entire dev set, both found.

Finding 2: the gold answer that's wrong by 8×

BIRD dev question #571 asks:

"For the user No.24, how many times is the number of his/her posts compared to his/her votes?"

We executed the benchmark's own database: user 24 has 3 posts and 8 votes. The correct answer is 3 ÷ 8 = 0.375. The gold query returns 3.0.

Why? The gold SQL joins votes to posts on UserId = OwnerUserId — two non-key columns — creating a 3 × 8 = 24-row cartesian product per user. The inflated count divided by the distinct count algebraically collapses to… the post count. The query mechanically computes the wrong quantity, and the fan-out factor (8) is exactly how wrong it is.

-- BIRD gold (returns 3.0):
SELECT CAST(COUNT(T2.Id) AS REAL) / COUNT(DISTINCT T1.Id)
FROM votes T1 JOIN posts T2 ON T1.UserId = T2.OwnerUserId
WHERE T1.UserId = 24

-- correct (returns 0.375):
WITH v AS (SELECT COUNT(*) c FROM votes WHERE UserId = 24),
     p AS (SELECT COUNT(*) c FROM posts WHERE OwnerUserId = 24)
SELECT CAST(p.c AS REAL) / v.c FROM p, v

Every model that answers this question correctly is marked wrong by the benchmark.

The twist: the experts agree

After our audit, we found the expert-corrected BIRD dataset from the VLDB'26 annotation-errors project. Ten of our fifteen BIRD flags were independently identified by their human review — and for #571, their expert-written correction computes exactly the 0.375 we computed. Their process: expert panels. Ours: dictionary lookups against declared keys, in two seconds, with zero spurious flags. The one thing our pass found that per-question corrections structurally can't: the schema-layer FK defect.

Why this matters beyond benchmarks

The field's standard yardstick — execution accuracy, "does the output match gold?" — is blind to the case where the generated SQL and the gold SQL share the same semantic bug. Expert authors fell into the fan-out trap while writing an answer key. Your analysts and your AI agents write the same joins every day, against warehouses with no answer key at all.

That's what sqlsure is: the deterministic judge between any SQL author — human or AI — and your warehouse. It reads the facts your team already declared (dbt tests, PK/FK), rejects queries that contradict them, and attaches a fix the author can apply mechanically. In our benchmark, AI agents repaired 10/10 rejected queries by following the fix text verbatim.

pip install sqlsure

Star the repo, run python -m sqlsure.scan on your own dbt project, and tell us what it finds. Full audit reports, methodology, and honest limitations: docs/.