The benchmark grading every text-to-SQL model has wrong answers in its key
We ran a deterministic semantic check over 2,568 gold queries from Spider and BIRD: zero false alarms, a schema defect filed upstream, and one gold answer proven wrong by 8× — by executing the benchmark's own database.