Your guardrails are burning tokens — and token prices are about to get honest

July 2026 · on the economics of trusting AI agents

The AI industry has quietly converged on one answer to "how do we trust agent output?": ask another LLM. A judge model reviews the agent's action before it executes. A reflection subagent critiques the draft. A scorer grades the output. It's become the canonical agent-safety pattern, productized everywhere: Guardrails AI validators, Databricks MLflow scorers, and in text-to-SQL specifically, systems like Snowflake's Cortex Analyst employing LLM-based evaluation layers to judge whether generated SQL matches the question.

It works, partially (more on that below). But notice what the pattern costs: every check is an inference call. Tokens on every query. Tokens on every retry. Tokens per agent, per subagent, per reflection round. Your safety budget scales linearly with your usage — on the most expensive computing substrate ever sold at retail.

The subsidy question

Which would be fine if tokens stayed cheap. Here's what's actually being reported. SemiAnalysis estimated that heavy users of $200/month AI coding plans consume thousands of dollars of tokens at API prices — by their math, up to ~$8,000 /month for the heaviest Claude users, implying deeply negative gross margins on power users. A credible counter-analysis argues this confuses retail API prices with actual compute cost (perhaps ~10× lower), making the true subsidy far smaller.

We don't need to adjudicate that debate, because both sides agree on the observable facts: a small fraction of token-hungry users consume wildly disproportionate compute; vendors responded with weekly rate limits and usage caps; and every provider's pricing trajectory is converging toward compute reality. Whatever today's true margin is, the direction of travel is pricing that reflects cost. Architectures built on "tokens are basically free" assumptions are carrying repricing risk — and token-based guardrails sit squarely in that category, because they're pure overhead: tokens spent producing no user-visible output at all.

The variance problem (the part that isn't about money)

Even with free tokens, LLM judges have a documented flaw: evaluator variance — the same query pair receives inconsistent verdicts across calls. A guardrail that says "unsafe" on Tuesday and "fine" on Wednesday isn't a guardrail; it's a mood. And for SQL semantics specifically, the judge has a deeper problem: it doesn't know your join cardinalities. No amount of reflection tells a model that orders → order_items multiplies rows at your company — that fact lives in your dbt tests, not in its weights. We demonstrated where that ends: expert humans writing a benchmark answer key produced a query that's wrong by 8×, and review-by- intelligence didn't catch it. A cardinality lookup did, in milliseconds.

The alternative: judge arithmetic with arithmetic

For a meaningful class of checks — grain, join fan-out, additivity, join keys, column policy — correctness is not a matter of judgment. It's a lookup against declared facts. That class of guardrail can be deterministic:

LLM-as-judge subagentDeterministic constraint check
cost per checkan inference call (tokens + $)~$0
latency1–3 s0.1 ms
consistencyvariance across callssame input, same verdict, always
knows YOUR schema factsno — guesses from contextyes — reads dbt tests / PK-FK / OSI
repricing exposurefullzero
handles genuine ambiguityyesno — and says so honestly

That last row matters: this is not an argument to delete your LLM judges. Intent ambiguity, tone, relevance — those genuinely need judgment. The argument is about division of labor:

Spend tokens on ambiguity, not on arithmetic. Let a deterministic gate handle everything that's checkable against declared facts, and reserve LLM judgment for what actually requires it.

The economics compound in the repair loop, too. A reflection subagent critiques vaguely ("re-examine your joins"), so repairs take multiple rounds — each one a fresh generation. A deterministic gate rejects with a machine-actionable fix ("pre-aggregate orders to order_id before joining order_items"); in our benchmark, applying the fix verbatim produced a passing query 10 out of 10 times — one round. The gate doesn't just cost less per check; it reduces the generation tokens on the other side of the loop.

What this looks like in practice

agent drafts SQL
   → sqlsure check          # 0.1 ms, $0, deterministic
       approved → execute
       rejected → apply the fix → re-check → execute
   → (optional) LLM judge for what remains:
       does this answer the user's actual question?

The deterministic layer runs on every query at zero marginal cost. The token-based layer runs where judgment is genuinely needed. Total guardrail spend stops scaling with query volume — which is exactly the property you want in the year pricing gets honest.

sqlsure is open source (Apache-2.0): a semantic gate for SQL with a CLI, an MCP server your agents call before executing, and audited receipts — zero false alarms across 2,568 benchmark gold queries.

pip install sqlsure