Author Image

CTO & Co-founder of Visivo

Text-to-SQL vs the Semantic Layer: What the 2026 Benchmarks Show

Text-to-SQL hits 80% on benchmarks and falls apart on real business logic. The 2026 data is clear: a semantic layer is what gets AI answers to near-100% accuracy.

Text-to-SQL accuracy compared with a semantic-layer-backed approach on a query benchmark

Raw text-to-SQL looks dazzling on standardized benchmarks (the best models now clear 80 percent) and then collapses the moment it meets real enterprise logic. The 2026 data points to one fix: pair NL-to-SQL with a semantic layer, and AI answers climb to roughly 95 percent reliability and approach 100 percent accuracy on anything the layer covers.

That single sentence is the whole argument of this post, so if you only take one thing away, take that. The rest is the detail that matters when you are the engineer who has to make an AI data agent trustworthy enough to put in front of a finance team.

The promise: ask a question, get SQL

The pitch for text-to-SQL is irresistible. A non-technical user types "what was net revenue by region last quarter," a large language model translates it into a SQL query, the warehouse runs it, and a chart appears. No analyst in the loop, no ticket queue, no week-long wait. For a decade the bottleneck in self-serve analytics was that humans could not write SQL; the promise of NL-to-SQL is that they no longer have to.

And the benchmarks make it look solved. On standardized academic suites that test natural-language-to-SQL translation, the strongest 2026 models routinely score above 80 percent execution accuracy. If you read only the leaderboard, you would conclude this is a finished problem and start ripping out your BI tool.

The leaderboard is lying to you, or more precisely, it is answering a different question than the one your business is asking. Benchmark schemas are clean, small, and documented. Your warehouse is none of those things.

Why raw text-to-SQL breaks in production

The gap between an 80-percent benchmark and a production failure is not subtle once you see where the model is forced to guess. Three failure modes dominate, and all three come from the same root cause: the model has no idea what your business means by anything.

Multi-step logic. "What is our quarter-over-quarter net revenue retention for accounts that signed in 2024?" is not one query, it is a chain of derived steps with a precise order of operations. A model that nails a single-table aggregate on a benchmark routinely produces plausible-looking SQL here that computes the wrong thing, because each step has several defensible interpretations and the model picks one without telling you.

Ambiguous columns. Your warehouse has amount, amount_usd, gross_amount, and net_amount, plus a status column where 'closed' sometimes means won and sometimes means churned depending on the table. The model cannot know that "revenue" means SUM(net_amount) filtered to status = 'won' and excluding internal test accounts. It guesses from column names, and column names lie constantly.

Cross-table joins. The model has to infer the join path between orders, customers, and subscriptions. Pick the wrong key (or fan out a one-to-many join and double-count) and you get a number that is confidently, silently wrong. This is the most dangerous failure of all, because the query runs, returns rows, and renders a clean chart. Nothing looks broken.

The through-line: a benchmark rewards getting the syntax right on a known schema. Production demands getting the semantics right on an unknown one. An LLM is genuinely good at the first and structurally incapable of the second on its own, because the semantics live in your team's heads and your tribal knowledge, not in the schema.

What the 2026 benchmarks actually measured

The interesting 2026 work stopped benchmarking the model in isolation and started benchmarking the system around it. That is the shift worth understanding.

When researchers ran a realistic workload (on the order of 522 business questions against an enterprise-shaped schema) they compared two setups. The first was a strong model doing raw text-to-SQL. The second was the same model given a semantic layer and a context layer to look things up in before generating any SQL. The second setup delivered on the order of three times the accuracy of the first, landing around 95-plus percent reliability on the full workload.

Drill into the breakdown and the result gets sharper. For the subset of questions covered by a well-modeled semantic layer (where "net revenue," "active account," and the join path between orders and customers were already defined) accuracy approached 100 percent. The model was not being asked to invent business logic. It was being asked to assemble logic that already existed, and assembly is something LLMs do reliably.

The headline is not "AI is bad at SQL." The headline is that the accuracy of an AI data agent is bounded almost entirely by the quality of the context you hand it, and a semantic layer is the highest-leverage context you can provide. The model was never the bottleneck. The missing map of your business was.

How a semantic layer closes the accuracy gap

A semantic layer is the layer that maps human concepts to data. "Activated user," "churn rate," "net revenue," "trailing-twelve-month bookings": each of these is a named definition that resolves to an exact expression over exact tables with an exact join path. Define it once, and it is no longer something anyone (human or model) has to guess.

That reframes the agent's job completely. Without a semantic layer, the agent's task is "translate this English sentence into correct business SQL against a schema you have never seen." With one, the task becomes "this sentence maps to the metric net_revenue split by the dimension region over the last quarter; look those up and compose the query." The first task is open-ended generation, which is where models hallucinate. The second is constrained retrieval and assembly, which is where they excel.

The semantic layer also turns silent failures into impossible ones. If "revenue" can only resolve to one definition, the model cannot quietly pick the wrong column. If the join from orders to customers is declared once as a relation, the model cannot invent a different key and double-count. You are not making the model smarter. You are shrinking the space of wrong answers it is even able to produce, which is a far more reliable engineering move than hoping a bigger model guesses better. It is the same principle behind insisting that metrics live in a semantic layer rather than in every dashboard: one definition, one answer, for humans and for agents alike.

Designing metrics an agent can rely on

If the agent's accuracy is bounded by your semantic layer, then designing that layer well is the actual engineering work. A few principles that hold up under an agent's scrutiny:

Name metrics the way people ask for them. If the business says "net revenue," the metric is net_revenue, not nr_calc_v2. The agent matches natural language to your definitions, so definitions that read like natural language match more often and more correctly.

Encode the business rule in the expression, not in tribal knowledge. If revenue excludes refunds, discounts, and internal test accounts, that filtering belongs inside the metric definition where the agent will use it automatically.

models:
  - name: orders
    sql: "SELECT * FROM orders WHERE is_internal_test = false"
    metrics:
      - name: net_revenue
        expression: "SUM(amount - refunds - discounts)"
    dimensions:
      - name: region
        expression: "billing_region"
      - name: order_quarter
        expression: "DATE_TRUNC('quarter', created_at)"

Declare the join paths so the agent never invents one. A relation states, once, how two models connect. Every metric that composes across them inherits a correct, reviewed join instead of a guessed one.

relations:
  - name: orders_to_customers
    condition: ${ref(orders).customer_id} = ${ref(customers).id}

Make the layer testable. A named metric can carry a test (revenue is never negative, active-account counts reconcile to the source system). When a definition is testable, it is trustworthy, and an agent reasoning over trustworthy definitions produces trustworthy answers. This is exactly why testing dashboards like software matters even more once an agent is the consumer.

The discipline here is the same one good analytics engineers already practice. The new incentive is that a model, not just a teammate, now reads what you wrote, and the model has zero tolerance for the gaps a human would paper over with context.

Visivo's semantic layer as agent context

This is why Visivo is built around code-defined metrics, dimensions, and relations rather than charts that each re-derive their own logic. Those definitions are not just documentation for humans. They are precisely the structured context that makes NL-to-SQL safe.

When a metric like net_revenue is defined once on a model, an Insight that uses it references the definition rather than re-implementing it:

insights:
  - name: quarterly-revenue-by-region
    props:
      type: bar
      x: ?{ ${ref(orders).order_quarter} }
      y: ?{ ${ref(orders).net_revenue} }
    interactions:
      - split: ?{ ${ref(orders).region} }
      - sort: ?{ ${ref(orders).order_quarter} ASC }

Because the definitions live in version-controlled YAML, they are reviewable, testable, and unambiguous. That is the same property a human team wants from a single source of truth, and it happens to be the exact property an AI agent needs to answer reliably. The map of your business is written down, in code, in one place. A model pointed at that map looks things up instead of guessing, which is the entire difference between an 80-percent demo and a 95-plus-percent production agent.

The takeaway for anyone evaluating AI analytics in 2026: do not buy the model, build the layer. The model is a commodity that gets better every quarter on its own. The semantic layer is the durable, defensible asset that determines whether your AI answers are trustworthy, and it is the one piece only your team can build. You can see the model that makes this concrete in the Visivo examples gallery, and /get-started takes you from install to your first defined metric in a few minutes.

Previously in Visivo

This post extends the case made in the single source of truth for metrics: once you accept that a metric should have exactly one definition, the payoff is not only that humans stop arguing about revenue, it is that AI agents can finally answer about it correctly. For the foundational argument, start with what BI-as-code is and why it matters.

Install command copied