Nested subqueries in SQL

How a recent viewing of “Tenet” made me think of problematic SQL
Author

Johnny Breen

Published

September 26, 2023

Recently, I had the fortune (or misfortune depending on how you look at it) of rewatching Christopher Nolan’s 2020 sci-fi heist.. thriller? film starring John David Washington and Robert Pattinson1 amongst others.

If you haven’t watched Tenet yet, I’m not sure I’d recommend it. It’s one of those films that seems to get more confusing the more you watch it. Any payoff gained from finally ‘understanding’ something is usually small and most revelatory moments are structural rather than thematic - think “Christ, I guess that makes a bit of sense now (fistful of anger)?” rather than “Oh damn! It was him all along!”.

Anyway, in Tenet a central concept is that of inversion: the ability to ‘invert’ yourself or other objects so that they travel in the opposite direction of time. From the object’s perspective, everything goes backwards and from the perspective of an external subject, the object appears to travel backwards.

Confused yet? Unhelpfully there are (many) diagrams online that try to explain this concept but which also end up making everything 100x more confusing,

Sidenote: I don’t think inversion can be explained well with a static picture like this - you need to animate it to make the concept more understandable.

Anyway, I’m not here to talk about Tenet per se, but what this convoluted, almost paradoxical concept reminds me of is badly written SQL.

SQL itself is a bit weird

To start off with, the basic structure of a SQL query is itself slightly cyclical,

SELECT 
  x,
  y
FROM table t
WHERE t.x < 1
AND t.y > 10

Whilst the query starts with SELECT this is not the order in which the statements are executed. In fact, we start with FROM (where are we sourcing our data from?), followed by the WHERE clause (what filtering conditions do we impose on our source?) and only then do we ‘cycle back up’ to the SELECT clause.

It’s a weird concept to get used to at first and it often gets more complicated. Julia Evans talks about this in her excellent book on learning SQL. Here is how she conceptualises a SELECT query (folks, you really should invest in her book by the way),

Death by subqueries

Back to Tenet: every time a character inverts the timeline in Tenet, a new ‘copy’ of the original person is made and now coexists with its past self at the same time.

The more frequently a character inverts, the more confusing it gets because you - the observer - now have to try to juggle more than one timeline (and all the ensuing possibilities) in your head at the same time.

My analogy for this in SQL is nested subqueries because, fundamentally, they encourage the observer to keep everything inside one’s head, concurrently, which is not practical. Here’s a simple example of a subquery that goes one level deep,

SELECT 
  contract_id,
  total_claims
FROM
  (SELECT
    contract_id,
    SUM(claims) AS total_claims
  FROM contract_claims
  GROUP BY contract_id) claims
WHERE total_claims > 10000

The business problem here is to extract all contracts with claims that breach a threshold of $10,000.

This query isn’t necessarily ideal but it’s a feasible approach. Where things get really narly is when subqueries become nested.

Suppose we didn’t have ready access to the contract_claims table. Then,

SELECT 
  contract_id,
  total_claim_size
FROM
  (SELECT
    contract_id,
    SUM(claim_size) AS total_claim_size
  FROM (SELECT
          contract_id,
          claim_size
        FROM contract
        JOIN claims ON contract.contract_id = claims.contract_id) contract_claims
        GROUP BY contract_id) claims
WHERE total_claim_size > 10000

I’m already struggling to read this. The simple fact of the matter is that it’s not ‘natural’ to have to read this query from the ‘inside-out’ (which is precisely what must be done in this case).

Possible ‘Solution’ #1

A more natural approach would be to convert this expression into a series of CTEs like so,

WITH contract_claims AS (
  
  SELECT
    contract_id,
    claim_size
  FROM contract
  JOIN claims ON contract.contract_id = claims.contract_id

),
claim_sizes AS (

  SELECT
    contract_id,
    SUM(claim_size) AS total_claim_size
  FROM contract_claims
  GROUP BY contract_id

)
SELECT 
  contract_id,
  total_claim_size
FROM claim_sizes
WHERE total_claim_size > 10000

This approach is easier to read because - as a human being - you can now read this with some semblance of ‘order’!

  • First we get the claims for each contract
  • Second we aggregate the claims amount for each contract
  • Finally we filter out those contracts with total claims of less than $10,000

Possible ‘Solution’ #2

Another alternative solution is to encapsulate each ‘subquery’ into a temporary table. This has an advantage over pure CTEs in efficiency terms because CTEs - if repetitively scattered across a script - will be repeated in most underlying execution plans. Most SQL engines however can optimise temporary tables much easier.

In the above then, something like the following would be better (not quite sure what dialect of SQL I’m trying to write here; that’s not the point though!),

DROP TABLE IF EXISTS #contract_claims
DROP TABLE IF EXISTS #claim_sizes

CREATE TABLE #contract_claims (
  contract_id INT,
  claim_size FLOAT
)

CREATE TABLE #claim_sizes (
  contract_id INT,
  total_claim_size FLOAT
)

INSERT INTO #contract_claims (
  contract_id,
  claim_size
)
SELECT
  contract_id,
  claim_size
FROM contract
JOIN claims ON contract.contract_id = claims.contract_id

INSERT INTO #claim_sizes (
  contract_id,
  total_claim_size
)
SELECT
  contract_id,
  SUM(claim_size) AS total_claim_size
FROM contract_claims
GROUP BY contract_id

SELECT 
  contract_id,
  total_claim_size
FROM #claim_sizes
WHERE total_claim_size > 10000

Personal take

My personal take: avoid ‘nesting’ like the above whenever you can. It creates too heavy a cognitive load for anyone else reading your code.

The same thing applies to nested if statements in procedural code. As a general rule, if you’re going more than one level deep, you’re probably doing something wrong2.

Remember, you’re trying to write good code; not the next incomprehensible iteration of Tenet!

Footnotes

  1. And, most hilarious of all, Michael Caine as… Sir Michael↩︎

  2. Though there are occasionally reasons to do so↩︎