SQL Riddles to Test Your Wits#sql
Written by Matt Sosna on February 21, 2023
SQL is a deceptively simple language. Across its many dialects, users can query databases in a syntax similar to English. What you see is what you get… until you don’t.
Every now and then I come across a query that produces a result completely different from what I expected, teaching me little nuances about the language. I’ve compiled three recent head-scratchers in this post, and I’ve arranged them as riddles to make them more interesting. Try to figure out the answer before reading the end of each section!
I’ve also included quick common table expressions (CTEs) to generate the tables in each example, so you don’t need to try querying your company’s production tables! But to get really comfortable with SQL, I actually recommend creating your own database and tables to play with. Check out this post to learn how.
Note that all queries are in Postgres - you may get different results in a different dialect. Finally, an obligatory note that the actual data and topics in each query are just illustrative examples. 🙂
Riddle 1: Timestamp Specificity
Imagine we have a table called
purchases with purchase IDs, amounts, and times the purchase were made. Let’s say it looks like this:
As a CTE, this would look something like this. Note that we need to specify that the
dt column format is a timestamp so it isn’t interpreted as a string. We also only need to specify the data types for one of the rows; the rest are inferred.
1 2 3 4 5 6 7 8 WITH purchases(id, amount, dt) AS ( VALUES (1::bigint, 0.99::float, '2023-02-15 00:00:00 GMT'::timestamp), (2, 9.99, '2023-02-15 07:15:00 GMT'), (3, 15.99, '2023-02-15 23:01:15 GMT'), (4, 7.99, '2023-02-16 14:22:09 GMT') ) ...
Now let’s calculate the sum of purchases made on Feb 15. We can write a query like the one below:
1 2 3 4 5 6 ... SELECT SUM(amount) AS sum FROM purchases WHERE dt = '2023-02-15'
We mysteriously receive the following response.
What happened? There were three purchases made on Feb 15: IDs 1, 2, and 3. The sum should therefore be $26.97. Instead, only the first purchase was counted.
If you change the filter to
2023-02-16, no rows are returned.
dt column is a timestamp that includes both date and time. Our
WHERE filter only specifies the date. Rather than rejecting this query, Postgres automatically reformats the date string to
2023-02-15 00:00:00. This matches only the first transaction in the table, so we’re therefore taking only the sum of one row.
If we wanted to select all rows corresponding to Feb 15, we should first cast the timestamp to date.
1 2 3 4 5 SELECT SUM(amount) AS sum FROM purchases WHERE DATE(dt) = '2023-02-15'
We now get the expected result.
Riddle 2: Dependent vs. independent filters
Alright, next riddle. We have a table called
users, and our goal is to remove all rows that meet any one of three conditions. In the table below, for example, let’s say that we want to only return tenured and active users, i.e., ones who have logged in during the last 28 days, have posted before, and are not a new account.
In other words, we want our query to only user 8, who has False values for
Let’s start with the top of our query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 WITH users(id, no_login_l28, has_never_posted, is_new_account) AS ( VALUES (1, True, True, True), (2, True, True, False), (3, True, False, True), (4, True, False, False), (5, False, True, True), (6, False, True, False), (7, False, False, True), (8, False, False, False) ) SELECT id FROM users WHERE ...
How should we structure the
WHERE clause of our query? Think for a minute – we need to be careful not to return rows where any of the columns is
When you’re ready, take a look at the options below. Two are correct and two are wrong.
Option 1: Multiple
1 2 3 4 WHERE NOT no_login_l28 AND NOT has_never_posted AND NOT is_new_account
Option 2: Multiple
1 2 3 4 WHERE NOT no_login_l28 OR NOT has_never_posted OR NOT is_new_account
NOT + grouped
1 2 3 4 5 6 WHERE NOT ( no_login_l28 OR has_never_posted OR is_new_account )
NOT + grouped
1 2 3 4 5 6 WHERE NOT ( no_login_l28 AND has_never_posted AND is_new_account )
When are conditions in a filter evaluated separately versus together? If they’re evaluated together, can we condense all conditions down to one
Option 1. This one tripped me up a bit. A data scientist on my team submitted a PR with this filter, which I was convinced would pull in rows 2-7 because the query would only remove users with
False values for all three columns. But to my surprise, Option 1 actually works because the three filters are evaluated independently. ✅
Option 2. This was the filter I initially thought was correct, since I didn’t realize the filters would be evaluated independently. But this filter will actually return users 2-8, since anyone who has at least one
is_new_account will be allowed through. ❌
Option 3. This was how I initially thought the filter needed to be worded. If a user has
True for any of
is_new_account, then lines 3-5 evaluate to
NOT flips this to
False, and those rows are ultimately excluded. This indeed works, and I find this easier to understand than Option 1, but both are valid. ✅
Option 4. This returns the same incorrect result as Option 2. Lines 3-5 evaluate to
True only for user 1, meaning that when we flip the boolean with
NOT, all remaining users are pulled through. ❌
Riddle 3: Left joins acting like inner joins
Take a look at the query below. We have two tables,
customers contains customer IDs and their lifetime dollars spent on the platform.
reviews contains information about reviews left by customers: the review ID, customer ID, and whether the review was reported as spam.
Here’s the subquery to generate the two CTEs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 WITH customers(id, total_spend) AS ( VALUES (100, 1583.49), (200, 8739.03), (300, 431.00), (400, 1.00), (500, 22.27) ), reviews(id, customer_id, reported_as_spam) AS ( VALUES (1, 100, False), (2, 100, False), (3, 400, True), (4, 400, True), (5, 500, False) ) ...
Now let’s say we’re curious about the relationship between a customer’s total spend and the number of non-spam reviews they write. Since not each customer has left a review, we’ll want to left join
customers. We can structure our query like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... SELECT c.id, c.total_spend, COALESCE(COUNT(r.id), 0) AS n_reviews FROM customers c LEFT JOIN reviews r ON c.id = r.customer_id WHERE NOT r.reported_as_spam GROUP BY 1, 2 ORDER BY 1
Ready? Here’s what comes out.
Wait a minute. Where did users 200, 300, and 400 go? Why were they removed, and how can we bring them back in?
If you create a CTE for
reviews with spam reviews filtered out, then join on this CTE, do we get the same result?
Looking closely, we can see that users 200 and 300 have never left any reviews. 400 only has spam reviews, but they were completely removed as well. Since we did a left join, these users should still be in the table and have a 0 for
n_reviews. Instead, our left join behaved like an inner join.
The issue, it turns out, is that
WHERE clauses are evaluated after joins. Our left join brings in null values for
reported_as_spam for users 200 and 300. The
WHERE filter then removes all rows where
reported_as_spam is True, which removes user 400. However, this filter also removes null values, so users 200 and 300 are also removed.
To do this properly, we need to pre-filter
reviews before joining with
customers. As the hint states, we can create a CTE for
reviews and perform the filtering there. But more efficiently, let’s perform the filtering within the join.
We can do this by adding
AND NOT r.reported_as_spam to the
LEFT JOIN block. See below:
1 2 3 4 5 6 7 8 9 10 11 12 13 ... SELECT c.id, c.total_spend, COALESCE(COUNT(r.id), 0) AS n_reviews FROM customers c LEFT JOIN reviews r ON c.id = r.customer_id AND NOT r.reported_as_spam GROUP BY 1, 2 ORDER BY 1
Now we get the expected result.
This post shared three SQL wrinkles that can lead to unexpected results: timestamp specificity, dependent versus independent filters, and left joins acting like inner joins. I specifically provided simple examples to keep the focus on the syntax, but you’ll likely encounter SQL nuances like these nestled within large, complex queries.
These bugs can be incredibly challenging to identify, especially for queries with many components. Whenever I’m confused by a result, I try to break the query into its pieces and verify each component’s result. But when in doubt, write some simple CTEs with test data and confirm the results do what you expect.