SQL Riddles to Test Your Wits
#sqlWritten by Matt Sosna on February 21, 2023
SQL is a deceptively simple language. Across its many dialects, users can query databases in a syntax similar to English. What you see is what you get… until you don’t.
Every now and then I come across a query that produces a result completely different from what I expected, teaching me little nuances about the language. I’ve compiled three recent head-scratchers in this post, and I’ve arranged them as riddles to make them more interesting. Try to figure out the answer before reading the end of each section!
I’ve also included quick common table expressions (CTEs) to generate the tables in each example, so you don’t need to try querying your company’s production tables! But to get really comfortable with SQL, I actually recommend creating your own database and tables to play with. Check out this post to learn how.
Note that all queries are in Postgres - you may get different results in a different dialect. Finally, an obligatory note that the actual data and topics in each query are just illustrative examples. 🙂
Riddle 1: Timestamp Specificity
Imagine we have a table called purchases
with purchase IDs, amounts, and times the purchase were made. Let’s say it looks like this:
id | amount | dt |
---|---|---|
1 | 0.99 | 2023-02-15 00:00:00 |
2 | 9.99 | 2023-02-15 07:15:00 |
3 | 15.99 | 2023-02-15 23:01:15 |
4 | 7.99 | 2023-02-16 14:22:09 |
As a CTE, this would look something like this. Note that we need to specify that the dt
column format is a timestamp so it isn’t interpreted as a string. We also only need to specify the data types for one of the rows; the rest are inferred.
SQL |
1
2
3
4
5
6
7
8
WITH purchases(id, amount, dt) AS (
VALUES
(1::bigint, 0.99::float, '2023-02-15 00:00:00 GMT'::timestamp),
(2, 9.99, '2023-02-15 07:15:00 GMT'),
(3, 15.99, '2023-02-15 23:01:15 GMT'),
(4, 7.99, '2023-02-16 14:22:09 GMT')
)
...
Now let’s calculate the sum of purchases made on Feb 15. We can write a query like the one below:
SQL |
1
2
3
4
5
6
...
SELECT
SUM(amount) AS sum
FROM purchases
WHERE
dt = '2023-02-15'
We mysteriously receive the following response.
sum |
---|
0.99 |
What happened? There were three purchases made on Feb 15: IDs 1, 2, and 3. The sum should therefore be $26.97. Instead, only the first purchase was counted.
Hint
If you change the filter to 2023-02-16
, no rows are returned.
Answer
The dt
column is a timestamp that includes both date and time. Our WHERE
filter only specifies the date. Rather than rejecting this query, Postgres automatically reformats the date string to 2023-02-15 00:00:00
. This matches only the first transaction in the table, so we’re therefore taking only the sum of one row.
If we wanted to select all rows corresponding to Feb 15, we should first cast the timestamp to date.
SQL |
1
2
3
4
5
SELECT
SUM(amount) AS sum
FROM purchases
WHERE
DATE(dt) = '2023-02-15'
We now get the expected result.
sum |
---|
26.97 |
Riddle 2: Dependent vs. independent filters
Alright, next riddle. We have a table called users
, and our goal is to remove all rows that meet any one of three conditions. In the table below, for example, let’s say that we want to only return tenured and active users, i.e., ones who have logged in during the last 28 days, have posted before, and are not a new account.
id | no_login_l28 | has_never_posted | is_new_account |
---|---|---|---|
1 | True | True | True |
2 | True | True | False |
3 | True | False | True |
4 | True | False | False |
5 | False | True | True |
6 | False | True | False |
7 | False | False | True |
8 | False | False | False |
In other words, we want our query to only user 8, who has False values for no_login_l28
, has_never_posted
, and is_new_account
.
Let’s start with the top of our query.
SQL |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
WITH users(id, no_login_l28, has_never_posted, is_new_account) AS (
VALUES
(1, True, True, True),
(2, True, True, False),
(3, True, False, True),
(4, True, False, False),
(5, False, True, True),
(6, False, True, False),
(7, False, False, True),
(8, False, False, False)
)
SELECT
id
FROM users
WHERE
...
How should we structure the WHERE
clause of our query? Think for a minute – we need to be careful not to return rows where any of the columns is False
.
When you’re ready, take a look at the options below. Two are correct and two are wrong.
Option 1: Multiple AND NOT
SQL |
1
2
3
4
WHERE
NOT no_login_l28
AND NOT has_never_posted
AND NOT is_new_account
Option 2: Multiple OR NOT
SQL |
1
2
3
4
WHERE
NOT no_login_l28
OR NOT has_never_posted
OR NOT is_new_account
Option 3: NOT
+ grouped OR
SQL |
1
2
3
4
5
6
WHERE
NOT (
no_login_l28
OR has_never_posted
OR is_new_account
)
Option 4: NOT
+ grouped AND
SQL |
1
2
3
4
5
6
WHERE
NOT (
no_login_l28
AND has_never_posted
AND is_new_account
)
Hint
When are conditions in a filter evaluated separately versus together? If they’re evaluated together, can we condense all conditions down to one True
or False
value?
Answer
Option 1. This one tripped me up a bit. A data scientist on my team submitted a PR with this filter, which I was convinced would pull in rows 2-7 because the query would only remove users with False
values for all three columns. But to my surprise, Option 1 actually works because the three filters are evaluated independently. ✅
Option 2. This was the filter I initially thought was correct, since I didn’t realize the filters would be evaluated independently. But this filter will actually return users 2-8, since anyone who has at least one True
for no_login_l28
, has_never_posted
, and is_new_account
will be allowed through. ❌
Option 3. This was how I initially thought the filter needed to be worded. If a user has True
for any of no_login_l28
, has_never_posted
, or is_new_account
, then lines 3-5 evaluate to True
, the NOT
flips this to False
, and those rows are ultimately excluded. This indeed works, and I find this easier to understand than Option 1, but both are valid. ✅
Option 4. This returns the same incorrect result as Option 2. Lines 3-5 evaluate to True
only for user 1, meaning that when we flip the boolean with NOT
, all remaining users are pulled through. ❌
Riddle 3: Left joins acting like inner joins
Take a look at the query below. We have two tables, customers
and reviews
. customers
contains customer IDs and their lifetime dollars spent on the platform.
id | total_spend |
---|---|
100 | 1583.49 |
200 | 8739.03 |
300 | 431.00 |
400 | 1.00 |
500 | 22.27 |
reviews
contains information about reviews left by customers: the review ID, customer ID, and whether the review was reported as spam.
id | customer_id | reported_as_spam |
---|---|---|
1 | 100 | False |
2 | 100 | False |
3 | 400 | True |
4 | 400 | True |
5 | 500 | False |
Here’s the subquery to generate the two CTEs:
SQL |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
WITH customers(id, total_spend) AS (
VALUES
(100, 1583.49),
(200, 8739.03),
(300, 431.00),
(400, 1.00),
(500, 22.27)
),
reviews(id, customer_id, reported_as_spam) AS (
VALUES
(1, 100, False),
(2, 100, False),
(3, 400, True),
(4, 400, True),
(5, 500, False)
)
...
Now let’s say we’re curious about the relationship between a customer’s total spend and the number of non-spam reviews they write. Since not each customer has left a review, we’ll want to left join reviews
to customers
. We can structure our query like this:
SQL |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
SELECT
c.id,
c.total_spend,
COALESCE(COUNT(r.id), 0) AS n_reviews
FROM customers c
LEFT JOIN reviews r
ON c.id = r.customer_id
WHERE
NOT r.reported_as_spam
GROUP BY
1, 2
ORDER BY
1
Ready? Here’s what comes out.
id | total_spend | n_reviews |
---|---|---|
100 | 1583.49 | 2 |
500 | 22.27 | 1 |
Wait a minute. Where did users 200, 300, and 400 go? Why were they removed, and how can we bring them back in?
Hint
If you create a CTE for reviews
with spam reviews filtered out, then join on this CTE, do we get the same result?
Answer
Looking closely, we can see that users 200 and 300 have never left any reviews. 400 only has spam reviews, but they were completely removed as well. Since we did a left join, these users should still be in the table and have a 0 for n_reviews
. Instead, our left join behaved like an inner join.
The issue, it turns out, is that WHERE
clauses are evaluated after joins. Our left join brings in null values for reported_as_spam
for users 200 and 300. The WHERE
filter then removes all rows where reported_as_spam
is True, which removes user 400. However, this filter also removes null values, so users 200 and 300 are also removed.
To do this properly, we need to pre-filter reviews
before joining with customers
. As the hint states, we can create a CTE for reviews
and perform the filtering there. But more efficiently, let’s perform the filtering within the join.
We can do this by adding AND NOT r.reported_as_spam
to the LEFT JOIN
block. See below:
SQL |
1
2
3
4
5
6
7
8
9
10
11
12
13
...
SELECT
c.id,
c.total_spend,
COALESCE(COUNT(r.id), 0) AS n_reviews
FROM customers c
LEFT JOIN reviews r
ON c.id = r.customer_id
AND NOT r.reported_as_spam
GROUP BY
1, 2
ORDER BY
1
Now we get the expected result.
id | total_spend | n_reviews |
---|---|---|
100 | 1583.49 | 2 |
200 | 8739.03 | 0 |
300 | 431.00 | 0 |
400 | 1.00 | 0 |
500 | 22.27 | 1 |
Conclusions
This post shared three SQL wrinkles that can lead to unexpected results: timestamp specificity, dependent versus independent filters, and left joins acting like inner joins. I specifically provided simple examples to keep the focus on the syntax, but you’ll likely encounter SQL nuances like these nestled within large, complex queries.
These bugs can be incredibly challenging to identify, especially for queries with many components. Whenever I’m confused by a result, I try to break the query into its pieces and verify each component’s result. But when in doubt, write some simple CTEs with test data and confirm the results do what you expect.
Happy querying!
Matt