In this puzzle, we’re going to learn about joins versus subqueries. In many queries you can substitute joins and subqueries. Yet, since each has their strengths, it isn’t wise to do so. Once you have worked through this puzzle you see some of the benefits and disadvantages of joins versus subqueries.
Solving puzzles is a great way to learn SQL. Nothing beats practicing what you’ve learned. Once you have figured out the puzzle, post you answer in the comments so we all can learn from one another. We also discuss puzzle and more in Essential SQL Learning Group on Facebook. Be sure to find us there!
SQL Puzzle Question – Joins versus Subqueries
The system admin would like a report of active users. Can you write a report that returns the name, phone number and most recent date for any user that has logged in the 30 days since March14th, 2017?
(you can tell a user has logged in if the action field in UserHistory is set to “Logged On”).
Every time a user logs in a new row is inserted into the UserHistory table with userID, current date and action (where action = “Logged On”).
Question #1 – Write a SQL query without using Subqueries.
Question #2 – Write the SQL using subqueries.
As you answer the questions think about the differences of joins versus subqueries. Which technique is better suited to solve the problem? Can you see advantages to either case?
To get you started, download this script which contains the table definitions and some test data.
Answer to Question #1 – INNER JOIN
Like most queries I write, I took this one step by step. Let me show you what I mean by walking you through my thought process.
Regardless of how this query turned out, I knew I would have to figure out how to get the logins 30 days since 2017-03-14. To do this I wrote a simple query that used the DATEDIFF function to calculate the number of days from the actionDate to the 14th. If the days were less than or equal to 30, I knew the entry was within the window.
SELECT *, DATEDIFF(DAY, actionDate, '2017-03-14') FROM @UserHistory WHERE action = 'Logged On' AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30;
You’ll see that I included the result of the DATEDIFF in query result. I didn’t this because I always get the parameter mixed up! By displaying the result, I could verify that I had the formula correct; therefore, being used in my WHERE clause as expected.
Here are is what I saw:
Now that I knew I could get actionDate within the window, I proceeded to join the UserHistory with User so I could include the PhoneNumber in my results. I joined the tables by userID.
SELECT U.Name, U.PhoneNumber, UH.actionDate AS RecentLogonDate FROM @User AS U INNER JOIN @UserHistory AS UH ON U.userID = UH.userID WHERE action = 'Logged On' AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30;
Given there are several UserHistory entries per user, I would expect to see duplicates. I’ve highlighted some of those below:
The last task which remains is to only show the most recent login date for each user. For Bob, this would be 2017-03-02.
Notice, that for each user, the most recent date is the maxim date for that user. This is a big hint, as it points us towards using aggregate functions such as MAX.
So, at this point we can use GROUP BY to calculation the maximum date for each user.
SELECT U.Name, U.PhoneNumber, MAX(UH.actionDate) AS RecentLogonDate FROM @User AS U INNER JOIN @UserHistory AS UH ON U.userID = UH.userID WHERE action = 'Logged On' AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30 GROUP BY U.Name, U.phoneNumber;
Whose result is
There are a couple of takeaways from this:
- First, here is no shame is doing a query to get intermediate results. When using formulas, such as DATEDIFF, make sure you’re getting the calculation you expect.
- Second, it can take more than one technique to solve a problem. In our case we had first join the items, and once we had the result use GROUP BY to get the most recent date.
- Last, keep it simple! At first I thought I would have to do another join so I could compare dates, but I realized that was going to get complicated. Also, I figured I could partition the data and use a window function, but that seemed like overkill. Once I realized I simply had to calculate the MAX, I realized GROUP BY would work.
I’m sure there are other ways to solve this puzzle. So, what answer did you get? Please share in the comments.
Answer to Question #2 – Subquery
Let’s look at the second piece of the puzzle when it comes to joins versus subqueries; the subquery.
This query is elegant in that IN operator naturally remove duplicates, but gets ugly as another sub query is needed to pull in the most recent date.
What you’ll see is that subqueries are really great when you need to return rows from one table based on the existence of one or more conditions in another, but not so swift at combining data from one table with another.
To show you what I mean, check out this query which shows the user and phone number for an actionDate within the windows:
SELECT U.Name, U.PhoneNumber FROM @User AS U WHERE U.UserID IN (SELECT UserID FROM @UserHistory WHERE action = 'Logged On' AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30);
In this case the we naturally get a unique list of name and phone numbers as we are listing users.
The uniqueness is guaranteed by the fact the table naturally contains unique value (yes, I know I didn’t define a Primary Key on the table, but hopefully you see it could be one).
The subquery in the WHERE clause compiles a list of userID’s whose actionDate falls within the 30-day window.
Of course, our query is only partially written, as it doesn’t display the latest actionDate.
If I could dream a bit (the dreamy bits are in blue) I would love to be able to write a query like:
SELECT U.Name, U.PhoneNumber, (SELECT Max(UH.actionDate) FROM @UserHistory AS UH WHERE action = 'Logged On' AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30 AND U.userID = UH.userID) AS RecentLogonDate FROM @User AS U WHERE DATEDIFF(DAY, RecentLogonDate, '2017-03-14') <= 30 AND RecentLogonDate IS NOT NULL
But this isn’t a properly written query, and SQL gets mad!
Msg 207, Level 16, State 1, Line 62 Invalid column name 'RecentLogonDate'. Msg 207, Level 16, State 1, Line 63 Invalid column name 'RecentLogonDate'.
What makes our subquery ugly is being forced to use two subqueries to retrieve the most recent actionDate. To do this you must essentially repeat the query, but now as a correlated sub query. The correlation is show in red. It just the match of the User table UserID to that in UserHistory.
SELECT U.Name, U.PhoneNumber, (SELECT Max(UH.actionDate) FROM @UserHistory AS UH WHERE action = 'Logged On' AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30 AND U.userID = UH.userID) AS RecentLogonDate FROM @User AS U WHERE U.UserID IN (SELECT UserID FROM @UserHistory WHERE action = 'Logged On' AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30);
Now that you’ve seen the differences of joins versus subqueries I’m hoping you have a better understanding on when to use one versus the other. To summarize, joins excel at combining data from two tables, subqueries are best when testing for the existence of a value from one table found in another.
What are your experiences with joins versus subqueries? Which do you prefer to use when? Please share in the comments!