There are some Sql patterns that, as soon as you recognize them, you begin seeing them in all places. The options to the puzzles that I’ll present you as we speak are literally quite simple SQL queries, however understanding the idea behind them will certainly unlock new options to the queries you write on a day-to-day foundation.
These challenges are all primarily based on real-world eventualities, as over the previous few months I made a degree of writing down each puzzle-like question that I needed to construct. I additionally encourage you to attempt them for your self, with the intention to problem your self first, which can enhance your studying!
All queries to generate the datasets might be offered in a PostgreSQL and DuckDB-friendly syntax, with the intention to simply copy and play with them. On the finish I will even present you a hyperlink to a GitHub repo containing all of the code, in addition to the reply to the bonus problem I’ll go away for you!
I organized these puzzles so as of accelerating problem, so, for those who discover the primary ones too simple, at the very least check out the final one, which makes use of a way that I really imagine you received’t have seen earlier than.
Okay, let’s get began.
I like this puzzle due to how quick and easy the ultimate question is, though it offers with many edge instances. The info for this problem exhibits tickets transferring in between Kanban levels, and the target is to search out how lengthy, on common, tickets keep within the Doing stage.
The info accommodates the ID of the ticket, the date the ticket was created, the date of the transfer, and the “from” and “to” levels of the transfer. The levels current are New, Doing, Evaluation, and Accomplished.
Some issues it’s essential to know (edge instances):
- Tickets can transfer backwards, which means tickets can return to the Doing stage.
- You shouldn’t embody tickets which can be nonetheless caught within the Doing stage, as there isn’t any option to understand how lengthy they’ll keep there for.
- Tickets aren’t all the time created within the New stage.
CREATE TABLE ticket_moves (
ticket_id INT NOT NULL,
create_date DATE NOT NULL,
move_date DATE NOT NULL,
from_stage TEXT NOT NULL,
to_stage TEXT NOT NULL
);
INSERT INTO ticket_moves (ticket_id, create_date, move_date, from_stage, to_stage)
VALUES
-- Ticket 1: Created in "New", then strikes to Doing, Evaluation, Accomplished.
(1, '2024-09-01', '2024-09-03', 'New', 'Doing'),
(1, '2024-09-01', '2024-09-07', 'Doing', 'Evaluation'),
(1, '2024-09-01', '2024-09-10', 'Evaluation', 'Accomplished'),
-- Ticket 2: Created in "New", then strikes: New → Doing → Evaluation → Doing once more → Evaluation.
(2, '2024-09-05', '2024-09-08', 'New', 'Doing'),
(2, '2024-09-05', '2024-09-12', 'Doing', 'Evaluation'),
(2, '2024-09-05', '2024-09-15', 'Evaluation', 'Doing'),
(2, '2024-09-05', '2024-09-20', 'Doing', 'Evaluation'),
-- Ticket 3: Created in "New", then strikes to Doing. (Edge case: no subsequent transfer from Doing.)
(3, '2024-09-10', '2024-09-16', 'New', 'Doing'),
-- Ticket 4: Created already in "Doing", then strikes to Evaluation.
(4, '2024-09-15', '2024-09-22', 'Doing', 'Evaluation');
A abstract of the information:
- Ticket 1: Created within the New stage, strikes usually to Doing, then Evaluation, after which Accomplished.
- Ticket 2: Created in New, then strikes: New → Doing → Evaluation → Doing once more → Evaluation.
- Ticket 3: Created in New, strikes to Doing, however it’s nonetheless caught there.
- Ticket 4: Created within the Doing stage, strikes to Evaluation afterward.
It could be a good suggestion to cease for a bit and suppose how you’d take care of this. Are you able to learn the way lengthy a ticket stays on a single stage?
Truthfully, this sounds intimidating at first, and it appears like will probably be a nightmare to take care of all the sting instances. Let me present you the complete resolution to the issue, after which I’ll clarify what is occurring afterward.
WITH stage_intervals AS (
SELECT
ticket_id,
from_stage,
move_date
- COALESCE(
LAG(move_date) OVER (
PARTITION BY ticket_id
ORDER BY move_date
),
create_date
) AS days_in_stage
FROM
ticket_moves
)
SELECT
SUM(days_in_stage) / COUNT(DISTINCT ticket_id) as avg_days_in_doing
FROM
stage_intervals
WHERE
from_stage = 'Doing';

The primary CTE makes use of the LAG perform to search out the earlier transfer of the ticket, which would be the time the ticket entered that stage. Calculating the length is so simple as subtracting the earlier date from the transfer date.
What you need to discover is using the COALESCE within the earlier transfer date. What that does is that if a ticket doesn’t have a earlier transfer, then it makes use of the date of creation of the ticket. This takes care of the instances of tickets being created instantly into the Doing stage, because it nonetheless will correctly calculate the time it took to go away the stage.
That is the results of the primary CTE, displaying the time spent in every stage. Discover how the Ticket 2 has two entries, because it visited the Doing stage in two separate events.

With this accomplished, it’s only a matter of getting the typical because the SUM of whole days spent in doing, divided by the distinct variety of tickets that ever left the stage. Doing it this fashion, as an alternative of merely utilizing the AVG, makes certain that the 2 rows for Ticket 2 get correctly accounted for as a single ticket.
Not so dangerous, proper?
The purpose of this second problem is to discover the newest contract sequence of each worker. A break of sequence occurs when two contracts have a spot of greater than sooner or later between them.
On this dataset, there aren’t any contract overlaps, which means {that a} contract for a similar worker both has a spot or ends a day earlier than the brand new one begins.
CREATE TABLE contracts (
contract_id integer PRIMARY KEY,
employee_id integer NOT NULL,
start_date date NOT NULL,
end_date date NOT NULL
);
INSERT INTO contracts (contract_id, employee_id, start_date, end_date)
VALUES
-- Worker 1: Two steady contracts
(1, 1, '2024-01-01', '2024-03-31'),
(2, 1, '2024-04-01', '2024-06-30'),
-- Worker 2: One contract, then a spot of three days, then two contracts
(3, 2, '2024-01-01', '2024-02-15'),
(4, 2, '2024-02-19', '2024-04-30'),
(5, 2, '2024-05-01', '2024-07-31'),
-- Worker 3: One contract
(6, 3, '2024-03-01', '2024-08-31');

As a abstract of the information:
- Worker 1: Has two steady contracts.
- Worker 2: One contract, then a spot of three days, then two contracts.
- Worker 3: One contract.
The anticipated outcome, given the dataset, is that each one contracts must be included apart from the primary contract of Worker 2, which is the one one which has a spot.
Earlier than explaining the logic behind the answer, I would love you to consider what operation can be utilized to affix the contracts that belong to the identical sequence. Focus solely on the second row of information, what info do it’s essential to know if this contract was a break or not?
I hope it’s clear that that is the right scenario for window capabilities, once more. They’re extremely helpful for fixing issues like this, and understanding when to make use of them helps rather a lot find clear options to issues.
Very first thing to do, then, is to get the tip date of the earlier contract for a similar worker with the LAG perform. Doing that, it’s easy to check each dates and verify if it was a break of sequence.
WITH ordered_contracts AS (
SELECT
*,
LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS previous_end_date
FROM
contracts
),
gapped_contracts AS (
SELECT
*,
-- Offers with the case of the primary contract, which will not have
-- a earlier finish date. On this case, it is nonetheless the beginning of a brand new
-- sequence.
CASE WHEN previous_end_date IS NULL
OR previous_end_date

An intuitive option to proceed the question is to quantity the sequences of every worker. For instance, an worker who has no hole, will all the time be on his first sequence, however an worker who had 5 breaks in contracts might be on his fifth sequence. Funnily sufficient, that is accomplished by one other window perform.
--
-- Earlier CTEs
--
sequences AS (
SELECT
*,
SUM(is_new_sequence) OVER (PARTITION BY employee_id ORDER BY start_date) AS sequence_id
FROM
gapped_contracts
)
SELECT * FROM sequences ORDER BY employee_id ASC;

Discover how, for Worker 2, he begins his sequence #2 after the primary gapped worth. To complete this question, I grouped the information by worker, received the worth of their most up-to-date sequence, after which did an internal be a part of with the sequences to maintain solely the newest one.
--
-- Earlier CTEs
--
max_sequence AS (
SELECT
employee_id,
MAX(sequence_id) AS max_sequence_id
FROM
sequences
GROUP BY
employee_id
),
latest_contract_sequence AS (
SELECT
c.contract_id,
c.employee_id,
c.start_date,
c.end_date
FROM
sequences c
JOIN max_sequence m ON c.sequence_id = m.max_sequence_id
AND c.employee_id = m.employee_id
ORDER BY
c.employee_id,
c.start_date
)
SELECT
*
FROM
latest_contract_sequence;

As anticipated, our remaining result’s principally our beginning question simply with the primary contract of Worker 2 lacking!
Lastly, the final puzzle — I’m glad you made it this far.
For me, that is probably the most mind-blowing one, as after I first encountered this downside I considered a totally totally different resolution that might be a multitude to implement in SQL.
For this puzzle, I’ve modified the context from what I needed to take care of for my job, as I believe it’ll make it simpler to clarify.
Think about you’re an information analyst at an occasion venue, and also you’re analyzing the talks scheduled for an upcoming occasion. You wish to discover the time of day the place there would be the highest variety of talks taking place on the similar time.
That is what you need to know concerning the schedules:
- Rooms are booked in increments of 30min, e.g. from 9h-10h30.
- The info is clear, there aren’t any overbookings of assembly rooms.
- There might be back-to-back conferences in a single assembly room.

Assembly schedule visualized (that is the precise information).
CREATE TABLE conferences (
room TEXT NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL
);
INSERT INTO conferences (room, start_time, end_time) VALUES
-- Room A conferences
('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
-- Room B conferences
('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
-- Room C conferences
('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room C', '2024-10-01 11:30', '2024-10-01 12:00');

The way in which to resolve that is utilizing what known as a Sweep Line Algorithm, or also called an event-based resolution. This final title really helps to grasp what might be accomplished, as the thought is that as an alternative of coping with intervals, which is what we now have within the unique information, we take care of occasions as an alternative.
To do that, we have to rework each row into two separate occasions. The primary occasion would be the Begin of the assembly, and the second occasion would be the Finish of the assembly.
WITH occasions AS (
-- Create an occasion for the beginning of every assembly (+1)
SELECT
start_time AS event_time,
1 AS delta
FROM conferences
UNION ALL
-- Create an occasion for the tip of every assembly (-1)
SELECT
-- Small trick to work with the back-to-back conferences (defined later)
end_time - interval '1 minute' as end_time,
-1 AS delta
FROM conferences
)
SELECT * FROM occasions;

Take the time to grasp what is occurring right here. To create two occasions from a single row of information, we’re merely unioning the dataset on itself; the primary half makes use of the beginning time because the timestamp, and the second half makes use of the tip time.
You would possibly already discover the delta column created and see the place that is going. When an occasion begins, we rely it as +1, when it ends, we rely it as -1. You would possibly even be already pondering of one other window perform to resolve this, and also you’re really proper!
However earlier than that, let me simply clarify the trick I used ultimately dates. As I don’t need back-to-back conferences to rely as two concurrent conferences, I’m subtracting a single minute of each finish date. This fashion, if a gathering ends and one other begins at 10h30, it received’t be assumed that two conferences are concurrently taking place at 10h30.
Okay, again to the question and one more window perform. This time, although, the perform of selection is a rolling SUM.
--
-- Earlier CTEs
--
ordered_events AS (
SELECT
event_time,
delta,
SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent_meetings
FROM occasions
)
SELECT * FROM ordered_events ORDER BY event_time DESC;

The rolling SUM on the Delta column is basically strolling down each report and discovering what number of occasions are lively at the moment. For instance, at 9 am sharp, it sees two occasions beginning, so it marks the variety of concurrent conferences as two!
When the third assembly begins, the rely goes as much as three. However when it will get to 9h59 (10 am), then two conferences finish, bringing the counter again to at least one. With this information, the one factor lacking is to search out when the best worth of concurrent conferences occurs.
--
-- Earlier CTEs
--
max_events AS (
-- Discover the utmost concurrent conferences worth
SELECT
event_time,
concurrent_meetings,
RANK() OVER (ORDER BY concurrent_meetings DESC) AS rnk
FROM ordered_events
)
SELECT event_time, concurrent_meetings
FROM max_events
WHERE rnk = 1;

That’s it! The interval of 9h30–10h is the one with the most important variety of concurrent conferences, which checks out with the schedule visualization above!
This resolution appears extremely easy in my view, and it really works for therefore many conditions. Each time you might be coping with intervals now, you need to suppose if the question wouldn’t be simpler if you considered it within the perspective of occasions.
However earlier than you progress on, and to essentially nail down this idea, I wish to go away you with a bonus problem, which can be a standard utility of the Sweep Line Algorithm. I hope you give it a attempt!
Bonus problem
The context for this one remains to be the identical because the final puzzle, however now, as an alternative of looking for the interval when there are most concurrent conferences, the target is to search out dangerous scheduling. Plainly there are overlaps within the assembly rooms, which should be listed so it may be fastened ASAP.
How would you discover out if the identical assembly room has two or extra conferences booked on the similar time? Listed here are some tips about the way to resolve it:
- It’s nonetheless the identical algorithm.
- This implies you’ll nonetheless do the UNION, however it’ll look barely totally different.
- You must suppose within the perspective of every assembly room.
You need to use this information for the problem:
CREATE TABLE meetings_overlap (
room TEXT NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL
);
INSERT INTO meetings_overlap (room, start_time, end_time) VALUES
-- Room A conferences
('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
-- Room B conferences
('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
-- Room C conferences
('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
-- Overlaps with earlier assembly.
('Room C', '2024-10-01 09:30', '2024-10-01 12:00');
In the event you’re within the resolution to this puzzle, in addition to the remainder of the queries, verify this GitHub repo.
The primary takeaway from this weblog put up is that window capabilities are overpowered. Ever since I received extra snug with utilizing them, I really feel that my queries have gotten a lot easier and simpler to learn, and I hope the identical occurs to you.
In the event you’re concerned with studying extra about them, you’d in all probability get pleasure from studying this other blog post I’ve written, the place I am going over how one can perceive and use them successfully.
The second takeaway is that these patterns used within the challenges actually do occur in lots of different locations. You would possibly want to search out sequences of subscriptions, buyer retention, otherwise you would possibly want to search out overlap of duties. There are various conditions when you will have to make use of window capabilities in a really related vogue to what was accomplished within the puzzles.
The third factor I need you to recollect is about this resolution to utilizing occasions apart from coping with intervals. I’ve checked out some issues I solved a very long time in the past that I might’ve used this sample on to make my life simpler, and sadly, I didn’t learn about it on the time.
I actually do hope you loved this put up and gave a shot to the puzzles your self. And I’m certain that for those who made it this far, you both discovered one thing new about SQL or strengthened your information of window capabilities!
Thanks a lot for studying. In case you have questions or simply wish to get in contact with me, don’t hesitate to contact me at mtrentz.com.
All pictures by the creator except acknowledged in any other case.