There are some Sql patterns that, as soon as you understand them, you begin seeing them in every single place. The options to the puzzles that I’ll present you at present are literally quite simple SQL queries, however understanding the idea behind them will certainly unlock new options to the queries you write on a day-to-day foundation.
These challenges are all based mostly on real-world eventualities, as over the previous few months I made a degree of writing down each puzzle-like question that I needed to construct. I additionally encourage you to attempt them for your self, in an effort to problem your self first, which is able to enhance your studying!
All queries to generate the datasets might be offered in a PostgreSQL and DuckDB-friendly syntax, in an effort to simply copy and play with them. On the finish I may also present you a hyperlink to a GitHub repo containing all of the code, in addition to the reply to the bonus problem I’ll go away for you!
I organized these puzzles so as of accelerating issue, so, for those who discover the primary ones too simple, not less than check out the final one, which makes use of a method that I actually consider you received’t have seen earlier than.
Okay, let’s get began.
I like this puzzle due to how brief and easy the ultimate question is, although it offers with many edge circumstances. The information for this problem exhibits tickets transferring in between Kanban phases, and the target is to seek out how lengthy, on common, tickets keep within the Doing stage.
The information incorporates the ID of the ticket, the date the ticket was created, the date of the transfer, and the “from” and “to” phases of the transfer. The phases current are New, Doing, Assessment, and Carried out.
Some issues you’ll want to know (edge circumstances):
- Tickets can transfer backwards, that means tickets can return to the Doing stage.
- You shouldn’t embody tickets which can be nonetheless caught within the Doing stage, as there isn’t a strategy to know the way lengthy they’ll keep there for.
- Tickets are usually not all the time created within the New stage.
CREATE TABLE ticket_moves (
ticket_id INT NOT NULL,
create_date DATE NOT NULL,
move_date DATE NOT NULL,
from_stage TEXT NOT NULL,
to_stage TEXT NOT NULL
);
INSERT INTO ticket_moves (ticket_id, create_date, move_date, from_stage, to_stage)
VALUES
-- Ticket 1: Created in "New", then strikes to Doing, Assessment, Carried out.
(1, '2024-09-01', '2024-09-03', 'New', 'Doing'),
(1, '2024-09-01', '2024-09-07', 'Doing', 'Assessment'),
(1, '2024-09-01', '2024-09-10', 'Assessment', 'Carried out'),
-- Ticket 2: Created in "New", then strikes: New → Doing → Assessment → Doing once more → Assessment.
(2, '2024-09-05', '2024-09-08', 'New', 'Doing'),
(2, '2024-09-05', '2024-09-12', 'Doing', 'Assessment'),
(2, '2024-09-05', '2024-09-15', 'Assessment', 'Doing'),
(2, '2024-09-05', '2024-09-20', 'Doing', 'Assessment'),
-- Ticket 3: Created in "New", then strikes to Doing. (Edge case: no subsequent transfer from Doing.)
(3, '2024-09-10', '2024-09-16', 'New', 'Doing'),
-- Ticket 4: Created already in "Doing", then strikes to Assessment.
(4, '2024-09-15', '2024-09-22', 'Doing', 'Assessment');
A abstract of the info:
- Ticket 1: Created within the New stage, strikes usually to Doing, then Assessment, after which Carried out.
- Ticket 2: Created in New, then strikes: New → Doing → Assessment → Doing once more → Assessment.
- Ticket 3: Created in New, strikes to Doing, however it’s nonetheless caught there.
- Ticket 4: Created within the Doing stage, strikes to Assessment afterward.
It is likely to be a good suggestion to cease for a bit and suppose how you’ll cope with this. Are you able to learn how lengthy a ticket stays on a single stage?
Actually, this sounds intimidating at first, and it appears to be like like it is going to be a nightmare to cope with all the sting circumstances. Let me present you the total answer to the issue, after which I’ll clarify what is going on afterward.
WITH stage_intervals AS (
SELECT
ticket_id,
from_stage,
move_date
- COALESCE(
LAG(move_date) OVER (
PARTITION BY ticket_id
ORDER BY move_date
),
create_date
) AS days_in_stage
FROM
ticket_moves
)
SELECT
SUM(days_in_stage) / COUNT(DISTINCT ticket_id) as avg_days_in_doing
FROM
stage_intervals
WHERE
from_stage = 'Doing';

The primary CTE makes use of the LAG perform to seek out the earlier transfer of the ticket, which would be the time the ticket entered that stage. Calculating the period is so simple as subtracting the earlier date from the transfer date.
What you must discover is the usage of the COALESCE within the earlier transfer date. What that does is that if a ticket doesn’t have a earlier transfer, then it makes use of the date of creation of the ticket. This takes care of the circumstances of tickets being created instantly into the Doing stage, because it nonetheless will correctly calculate the time it took to go away the stage.
That is the results of the primary CTE, exhibiting the time spent in every stage. Discover how the Ticket 2 has two entries, because it visited the Doing stage in two separate events.

With this executed, it’s only a matter of getting the typical because the SUM of complete days spent in doing, divided by the distinct variety of tickets that ever left the stage. Doing it this manner, as an alternative of merely utilizing the AVG, makes positive that the 2 rows for Ticket 2 get correctly accounted for as a single ticket.
Not so dangerous, proper?
The purpose of this second problem is to discover the newest contract sequence of each worker. A break of sequence occurs when two contracts have a spot of greater than sooner or later between them.
On this dataset, there are not any contract overlaps, that means {that a} contract for a similar worker both has a spot or ends a day earlier than the brand new one begins.
CREATE TABLE contracts (
contract_id integer PRIMARY KEY,
employee_id integer NOT NULL,
start_date date NOT NULL,
end_date date NOT NULL
);
INSERT INTO contracts (contract_id, employee_id, start_date, end_date)
VALUES
-- Worker 1: Two steady contracts
(1, 1, '2024-01-01', '2024-03-31'),
(2, 1, '2024-04-01', '2024-06-30'),
-- Worker 2: One contract, then a spot of three days, then two contracts
(3, 2, '2024-01-01', '2024-02-15'),
(4, 2, '2024-02-19', '2024-04-30'),
(5, 2, '2024-05-01', '2024-07-31'),
-- Worker 3: One contract
(6, 3, '2024-03-01', '2024-08-31');

As a abstract of the info:
- Worker 1: Has two steady contracts.
- Worker 2: One contract, then a spot of three days, then two contracts.
- Worker 3: One contract.
The anticipated end result, given the dataset, is that each one contracts ought to be included aside from the primary contract of Worker 2, which is the one one which has a spot.
Earlier than explaining the logic behind the answer, I would love you to consider what operation can be utilized to hitch the contracts that belong to the identical sequence. Focus solely on the second row of information, what data do you’ll want to know if this contract was a break or not?
I hope it’s clear that that is the proper state of affairs for window features, once more. They’re extremely helpful for fixing issues like this, and understanding when to make use of them helps quite a bit to find clear options to issues.
Very first thing to do, then, is to get the tip date of the earlier contract for a similar worker with the LAG perform. Doing that, it’s easy to match each dates and examine if it was a break of sequence.
WITH ordered_contracts AS (
SELECT
*,
LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS previous_end_date
FROM
contracts
),
gapped_contracts AS (
SELECT
*,
-- Offers with the case of the primary contract, which will not have
-- a earlier finish date. On this case, it is nonetheless the beginning of a brand new
-- sequence.
CASE WHEN previous_end_date IS NULL
OR previous_end_date < start_date - INTERVAL '1 day' THEN
1
ELSE
0
END AS is_new_sequence
FROM
ordered_contracts
)
SELECT * FROM gapped_contracts ORDER BY employee_id ASC;

An intuitive strategy to proceed the question is to quantity the sequences of every worker. For instance, an worker who has no hole, will all the time be on his first sequence, however an worker who had 5 breaks in contracts might be on his fifth sequence. Funnily sufficient, that is executed by one other window perform.
--
-- Earlier CTEs
--
sequences AS (
SELECT
*,
SUM(is_new_sequence) OVER (PARTITION BY employee_id ORDER BY start_date) AS sequence_id
FROM
gapped_contracts
)
SELECT * FROM sequences ORDER BY employee_id ASC;

Discover how, for Worker 2, he begins his sequence #2 after the primary gapped worth. To complete this question, I grouped the info by worker, received the worth of their most up-to-date sequence, after which did an inside be a part of with the sequences to maintain solely the newest one.
--
-- Earlier CTEs
--
max_sequence AS (
SELECT
employee_id,
MAX(sequence_id) AS max_sequence_id
FROM
sequences
GROUP BY
employee_id
),
latest_contract_sequence AS (
SELECT
c.contract_id,
c.employee_id,
c.start_date,
c.end_date
FROM
sequences c
JOIN max_sequence m ON c.sequence_id = m.max_sequence_id
AND c.employee_id = m.employee_id
ORDER BY
c.employee_id,
c.start_date
)
SELECT
*
FROM
latest_contract_sequence;

As anticipated, our ultimate result’s principally our beginning question simply with the primary contract of Worker 2 lacking!
Lastly, the final puzzle — I’m glad you made it this far.
For me, that is probably the most mind-blowing one, as after I first encountered this drawback I considered a very completely different answer that will be a large number to implement in SQL.
For this puzzle, I’ve modified the context from what I needed to cope with for my job, as I feel it can make it simpler to elucidate.
Think about you’re a knowledge analyst at an occasion venue, and also you’re analyzing the talks scheduled for an upcoming occasion. You need to discover the time of day the place there would be the highest variety of talks occurring on the similar time.
That is what you must know concerning the schedules:
- Rooms are booked in increments of 30min, e.g. from 9h-10h30.
- The information is clear, there are not any overbookings of assembly rooms.
- There will be back-to-back conferences in a single assembly room.

Assembly schedule visualized (that is the precise knowledge).
CREATE TABLE conferences (
room TEXT NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL
);
INSERT INTO conferences (room, start_time, end_time) VALUES
-- Room A conferences
('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
-- Room B conferences
('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
-- Room C conferences
('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room C', '2024-10-01 11:30', '2024-10-01 12:00');

The way in which to resolve that is utilizing what known as a Sweep Line Algorithm, or often known as an event-based answer. This final identify truly helps to know what might be executed, as the thought is that as an alternative of coping with intervals, which is what we’ve within the authentic knowledge, we cope with occasions as an alternative.
To do that, we have to remodel each row into two separate occasions. The primary occasion would be the Begin of the assembly, and the second occasion would be the Finish of the assembly.
WITH occasions AS (
-- Create an occasion for the beginning of every assembly (+1)
SELECT
start_time AS event_time,
1 AS delta
FROM conferences
UNION ALL
-- Create an occasion for the tip of every assembly (-1)
SELECT
-- Small trick to work with the back-to-back conferences (defined later)
end_time - interval '1 minute' as end_time,
-1 AS delta
FROM conferences
)
SELECT * FROM occasions;

Take the time to know what is going on right here. To create two occasions from a single row of information, we’re merely unioning the dataset on itself; the primary half makes use of the beginning time because the timestamp, and the second half makes use of the tip time.
You would possibly already discover the delta column created and see the place that is going. When an occasion begins, we rely it as +1, when it ends, we rely it as -1. You would possibly even be already considering of one other window perform to resolve this, and also you’re truly proper!
However earlier than that, let me simply clarify the trick I used in the long run dates. As I don’t need back-to-back conferences to rely as two concurrent conferences, I’m subtracting a single minute of each finish date. This manner, if a gathering ends and one other begins at 10h30, it received’t be assumed that two conferences are concurrently occurring at 10h30.
Okay, again to the question and yet one more window perform. This time, although, the perform of alternative is a rolling SUM.
--
-- Earlier CTEs
--
ordered_events AS (
SELECT
event_time,
delta,
SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent_meetings
FROM occasions
)
SELECT * FROM ordered_events ORDER BY event_time DESC;

The rolling SUM on the Delta column is basically strolling down each file and discovering what number of occasions are lively at the moment. For instance, at 9 am sharp, it sees two occasions beginning, so it marks the variety of concurrent conferences as two!
When the third assembly begins, the rely goes as much as three. However when it will get to 9h59 (10 am), then two conferences finish, bringing the counter again to 1. With this knowledge, the one factor lacking is to seek out when the best worth of concurrent conferences occurs.
--
-- Earlier CTEs
--
max_events AS (
-- Discover the utmost concurrent conferences worth
SELECT
event_time,
concurrent_meetings,
RANK() OVER (ORDER BY concurrent_meetings DESC) AS rnk
FROM ordered_events
)
SELECT event_time, concurrent_meetings
FROM max_events
WHERE rnk = 1;

That’s it! The interval of 9h30–10h is the one with the most important variety of concurrent conferences, which checks out with the schedule visualization above!
This answer appears to be like extremely easy for my part, and it really works for thus many conditions. Each time you might be coping with intervals now, you must suppose if the question wouldn’t be simpler if you considered it within the perspective of occasions.
However earlier than you progress on, and to essentially nail down this idea, I need to go away you with a bonus problem, which can also be a standard utility of the Sweep Line Algorithm. I hope you give it a attempt!
Bonus problem
The context for this one remains to be the identical because the final puzzle, however now, as an alternative of looking for the interval when there are most concurrent conferences, the target is to seek out dangerous scheduling. Evidently there are overlaps within the assembly rooms, which should be listed so it may be mounted ASAP.
How would you discover out if the identical assembly room has two or extra conferences booked on the similar time? Listed here are some tips about remedy it:
- It’s nonetheless the identical algorithm.
- This implies you’ll nonetheless do the UNION, however it can look barely completely different.
- It is best to suppose within the perspective of every assembly room.
You should utilize this knowledge for the problem:
CREATE TABLE meetings_overlap (
room TEXT NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL
);
INSERT INTO meetings_overlap (room, start_time, end_time) VALUES
-- Room A conferences
('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
-- Room B conferences
('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
-- Room C conferences
('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
-- Overlaps with earlier assembly.
('Room C', '2024-10-01 09:30', '2024-10-01 12:00');
Should you’re within the answer to this puzzle, in addition to the remainder of the queries, examine this GitHub repo.
The primary takeaway from this weblog put up is that window features are overpowered. Ever since I received extra comfy with utilizing them, I really feel that my queries have gotten a lot easier and simpler to learn, and I hope the identical occurs to you.
Should you’re considering studying extra about them, you’ll in all probability take pleasure in studying this other blog post I’ve written, the place I’m going over how one can perceive and use them successfully.
The second takeaway is that these patterns used within the challenges actually do occur in lots of different locations. You would possibly want to seek out sequences of subscriptions, buyer retention, otherwise you would possibly want to seek out overlap of duties. There are various conditions when you will want to make use of window features in a really comparable vogue to what was executed within the puzzles.
The third factor I need you to recollect is about this answer to utilizing occasions in addition to coping with intervals. I’ve checked out some issues I solved a very long time in the past that I may’ve used this sample on to make my life simpler, and sadly, I didn’t find out about it on the time.
I actually do hope you loved this put up and gave a shot to the puzzles your self. And I’m positive that for those who made it this far, you both realized one thing new about SQL or strengthened your information of window features!
Thanks a lot for studying. In case you have questions or simply need to get in contact with me, don’t hesitate to contact me at mtrentz.com.
All pictures by the writer except said in any other case.