552 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES (1004, 'N'), (1003, 'Y'), (1002, 'Y'), (1001, 'Y'), (1000, 'N'); The results we want assign a grouping number to each run of on- time/late payments, thus: Results grping payment_nbr paid_on_time =============================== 1 1006 'Y' 1 1005 'Y' 2 1004 'N' 3 1003 'Y' 3 1002 'Y' 3 1001 'Y' 4 1000 'N' A solution by Hugo Kornelis depends on the payments always being numbered consecutively. SELECT (SELECT COUNT(*) FROM PaymentHistory AS H2, PaymentHistory AS H3 WHERE H3.payment_nbr = H2.payment_nbr + 1 AND H3.paid_on_time <> H2.paid_on_time AND H2.payment_nbr >= H1.payment_nbr) + 1 AS grping, payment_nbr, paid_on_time FROM PaymentHistory AS H1; This can be modified for more types of behavior. 24.3 Finding Regions of Maximum Size A query to find a region, rather than a subregion of a known size, of seats was presented in SQL Forum (Rozenshtein, Abramovich, and Birger 1993). SELECT T1.seat_nbr, ' thru ', T2.seat_nbr FROM Theater AS T1, Theater AS T2 WHERE T1.seat_nbr < T2.seat_nbr 24.3 Finding Regions of Maximum Size 553 AND NOT EXISTS (SELECT * FROM Theater AS T3 WHERE (T3.seat_nbr BETWEEN T1.seat_nbr AND T2.seat_nbr AND T3.occupancy_status <> 'A') OR (T3.seat_nbr = T2.seat_nbr + 1 AND T3.occupancy_status = 'A') OR (T3.seat_nbr = T1.seat_nbr - 1 AND T3.occupancy_status = 'A')); The trick here is to look for the starting and ending seats in the region. The starting seat_nbr of a region is to the right of a sold seat_nbr, and the ending seat_nbr is to the left of a sold seat_nbr. No seat_nbr between the start and the end has been sold. If you only keep the available seat_nbrs in a table, the solution is a bit easier. It is also a more general problem that applies to any table of sequential, possibly noncontiguous, data: CREATE TABLE AvailableSeating (seat_nbr INTEGER NOT NULL CONSTRAINT valid_seat_nbr CHECK (seat_nbr BETWEEN 001 AND 999)); INSERT INTO Seatings VALUES (199), (200), (201), (202), (204), (210), (211), (212), (214), (218); You need to create a result that will show the start and finish values of each sequence in the table, thus: Results start finish ============ 199 202 204 204 210 212 214 214 218 218 554 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES This is a common way of finding the missing values in a sequence of tickets sold, unaccounted-for invoices, and so forth. Imagine a number line with closed dots for the numbers that are in the table and open dots for the numbers that are not. What do you see about a sequence? Well, we can start with a fact that anyone who has done inventory knows: the number of elements in a sequence is equal to the ending sequence number minus the starting sequence number plus one. This is a basic property of ordinal numbers: (finish - start + 1) = Length of open seats This tells us that we need to have a self- JOIN with two copies of the table, one for the starting value and one for the ending value of each sequence. Once we have those two items, we can compute the length with our formula and see if it is equal to the count of the items between the start and finish. SELECT S1.seat_nbr, MAX(S2.seat_nbr) start and rightmost item FROM AvailableSeating AS S1 INNER JOIN AvailableSeating AS S2 self-join ON S1.seat_nbr <= S2.seat_nbr AND (S2.seat_nbr - S1.seat_nbr + 1) formula for length = (SELECT COUNT(*) items in the sequence FROM AvailableSeating AS S3 WHERE S3.seat_nbr BETWEEN S1.seat_nbr AND S2.seat_nbr) AND NOT EXISTS (SELECT * FROM AvailableSeating AS S4 WHERE S1.seat_nbr - 1 = S4.seat_nbr) GROUP BY S1.seat_nbr; Finally, we need to be sure that we have the furthest item to the right as the end item. Each sequence of ( n ) items has ( n ) subsequences that all start with the same item. So we finally do a GROUP BY on the starting item and use a MAX() to get the rightmost value. However, there is a faster version with three tables. This solution is based on another property of the longest possible sequences. If you look to the right of the last item, you do not find anything. Likewise, if you look to the left of the first item, you do not find anything either. These missing items that are “just over the border” define a sequence by 24.3 Finding Regions of Maximum Size 555 framing it. There also cannot be any “gaps”—missing items—inside those borders. That translates into SQL as: SELECT S1.seat_nbr, MIN(S2.seat_nbr) start and leftmost border FROM AvailableSeating AS S1, AvailableSeating AS S2 WHERE S1.seat_nbr <= S2.seat_nbr AND NOT EXISTS border items of the sequence (SELECT * FROM AvailableSeating AS S3 WHERE S3.seat_nbr NOT BETWEEN S1.seat_nbr AND S2.seat_nbr AND (S3.seat_nbr = S1.seat_nbr - 1 OR S3.seat_nbr = S2.seat_nbr + 1)) GROUP BY S1.seat_nbr; We do not have to worry about getting the rightmost item in the sequence, but we do have to worry about getting the leftmost border. Once we do a GROUP BY , we use a MIN() to get what we want. Since the second approach uses only three copies of the original table, it should be a bit faster. Also, the EXISTS() predicates can often take advantage of indexing and thus run faster than subquery expressions, which require a table scan. Michel Walsh came up with two novel ways of getting the range of seat numbers that have been used in the table. He saw that the difference between the value and its rank is a constant for all values in the same consecutive sequence, so we just have to group, and count, on the value minus its rank to get the various consecutive runs (or just keep the maximum). It is so simple, an example will show everything: data = {1, 2, 5, 6, 7, 8, 9, 11, 12, 22} data rank (data_rank) AS absent ================================ 1 1 0 2 2 0 5 3 2 6 4 2 7 5 2 8 6 2 9 7 2 11 8 3 556 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES 12 9 3 22 10 12 absent COUNT(*) ================ 0 2 2 5 3 2 12 1 As you can see, the maximum contiguous sequence is 5 (for rows having (data − rank) = 2). Rank is defined as how many values are less than or equal to the actual value, with the assumption of a set of integers without repeated values. This is the query: SELECT X.absent, COUNT(*) FROM (SELECT my_data, (SELECT COUNT(*) FROM Foobar AS F2 WHERE F2.my_data <= F1.my_data), (SELECT COUNT(*) FROM Foobar AS F2 WHERE F2.my_data <= F1.my_data) - F1.my_data FROM Foobar AS F1) AS X(my_data, rank, absent); Playing with this basic idea, Mr. Walsh came up with this second query. SELECT MIN(Z.seat_nbr), MAX(Z.seat_nbr) FROM (SELECT S1.seat_nbr, S1.seat_nbr - (SELECT COUNT(*) FROM Seating AS S2 WHERE S2.seat_nbr <= S1.seat_nbr) FROM Seating AS S1) AS Z (seat_nbr, dif_rank) GROUP BY Z.dif_rank; The derived table finds the lengths of the blocks of seats to the left of each seat_nbr and uses that length to form groups. 24.5 Run and Sequence Queries 557 24.4 Bound Queries Another form of query asks whether there is an overall trend between two points in time bounded by a low value and a high value in the sequence of data. This is easier to show with an example. Let us assume that we have data on the selling prices of a stock in a table. We want to find periods of time when the price was generally increasing. Consider this data: MyStock sale_date price ===================== '2006-12-01' 10.00 '2006-12-02' 15.00 '2006-12-03' 13.00 '2006-12-04' 12.00 '2006-12-05' 20.00 The stock was generally increasing in all the periods that began on December 1 or ended on December 5—that is, it finished higher at the ends of those periods, in spite of the slump in the middle. A query for this problem is: SELECT S1.sale_date AS start_date, S2.sale_date AS finish_date FROM MyStock AS S1, MyStock AS S2 WHERE S1.sale_date < S2.sale_date AND NOT EXISTS (SELECT * FROM MyStock AS S3 WHERE S3.sale_date BETWEEN S1.sale_date AND S2.sale_date AND S3.price NOT BETWEEN S1.price AND S2.price); 24.5 Run and Sequence Queries Runs are informally defined as sequences with gaps. That is, we have a set of unique numbers whose order has some meaning, but the numbers are not all consecutive. Time series information in which the samples are taken at irregular intervals is an example of this sort of data. Runs can be constructed in the same manner as the sequences by making a minor change in the search condition. Let’s do these queries with an abstract table made up of a sequence number and a value: 558 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES CREATE TABLE Runs (seq_nbr INTEGER NOT NULL PRIMARY KEY, val INTEGER NOT NULL); Runs seq_nbr val ========== 1 6 2 41 3 12 4 51 5 21 6 70 7 79 8 62 9 30 10 31 11 32 12 34 13 35 14 57 15 19 16 84 17 80 18 90 19 63 20 53 21 3 22 59 23 69 24 27 25 33 One problem is that we do not want to get back all the runs and sequences of length one. Ideally, the length ( n ) of the run should be adjustable. This query will find runs of length ( n ) or greater; if you want runs of exactly ( n ), change the “greater than” sign to an equal sign. SELECT R1.seq_nbr AS start_seq_nbr, R2.seq_nbr AS end_seq_nbr__nbr FROM Runs AS R1, Runs AS R2 24.5 Run and Sequence Queries 559 WHERE R1.seq_nbr < R2.seq_nbr start and end points AND (R2.seq_nbr - R1.seq_nbr) > (:(n) - 1) length restrictions AND NOT EXISTS ordering within the end points (SELECT * FROM Runs AS R3, Runs AS R4 WHERE R4.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND R3.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND R3.seq_nbr < R4.seq_nbr AND R3.val > R4.val); This query sets up the S1 sequence number as the starting point and the S2 sequence number as the ending point of the run. The monster subquery in the NOT EXISTS() predicate is looking for a row in the middle of the run that violates the ordering of the run. If there is none, the run is valid. The best way to understand what is happening is to draw a linear diagram. This shows that as the ordering (seq_nbr) increases, so must the corresponding values (val). A sequence has the additional restriction that every value increases by one as you scan the run from left to right. This means that in a sequence, the highest value minus the lowest value, plus one, is the length of the sequence. SELECT R1.seq_nbr AS start_seq_nbr, R2.seq_nbr AS end_seq_nbr__nbr FROM Runs AS R1, Runs AS R2 WHERE R1.seq_nbr < R2.seq_nbr AND (R2.seq_nbr - R1.seq_nbr) = (R2.val - R1.val) order condition AND (R2.seq_nbr - R1.seq_nbr) > (:(n) - 1) length restrictions AND NOT EXISTS (SELECT * FROM Runs AS R3 WHERE R3.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND((R3.seq_nbr - R1.seq_nbr) <> (R3.val - R1.val) OR (R2.seq_nbr - R3.seq_nbr) <> (R2.val - R3.val))); The subquery in the NOT EXISTS predicate says that there is no point in between the start and the end of the sequence that violates the ordering condition. 560 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES Obviously, any of these queries can be changed from increasing to decreasing, or from strictly increasing to simply increasing or simply decreasing, and so on, by changing the comparison predicates. You can also change the query for finding sequences in a table by altering the size of the step from 1 to k , by observing that the difference between the starting position and the ending position should be k times the difference between the starting value and the ending value. 24.5.1 Filling in Sequence Numbers A fair number of SQL programmers want to reuse a sequence of numbers for keys. While I do not approve of the practice of generating a meaningless, unverifiable key after the creation of an entity, the problem of inserting missing numbers is interesting. The usual specifications are: 1. Begin the sequence with one, if it is missing or the table is empty. 2. Reuse the lowest missing number first. 3. Do not exceed some maximum value; if the sequence is full, then give us a warning or a NULL. Another option is to give us (MAX(seq_nbr) +1) , so we can add to the high end of the list. This answer is a good example of thinking in terms of sets rather than doing row-at-a-time processing. SELECT MIN(new_seq_nbr) FROM (SELECT CASE WHEN (seq_nbr + 1) NOT IN (SELECT seq_nbr FROM List) THEN (seq_nbr + 1) WHEN (seq_nbr - 1) NOT IN (SELECT seq_nbr FROM List) THEN (seq_nbr - 1) WHEN 1 NOT IN (SELECT seq_nbr FROM List) THEN 1 ELSE NULL END FROM List WHERE seq_nbr BETWEEN 1 AND (SELECT MAX(seq_nbr) FROM List) AS P(new_seq_nbr); 24.5 Run and Sequence Queries 561 The idea is to build a table expression of some of the missing values, then pick the minimum one. The starting value, one, is treated as an exception. Since an aggregate function cannot take a query expression as a parameter, we have to use a derived table. Along the same lines, we can use aggregate functions in a CASE expression: SELECT CASE WHEN MAX(seq_nbr) = COUNT(*) THEN CAST(NULL AS INTEGER) THEN MAX(seq_nbr) + 1 as other option WHEN MIN(seq_nbr) > 1 THEN 1 WHEN MAX(seq_nbr) <> COUNT(*) THEN (SELECT MIN(seq_nbr)+1 FROM List WHERE (seq_nbr)+1 NOT IN (SELECT seq_nbr FROM List)) ELSE NULL END FROM List; The first WHEN clause sees whether the table is already full and returns a NULL ; the NULL must be cast as an INTEGER to become an expression that can then be used in the THEN clause. However, you might want to increment the list by the next value. The second WHEN clause looks to see whether the minimum sequence number is one or not. If so, it uses one as the next value The third WHEN clause handles the situation when there is a gap in the middle of the sequence. It picks the lowest missing number. The ELSE clause is in case of errors and should not be executed. The order of execution in the CASE expression is important. It is a way of forcing an inspection of the table’s values from front to back. Simpler methods based on group characteristics would be: SELECT COALESCE(MIN(L1.seq_nbr) + 1, 1) FROM List AS L1 LEFT OUTER JOIN List AS L2 ON L1.seq_nbr = L2.seq_nbr - 1 WHERE L2.seq_nbr IS NULL; . AvailableSeating AS S1 INNER JOIN AvailableSeating AS S2 self-join ON S1 .seat_nbr <= S2 .seat_nbr AND (S2 .seat_nbr - S1 .seat_nbr + 1) formula for length = (SELECT COUNT(*) items in the sequence . AvailableSeating AS S3 WHERE S3 .seat_nbr BETWEEN S1 .seat_nbr AND S2 .seat_nbr) AND NOT EXISTS (SELECT * FROM AvailableSeating AS S4 WHERE S1 .seat_nbr - 1 = S4 .seat_nbr) GROUP BY S1 .seat_nbr; . FROM MyStock AS S1 , MyStock AS S2 WHERE S1 .sale_date < S2 .sale_date AND NOT EXISTS (SELECT * FROM MyStock AS S3 WHERE S3 .sale_date BETWEEN S1 .sale_date AND S2 .sale_date AND S3 .price NOT