572 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES AND I3.n > 1); SELECT DISTINCT C1.x, C1.y FROM Cover AS C1 WHERE NOT EXISTS (SELECT * FROM Cover AS C2 WHERE C2.x <= C1.x AND C2.y >= C1.y AND (C1.x <> C2.x OR C1.y <> C2.y)) ORDER BY C1.x; Finally, try this approach. Assume we have the usual Sequence auxiliary table. Now we find all the holes in the range of the intervals and put them in a VIEW or a WITH clause–derived table. CREATE VIEW Holes (hole) AS SELECT seq_nbr FROM Sequence WHERE seq_nbr <= (SELECT MAX(y) FROM Intervals) AND NOT EXISTS (SELECT * FROM Intervals WHERE seq_nbr BETWEEN x AND y) UNION VALUES(0) left sentinel value UNION (SELECT MAX(y) + 1 FROM Intervals); right sentinel value The query picks start and end pairs that are on the edge of a hole and counts the number of holes inside that range. Covering has no holes inside its range. SELECT Starts.x, Ends.y FROM Intervals AS Starts, Intervals AS Ends, Sequence AS S usual auxiliary table WHERE S.seq_nbr BETWEEN Starts.x AND Ends.y restrict seq_nbr numbers AND S.seq_nbr < (SELECT MAX(hole) FROM Holes) AND S.seq_nbr NOT IN (SELECT hole FROM Holes) not a hole 24.10 Coverings 573 AND Starts.x - 1 IN (SELECT hole FROM Holes) on a left cusp AND Ends.y + 1 IN (SELECT hole FROM Holes) on a right cusp GROUP BY Starts.x, Ends.y HAVING COUNT(DISTINCT seq_nbr) = Ends.y - Starts.x + 1; no holes CHAPTER 25 Arrays in SQL A I RRAYS CANNOT BE REPRESENTED directly in SQL-92, but they are a common vendor language extension that became part of SQL-99. Arrays violate the rules of First Normal Form (1NF) required for a relational database, which say that the tables have no repeating groups in any column. A repeating group is a data structure that is not scalar; examples of repeating groups include linked lists, arrays, records, and even tables within a column. The reason they are not allowed is that a repeating group would have to define a column like a data type. There is no obvious way to JOIN a column that contains an array to other columns, since there are no comparison operators or conversion rules. There is no obvious way to display or transmit a column that contains an array as a result set. Different languages and different compilers for the same language store arrays in column-major or row-major order, so there is no standard. There is no obvious way to write constraints on nonscalar values. The goal of SQL was to be a database language that would operate with a wide range of host languages. To meet that goal, the scalar data types are as varied as possible to match the host language data types, but as simple in structure as they can be to make the transfer of data to the host language as easy as possible. The extensions after SQL-92 ruin all of these advantages, so it is a good thing they are not widely implemented in products. 576 CHAPTER 25: ARRAYS IN SQL 25.1 Arrays via Named Columns An array in other programming languages has a name and subscripts by which the array elements are referenced. The array elements are all of the same data type, and the subscripts are all sequential integers. Some languages start numbering at zero, some start numbering at one, and some let the user set the upper and lower bounds. For example, a Pascal array declaration would look like this: foobar : ARRAY [1 5] OF INTEGER; and would have integer elements foobar[1], foobar[2], foobar[3], foobar[4], and foobar[5]. The same structure is most often mapped into an SQL declaration as: CREATE TABLE Foobar1 (element1 INTEGER NOT NULL, element2 INTEGER NOT NULL, element3 INTEGER NOT NULL, element4 INTEGER NOT NULL, element5 INTEGER NOT NULL); The elements cannot be accessed by the use of a subscript in this table, as they can in a true array. That is, to set the array elements equal to zero in Pascal takes one statement with a FOR loop in it: FOR i := 1 TO 5 DO foobar[i] := 0; The same action in SQL would be performed with the following statement: UPDATE Foobar1 SET element1 = 0, element2 = 0, element3 = 0, element4 = 0, element5 = 0; This is because there is no subscript that can be iterated in a loop. Any access must be based on column names, not on subscripts. These pseudosubscripts lead to building column names on the fly in dynamic 25.1 Arrays via Named Columns 577 SQL, giving code that is both slow and dangerous. Even worse, some users will use the same approach in table names, and destroy their logical data model. Let’s assume that we design an Employee table with separate columns for the names of four children, and we start with an empty table and then try to use it. 1. What happens if we hire a man with fewer than four children? We can fire him immediately or make him have more chil- dren. We can restructure the table to allow for fewer children. The usual, and less drastic, solution is to put NULL s in the col- umns for the nonexistent children. We then have all of the problems associated with NULL s to handle. 2. What happens if we hire a man with five children? We can fire him immediately or order him to kill one of his children. We can restructure the table to allow five children. We can add a second row to hold the information on children 5 through 8; however, this destroys the uniqueness of the emp_id, so it cannot be used as a key. We can overcome that problem by adding a new column for record number, which will form a two-column key with the emp_id. This leads to needless duplication in the table. 3. What happens if the employee dies? We will delete all his children’s data along with his, even if the company owes benefits to the survivors. 4. What happens if the child of an employee dies? We can fire him or order him to get another child immedi- ately. We can restructure the table to allow only three children. We can overwrite the child’s data with NULL s and get all of the problems associated with NULL values. This one is the most common decision. But what if we had used the multiple-row trick and this employee had a fifth child—should that child be brought up into the vacant slot in the current row, and the second row of the set be deleted? 5. What happens if the employee replaces a dead child with a new one? 578 CHAPTER 25: ARRAYS IN SQL Should the new child’s data overwrite the NULL s in the dead child’s data? Should the new child’s data be put in the next available slot and overwrite the NULL s in those columns? Some of these choices involve rebuilding the database. Others are simply absurd attempts to restructure reality to fit the database. The real point is that each insertion or deletion of a child involves a different procedure, depending on the size of the group to which he belongs. File systems had variant records that could change the size of their repeating groups. Consider, instead a table of employees, and another table for their children: CREATE TABLE Employees (emp_id INTEGER NOT NULL PRIMARY KEY, emp_name CHAR(30) NOT NULL, ); CREATE TABLE Children (emp_id INTEGER NOT NULL REFERENCES Employees(emp_id) ON UPFDATE CASCADE, child_name CHAR(30) NOT NULL, PRIMARY KEY (emp_id, child_name), birthday DATE NOT NULL, sex CHAR(1) NOT NULL); To add a child, you insert a row into Children. To remove a child, you delete a row from Children. There is nothing special about the fourth or fifth child that requires the database system to use special procedures. There are no NULL s in either table. The trade-off is that the number of tables in the database schema increases, but the total amount of storage used will be smaller, because you will keep data only on children who exist, rather than using NULL s to hold space. The goal is to have data in the simplest possible format, so any host program can use it. Gabrielle Wiorkowski, in her excellent DB2 classes, uses an example of a table for tracking the sales made by salespersons during the past year. That table could be defined as 25.1 Arrays via Named Columns 579 CREATE TABLE AnnualSales1 (salesman CHAR(15) NOT NULL PRIMARY KEY, jan DECIMAL(5, 2), feb DECIMAL(5, 2), mar DECIMAL(5, 2), apr DECIMAL(5, 2), may DECIMAL(5, 2), jun DECIMAL(5, 2), jul DECIMAL(5, 2), aug DECIMAL(5, 2), sep DECIMAL(5, 2), oct DECIMAL(5, 2), nov DECIMAL(5, 2), "dec" DECIMAL(5, 2) DEC[IMAL] is a reserved word ); We have to allow for NULL s in the monthly sales_amts in the first version of the table, but the table is actually quite a bit smaller than it would be if we were to declare it as: CREATE TABLE AnnualSales2 (salesman CHAR(15) NOT NULL PRIMARY KEY, sale_month CHAR(3) CONSTRAINT valid_month_abbrev CHECK (sale_month IN ('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'), sales_amt DECIMAL(5, 2) NOT NULL, PRIMARY KEY(salesman, sale_month)); In Wiorkowski’s actual example in DB2, the break-even point for DASD storage was April; that is, the storage required for AnnualSales1 and AnnualSales2 is about the same in April of the given year. Queries that deal with individual salespersons will run much faster against the AnnualSales1 table than queries based on the AnnualSales2 table, because all the data is in one row in the AnnualSales1 table. These tables may be a bit messy and they may require function calls to handle possible NULL values, but they are not very complex. The only reason for using AnnualSales1 is that you have a data warehouse and all you want to see is summary information, grouped into years. This design is not acceptable in an OLTP system. 580 CHAPTER 25: ARRAYS IN SQL 25.2 Arrays via Subscript Columns Another approach to faking a multidimensional array is to map arrays into a table with an integer column for each subscript, thus: CREATE TABLE Foobar (i INTEGER NOT NULL PRIMARY KEY CONSTRAINT valid_array_index CHECK(i BETWEEN 1 AND 5), element REAL NOT NULL); This looks more complex than the first approach, but it is closer to what the original Pascal declaration was doing behind the scenes. Subscripts resolve to unique physical addresses, so it is not possible to have two values for foobar[ i ]; hence, i is a key. The Pascal compiler will check to see that the subscripts are within the declared range; hence the CHECK() clause. The first advantage of this approach is that multidimensional arrays are easily handled by adding another column for each subscript. The Pascal declaration: ThreeD : ARRAY [1 3, 1 4, 1 5] OF REAL; is mapped over to: CREATE TABLE ThreeD (i INTEGER NOT NULL CONSTRAINT valid_i CHECK(i BETWEEN 1 AND 3), j INTEGER NOT NULL CONSTRAINT valid_j CHECK(j BETWEEN 1 AND 4), k INTEGER NOT NULL CONSTRAINT valid_k CHECK(k BETWEEN 1 AND 5), element REAL NOT NULL, PRIMARY KEY (i, j, k)); Obviously, SELECT statements with GROUP BY clauses on the subscript columns will produce row and column totals, thus: 25.3 Matrix Operations in SQL 581 SELECT i, j, SUM(element) sum across the k columns FROM ThreeD GROUP BY i, j; SELECT i, SUM(element) sum across the j and k columns FROM ThreeD GROUP BY i; SELECT SUM(element) sum the entire array FROM ThreeD; If the original one element/one column approach were used, the table declaration would have 120 columns named element_111 through element_345. There are too many names in this example to handle in any reasonable way; you would not be able to use the GROUP BY clauses for array projection, either. Another advantage of this approach is that the subscripts can be data types other than integers. DATE and TIME data types are often useful, but CHARACTER and approximate numerics have their uses too. 25.3 Matrix Operations in SQL A matrix is not quite the same thing as an array. Matrices are mathematical structures with particular properties. We cannot take the time to discuss them here; you can find the necessary information in a college freshman algebra book. Though it is possible to do many matrix operations in SQL, it is not a good idea; such queries and operations will eat up resources and run much too long. SQL was never meant to be a language for calculations. Let us assume that we have two-dimensional arrays that are declared as tables using two columns for subscripts, and that all columns are declared with a NOT NULL constraint. The presence of NULL s is not defined in linear algebra, and I have no desire to invent a three-valued linear algebra of my own. Another problem is that a matrix has rows and columns that are not the same as the rows and columns of an SQL table; as you read the rest of this section, be careful not to confuse the two. CREATE TABLE MyMatrix (element INTEGER NOT NULL, could be any numeric data type i INTEGER NOT NULL CHECK (i > 0), . holes inside its range. SELECT Starts.x, Ends.y FROM Intervals AS Starts, Intervals AS Ends, Sequence AS S usual auxiliary table WHERE S. seq_nbr BETWEEN Starts.x AND Ends.y restrict seq_nbr. types, but as simple in structure as they can be to make the transfer of data to the host language as easy as possible. The extensions after SQL- 92 ruin all of these advantages, so it is a. 0; This is because there is no subscript that can be iterated in a loop. Any access must be based on column names, not on subscripts. These pseudosubscripts lead to building column names on