622 CHAPTER 27: SUBSETS fourth step: distinct missing ifcs that are also missing clubs INSERT INTO Samples (member_id, club_name, ifc) SELECT MIN(Memberships.member_id), MIN(Memberships.club_name, Memberships.ifc FROM Memberships, MissingClubs, MissingIfcs WHERE Memberships.club_name = MissingClubs.club_name) AND Memberships.ifc = MissingIfcs.ifc GROUP BY Memberships.ifc; fifth step: remaining missing ifcs INSERT INTO Samples (member_id, club_name, ifc) SELECT MIN(Memberships.member_id), MIN(memberships.club_name), memberships.ifc FROM Memberships, MissingIfcs WHERE Memberships.ifc = MissingIfcs.ifc GROUP BY Memberships.ifc; sixth step: remaining missing clubs INSERT INTO Samples (Member_id, club_name, ifc) SELECT MIN(Memberships.Member_id), Memberships.club_name, MIN(Memberships.ifc) FROM Memberships, MissingClubs WHERE Memberships.club_name = MissingClubs.club_name GROUP BY Memberships.club_name; We can check the candidate rows for redundancy removal with the two views that were created earlier to be sure. CHAPTER 28 Trees and Hierarchies in SQL I HAVE A SEPARATE book ( Joe Celko’s Trees and Hierarchies in SQL for Smarties , 2004) devoted to this topic in great detail, so this chapter will be a very quick discussion of the three major approaches to modeling trees and hierarchies in SQL. A tree is a special kind of directed graph. Graphs are data structures that are made up of nodes (usually shown as boxes) connected by edges (usually shown as lines with arrowheads). Each edge represents a one-way relationship between the two nodes it connects. In an organizational chart, the nodes are positions that can be filled by employees, and each edge is the “reports to” relationship. In a parts explosion (also called a bill of materials), the nodes are assembly units that eventually resolve down to individual parts from inventory, and each edge is the “is made of” relationship. The top of the tree is called the root. In an organizational chart, it is the highest authority; in a parts explosion, it is the final assembly. The number of edges coming out of the node is its outdegree, and the number of edges entering it is its indegree. A binary tree is one in which a parent can have at most two children; more generally, an n ary tree is one in which a node can have at most outdegree n . The nodes of the tree that have no subtrees beneath them are called the leaf nodes. In a parts explosion, they are the individual parts, which cannot be broken down any further. The descendants, or 624 CHAPTER 28: TREES AND HIERARCHIES IN SQL children, of a node (the parent) are every node in the subtree that has the parent node as its root. There are several ways to define a tree: it is a graph with no cycles; it is a graph where all nodes except the root have indegree one and the root has indegree zero. Another defining property is that a path can be found from the root to any other node in the tree by following the edges in their natural direction. The tree structure and the nodes are very different things and therefore should be modeled in separate tables. But I am going to violate that design rule in this chapter and use an abstract tree in this chapter (see Figure 28.1). This little tree is small enough that you can remember what it looks like as you read the rest of this chapter. It will illustrate the various techniques discussed here. I will use the terms “child,” “parent,” and “node,” but you may see other terms used in various books on graphs. 28.1 Adjacency List Model Most SQL databases use the adjacency list model for two reasons. The first reason is that Dr. Codd came up with it in the early days of the relational model, and nobody thought about it after that. The second reason is that the adjacency list is a way of “faking” pointer chains, the traditional programming method in procedural languages for handling trees. It is a recording of the edges in a “boxes and arrows” diagram, something like this simple table: CREATE TABLE AdjTree (child CHAR(2) NOT NULL, parent CHAR(2), null is root PRIMARY KEY (child, parent)); Figure 28.1 An Abstract Tree Model. 28.1 Adjacency List Model 625 AdjTree child parent ============= 'A' NULL 'B' 'A' 'C' 'A' 'D' 'C' 'E' 'C' 'F' 'C' The queries for the leaf nodes and root are obvious. The root has a NULL parent, and the left nodes have no subordinates. Each row models two nodes that share an adjacent edge in a directed graph. The adjacency list model is both the most common and the worst possible tree model. On the other hand, it is the best way to model any general graph. 28.1.1 Complex Constraints The first problem is that the adjacency list model requires complex constraints to maintain any data integrity. In practice, the usual solution is to ignore the problems and hope that nothing bad happens to the structure. But if you care about data integrity, you need to be sure that: 1. There is only one root node. CREATE TABLE AdjTree (child CHAR(2) NOT NULL, parent CHAR(2), null is root PRIMARY KEY (child, parent), CONSTRAINT one_root CHECK((SELECT COUNT(*) FROM AdjTree WHERE parent IS NULL) = 1) ); 2. There are no cycles. Unfortunately, this cannot be done without a trigger. The trigger code must trace all the paths looking for a cycle. The most obvious constraint to prohibit a single node cycle in the graph would be: 626 CHAPTER 28: TREES AND HIERARCHIES IN SQL CHECK (child <> parent) - cannot be your own father! But that does not detect ( n > 2) node cycles. We know that the number of edges in a tree is the number of nodes minus one, so this is a connected graph. That constraint looks like this: CHECK ((SELECT COUNT(*) FROM AdjTree) -1 edges = (SELECT COUNT(parent) FROM AdjTree)) nodes The COUNT(parent) will drop the NULL in the root row. That gives us the effect of having a constraint to check for one NULL : CHECK((SELECT COUNT(*) FROM Tree WHERE parent IS NULL) = 1) This is a necessary condition, but it is not a sufficient condition. Consider this data, in which ‘D’ and ‘E’ are both in a cycle, and that cycle is not in the tree structure. Cycle child parent =========== 'A' NULL 'B' 'A' 'C' 'A' 'D' 'E' 'E' 'D' One approach would be to remove all the leaf nodes and repeat this procedure until the tree is reduced to an empty set. If the tree does not reduce to an empty set, then there is a disconnected cycle. CREATE FUNCTION TreeTest() RETURNS CHAR(6) LANGUAGE SQL BEGIN ATOMIC DECLARE row_count INTEGER; SET row_count = (SELECT COUNT(DISTINCT parent) + 1 FROM AdjTree); put a copy in a temporary table INSERT INTO WorkTree SELECT emp, parent FROM AdjTree; WHILE row_count > 0 28.1 Adjacency List Model 627 DO DELETE FROM WorkTree prune leaf nodes WHERE Tree.child NOT IN (SELECT T2.parent FROM Tree AS T2 WHERE T2.parent IS NOT NULL); SET row_count = row_count -1; END WHILE; IF NOT EXISTS (SELECT * FROM WorkTree) THEN RETURN ('Tree '); pruned everything ELSE RETURN ('Cycles'); cycles were left END IF; END; 28.1.2 Procedural Traversal for Queries The second problem is that the adjacency list model requires that you traverse from node to node to answer any interesting questions, such as “Does Mr. King have any authority over Mr. Jones?” or any other aggregations up and down the tree. SELECT P1.child, ' parent to ', C1.child FROM AdjTree AS P1, AdjTree AS C1 WHERE P1.child = C1.parent; But something is missing here. This query gives only the immediate parent of the node. Your parent’s parent also has authority over you, and so forth, up the tree until we find someone who has no subordinates. To go two levels deep in the tree, we need to do a more complex self- JOIN , thus: SELECT B1.child, ' parent to ', E2.child FROM AdjTree AS B1, AdjTree AS E1, AdjTree AS E2 WHERE B1.child = E1.parent AND E1.child = E2.parent; Unfortunately, you have no idea just how deep the tree is, so you must keep extending this query until you get an empty set back as a result. The practical problem is that most SQL compilers will start having serious problems optimizing queries with a large number of tables. The other methods are to declare a CURSOR and traverse the tree with procedural code. This is usually painfully slow, but it will work for any depth of tree. It also defeats the purpose of using a nonprocedural 628 CHAPTER 28: TREES AND HIERARCHIES IN SQL language like SQL. With Common Table Expressions in SQL-99, you can also write a query that recursively constructs the transitive closure of the table by hiding the traversal. This feature is not popular yet, and it is still slow compared to the nested sets model. 28.1.3 Altering the Table Insertion of a new node is the only easy operation in the adjacency list model. You simply do an INSERT INTO statement and check to see that the parent already exists in the table. Deleting an edge in the middle of tree will cause the table to become a forest of separate trees. You need some rule for rearranging the structure. The two usual methods are to promote a subordinate to the vacancy (and cascade the vacancy downward) or to assign all the subordinates to their parent’s parent (the orphans go to live with grandparents). Consider what has to happen when a middle-level node is changed. The change must occur in both the child and parent columns. UPDATE AdjTree SET child = CASE WHEN child = 'C' THEN 'C1', ELSE child END, parent = CASE WHEN parent= 'C' THEN 'C1', ELSE parent END WHERE 'C' IN (parent, child); 28.2 The Path Enumeration Model The next method for representing hierarchies in SQL was first discussed in detail by Stefan Gustafsson on an Internet site for SQL Server users. Later, Tom Moreau and Itzik Ben-Gan developed it in more detail in their book Advanced Transact-SQL for SQL Server 2000 (Moreau and Ben- Gan first edition was October 2000). This model stores the path from the root to each node as a string at that node. Of course, we purists might object that this is a denormalized table, since the path is not a scalar value. The worst-case operation you can do in this representation is to alter the root of the tree. We then have to recalculate all the paths in the entire tree. But if the assumption is that structural modifications high in the tree are relatively uncommon, then 28.2 The Path Enumeration Model 629 this might not be a problem. The table for the simple tree we will use for this chapter looks like this: CREATE TABLE PathTree (node CHAR(2) NOT NULL PRIMARY KEY, path VARCHAR (900) NOT NULL); The example tree would get the following representation: node path =========== 'A' 'a/' 'B' 'a/b/' 'C' 'a/c/' 'D' 'a/c/d/' 'E' 'a/c/e/' 'F' 'a/c/f/' What we have done is concatenate the node names and separate them with a slash. All of the operations will depend on string manipulations, so we’d like to have short node identifiers to keep the paths short. We would prefer, but not require, identifiers of one length to make substrings easier. You have probably recognized this because I used a slash separator; this is a version of the directory paths used in several operating systems such as the UNIX family and Windows. 28.2.1 Finding Subtrees and Nodes The major trick in this model is the LIKE predicate. The subtree rooted at :my_node is found with this query. SELECT node FROM PathTree WHERE path LIKE '%' || :my_node || '%'; Finding the root node is easy, since that is the substring of any node up to the first slash. However, the leaf nodes are harder. SELECT T1.node FROM PathTree AS T1 630 CHAPTER 28: TREES AND HIERARCHIES IN SQL WHERE NOT EXISTS (SELECT * FROM PathTree AS T2 WHERE T2.path LIKE T1.path || '/_'); 28.2.2 Finding Levels and Subordinates The depth of a node is shown by the number of ‘/’ characters in the path string. If you have a REPLACE() that can remove the ‘/’ characters, the difference between the length of the part with and without those characters gives you the level. CREATE VIEW DetailedTree (node, path, level) AS SELECT node, path, CHARLENGTH (path) - CHARLENGTH (REPLACE (path, '/', '')) FROM PathTree; The immediate descendents of a given node can be found with this query, if you know the length of the node identifiers. In this sample data, that length is one character: SELECT :mynode, T2.node FROM PathTree AS T1, PathTree AS T2 WHERE T1.node = :mynode AND T2.path LIKE T1.path || '_/'; This can be expanded with OR ed like predicates that cover the possible lengths of the node identifiers. 28.2.3 Deleting Nodes and Subtrees This is a bit weird at first, because the removal of a node requires that you first update all the paths. Let us delete node ‘B’ in the sample tree: BEGIN ATOMIC UPDATE PathTree SET path = REPLACE (path, 'b/', '') WHERE POSITION ('b/' IN path) > 0; DELETE FROM PathTree WHERE node = 'B'; END; 28.3 Nested Set Model of Hierarchies 631 Deleting a subtree rooted at :my_node is actually simpler: DELETE FROM PathTree WHERE path LIKE (SELECT path FROM PathTree WHERE node = :my_node ||'%'; 28.2.4 Integrity Constraints If a path has the same node in it twice, then there is a cycle in the graph. We can use a VIEW with just the node names in it to some advantage here. CHECK (NOT EXISTS (SELECT * FROM NodeList AS D1, PathTree AS P1 WHERE CHAR_LENGTH (REPLACE (D1.node, P1.path, '')) < (CHAR_LENGTH(P1.path) - CHAR_LENGTH(D1.node)) )) Unfortunately, a subquery in a constraint is not widely implemented yet. 28.3 Nested Set Model of Hierarchies Since SQL is a set-oriented language, the nested set model is a better model for the approach discussed here. If you have used HTML, XML or a language with a block structure, then you understand the basic idea of this model. The lft and rgt columns (their names are abbreviations for “left” and “right,” which are reserved words in Standard SQL) are the count of the “tags” in an XML representation of a tree. Imagine circles inside circles without any of them overlapping, the way you would draw a markup language structure. This has some predictable results that we can use for building queries, as shown in Figures 28.2, 28.3, and 28.4. If that mental model does not work for you, to convert the “boxes and arrows” graph into a nested set model, think of a little worm crawling along the tree. The worm starts at the top, the root, makes a complete trip around the tree. When he comes to a node, he puts a number in the cell on the side that he is visiting and increments his counter. Each node will get two numbers, one for the right side and one for the left. . Memberships.ifc FROM Memberships, MissingClubs, MissingIfcs WHERE Memberships.club_name = MissingClubs.club_name) AND Memberships.ifc = MissingIfcs.ifc GROUP BY Memberships.ifc; fifth step:. 'A' 'a/' 'B' 'a/b/' 'C' 'a/c/' 'D' 'a/c/d/' 'E' 'a/c/e/' 'F' 'a/c/f/' What. identifiers of one length to make substrings easier. You have probably recognized this because I used a slash separator; this is a version of the directory paths used in several operating systems such