682 CHAPTER 30: GRAPHS IN SQL The most common way to model a graph in SQL is with an adjacency list model. Each edge of the graph is shown as a pair of nodes in which the ordering matters, and then any values associated with that edge are shown in another column. 30.1 Basic Graph Characteristics The following code is from John Gilson. This code uses an adjacency list model of the graph, with nodes in a separate table. This is the most common method for modeling graphs in SQL. CREATE TABLE Nodes (node_id INTEGER NOT NULL PRIMARY KEY); CREATE TABLE AdjacencyListGraph (begin_node_id INTEGER NOT NULL REFERENCES Nodes (node_id), end_node_id INTEGER NOT NULL REFERENCES Nodes (node_id), PRIMARY KEY (begin_node_id, end_node_id), CHECK (begin_node_id <> end_node_id)); It is also possible to load an acyclic directed graph into a nested set model by splitting the nodes. CREATE TABLE NestedSetsGraph (node_id INTEGER NOT NULL REFERENCES Nodes (node_id), lft INTEGER NOT NULL CHECK (lft >= 1) PRIMARY KEY, rgt INTEGER NOT NULL UNIQUE, CHECK (rgt > lft), UNIQUE (node_id, lft)); To split nodes, start at the sink nodes and move up the tree. When you come to a node with an indegree greater than one, replace it with that many copies of the node under each of its superiors. Continue to do this until you get to the root. The acyclic graph will become a tree, but with duplicated node values. There are advantages to this model; we will discuss them in Section 30.3. 30.1.1 All Nodes in the Graph To view all nodes in the graph, use the following: 30.1 Basic Graph Characteristics 683 CREATE VIEW GraphNodes (node_id) AS SELECT DISTINCT node_id FROM NestedSetsGraph; 30.1.2 Path Endpoints A path through a graph is a traversal of consecutive nodes along a sequence of edges. Clearly, the node at the end of one edge in the sequence must also be the node at the beginning of the next edge in the sequence. The length of the path is the number of edges that are traversed along the path. Path endpoints are the first and last nodes of each path in the graph. For a path of length zero, the path endpoints are the same node. If there is more than one path between two nodes, each path will be distinguished by its own distinct set of number pairs for the nested-set representation. If there is only one path, P, between two nodes, but P is a subpath of more than one distinct path, then the endpoints of P will have number pairs for each of these greater paths. As a canonical form, the least- numbered pairs are returned for these endpoints. CREATE VIEW PathEndpoints (begin_node_id, end_node_id, begin_lft, begin_rgt, end_lft, end_rgt) AS SELECT G1.node_id, G2.node_id, G1.lft, G1.rgt, G2.lft, G2.rgt FROM (SELECT node_id, MIN(lft), MIN(rgt) FROM NestedSetsGraph GROUP BY node_id) AS G1 (node_id, lft, rgt) INNER JOIN NestedSetsGraph AS G2 ON G2.lft >= G1.lft AND G2.lft < G1.rgt; 30.1.3 Reachable Nodes If a node is reachable from another node, then a path exists from the one node to the other. It is assumed that every node is reachable from itself. 684 CHAPTER 30: GRAPHS IN SQL CREATE VIEW ReachableNodes (begin_node_id, end_node_id) AS SELECT DISTINCT begin_node_id, end_node_id FROM PathEndpoints; 30.1.4 Edges Edges are pairs of adjacent connected nodes in the graph. If edge E is represented by the pair of nodes (n0, n1), then n1 is reachable from n0 in a single traversal. CREATE VIEW Edges (begin_node_id, end_node_id) AS SELECT begin_node_id, end_node_id FROM PathEndpoints AS PE WHERE begin_node_id <> end_node_id AND NOT EXISTS (SELECT * FROM NestedSetsGraph AS G WHERE G.lft > PE.begin_lft AND G.lft < PE.end_lft AND G.rgt > PE.end_rgt); 30.1.5 Indegree and Outdegree The indegree of a node, n, is the number of distinct edges ending at n. Nodes that have an indegree of zero are not returned. To determine the indegree of all nodes in the graph: CREATE VIEW Indegree (node_id, node_indegree) AS SELECT N.node_id, COUNT(E.begin_node_id) FROM GraphNodes AS N LEFT OUTER JOIN Edges AS E ON N.node_id = E.end_node_id GROUP BY N.node_id; The outdegree of a node, (n), is the number of distinct edges beginning at (n). Nodes that have an outdegree of zero are not returned. To determine the outdegree of all nodes in the graph: 30.1 Basic Graph Characteristics 685 CREATE VIEW Outdegree (node_id, node_outdegree) AS SELECT N.node_id, COUNT(E.end_node_id) FROM GraphNodes AS N LEFT OUTER JOIN Edges AS E ON N.node_id = E.begin_node_id GROUP BY N.node_id; 30.1.6 Source, Sink, Isolated, and Internal Nodes A source node of a graph has a positive outdegree but an indegree of zero; that is, it has edges leading from, but not to, the node. This assumes there are no isolated nodes (nodes belonging to no edges). CREATE VIEW SourceNodes (node_id, lft, rgt) AS SELECT node_id, lft, rgt FROM NestedSetsGraph AS G1 WHERE NOT EXISTS (SELECT * FROM NestedSetsGraph AS G WHERE G1.lft > G2.lft AND G1.lft < G2.rgt); Likewise, a sink node of a graph has positive indegree but an outdegree of zero; that is, it has edges leading to, but not from, the node. This assumes there are no isolated nodes. CREATE VIEW SinkNodes (node_id) AS SELECT node_id FROM NestedSetsGraph AS G1 WHERE lft = rgt - 1 AND NOT EXISTS (SELECT * FROM NestedSetsGraph AS G2 WHERE G1.node_id = G2.node_id AND G2.lft < G1.lft); An isolated node belongs to no edges; i.e., it has zero indegree and zero outdegree. 686 CHAPTER 30: GRAPHS IN SQL CREATE VIEW IsolatedNodes (node_id, lft, rgt) AS SELECT node_id, lft, rgt FROM NestedSetsGraph AS G1 WHERE lft = rgt - 1 AND NOT EXISTS (SELECT * FROM NestedSetsGraph AS G2 WHERE G1.lft > G2.lft AND G1.lft < G2.rgt); An internal node of a graph has an indegree greater than zero and an outdegree greater than zero; that is, it acts as both a source and a sink. CREATE VIEW InternalNodes (node_id) AS SELECT node_id FROM (SELECT node_id, MIN(lft) AS lft, MIN(rgt) AS rgt FROM NestedSetsGraph WHERE lft < rgt - 1 GROUP BY node_id) AS G1 WHERE EXISTS (SELECT * FROM NestedSetsGraph AS G2 WHERE G1.lft > G2.lft AND G1.lft < G2.rgt) 30.2 Paths in a Graph Finding a path in a graph is the most important commercial application of graphs. Graphs model transportation networks, electrical and cable systems, process control flow and thousands of other things. A path, P, of length L from a node n0 to a node n k in the graph is defined as a traversal of ( L + 1) contiguous nodes along a sequence of edges, where the first node is node number 0 and the last is node number k . CREATE VIEW Paths (begin_node_id, end_node_id, this_node_id, seq_nbr, begin_lft, begin_rgt, end_lft, end_rgt, 30.2 Paths in a Graph 687 this_lft, this_rgt) AS SELECT PE.begin_node_id, PE.end_node_id, G1.node_id, (SELECT COUNT(*) FROM NestedSetsGraph AS G2 WHERE G2.lft > PE.begin_lft AND G2.lft <= G1.lft AND G2.rgt >= G1.rgt), PE.begin_lft, PE.begin_rgt, PE.end_lft, PE.end_rgt, G1.lft, G1.rgt FROM PathEndpoints AS PE INNER JOIN NestedSetsGraph AS G1 ON G1.lft BETWEEN PE.begin_lft AND PE.end_lft AND G1.rgt >= PE.end_rgt 30.2.1 Length of Paths The length of a path is the number of edges that are traversed along the path. A path of n nodes has a length of ( n − 1). CREATE VIEW PathLengths (begin_node_id, end_node_id, path_length, begin_lft, begin_rgt, end_lft, end_rgt) AS SELECT begin_node_id, end_node_id, MAX(seq_nbr), begin_lft, begin_rgt, end_lft, end_rgt FROM Paths GROUP BY begin_lft, end_lft, begin_rgt, end_rgt, begin_node_id, end_node_id; 30.2.2 Shortest Path The following code gives the shortest path length between all nodes, but it does not tell you what the actual path is. There are other queries that use the new CTE feature and recursion, which we will discuss in Section 30.3. 688 CHAPTER 30: GRAPHS IN SQL CREATE VIEW ShortestPathLengths (begin_node_id, end_node_id, path_length, begin_lft, begin_rgt, end_lft, end_rgt) AS SELECT PL.begin_node_id, PL.end_node_id, PL.path_length, PL.begin_lft, PL.begin_rgt, PL.end_lft, PL.end_rgt FROM (SELECT begin_node_id, end_node_id, MIN(path_length) AS path_length FROM PathLengths GROUP BY begin_node_id, end_node_id) AS MPL INNER JOIN PathLengths AS PL ON MPL.begin_node_id = PL.begin_node_id AND MPL.end_node_id = PL.end_node_id AND MPL.path_length = PL.path_length; 30.2.3 Paths by Iteration First, let’s build a graph that has a cost associated with each edge and put it into an adjacency list model. INSERT INTO Edges (out_node, in_node, cost) VALUES ('A', 'B', 50), ('A', 'C', 30), ('A', 'D', 100), ('A', 'E', 10), ('C', 'B', 5), ('D', 'B', 20), ('D', 'C', 50), ('E', 'D', 10); To find the shortest paths from one node to the other nodes it can reach, we can write this recursive VIEW . CREATE VIEW ShortestPaths (out_node, in_node, path_length) AS WITH RECURSIVE Paths (out_node, in_node, path_length) AS (SELECT out_node, in_node, 1 FROM Edges 30.2 Paths in a Graph 689 UNION ALL SELECT E1.out_node, P1.in_node, P1.path_length + 1 FROM Edges AS E1, Paths AS P1 WHERE E1.in_node = P1.out_node) SELECT out_node, in_node, MIN(path_length) FROM Paths GROUP BY out_node, in_node; out_node in_node path_length ============================ 'A' 'B' 1 'A' 'C' 1 'A' 'D' 1 'A' 'E' 1 'C' 'B' 1 'D' 'B' 1 'D' 'C' 1 'E' 'B' 2 'E' 'D' 1 To find the shortest paths without recursion, stay in a loop and add one edge at a time to the set of paths defined so far. CREATE PROCEDURE IteratePaths() LANGUAGE SQL MODIFIES SQL DATA BEGIN DECLARE old_path_tally INTEGER; SET old_path_tally = 0; DELETE FROM Paths; clean out working table INSERT INTO Paths SELECT out_node, in_node, 1 FROM Edges; load the edges add one edge to each path WHILE old_path_tally < (SELECT COUNT(*) FROM Paths) DO SET old_path_tally = (SELECT COUNT(*) FROM Paths); INSERT INTO Paths (out_node, in_node, lgth) SELECT E1.out_node, P1.in_node, (1 + P1.lgth) FROM Edges AS E1, Paths AS P1 WHERE E1.in_node = P1.out_node AND NOT EXISTS path is not here already 690 CHAPTER 30: GRAPHS IN SQL (SELECT * FROM Paths AS P2 WHERE E1.out_node = P2.out_node AND P1.in_node = P2.in_node); END WHILE; END; The least cost path is basically the same algorithm, but instead of a constant of one for the path length, we use the actual costs of the edges. CREATE PROCEDURE IterateCheapPaths () LANGUAGE SQL MODIFIES SQL DATA BEGIN DECLARE old_path_cost INTEGER; SET old_path_cost = 0; DELETE FROM Paths; clean out working table INSERT INTO Paths SELECT out_node, in_node, cost FROM Edges; load the edges add one edge to each path WHILE old_path_cost < (SELECT COUNT(*) FROM Paths) DO SET old_path_cost = (SELECT COUNT(*) FROM Paths); INSERT INTO Paths (out_node, in_node, cost) SELECT E1.out_node, P1.in_node, (E1.cost + P1.cost) FROM Edges AS E1 INNER JOIN (SELECT out_node, in_node, MIN(cost) FROM Paths GROUP BY out_node, in_node) AS P1 (out_node, in_node, cost) ON E1.in_node = P1.out_node AND NOT EXISTS (SELECT * FROM Paths AS P2 WHERE E1.out_node = P2.out_node AND P1.in_node = P2.in_node AND P2.cost <= E1.cost + P1.cost); END WHILE; END; 30.2 Paths in a Graph 691 30.2.4 Listing the Paths I took the data for this table from the book Introduction to Algorithms (Cormen, Leiserson, and Rivest 1990), page 518. This book was very popular in college courses in the United States. I made one decision that will be important later: I added self-traversal edges (i.e., the node is both the out_node and the in_node of an edge) with weights of zero. INSERT INTO Edges VALUES ('s', 's', 0); INSERT INTO Edges VALUES ('s', 'u', 3); INSERT INTO Edges VALUES ('s', 'x', 5); INSERT INTO Edges VALUES ('u', 'u', 0); INSERT INTO Edges VALUES ('u', 'v', 6); INSERT INTO Edges VALUES ('u', 'x', 2); INSERT INTO Edges VALUES ('v', 'v', 0); INSERT INTO Edges VALUES ('v', 'y', 2); INSERT INTO Edges VALUES ('x', 'u', 1); INSERT INTO Edges VALUES ('x', 'v', 4); INSERT INTO Edges VALUES ('x', 'x', 0); INSERT INTO Edges VALUES ('x', 'y', 6); INSERT INTO Edges VALUES ('y', 's', 3); INSERT INTO Edges VALUES ('y', 'v', 7); INSERT INTO Edges VALUES ('y', 'y', 0); I am not happy about this approach, because I have to decide the maximum number of edges in a path before I start looking for an answer. But this solution will work, and I know that a path will have no more than the total number of nodes in the graph. Let’s create a table to hold the paths: CREATE TABLE Paths (step1 CHAR(2) NOT NULL, step2 CHAR(2) NOT NULL, step3 CHAR(2) NOT NULL, step4 CHAR(2) NOT NULL, step5 CHAR(2) NOT NULL, total_cost INTEGER NOT NULL, path_length INTEGER NOT NULL, PRIMARY KEY (step1, step2, step3, step4, step5)); . (&apos ;s& apos;, &apos ;s& apos;, 0); INSERT INTO Edges VALUES (&apos ;s& apos;, 'u', 3); INSERT INTO Edges VALUES (&apos ;s& apos;, 'x', 5); INSERT INTO Edges VALUES ('u', 'u',. 'E' 1 'C' 'B' 1 'D' 'B' 1 'D' 'C' 1 'E' 'B' 2 'E' 'D' 1 To find the shortest paths. ('x', 'x', 0); INSERT INTO Edges VALUES ('x', 'y', 6); INSERT INTO Edges VALUES ('y', &apos ;s& apos;, 3); INSERT INTO Edges VALUES ('y',