Tài liệu Expert SQL Server 2008 Development- P9 docx

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	50
Dung lượng	833,71 KB

Nội dung

CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS AS ( SELECT @Start AS theStart, IntersectionId_End AS theEnd FROM dbo.StreetSegments WHERE IntersectionId_Start = @Start UNION ALL SELECT p.theEnd, ss.IntersectionId_End FROM Paths p JOIN dbo.StreetSegments ss ON ss.IntersectionId_Start = p.theEnd WHERE p.theEnd <> @End ) SELECT * FROM Paths; GO The anchor part of the CTE finds all nodes to which the starting intersection is connected—in this case, given the data we’ve already input, there is only one. The recursive part uses the anchor’s output as its input, finding all connected nodes from there, and continuing only if the endpoint of the next intersection is not equal to the end intersection. The output for this query is as follows: theStart theEnd 1 2 2 3 3 4 While this output is correct and perfectly descriptive with only one path between the two points, it has some problems. First of all, the ordering of the output of a CTE—just like any other query—is not guaranteed without an ORDER BY clause. In this case, the order happens to coincide with the order of the path, but this is a very small data set, and the server on which I ran the query has only one processor. On a bigger set of data and/or with multiple processors, SQL Server could choose to process the data in a different order, thereby destroying the implicit output order. The second issue is that in this case there is exactly one path between the start and endpoints. What if there were more than one path? Figure 12-6 shows the street map with a new street, a few new intersections, and more street segments added. The following T-SQL can be used to add the new data to the appropriate tables: 381 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS --New street INSERT INTO Streets VALUES (6, 'Lexington'); GO --New intersections INSERT INTO Intersections VALUES (5, 'E'), (6, 'F'), (7, 'G'), (8, 'H'); GO --New intersection/street mappings INSERT INTO IntersectionStreets VALUES (5, 1), (5, 6), (6, 2), (6, 6), (7, 3), (7, 6), (8, 4), (8, 6); GO --North/South segments INSERT INTO StreetSegments VALUES (2, 6, 2), (4, 8, 4); GO --East/West segments INSERT INTO StreetSegments VALUES (8, 7, 6), (7, 6, 6), (6, 5, 6); GO Note that although intersections E and G have been created, their corresponding north/south segments have not yet been inserted. This is on purpose, as I’m going to use those segments to illustrate yet another complication. Figure 12-6. A slightly more complete version of the street map Once the new data is inserted, we can try the same CTE as before, this time traveling from Madison and 1st Avenue to Lexington and 1st Avenue. To change the destination, modify the DECLARE statement that assigns the @Start and @End variables to be as follows: DECLARE @Start int = dbo.GetIntersectionId('Madison', '1st Ave'), @End int = dbo.GetIntersectionId('Lexington', '1st Ave'); Having made these changes, the output of the CTE query is now as follows: 382 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS theStart theEnd 1 2 2 3 2 6 6 5 3 4 4 8 8 7 7 6 6 5 There are now two paths from the starting point to the ending point, but it’s impossible to tell what they are; the intersections involved in each path are mixed up in the output. To solve this problem, the CTE will have to “remember” on each iteration where it’s been on previous iterations. Since each iteration of a CTE can only access the data from the previous iteration— and not all data from all previous iterations—each row will have to keep its own records inline. This can be done using a materialized path notation, where each previously visited node will be appended to a running list. This will require adding a new column to the CTE as highlighted in bold in the following code listing: DECLARE @Start int = dbo.GetIntersectionId('Madison', '1st Ave'), @End int = dbo.GetIntersectionId('Lexington', '1st Ave'); WITH Paths AS ( SELECT @Start AS theStart, IntersectionId_End AS theEnd, CAST('/' + CAST(@Start AS varchar(255)) + '/' + CAST(IntersectionId_End AS varchar(255)) + '/' AS varchar(255) ) AS thePath FROM dbo.StreetSegments WHERE IntersectionId_Start = @Start UNION ALL SELECT 383 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS p.theEnd, ss.IntersectionId_End, CAST(p.ThePath + CAST(IntersectionId_End AS varchar(255)) + '/' AS varchar(255) ) FROM Paths p JOIN dbo.StreetSegments ss ON ss.IntersectionId_Start = p.theEnd WHERE p.theEnd <> @End ) SELECT * FROM Paths; GO This code will start to form a list of visited nodes. If node A (IntersectionId 1) is specified as the start point, the output for this column for the anchor member will be /1/2/, since node B (IntersectionId 2) is the only node that participates in a street segment starting at node A. As new nodes are visited, their IDs will be appended to the list, producing a “breadcrumb” trail of all visited nodes. Note that the columns in both the anchor and recursive members are CAST to make sure their data types are identical. This is required because the varchar size changes due to concatenation, and all columns exposed by the anchor and recursive members must have identical types. The output of the CTE after making these modifications is as follows: theStart theEnd thePath 1 2 /1/2/ 2 3 /1/2/3/ 2 6 /1/2/6/ 6 5 /1/2/6/5/ 3 4 /1/2/3/4/ 4 8 /1/2/3/4/8/ 8 7 /1/2/3/4/8/7/ 7 6 /1/2/3/4/8/7/6/ 6 5 /1/2/3/4/8/7/6/5/ The output now includes the complete paths to the endpoints, but it still includes all subpaths visited along the way. To finish, add the following to the outermost query: WHERE theEnd = @End 384 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS This will limit the results to only paths that actually end at the specified endpoint—in this case, node E (IntersectionId 5). After making that addition, only the two paths that actually visit both the start and end nodes are shown. The CTE still has one major problem as-is. Figure 12-7 shows a completed version of the map, with the final two street segments filled in. The following T-SQL can be used to populate the StreetSegments table with the new data: INSERT INTO StreetSegments VALUES (5, 1, 1), (7, 3, 3); GO Figure 12-7. A version of the map with all segments filled in Rerunning the CTE after introducing the new segments results in the following partial output (abbreviated for brevity): theStart theEnd thePath 6 5 /1/2/6/5/ 6 5 /1/2/3/4/8/7/6/5/ 6 5 /1/2/3/4/8/7/3/4/8/7/6/5/ 6 5 /1/2/3/4/8/7/3/4/8/7/3/4/8/7/6/5/ 6 5 /1/2/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/6/5/ 6 5 /1/2/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/6/5/ 6 5 /1/2/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/6/5/ 6 5 /1/2/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/3/4/8/7/6/5/ . along with the following error: 385 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS Msg 530, Level 16, State 1, Line 9 The statement terminated. The maximum recursion 100 has been exhausted before statement completion. The issue is that these new intersections create cycles in the graph. The problem can be seen to start at the fourth line of the output, when the recursion first visits node G (IntersectionId 7). From there, one can go one of two ways: west to node F (IntersectionId 6) or north to node C (IntersectionId 3). Following the first route, the recursion eventually completes. But following the second route, the recursion will keep coming back to node G again and again, following the same two branches. Eventually, the default recursive limit of 100 is reached and execution ends with an error. Note that this default limit can be overridden using the OPTION (MAXRECURSION N) query hint, where N is the maximum recursive depth you’d like to use. In this case, 100 is a good limit because it quickly tells us that there is a major problem! Fixing this issue, luckily, is quite simple: check the path to find out whether the next node has already been visited, and if so, do not visit it again. Since the path is a string, this can be accomplished using a LIKE predicate by adding the following argument to the recursive member’s WHERE clause: AND p.thePath NOT LIKE '%/' + CONVERT(varchar, ss.IntersectionId_End) + '/%' This predicate checks to make sure that the ending IntersectionId, delimited by / on both sides, does not yet appear in the path—in other words, has not yet been visited. This will make it impossible for the recursion to fall into a cycle. Running the CTE after adding this fix eliminates the cycle issue. The full code for the fixed CTE follows: DECLARE @Start int = dbo.GetIntersectionId('Madison', '1st Ave'), @End int = dbo.GetIntersectionId('Lexington', '1st Ave'); WITH Paths AS ( SELECT @Start AS theStart, IntersectionId_End AS theEnd, CAST('/' + CAST(@Start AS varchar(255)) + '/' + CAST(IntersectionId_End AS varchar(255)) + '/' AS varchar(255) ) AS thePath FROM dbo.StreetSegments WHERE IntersectionId_Start = @Start UNION ALL SELECT p.theEnd, ss.IntersectionId_End, CAST(p.ThePath + CAST(IntersectionId_End AS varchar(255)) + '/' 386 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS AS varchar(255) ) FROM Paths p JOIN dbo.StreetSegments ss ON ss.IntersectionId_Start = p.theEnd WHERE p.theEnd <> @End AND p.thePath NOT LIKE '%/' + CONVERT(varchar, ss.IntersectionId_End) + '/%' ) SELECT * FROM Paths; GO This concludes this chapter’s coverage on general graphs. The remainder of the chapter deals with modeling and querying of hierarchies. Although hierarchies are much more specialized than graphs, they tend to be more typically seen in software projects than general graphs, and developers must consider slightly different issues when modeling them. Advanced routing The example shown in this section is highly simplified, and it is designed to teach the basics of querying graphs rather than serve as a complete routing solution. I have had the pleasure of working fairly extensively with a production system designed to traverse actual street routes and will briefly share some of the insights I have gained in case you are interested in these kinds of problems. The first issue with the solution shown here is that of scalability. A big city has tens of thousands of street segments, and determining a route from one end of the city to another using this method will create a combinatorial explosion of possibilities. In order to reduce the number of combinations, a few things can be done. First of all, each segment can be weighted, and a score tallied along the way as you recurse over the possible paths. If the score gets too high, you can terminate the recursion. For example, in the system I worked on, weighting was done based on distance traveled. The algorithm used was fairly complex, but essentially, if a destination was 2 miles away and the route went over 3 miles, recursion would be terminated for that branch. This scoring also lets the system determine the shortest possible routes. Another method used to greatly decrease the number of combinations was an analysis of the input set of streets, and a determination made of major routes between certain locations. For instance, traveling from one end of the city to another is usually most direct on a freeway. If the system determines that a freeway route is appropriate, it breaks the routing problem down into two sections: first, find the shortest route from the starting point to a freeway on-ramp, and then find the shortest route from the endpoint to a freeway exit. Put these routes together, including the freeway travel, and you have an optimized path from the starting point to the ending point. Major routes—like freeways—can be underweighted in order to make them appear higher in the scoring rank. If you’d like to try working with real street data, you can download US geographical shape files (including streets as well as various natural formations) for free from the US Census Bureau. The data, called TIGER/Line, is available from www.census.gov/geo/www/tiger/index.html . Be warned: this data is not easy to work with and requires a lot of cleanup to get it to the point where it can be easily queried. 387 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS Adjacency List Hierarchies As mentioned previously, any kind of graph can be modeled using an adjacency list. This of course includes hierarchies, which are nothing more than rooted, directed, acyclic graphs with exactly one path between any two nodes (irrespective of direction). Adjacency list hierarchies are very easy to model, visualize, and understand, but can be tricky or inefficient to query in some cases since they require iteration or recursion, as I’ll discuss shortly. Traversing an adjacency list hierarchy is virtually identical to traversing an adjacency list graph, but since hierarchies don’t have cycles, you don’t need to worry about them in your code. This is a nice feature, since it makes your code shorter, easier to understand, and more efficient. However, being able to make the assumption that your data really does follow a hierarchical structure—and not a general graph—takes a bit of work up front. See “Constraining the Hierarchy” later in this section for information on how to make sure that your hierarchies don’t end up with cycles, multiple roots, or disconnected subtrees. The most commonly recognizable example of an adjacency list hierarchy is a self-referential personnel table that models employees and their managers. Since it’s such a common and easily understood example, this is the scenario that will be used for this section and the rest of this chapter. To start, we’ll create an simple adjacency list based on three columns of data from the HumanResources.Employee table of the AdventureWorks database. The columns used will be as follows: • EmployeeID is the primary key for each row of the table. Most of the time, adjacency list hierarchies are modeled in a node-centric rather than edge-centric way; that is, the primary key of the hierarchy is the key for a given node, rather than a key representing an edge. This makes sense because each node in a hierarchy can only have one direct ancestor. • ManagerID is the key for the employee that each row reports to in the same table. If ManagerID is NULL, that employee is the root node in the tree (i.e., the head of the company). It’s common when modeling adjacency list hierarchies to use either NULL or an identical key to the row’s primary key to represent root nodes. • Finally, the Title column, representing employees’ job titles, will be used to make the output easier to read. You can use the following T-SQL to create a table based on these columns: USE AdventureWorks; GO CREATE TABLE Employee_Temp ( EmployeeID int NOT NULL CONSTRAINT PK_Employee PRIMARY KEY, ManagerID int NULL CONSTRAINT FK_Manager REFERENCES Employee_Temp (EmployeeID), Title nvarchar(100) ); GO INSERT INTO Employee_Temp ( EmployeeID, 388 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS ManagerID, Title ) SELECT EmployeeID, ManagerID, Title FROM HumanResources.Employee; GO The types of questions generally posed against a hierarchy are somewhat different from the example graph traversal questions examined in the previous section. For adjacency lists as well as the other hierarchical models discussed in this chapter, we’ll consider how to answer the following common questions: • What are the direct descendants of a given node? In other words, who are the people who directly report to a given manager? • What are all of the descendants of a given node? Which is to say, how many people all the way down the organizational hierarchy ultimately report up to a given manager? The challenge here is how to sort the output so that it makes sense with regard to the hierarchy. • What is the path from a given child node back to the root node? In other words, following the management path up instead of down, who reports to whom? I will also discuss the following data modification challenges: • Inserting a new node into the hierarchy, as when a new employee is hired • Relocating a subtree, such as might be necessary if a division gets moved under a new manager • Deleting a node from the hierarchy, which might, for example, need to happen in an organizational hierarchy due to attrition Each of the techniques discussed in this chapter have slightly different levels of difficulty with regard to the complexity of solving these problems, and I will make general suggestions on when to use each model. Finding Direct Descendants Finding the direct descendants of a given node is quite straightforward in an adjacency list hierarchy; it’s the same as finding the available nodes to which you can traverse in a graph. Start by choosing the parent node for your query, and select all nodes for which that node is the parent. To find all employees that report directly to the CEO (EmployeeID 109), use the following T-SQL: SELECT * FROM Employee_Temp WHERE ManagerID = 109; This query returns the results shown following, showing the six branches of AdventureWorks, represented by its upper management team—exactly the results that we expected. 389 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS EmployeeID ManagerID Title 6 109 Marketing Manager 12 109 Vice President of Engineering 42 109 Information Services Manager 140 109 Chief Financial Officer 148 109 Vice President of Production 273 109 Vice President of Sales However, this query has a hidden problem: traversing from node to node in the Employee_Temp table means searching based on the ManagerID column. Considering that this column is not indexed, it should come as no surprise that the query plan for the preceding query involves a scan, as shown in Figure 12-8. Figure 12-8. Querying on the ManagerID causes a table scan. To eliminate this issue, an index on the ManagerID column must be created. However, choosing exactly how best to index a table such as this one can be difficult. In the case of this small example, a clustered index on ManagerID would yield the best overall mix of performance for both querying and data updates, by covering all queries that involve traversing the table. However, in an actual production system, there might be a much higher percentage of queries based on the EmployeeID—for instance, queries to get a single employee’s data—and there would probably be a lot more columns in the table than the three used here for example purposes, meaning that clustered key lookups could be expensive. In such a case, it is important to test carefully which combination of indexes delivers the best balance of query and data modification performance for your particular workload. In order to show the best possible performance in this case, change the primary key to use a nonclustered index and create a clustered index on ManagerID, as shown in the following T-SQL: ALTER TABLE Employee_Temp DROP CONSTRAINT FK_Manager, PK_Employee; CREATE CLUSTERED INDEX IX_Manager ON Employee_Temp (ManagerID); ALTER TABLE Employee_Temp ADD CONSTRAINT PK_Employee PRIMARY KEY NONCLUSTERED (EmployeeID); GO 390 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. [...]... www.verypdf.com to remove this watermark 395 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS Are CTEs the Best Choice? While CTEs are possibly the most convenient way to traverse adjacency list hierarchies in SQL Server 2008, they do not necessarily deliver the best possible performance Iterative methods involving temporary tables or table variables may well outperform recursive CTEs, especially as the hierarchy... Datatype In the preceding section, I discussed one of the limitations of the materialized path approach—namely, that the string encoding makes it difficult to work with deep hierarchies Fortunately, in SQL Server 2008, the hierarchyid datatype was introduced, which essentially stores hierarchical data using materialized paths that are serialized into a CLR datatype While this doesn’t provide much additional... with zeros This ensures that, for instance, the path 1/2/ does not sort higher than the path 1/10/ The numbers are padded to ten digits to support the full range of positive integer values supported by SQL Server s int data type Note that siblings in this case are ordered based on their EmployeeID Changing the ordering of siblings—for instance, to alphabetical order based on Title—requires a bit of manipulation... Employee_Temp.EmployeeID; GO varchar(900) is important in this case because the materialized path will be used as an index key in order to allow it to be efficiently used to traverse the hierarchy Index keys in SQL Server are limited to 900 bytes This is also a bit of a limitation for persisted materialized paths; a path to navigate an especially deep hierarchy will not be indexable and therefore will not be usable... included): CREATE NONCLUSTERED INDEX IX_Employee_Temp_Path ON Employee_Temp (thePath) INCLUDE (EmployeeID, Title); Finding Subordinates Since the materialized path is a string, we can take advantage of SQL Server s LIKE predicate to traverse down the hierarchy The path for every given node N that is a subordinate of some node M starts with node M’s path Looking back at the results of the enumerated path... manager’s path Finally, concatenate the employee’s ID onto the end of the path The most important thing to mention about this trigger is its limitation when it comes to multirow inserts Due to the fact that SQL Server does not have any guarantees when it comes to update order, it is possible to create invalid paths by inserting two nodes at the same time For instance, try disabling the row count check and inserting... row number that represents the current ordered Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 393 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS sibling This can be done using SQL Server s ROW_NUMBER function, and is sometimes referred to as enumerating the path The following modified version of the CTE enumerates the path: WITH n AS ( SELECT EmployeeID, ManagerID, Title, CONVERT(varchar(900),... all string representations of hierarchyid data, both begins and ends with an oblique stroke This style of syntax is probably familiar to all developers who have previously used XQuery functionality in SQL Server or elsewhere Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 413 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS • The hierarchyid values are returned in the result set in... described in the section “Traversing the Graph,” but without the need to be concerned with cycles By ordering by the path, the output will follow the same nested order as the hierarchy itself The following T -SQL shows how to accomplish this: WITH n AS ( SELECT EmployeeID, ManagerID, Title, CONVERT(varchar(900), RIGHT(REPLICATE('0', 10) + CONVERT(varchar, EmployeeID), 10) + '/' ) AS thePath FROM Employee_Temp... designed solely to determine those employees that report to a given manager, but it is not necessarily the best design for a general purpose employees table Once this change has been made, rerunning the T -SQL to find the CEO’s direct reports produces a clustered index seek instead of a scan—a small improvement that will be magnified when performing queries against a table with a greater number of rows Traversing . data set, and the server on which I ran the query has only one processor. On a bigger set of data and/or with multiple processors, SQL Server could choose. possibly the most convenient way to traverse adjacency list hierarchies in SQL Server 2008, they do not necessarily deliver the best possible performance. Iterative

Ngày đăng: 24/12/2013, 02:18

Xem thêm