SQL Server MVP Deep Dives- P3

40 249 0
SQL Server MVP Deep Dives- P3

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

36 CHAPTER Table Testing functional dependencies for Qty OrderNo CustomerID 7001 12 Qty TotalPrice 10 125.00 Dooble 10 OrderTotal 125.00 170.00 Testing functional dependencies for TotalPrice OrderNo CustomerID 7001 12 Product Qty TotalPrice OrderTotal 125.00 Gizmo 10 125.00 Dooble 7002 Table Product Gizmo 7002 Table Finding functional dependencies 20 125.00 Testing functional dependencies for OrderTotal OrderNo CustomerID 7001 12 7002 Product 15 Gizmo Qty 10 TotalPrice OrderTotal 125.00 125.00 125.00 None of these examples were rejected by the domain expert, so I was able to conclude that there are no more single-column dependencies in this table Note that I didn’t produce these three examples at the same time I created them one by one, for if there had been more functional dependencies I could’ve further reduced the number of tests still needed But because there turned out to be no more dependencies, I decided to combine them in this description, to save space and reduce the repetitiveness Second step: finding two-attribute dependencies After following the preceding steps, I can now be sure that I’ve found all the cases where an attribute depends on one of the other attributes But there can also be attributes that depend on two, or even more, attributes In fact, I hope there are, because I’m still left with a few attributes that don’t depend on any other attribute If you ever run into this, it’s a sure sign of one or more missing attributes on your shortlist—one of the hardest problems to overcome in data modeling The method for finding multiattribute dependencies is the same as that for singleattribute dependencies—for every possible combination, create a sample with two rows that duplicate the columns to test and don’t duplicate any other column If at this point I hadn’t found any dependency yet, I’d be facing an awful lot of combinations to test Fortunately, I’ve already found some dependencies (which you’ll find is almost always the case if you start using this method for your modeling), so I can rule out most of these combinations Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Modeling the sales order 37 At this point, if you haven’t already done so, you should remove attributes that don’t depend on the candidate key or that transitively depend on the primary key You’ll have noticed that I already did so Not moving these attributes to their own tables now will make this step unnecessarily complex The key to reducing the number of possible combinations is to observe that at this point, you can only have three kinds of attributes in the table: a single-attribute candidate key (or more in the case of a mutual dependency), one or more attributes that depend on the candidate key, and one or more attributes that don’t depend on the candidate key, or on any other attribute (as we tested all single-attribute dependencies) Because we already moved attributes that depend on an attribute other than the candidate key, these are the only three kinds of attributes we have to deal with And that means that there are six possible kinds of combinations to consider: a candidate key and a dependent attribute; a candidate key and an independent attribute; a dependent attribute and an independent attribute; two independent attributes; two dependent attributes; or two candidate keys Because alternate keys always have a mutual dependency, the last category is a special case of the one before it, so I won’t cover it explicitly Each of the remaining five possibilities will be covered below CANDIDATE KEY AND DEPENDENT ATTRIBUTE This combination (as well as the combination of two candidate keys, as I already mentioned) can be omitted completely I won’t bother you with the mathematical proof, but instead will try to explain in language intended for mere mortals Given three attributes (A, B, and C), if there’s a dependency from the combination of A and B to C, that would imply that for each possible combination of values for A and B, there can be at most one value of C But if there’s also a dependency of A to B, this means that for every value of A, there can be at most one value of B—in other words, there can be only one combination of A and B for every value of A; hence there can be only one value of C for every value of A So it naturally follows that if B depends on A, then every attribute that depends on A will also depend on the combination of A and B, and every attribute that doesn’t depend on A can’t depend on the combination of A and B CANDIDATE KEY AND INDEPENDENT ATTRIBUTE For this combination, some testing is required In fact, I’ll test combination first, because it’s the most common—and the sooner I find extra dependencies, the sooner I can start removing attributes from the table, cutting down on the number of other combinations to test But, as before, it’s not required to test all other attributes for dependency on a given combination of a candidate key and an independent attribute Every attribute that depends on the candidate key will also appear to depend on any combination of the candidate key with any other attribute This isn’t a real dependency, so there’s no need to test for it, or to conclude the existence of such a dependency This means that in my example, I need to test the combinations of OrderNo and Product, OrderNo and Qty, and OrderNo and TotalPrice And when testing the first Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 38 CHAPTER Finding functional dependencies combination (OrderNo and Product), I can omit the attributes CustomerID and OrderTotal, but I need to test whether Qty or TotalPrice depend on the combination of OrderNo and Price, as shown in table (Also note how in this case I was able to observe the previously-discovered business rule that TotalPrice = Qty x Price—even though Price is no longer included in the table, it is still part of the total collection of data, and still included in the domain expert’s familiar notation.) Table Testing functional dependencies for the combination of OrderNo and Product OrderNo CustomerID 7001 12 7001 Product Qty TotalPrice OrderTotal 225.00 Gizmo 10 125.00 Gizmo 12 150.00 The domain expert rejected the sample order confirmation I based on this data As reason for this rejection, she told me that obviously, the orders for 10 and 12 units of Gizmo should’ve been combined on a single line, as an order for 22 units of Gizmo, at a total price of $375.00 This proves that Qty and TotalPrice both depend on the combination of OrderNo and Product Second normal form requires me to create a new table with the attributes OrderNo and Product as key attributes, and Qty and TotalPrice as dependent attributes I’ll have to continue testing in this new table for twoattribute dependencies for all remaining combinations of two attributes, but I don’t have to repeat the single-attribute dependencies, because they’ve already been tested before the attributes were moved to their own table For the orders table, I now have only the OrderNo, CustomerID, and OrderTotal as remaining attributes TWO DEPENDENT ATTRIBUTES This is another combination that should be included in the tests Just as with a single dependent attribute, you’ll have to test the key attribute (which will be dependent on the combination in case of a mutual dependency, in which case the combination is an alternate key) and the other dependent attributes (which will be dependent on the combination in case of a transitive dependency) In the case of my sample Orders table, Table 10 Testing functional dependencies for I only have two dependent attributes left the combination of CustomerID and OrderTotal (CustomerID and OrderTotal), so there’s OrderNo CustomerID OrderTotal only one combination to test And the only other attribute is OrderID, the key 7001 12 125.00 So I create the test population of table 10 7002 12 125.00 to check for a possible alternate key The domain expert saw no reason to reject this example (after I populated the related tables with data that observes all rules discovered so far), so there’s obviously no dependency from CustomerID and OrderTotal to OrderNo Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Modeling the sales order 39 TWO INDEPENDENT ATTRIBUTES Because the Orders table used in my example has no independent columns anymore, I can obviously skip this combination But if there still were two or more independent columns left, then I’d have to test each combination for a possible dependency of a candidate key or any other independent attribute upon this combination DEPENDENT AND INDEPENDENT ATTRIBUTES This last possible combination is probably the least common—but there are cases where an attribute turns out to depend on a combination of a dependent and an independent attribute Attributes that depend on the key attribute can’t also depend on a combination of a dependent and an independent column (see the sidebar a few pages back for an explanation), so only candidate keys and other independent attributes need to be tested Further steps: three-and-more-attribute dependencies It won’t come as a surprise that you’ll also have to test for dependencies on three or more attributes But these are increasingly rare as the number of attributes increases, so you should make a trade-off between the amount of work involved in testing all possible combinations on one hand, and the risk of missing a dependency on the other The amount of work involved is often fairly limited, because in the previous steps you’ll often already have changed the model from a single many-attribute relation to a collection of relations with only a limited number of attributes each, and hence with a limited number of possible three-or-more-attribute combinations For space reasons, I can’t cover all possible combinations of three or more attributes here But the same logic applies as for the two-attribute dependencies, so if you decide to go ahead and test all combinations you should be able to figure out for yourself which combinations to test and which to skip What if I have some independent attributes left? At the end of the procedure, you shouldn’t have any independent attributes left—except when the original collection of attributes was incomplete Let’s for instance consider the order confirmation form used earlier—but this time, there may be multiple products with the same product name but a different product ID In this case, unless we add the product ID to the table before starting the procedure, we’ll end up with the attributes Product, Qty, and Price as completely independent columns in the final result (go ahead, try it for yourself—it’s a great exercise!) So if you ever happen to finish the procedure with one or more independent columns left, you’ll know that either you or the domain expert made a mistake when producing and assessing the collections of test sample data, or you’ve failed to identify at least one of the candidate key attributes Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 40 CHAPTER Finding functional dependencies Summary I’ve shown you a method to find all functional dependencies between attributes If you’ve just read this chapter, or if you’ve already tried the method once or twice, it may seem like a lot of work for little gain But once you get used to it, you’ll find that this is very useful, and that the amount of work is less than it appears at first sight For starters, in a real situation, many dependencies will be immediately obvious if you know a bit about the subject matter, and it’ll be equally obvious that there are no dependencies between many attributes There’s no need to verify those with the domain expert (Though you should keep in mind that some companies may have a specific situation that deviates from the ordinary.) Second, you’ll find that if you start by testing the dependencies you suspect to be there, you’ll quickly be able to divide the data over multiple relations with relatively few attributes each, thereby limiting the number of combinations to be tested And finally, by cleverly combining multiple tests into a single example, you can limit the number of examples you have to run by the domain expert This may not reduce the amount of work you have to do, but it does reduce the number of examples your domain expert has to assess—and she’ll love you for it! As a bonus, this method can be used to develop sample data for unit testing, which can improve the quality of the database schema and stored procedures A final note of warning—there are some situations where, depending on the order you choose to your tests, you might miss a dependency You can find them too, but they’re beyond the scope of this chapter Fortunately this will only happen in cases where rare combinations of dependencies between attributes exist, so it’s probably best not to worry too much about it About the author Hugo is cofounder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and to develop computer-aided tools that will generate completely functional applications from the analysis deliverable The chosen platform for this development is SQL Server In his spare time, Hugo likes to share and enhance his knowledge of SQL Server by frequenting newsgroups and forums, reading and writing books and blogs, and attending and speaking at conferences Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross PART Database Development Edited by Adam Machanic It can be argued that database development, as an engineering discipline, was born along with the relational model in 1970 It has been almost 40 years (as I write these words), yet the field continues to grow and evolve—seemingly at a faster rate every year This tremendous growth can easily be seen in the many facets of the Microsoft database platform SQL Server is no longer just a simple SQL database system; it has become an application platform, a vehicle for the creation of complex and multifaceted data solutions Today’s database developer is expected to understand not only the TransactSQL dialect spoken by SQL Server, but also the intricacies of the many components that must be controlled in order to make the database system their bidding This variety can be seen in the many topics discussed in the pages ahead: indexing, full-text search, SQL CLR integration, XML, external interfaces such as ADO.NET, and even mobile device development are all subjects within the realm of database development The sheer volume of knowledge both required and available for consumption can seem daunting, and giving up is not an option The most important thing we can is understand that while no one can know everything, we can strive to continually learn and enhance our skill sets, and that is where this book comes in The chapters in this section—as well as those in the rest of the book—were written by some of the top minds in the SQL Server world, and whether you’re just beginning your journey into the world of database development or have several years of experience, you will undoubtedly learn something new from these experts Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross It has been a pleasure and an honor working on this unique project with such an amazing group of writers, and I sincerely hope that you will thoroughly enjoy the results of our labor I wish you the best of luck in all of your database development endeavors Here’s to the next 40 years About the editor Adam Machanic is a Boston-based independent database consultant, writer, and speaker He has written for numerous websites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CODE, and VSJ He has also contributed to several books on SQL Server, including SQL Server 2008 Internals (Microsoft Press, 2009) and Expert SQL Server 2005 Development (Apress, 2007) Adam regularly speaks at user groups, community events, and conferences on a variety of SQL Server and NET-related topics He is a Microsoft Most Valuable Professional (MVP) for SQL Server, Microsoft Certified IT Professional (MCITP), and a member of the INETA North American Speakers Bureau Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Set-based iteration, the third alternative Hugo Kornelis When reading SQL Server newsgroups or blogs, you could easily get the impression that there are two ways to manipulate data: declarative (set-based) or iterative (cursor-based) And that iterative code is always bad and should be avoided like the plague Those impressions are both wrong Iterative code isn’t always bad (though, in all honesty, it usually is) And there’s more to SQL Server than declarative or iterative—there are ways to combine them, adding their strengths and avoiding their weaknesses This article is about one such method: set-based iteration The technique of set-based iteration can lead to efficient solutions for problems that don’t lend themselves to declarative solutions, because those would result in an amount of work that grows exponentially with the amount of data In those cases, the trick is to find a declarative query that solves a part of the problem (as much as feasible), and that doesn’t have the exponential performance problem—then repeat that query until all work has been done So instead of attempting a single set-based leap, or taking millions of single-row-sized miniature steps in a cursor, set-based iteration arrives at the destination by taking a few seven-mile leaps In this chapter, I’ll first explain the need for an extra alternative by discussing the weaknesses and limitations of purely iterative and purely declarative coding I’ll then explain the technique of set-based iteration by presenting two examples: first a fairly simple one, and then a more advanced case The common methods and their shortcomings Developing SQL Server code can be challenging You have so many ways to achieve the same result that the challenge isn’t coming up with working code, but picking the “best” working code from a bunch of alternatives So what’s the point of adding yet another technique, other than making an already tough choice even harder? 43 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 44 CHAPTER Set-based iteration, the third alternative The answer is that there are cases (admittedly, not many) where none of the existing options yield acceptable performance, and set-based iteration does Declarative (set-based) code Declarative coding is, without any doubt, the most-used way to manipulate data in SQL Server And for good reason, because in most cases it’s the fastest possible code The basic principle of declarative code is that you don’t tell the computer how to process the data in order to create the required results, but instead declare the results you want and leave it to the DBMS to figure out how to get those results Declarative code is also called set-based code because the declared required results aren’t based on individual rows of data, but on the entire set of data For example, if you need to find out which employees earn more than their manager, the declarative answer would involve one single query, specifying all the tables that hold the source data in its FROM clause, all the required output columns in its SELECT clause, and using a WHERE clause to filter out only those employees that meet the salary requirement BENEFITS The main benefit of declarative coding is its raw performance For one thing, SQL Server has been heavily optimized toward processing declarative code But also, the query optimizer—the SQL Server component that selects how to process each query—can use all the elements in your database (including indexes, constraints, and statistics on data distribution) to find the most efficient way to process your request, and even adapt the execution plan when indexes are added or statistics indicate a major change in data distribution Another benefit is that declarative code is often much shorter and (once you get the hang of it) easier to read and maintain than iterative code Shorter, easier-to-read code directly translates into a reduction of development cost, and an even larger reduction of future maintenance cost DRAWBACKS Aside from the learning curve for people with a background in iterative coding, there’s only one problem with the set-based approach Because you have to declare the results in terms of the original input, you can’t take shortcuts by specifying end results in terms of intermediate results In some cases, this results in queries that are awkward to write and hard to read In other cases, it may result in queries that force SQL Server to more work than would otherwise be required Running totals is an example of this There’s no way to tell SQL Server to calculate the running total of each row as the total of the previous row plus the value of the current row, because the running total of the previous row isn’t available in the input, and partial query results (even though SQL Server does know them) can’t be specified in the language The only way to calculate running totals in a set-based fashion is to specify each running total as the sum of the values in all preceding rows That implies that a lot Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross The common methods and their shortcomings 45 more summation is done than would be required if intermediate results were available This results in performance that degrades exponentially with the amount of data, so even if you have no problems in your test environment, you will have problems in your 100-million-row production database! Running totals in the OVER clause The full ANSI standard specification of the OVER clause includes windowing extensions that allow for simple specification of running totals This would result in short queries with probably very good performance—if SQL Server had implemented them Unfortunately, these extensions aren’t available in any current version of SQL Server, so we still have to code the running totals ourselves Iterative (cursor-based) code The base principle of iterative coding is to write T-SQL as if it were just another thirdgeneration programming language, like C#, VB.NET, Cobol, and Pascal In those languages, the only way to process a set of data (such as a sequentially organized file) is to iterate over the data, reading one record at a time, processing that record, and then moving to the next record until the end of the file has been reached SQL Server has cursors as a built-in mechanism for this iteration, hence the term cursor-based code as an alternative to the more generic iterative code Most iterative code encountered “in the wild” is written for one of two reasons: either because the developer was used to this way of coding and didn’t know how (or why!) to write set-based code instead; or because the developer was unable to find a good-performing set-based approach and had to fall back to iterative code to get acceptable performance BENEFITS A perceived benefit of iterative code might be that developers with a background in third-generation languages can start coding right away, instead of having to learn a radically different way to their work But that argument would be like someone from the last century suggesting that we hitch horses to our cars so that drivers don’t have to learn how to start the engine and operate the steering wheel Iterative code also has a real benefit—but only in a few cases Because the coder has to specify each step SQL Server has to take to get to the end result, it’s easy to store an intermediate result and reuse it later In some cases (such as the running totals already mentioned), this can result in faster-running code DRAWBACKS By writing iterative code, you’re crippling SQL Server’s performance in two ways at the same time You not only work around all the optimizations SQL Server has for fast setbased processing, you also effectively prevent the query optimizer from coming up with a faster way to achieve the same results Tell SQL Server to read employees, and Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Solutions to gaps problem start_range end_range 11 13 31 31 33 35 42 42 61 Table Desired result for islands problem BigNumSeq table and populate it with sample data Note that it may take a few minutes for this code to finish Listing Code creating and populating the BigNumSeq table dbo.BigNumSeq (big numeric sequence with unique values, interval: 1) IF OBJECT_ID('dbo.BigNumSeq', 'U') IS NOT NULL DROP TABLE dbo.BigNumSeq; CREATE TABLE dbo.BigNumSeq ( seqval INT NOT NULL CONSTRAINT PK_BigNumSeq PRIMARY KEY ); Populate table with values in the range through to 10,000,000 with a gap every 1000 (total 10,000 gaps) WITH L0 AS(SELECT AS c UNION ALL SELECT 1), L1 AS(SELECT AS c FROM L0 AS A, L0 AS B), L2 AS(SELECT AS c FROM L1 AS A, L1 AS B), L3 AS(SELECT AS c FROM L2 AS A, L2 AS B), L4 AS(SELECT AS c FROM L3 AS A, L3 AS B), L5 AS(SELECT AS c FROM L4 AS A, L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5) INSERT INTO dbo.BigNumSeq WITH(TABLOCK) (seqval) SELECT n FROM Nums WHERE n A.seqval) - AS end_range FROM dbo.NumSeq AS A WHERE NOT EXISTS (SELECT * FROM dbo.NumSeq AS B WHERE B.seqval = A.seqval + 1) AND seqval < (SELECT MAX(seqval) FROM dbo.NumSeq); This solution is based on subqueries In order to understand it you should first focus on the filtering activity in the WHERE clause and then proceed to the activity in the SELECT list The purpose of the NOT EXISTS predicate in the WHERE clause is to filter only points that are a point before a gap You can identify a point before a gap when you see that for such a point, the value plus doesn’t exist in the sequence The purpose of the second predicate in the WHERE clause is to filter out the maximum value from the sequence because it represents the point before infinity, which does not concern us That filter left only points that are a point before a gap What remains is to relate to each such point the next point that exists in the sequence A subquery in the SELECT list is used to return for each point the minimum value that is greater than the current point This is one way to implement the concept of next in SQL Each pair is made of a point before a gap, and the next point that exists in the sequence represents a pair of values bounding a gap To get the start and end points of the gap, add to the point before the gap, and subtract from the point after the gap This solution performs well, and I must say that I’m a bit surprised by the efficiency of the execution plan for this query Against the BigNumSeq table, this solution ran for only about seconds on my system, and incurred 62,262 logical reads To filter points before gaps, I expected that the optimizer would apply an index seek per each of the 10,000,000 rows, which could have meant over 30 million random reads Instead, it performed two ordered scans of the index, costing a bit over 30,000 sequential reads, and applying a merge join between the two inputs for this purpose For each of the 10,000 rows that remain (for the 10,000 points before gaps), an index seek is used to return the next existing point Each such seek costs random reads for the three levels in the index b-tree, amounting in total to about 30,000 random reads The total number of logical reads is 62,262 as mentioned earlier If the number of gaps is fairly small, as in our case, this solution performs well Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Solutions to gaps problem 63 To apply the solution to a temporal sequence, instead of using +1 or -1 to add or subtract the interval from the integer sequence value, use the DATEADD function to add or subtract the applicable temporal interval from the temporal sequence To apply the solution to a sequence with duplicates, you have several options One option is to substitute the reference to the table that is aliased as A with a derived table based on a query that removes duplicates: (SELECT DISTINCT seqval FROM dbo.TempSeq) AS A Another option is to add a DISTINCT clause to the SELECT list Gaps—solution using subqueries Listing shows the second solution for the gaps problem Listing Gaps—solution using subqueries SELECT cur + AS start_range, nxt - AS end_range FROM (SELECT seqval AS cur, (SELECT MIN(B.seqval) FROM dbo.NumSeq AS B WHERE B.seqval > A.seqval) AS nxt FROM dbo.NumSeq AS A) AS D WHERE nxt - cur > 1; As with solution 1, this solution is also based on subqueries The logic of this solution is straightforward The query that defines the derived table D uses a subquery in the SELECT list to produce current-next pairs That is, for each current value, the subquery returns the minimum value that is greater than the current The current value is aliased as cur, and the next value is aliased as nxt The outer query filters the pairs in which the difference is greater than 1, because those pairs bound gaps By adding to cur and subtracting from nxt, you get the gap starting and ending points Note that the maximum value in the table (the point before infinity) will get a NULL back from the subquery, the difference between cur and nxt will yield a NULL, the predicate NULL > will yield UNKNOWN, and the row will be filtered out That’s the behavior we want for the point before infinity It is important to always think about the three-valued logic and ensure that you get desired behavior; and if you don’t, you need to add logic to get the behavior you are after The performance measures for this solution are not as good as solution The plan for this query shows that the index is fully scanned to retrieve the current values, amounting to about 16,000 sequential reads Per each of the 10,000,000 current values, an index seek operation is used to return the next value to produce the currentnext pairs Those seeks are the main contributor to the cost of this plan, as they amount to about 30,000,000 random reads This query ran for about 48 seconds on my system, and incurred 31,875,478 logical reads To apply the solution to a temporal sequence, use the DATEADD function instead of using +1 and -1 and the DATEDIFF function to calculate the difference between cur and nxt Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 64 CHAPTER Gaps and islands To apply the solution to a sequence with duplicates, use guidelines similar to those in the previous solution Gaps—solution using ranking functions Listing shows the third solution to the gaps problem Listing Gaps—solution using ranking functions WITH C AS ( SELECT seqval, ROW_NUMBER() OVER(ORDER BY seqval) AS rownum FROM dbo.NumSeq ) SELECT Cur.seqval + AS start_range, Nxt.seqval - AS end_range FROM C AS Cur JOIN C AS Nxt ON Nxt.rownum = Cur.rownum + WHERE Nxt.seqval - Cur.seqval > 1; This solution implements the same logic as the previous solution, but it uses a different technique to produce current-next pairs This solution defines a common table expression (CTE) called C that calculates row numbers to position sequence values The outer query joins two instances of the CTE, one representing current values, and the other representing next values The join condition matches current-next values based on an offset of between their row numbers The filter in the outer query then keeps only those pairs with a difference greater than The plan for this solution shows that the optimizer does two ordered scans of the index to produce row numbers for current and next values The optimizer then uses a merge join to match current-next values The two ordered scans of the index are not expensive compared to the seek operations done in the previous solution However, the merge join appears to be many-to-many and is a bit expensive This solution ran for 24 seconds on my system, and incurred 32,246 logical reads To apply the solution to a temporal sequence, use the DATEADD function instead of using +1 and -1, and the DATEDIFF function to calculate the difference between Nxt.seqval and Cur.seqval To apply the solution to a sequence with duplicates, use the DENSE_RANK function instead of ROW_NUMBER, and add DISTINCT to the SELECT clause of the inner query Gaps—solution using cursors Listing shows the fourth solution to the gaps problem Listing Gaps—solution using cursors SET NOCOUNT ON; DECLARE @seqval AS INT, @prvseqval AS INT; DECLARE @Gaps TABLE(start_range INT, end_range INT); Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Solutions to gaps problem 65 DECLARE C CURSOR FAST_FORWARD FOR SELECT seqval FROM dbo.NumSeq ORDER BY seqval; OPEN C; FETCH NEXT FROM C INTO @prvseqval; IF @@FETCH_STATUS = FETCH NEXT FROM C INTO @seqval; WHILE @@FETCH_STATUS = BEGIN IF @seqval - @prvseqval > INSERT INTO @Gaps(start_range, end_range) VALUES(@prvseqval + 1, @seqval - 1); SET @prvseqval = @seqval; FETCH NEXT FROM C INTO @seqval; END CLOSE C; DEALLOCATE C; SELECT start_range, end_range FROM @Gaps; This solution is based on cursors, which represent the ordered sequence values The code fetches the ordered sequence values from the cursor one at a time, and identifies a gap whenever the difference between the previous and current values is greater than one When a gap is found, the gap information is stored in a table variable When finished with the cursor, the code queries the table variable to return the gaps This solution is a good example of the overhead involved with using cursors The I/O work involved here is a single ordered scan of the index, amounting to 16,123 logical reads However, there’s overhead involved with each record manipulation that is multiplied by the 10,000,000 records involved This code ran for 250 seconds on my system and is the slowest of the solutions I tested for the gaps problem To apply the solution to a temporal sequence, use the DATEADD function instead of using +1 and -1 and the DATEDIFF function to calculate the difference between @seqval and @prvseqval Nothing must be added to handle a sequence with duplicates Performance summary for gaps solutions Table shows a summary of the performance measures I got for the four solutions presented Table Performance summary of solutions to gaps problem Solution Runtime in seconds Logical reads Solution 1—using subqueries 62,262 Solution 2—using subqueries 48 31,875,478 Solution 3—using ranking functions 24 32,246 250 16,123 Solution 4—using cursors Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 66 CHAPTER Gaps and islands As you can see, the first solution using subqueries is by far the fastest, whereas the fourth solution using cursors is by far the slowest Solutions to islands problem I’ll present four solutions to the islands problem: using subqueries and ranking calculations; using a group identifier based on subqueries; using a group identifier based on ranking calculations; and using cursors I’ll also present a variation on the islands problem, and then conclude this section with a performance summary Islands—solution using subqueries and ranking calculations Listing shows the first solution to the islands problem Listing Islands—solution using subqueries and ranking calculations WITH StartingPoints AS ( SELECT seqval, ROW_NUMBER() OVER(ORDER BY seqval) AS rownum FROM dbo.NumSeq AS A WHERE NOT EXISTS (SELECT * FROM dbo.NumSeq AS B WHERE B.seqval = A.seqval - 1) ), EndingPoints AS ( SELECT seqval, ROW_NUMBER() OVER(ORDER BY seqval) AS rownum FROM dbo.NumSeq AS A WHERE NOT EXISTS (SELECT * FROM dbo.NumSeq AS B WHERE B.seqval = A.seqval + 1) ) SELECT S.seqval AS start_range, E.seqval AS end_range FROM StartingPoints AS S JOIN EndingPoints AS E ON E.rownum = S.rownum; This solution defines two CTEs—one called StartingPoints, representing starting points of islands, and one called EndingPoints, representing ending points of islands A point is identified as a starting point if the value minus doesn’t exist in the sequence A point is identified as an ending point if the value plus doesn’t exist in the sequence Each CTE also assigns row numbers to position the starting/ending points The outer query joins the CTEs by matching starting and ending points based on equality between their row numbers This solution is straightforward, and also has reasonable performance when the sequence has a fairly small number of islands The plan for this solution shows that the index is scanned four times in order—two ordered scans and a merge join are used to identify starting points and calculate their row numbers, and similar activity to identify ending points A merge join is then used to match starting and ending points Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Solutions to islands problem 67 This query ran for 17 seconds against the BigNumSeq table, and incurred 64,492 logical reads To apply the solution to a temporal sequence, use DATEADD to add or subtract the appropriate interval instead of using +1 and -1 As for a sequence with duplicates, the existing solution works as is; no changes are needed Islands—solution using group identifier based on subqueries Listing shows the second solution to the islands problem Listing Islands—solution using group identifier based on subqueries SELECT MIN(seqval) AS start_range, MAX(seqval) AS end_range FROM (SELECT seqval, (SELECT MIN(B.seqval) FROM dbo.NumSeq AS B WHERE B.seqval >= A.seqval AND NOT EXISTS (SELECT * FROM dbo.NumSeq AS C WHERE C.seqval = B.seqval + 1)) AS grp FROM dbo.NumSeq AS A) AS D GROUP BY grp; This solution uses subqueries to produce, for each point, a group identifier (grp) The group identifier is a value that uniquely identifies the island Such a value must be the same for all members of the island, and different than the value produced for other islands When a group identifier is assigned to each member of the island, you can group the data by this identifier, and return for each group the minimum and maximum values The logic used to calculate the group identifier for each current point is this: return the next point (inclusive) that is also a point before a gap Try to apply this logic and figure out the group identifier that will be produced for each point in the NumSeq table For the values 2, the grp value that will be produced will be 3, because for both values the next point (inclusive) before a gap is For the values 11, 12, 13, the next point before a gap is 13, and so on Recall the techniques used previously to identify the next point—the minimum value that is greater than the current In our case, the next point should be inclusive; therefore you should use the >= operator instead of the > operator To identify a point before a gap, use a NOT EXISTS predicate that ensures that the value plus doesn’t exist Now combine the two techniques and you get the next point before a gap The query defining the derived table D does all the work of producing the group identifiers The outer query against D is then left to group the data by grp, and return for each island the first and last sequence values If you think that the logic of this solution is complex, I’m afraid its performance will not comfort you The plan for this query is horrible—it scans the 10,000,000 sequence values, and for each of those values it does expensive work that involves a merge join between two inputs that identifies the next point before a gap I aborted the execution of this query after letting it run for about 10 minutes Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 68 CHAPTER Gaps and islands Islands—solution using group identifier based on ranking calculations Listing shows the third solution Listing Islands—solution using group identifier based on ranking calculations SELECT MIN(seqval) AS start_range, MAX(seqval) AS end_range FROM (SELECT seqval, seqval - ROW_NUMBER() OVER(ORDER BY seqval) AS grp FROM dbo.NumSeq) AS D GROUP BY grp; This solution is similar to solution in the sense that it also calculates a group identifier that uniquely identifies the island; however this solution is dramatically simpler and more efficient The group identifier is calculated as the difference between the sequence value and a row number that represents the position of the sequence value Within an island, both the sequence values and the row numbers keep incrementing by the fixed interval of Therefore, the difference between the two is constant When moving on to the next island, the sequence value increases by more than 1, while the row number increases by Therefore, the difference becomes higher than the previous island Each island will have a higher difference than the previous island Note that the last two characteristics that I described of this difference (constant within an island, and different for different islands) are the exact requirements we had for the group identifier Hence, you can use this difference as the group identifier As in solution 2, after the group identifier is calculated, you group the data by this identifier, and for each group, return the minimum and maximum values For performance, this solution is efficient because it required a single ordered scan of the index to calculate the row numbers, and an aggregate operator for the grouping activity It ran for 10 seconds against the BigNumSeq table on my system and incurred 16,123 logical reads To apply this solution to a temporal sequence, subtract from the sequence value as many temporal intervals as the row number representing the position of the sequence value The logic is similar to the integer sequence, but instead of getting a unique integer per island, you will get a unique time stamp for each island For example, if the interval of the temporal sequence is hours, substitute the expression seqval ROW_NUMBER() OVER(ORDER BY seqval) AS grp in listing with the expression DATEADD(hour, -4 * ROW_NUMBER() OVER(ORDER BY seqval), seqval) AS grp For a numeric sequence with duplicates, the trick is to have the same rank for all duplicate occurrences This can be achieved by using the DENSE_RANK function instead of ROW_NUMBER Simply substitute the expression seqval - ROW_NUMBER() OVER(ORDER BY seqval) AS grp in listing with the expression seqval - DENSE_RANK() OVER(ORDER BY seqval) AS grp Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Solutions to islands problem 69 Islands—solution using cursors For kicks, and for the sake of completeness, listing 10 shows the fourth solution, which is based on cursors Listing 10 Islands—solution using cursors SET NOCOUNT ON; DECLARE @seqval AS INT, @prvseqval AS INT, @first AS INT; DECLARE @Islands TABLE(start_range INT, end_range INT); DECLARE C CURSOR FAST_FORWARD FOR SELECT seqval FROM dbo.NumSeq ORDER BY seqval; OPEN C; FETCH NEXT FROM C INTO @seqval; SET @first = @seqval; SET @prvseqval = @seqval; WHILE @@FETCH_STATUS = BEGIN IF @seqval - @prvseqval > BEGIN INSERT INTO @Islands(start_range, end_range) VALUES(@first, @prvseqval); SET @first = @seqval; END SET @prvseqval = @seqval; FETCH NEXT FROM C INTO @seqval; END IF @first IS NOT NULL INSERT INTO @Islands(start_range, end_range) VALUES(@first, @prvseqval); CLOSE C; DEALLOCATE C; SELECT start_range, end_range FROM @Islands; The logic of this solution is to scan the sequence values in order, and at each point, check if the difference between the previous value and the current is greater than If it is, you know that the last island ended with the previous value closed, and a new one just started with the current value As expected, the performance of this solution is not good—it ran for 217 seconds on my system Variation on the islands problem Before I present a performance summary for the different solutions, I want to show a variation on the islands problem, and a solution based on the fast technique with the ranking calculation I’ll use a table called T1 with new sample data to discuss this problem Run the code in listing 11 to create the table T1 and populate it with sample data Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 70 CHAPTER Listing 11 Gaps and islands Code creating and populating table T1 SET NOCOUNT ON; USE tempdb; IF OBJECT_ID('dbo.T1') IS NOT NULL DROP TABLE dbo.T1; CREATE TABLE dbo.T1 ( id INT NOT NULL PRIMARY KEY, val VARCHAR(10) NOT NULL ); GO INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT INTO INTO INTO INTO INTO INTO INTO INTO INTO INTO INTO INTO INTO INTO INTO INTO INTO dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, dbo.T1(id, val) val) val) val) val) val) val) val) val) val) val) val) val) val) val) val) val) VALUES(2, 'a'); VALUES(3, 'a'); VALUES(5, 'a'); VALUES(7, 'b'); VALUES(11, 'b'); VALUES(13, 'a'); VALUES(17, 'a'); VALUES(19, 'a'); VALUES(23, 'c'); VALUES(29, 'c'); VALUES(31, 'a'); VALUES(37, 'a'); VALUES(41, 'a'); VALUES(43, 'a'); VALUES(47, 'c'); VALUES(53, 'c'); VALUES(59, 'c'); This variation involves two attributes: one represents a sequence (id in our case), and the other represents a kind of status (val in our case) The task at hand is to identify the starting and ending sequence point (id) of each consecutive status (val) segment Table shows the desired output This is a variation on the islands problem Listing 12 shows the solution to this problem mn mx val A 11 B 13 19 A 23 29 C 31 43 A 47 59 C Table Desired result for variation on the islands problem Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Summary Listing 12 71 Solution to variation on the islands problem WITH C AS ( SELECT id, val, ROW_NUMBER() OVER(ORDER BY id) - ROW_NUMBER() OVER(ORDER BY val, id) AS grp FROM dbo.T1 ) SELECT MIN(id) AS mn, MAX(id) AS mx, val FROM C GROUP BY val, grp ORDER BY mn; The query defining the CTE C calculates the difference between a row number representing id ordering and a row number representing val, id ordering, and calls that difference grp Within each consecutive status segment, the difference will be constant, and smaller than the difference that will be produced for the next consecutive status segment In short, the combination of status (val) and that difference uniquely identifies a consecutive status segment What’s left for the outer query is to group the data by val and grp, and return for each group the status (val), and the minimum and maximum ids Performance summary for islands solutions Table shows a summary of the performance measures I got for the four solutions presented Table Performance summary of solutions to islands problem Solution Solution 1—using subqueries and ranking calculations Solution 2—using group identifier based on subqueries Solution 3—using group identifier based on ranking calculations Solution 4—using cursors Runtime in seconds Logical reads 17 64,492 Aborted after 10 minutes 10 16,123 217 16,123 As you can see, solution that uses the row number function to calculate a group identifier is the fastest Solution using a cursor is slow, but not the slowest Solution that uses subqueries to calculate a group identifier is the slowest Summary In this chapter I explained gaps and islands problems and provided different solutions to those problems I compared the performance of the different solutions and as Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 72 CHAPTER Gaps and islands you could see, the performance of the solutions varied widely The common theme was that the cursors performed badly, and the solutions that were based on ranking calculations performed either reasonably or very well Some of the solutions based on subqueries performed well, whereas some not so well One of my goals in this chapter was to cover the specifics of the gaps and islands problems and identify good performing solutions However, I also had an additional goal It is common to find that for any given querying problem there are several possible solutions that will vary in complexity and performance I wanted to emphasize the importance of producing multiple solutions, and analyzing and comparing their logic and performance About the author Itzik Ben-Gan is a Mentor and Co-Founder of Solid Quality Mentors A SQL Server Microsoft MVP (Most Valuable Professional) since 1999, Itzik has delivered numerous training events around the world focused on T-SQL Querying, Query Tuning, and Programming Itzik is the author of several books, including Microsoft SQL Server 2008: T-SQL Fundamentals, Inside Microsoft SQL Server 2008: T-SQL Querying, and Inside Microsoft SQL Server 2008: T-SQL Programming He has written many articles for SQL Server Magazine as well as articles and white papers for MSDN Itzik’s speaking activities include TechEd, DevWeek, SQLPASS, SQL Server Magazine Connections, various user groups around the world, and Solid Quality Mentors’ events, to name a few Itzik is the author of Solid Quality Mentors’ Advanced T-SQL Querying, Programming and Tuning and TSQL Fundamentals courses along with being a primary resource within the company for their T-SQL related activities Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Error handling in SQL Server and applications Bill Graziano Prior to SQL Server 2005, error handling was limited to testing @@ERROR after each statement This required users to write lots of similar code to error handling The introduction of TRY CATCH blocks in Transact-SQL (T-SQL) gives the developer more options to detect and handle errors It is now possible to completely consume errors on the server Or if you’d like, you can return specific errors back to the client The NET development languages also have a series of classes for SQL Server error handling These can be used to process errors and capture informational messages from SQL Server The combination of TRY CATCH on the server and the SQL Server–specific error handling on the client give developers many options to capture errors and handle them appropriately Handling errors inside SQL Server Consider the following T-SQL statements that generate an error: Listing Error sent to SQL Server Management Studio SELECT [First] = SELECT [Second] = 1/0 SELECT [Third] = This returns the following result to SQL Server Management Studio: First Second Msg 8134, Level 16, State 1, Line Divide by zero error encountered 73 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 74 CHAPTER Error handling in SQL Server and applications Third The error message from the second SELECT statement is displayed and then the third SELECT statement executes This error would’ve been returned to a client application Now consider the same three SELECT statements inside a TRY CATCH block, as shown in listing Listing T-SQL statements in a TRY CATCH block BEGIN TRY SELECT [First] = SELECT [Second] = 1/0 SELECT [Third] = END TRY BEGIN CATCH PRINT 'An error occurred' END CATCH This will produce the following result: First Second An error occurred The T-SQL statements inside the TRY block will execute until an error is encountered, and then execution will transfer to the code inside the CATCH block The first SELECT statement executes properly The second statement generates an error You can see that it returned the column header but not any resultset The third SELECT statement didn’t execute because control had been passed to the CATCH block Inside the CATCH block, the PRINT statement is executed The use of a TRY CATCH block also prevents the error from being returned to a client In effect, the CATCH block consumed the error and no error was returned to a client application A TRY CATCH block has some limitations: A TRY CATCH block can’t span batches Severity levels of 20 or higher will not be caught because the connection is closed We will cover severity levels in more detail shortly Compile errors will not be caught because the batch never executes Statement recompilation errors will not be caught Returning information about the error Each error that’s captured in a CATCH block has properties that we can examine using system-provided functions These functions only return values inside a CATCH block If Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Handling errors inside SQL Server 75 they’re called outside a CATCH block, they return NULL A stored procedure called from within a CATCH block will have access to the functions These functions are illustrated in listing Listing Outputting error properties with system-provided functions BEGIN TRY SELECT [Second] = 1/0 END TRY BEGIN CATCH SELECT [Error_Line] = ERROR_LINE(), [Error_Number] = ERROR_NUMBER(), [Error_Severity] = ERROR_SEVERITY(), [Error_State] = ERROR_STATE() SELECT [Error_Message] = ERROR_MESSAGE() END CATCH This returns the following result: Second Error_Line Error_Number Error_Severity Error_State - 8134 16 Error_Message -Divide by zero error encountered NOTE Any line number is this chapter is dependent on the amount of whitespace in the batch Your results may vary The ERROR_NUMBER() function returns the error number that caused the code to jump into this CATCH block This value will not change for the duration of the CATCH block You can see a list of system messages and errors in sys.messages The ERROR_MESSAGE() function returns the text of the error that caused the code to jump into the CATCH block For system errors this value can also be found in sys.messages The ERROR_MESSAGE() function will return the message with any parameters expanded You can return the severity of the error using the ERROR_SEVERITY() function TRY CATCH blocks behave differently depending on the severity of the error Errors (or messages) with a severity of 10 or less don’t cause the CATCH block to fire These are typically informational messages Error severities of 11 or higher will cause the CATCH block to fire unless the error terminates the connection Error severities from 11 to 16 are typically user or code errors Severity levels from 17 to 25 usually indicate a software or hardware problem, where processing may be unable to continue Books Online has a detailed description of each severity under the heading “Database Engine Error Severities.” Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross ... Microsoft SQL Server 2008: T -SQL Fundamentals, Inside Microsoft SQL Server 2008: T -SQL Querying, and Inside Microsoft SQL Server 2008: T -SQL Programming He has written many articles for SQL Server. .. magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CODE, and VSJ He has also contributed to several books on SQL Server, including SQL Server 2008 Internals... classes for SQL Server error handling These can be used to process errors and capture informational messages from SQL Server The combination of TRY CATCH on the server and the SQL Server? ??specific

Ngày đăng: 24/10/2013, 19:15

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan