Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 322 Part II Manipulating Data With Select 4 43877 2001-08-01 00:00:00.000 5 43894 2001-08-01 00:00:00.000 6 43895 2001-08-01 00:00:00.000 7 43911 2001-08-01 00:00:00.000 1 44109 2001-09-01 00:00:00.000 1 44285 2001-10-01 00:00:00.000 1 44483 2001-11-01 00:00:00.000 2 44501 2001-11-01 00:00:00.000 As expected, the windowed sort (in this case, the RowNumber column) restarts with every new month. Ranking Functions The windowing capability (the OVER() clause) by itself doesn’t create any query output columns; that’s where the ranking functions come into play: ■ row_number ■ rank ■ dense_rank ■ ntile Just to be explicit, the ranking functions all require the windowing function. All the normal aggregate functions — SUM(), MIN(), MAX(), COUNT(*), and so on — can also be used as ranking functions. Row number() function The ROW_NUMBER() function generates an on-the-fly auto-incrementing integer according to the sort order of the OVER() clause. It’s similar to Oracle’s RowNum column. The row number function simply numbers the rows in the query result — there’s absolutely no correlation with any physical address or absolute row number. This is important because in a relational database, row position, number, and order have no meaning. It also means that as rows are added or deleted from the underlying data source, the row numbers for the query results will change. In addition, if there are sets of rows with the same values in all ordering columns, then their order is undefined, so their row numbers may change between two executions even if the underlying data does not change. One common practical use of the ROW_NUMBER() function is to filter by the row number values for pagination. For example, a query that easily produces rows 21–40 would be useful for returning the second page of data for a web page. Just be aware that the rows in the pages may change — typically, this grabs data from a temp table. It would seem that the natural way to build a row number pagination query would be to simply add the OVER() clause and ROW_NUMBER() function to the WHERE clause: SELECT ROW_NUMBER() OVER(ORDER BY OrderDate, SalesOrderID) as RowNumber, SalesOrderID FROM Sales.SalesOrderHeader WHERE SalesPersonID = 280 322 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 323 Windowing and Ranking 13 AND ROW_NUMBER() OVER(ORDER BY OrderDate, SalesOrderID) Between 21 AND 40 ORDER BY RowNumber; Result: Msg 4108, Level 15, State 1, Line 4 Windowed functions can only appear in the SELECT or ORDER BY clauses. Because the WHERE clause occurs very early in the query processing — often in the query operation that actually reads the data from the data source — and the OVER() clause occurs late in the query process- ing, the WHERE clause doesn’t yet know about the windowed sort of the data or the ranking function. The WHERE clause can’t possibly filter by the generated row number. There is a simple solution: Embed the windowing and ranking functionality in a subquery or common table expression: SELECT RowNumber, SalesOrderID, OrderDate, SalesOrderNumber FROM ( SELECT ROW_NUMBER() OVER(ORDER BY OrderDate, SalesOrderID) as RowNumber, * FROM Sales.SalesOrderHeader WHERE SalesPersonID = 280 )ASQ WHERE RowNumber BETWEEN 21 AND 40 ORDER BY RowNumber; Result: RowNumber SalesOrderID OrderDate SalesOrderNumber 21 45041 2002-01-01 00:00:00.000 SO45041 22 45042 2002-01-01 00:00:00.000 SO45042 23 45267 2002-02-01 00:00:00.000 SO45267 24 45283 2002-02-01 00:00:00.000 SO45283 25 45295 2002-02-01 00:00:00.000 SO45295 26 45296 2002-02-01 00:00:00.000 SO45296 27 45303 2002-02-01 00:00:00.000 SO45303 28 45318 2002-02-01 00:00:00.000 SO45318 29 45320 2002-02-01 00:00:00.000 SO45320 30 45338 2002-02-01 00:00:00.000 SO45338 31 45549 2002-03-01 00:00:00.000 SO45549 32 45783 2002-04-01 00:00:00.000 SO45783 33 46025 2002-05-01 00:00:00.000 SO46025 34 46042 2002-05-01 00:00:00.000 SO46042 35 46052 2002-05-01 00:00:00.000 SO46052 36 46053 2002-05-01 00:00:00.000 SO46053 37 46060 2002-05-01 00:00:00.000 SO46060 38 46077 2002-05-01 00:00:00.000 SO46077 39 46080 2002-05-01 00:00:00.000 SO46080 40 46092 2002-05-01 00:00:00.000 SO46092 323 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 324 Part II Manipulating Data With Select The second query in this chapter, in the ‘‘Partitioning within the Window’’ section, showed how group- ing the sort order of the window generated row numbers that started over with every new partition. Rank() and dense_rank() functions The RANK() and DENSE_RANK() functions return values as if the rows were competing according to the windowed sort order. Any ties are grouped together with the same ranked value. For example, if Frank and Jim both tied for third place, then they would both receive a rank() value of 3. Using sales data from AdventureWorks2008, there are ties for least sold products, which makes it a good table to play with RANK() and DENSE_RANK(). ProductID’s 943 and 911 tie for third place and ProductID’s 927 and 898 tie for fourth or fifth place depending on how ties are counted: Least Sold Products: SELECT ProductID, COUNT(*) as ‘count’ FROM Sales.SalesOrderDetail GROUP BY ProductID ORDER BY COUNT(*); Result (abbreviated): ProductID count 897 2 942 5 943 6 911 6 927 9 898 9 744 13 903 14 Examining the sales data using windowing and the RANK() function returns the ranking values: SELECT ProductID, SalesCount, RANK() OVER (ORDER BY SalesCount) as ‘Rank’, DENSE_RANK() OVER(Order By SalesCount) as ‘DenseRank’ FROM (SELECT ProductID, COUNT(*) as SalesCount FROM Sales.SalesOrderDetail GROUP BY ProductID )ASQ ORDER BY ‘Rank’; Result (abbreviated): ProductID SalesCount Rank DenseRank 897 2 1 1 942 5 2 2 943 6 3 3 911 6 3 3 927 9 54 324 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 325 Windowing and Ranking 13 898 9 5 4 744 13 7 5 903 14 8 6 This example perfectly demonstrates the difference between RANK() and DENSE_RANK(). RANK() counts each tie as a ranked row. In this example, Product IDs 943 and 911 both tie for third place but consume the third and fourth row in the ranking, placing ProductID 927 in fifth place. DENSE_RANK() handles ties differently. Tied rows only consume a single value in the ranking, so the next rank is the next place in the ranking order. No ranks are skipped. In the previous query, ProductID 927 is in fourth place using DENSE_RANK(). Just as with the ROW_NUMBER() function, RANK() and DENSE_RANK() can be used with a partitioned OVER() clause. The previous example could be partitioned by product category to rank product sales with each category. Ntile() function The fourth ranking function organizes the rows into n number of groups, called tiles, and returns the tile number. For example, if the result set has ten rows, then NTILE(5) would split the ten rows into five equally sized tiles with two rows in each tile in the order of the OVER() clause’s ORDER BY. If the number of rows is not evenly divisible by the number of tiles, then the tiles get the extra row. For example, for 74 rows and 10 tiles, the first 4 tiles get 8 rows each, and tiles 5 through 10 get 7 rows each. This can skew the results for smaller data sets. For example, 15 rows into 10 tiles would place 10 rows in the lower five tiles and only place five tiles in the upper five tiles. But for larger data sets — splitting a few hundred rows into 100 tiles, for example — it works great. This rule also applies if there are fewer rows than tiles. The rows are not spread across all tiles; instead, the tiles are filled until the rows are consumed. For example, if five rows are split using NTILE(10), the result set would not use tiles 1, 3, 5, 7, and 9, but instead show tiles 1, 2, 3, 4, and 5. A common real-world example of NTILE() is the percentile scoring used in college entrance exams. The following query first calculates the AdventureWorks2008 products’ sales quantity in the sub- query. The outer query then uses the OVER() clause to sort by the sales count, and the NTILE(100) to calculate the percentile according to the sales count: SELECT ProductID, SalesCount, NTILE(100) OVER (ORDER BY SalesCount) as Percentile FROM (SELECT ProductID, COUNT(*) as SalesCount FROM Sales.SalesOrderDetail GROUP BY ProductID )ASQ ORDER BY Percentile DESC; Result (abbreviated): ProductID SalesCount Percentile 712 3382 100 870 4688 100 921 3095 99 325 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 326 Part II Manipulating Data With Select 873 3354 99 707 3083 98 711 3090 98 922 2376 97 830 33 5 888 39 5 902 20 4 950 28 4 946 30 4 744 13 3 903 14 3 919 16 3 911 6 2 927 9 2 898 9 2 897 2 1 942 5 1 943 6 1 Like the other three ranking functions, NTILE() can be used with a partitioned OVER() clause. Simi- lar to the ranking example, the previous example could be partitioned by product category to generate percentiles within each category. Aggregate Functions SQL query functions all fit together like a magnificent puzzle. A fine example is how windowing can use not only the four ranking functions — ROW_NUMBER(), RANK(), DENSE_RANK(),and NTILE() — but also the standard aggregate functions: COUNT(*), MIN(), MAX(), and so on, which were covered in the last chapter. I won’t rehash the aggregate functions here, and usually the aggregate functions will fit well within a normal aggregate query, but here’s an example of using the SUM() aggregate function in a window to calculate the total sales order count for each product subcategory, and then, using that result from the window, calculate the percentage of sales orders for each product within its subcategory: SELECT ProductID, Product, SalesCount, NTILE(100) OVER (ORDER BY SalesCount) as Percentile, SubCat, CAST(CAST(SalesCount AS NUMERIC(9,2)) / SUM(SalesCount) OVER(Partition BY SubCat) * 100 AS NUMERIC (4,1)) AS PercOfSubCat FROM (SELECT P.ProductID, P.[Name] AS Product, PSC.NAME AS SubCat, COUNT(*) as SalesCount FROM Sales.SalesOrderDetail AS SOD JOIN Production.Product AS P ON SOD.ProductID = P.ProductID 326 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 327 Windowing and Ranking 13 JOIN Production.ProductSubcategory PSC ON P.ProductSubcategoryID = PSC.ProductSubcategoryID GROUP BY PSC.NAME, P.[Name], P.ProductID )Q ORDER BY Percentile DESC Result (abbreviated): ProductID Product SalesCount Percentile SubCat PercOfSubCat 870 Water Bottle - 30 oz. 4688 100 Bottles and Cages 55.6 712 AWC Logo Cap 3382 100 Caps 100.0 921 Mountain Tire Tube 3095 99 Tires and Tubes 17.7 873 Patch Kit/8 Patches 3354 99 Tires and Tubes 19.2 707 Sport-100 Helmet, Red 3083 98 Helmets 33.6 711 Sport-100 Helmet, Blue 3090 98 Helmets 33.7 708 Sport-100 Helmet, Black 3007 97 Helmets 32.8 922 Road Tire Tube 2376 97 Tires and Tubes 13.6 878 Fender Set - Mountain 2121 96 Fenders 100.0 871 Mountain Bottle Cage 2025 96 Bottles and Cages 24.0 Summary Windowing — an extremely powerful technology that creates an independent sort of the query results — supplies the sort order for the ranking functions which calculate row numbers, ranks, dense ranks, and n-tiles. When coding a complex query that makes the data twist and shout, creative use of windowing and ranking can be the difference between solving the problem in a single query or resorting to temp tables and code. The key point to remember is that the OVER() clause generates the sort order for the ranking functions. This chapter wraps up the set of chapters that explain how to query the data. The next chapters finish up the part on select by showing how to package queries into reusable views, and add insert, update, delete, and merge verbs to queries to modify data. (In case you haven’t checked yet and still need to know: The hidden arrow in the FedEx logo is between the E and the X.) 327 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 328 www.getcoolebook.com Nielsen c14.tex V4 - 07/21/2009 12:49pm Page 329 Projecting Data Through Views IN THIS CHAPTER Planning views wisely Creating views with Management Studio or DDL Updating through views Performance and views Nesting views Security through views Synonyms A view is the saved text of a SQL SELECT statement that may be referenced as a data source within a query, similar to how a subquery can be used as a data source — no more, no less. A view can’t be executed by itself; it must be used within a query. Views are sometimes described as ‘‘virtual tables.’’ This isn’t an accurate descrip- tion because views don’t store any data. Like any other SQL query, views merely refer to the data stored in tables. With this in mind, it’s important to fully understand how views work, the pros and cons of using views, and the best place to use views within your project architecture. Why Use Views? While there are several opinions on the use of views, ranging from total absti- nence to overuse, the Information Architecture Principle (from Chapter 2, ‘‘Smart Database Design’’) serves as a guide for their most appropriate use. The principle states that ‘‘information must be made readily available in a usable format for daily operations and analysis by individuals, groups, and processes ’’ Presenting data in a more useable format is precisely what views do best. Based on the premise that views are best used to increase data integrity and ease of writing ad hoc queries, and not as a central part of a production application, here are some ideas for building ad hoc query views: ■ Use views to denormalize or flatten complex joins and hide any surro- gate keys used to link data within the database schema. A well-designed view invites the user to get right to the data of interest. ■ Save complex aggregate queries as views. Even power users will appreciate a well-crafted aggregate query saved as a view. 329 www.getcoolebook.com Nielsen c14.tex V4 - 07/21/2009 12:49pm Page 330 Part II Manipulating Data with Select Best Practice V iews are an important part of the abstraction puzzle; I recommend being intentional in their use. Some developers are enamored with views and use them as the primary abstraction layer for their databases. They create layers of nested views, or stored procedures that refer to views. This practice serves no valid purpose, creates confusion, and requires needless overhead. The best database abstraction layer is a single layer of stored procedures that directly refer to tables, or sometimes user-defined functions (see Chapter 28, ‘‘Building out the Data Abstraction Layer’’). Instead, use views only to support ad hoc queries and reports. For queries that are run occasionally, views perform well even when compared with stored procedures. Data within a normalized database is rarely organized in a readily available format. Building ad hoc queries that extract the correct information from a normalized database is a challenge for most end-users. A well-written view can hide the complexity and present the correct data to the user. ■ Use aliases to change cryptic column names to recognizable column names. Just as the SQL SELECT statement can use column or table aliases to modify the names of columns or tables, these features may be used within a view to present a more readable record set to the user. ■ Include only the columns of interest to the user. When columns that don’t concern users are left out of the view, the view is easier to query. The columns that are included in the view are called projected columns, meaning they project only the selected data from the entire underlying table. ■ Plan generic, dynamic views that will have long, useful lives. Single-purpose views quickly become obsolete and clutter the database. Build the view with the intention that it will be used with a WHERE clause to select a subset of data. The view should return all the rows if the user does not supply a WHERE restriction. For example, the vEventList view returns all the events; the user should use a WHERE clause to select the local events, or the events in a certain month. ■ If a view is needed to return a restricted set of data, such as the next month’s events, then the view should calculate the next month so that it will continue to function over time. Hard-coding values such as a month number or name would be poor practice. ■ If the view selects data from a range, then consider writing it as a user-defined function (see Chapter 25, ‘‘Building User-Defined Functions’’), which can accept parameters. ■ Consolidate data from across a complex environment. Queries that need to collect data from across multiple servers are simplified by encapsulating the union of data from multiple servers within a view. This is one case where basing several reports, and even stored procedures, on a view improves the stability, integrity, and maintainability of the system. 330 www.getcoolebook.com Nielsen c14.tex V4 - 07/21/2009 12:49pm Page 331 Projecting Data Through Views 14 Using Views for Column-Level Security O ne of the basic relational operators is projection — the ability to expose specific columns. One primary advantage of views is their natural capacity to project a predefined set of columns. Here’s where theory becomes practical. A view can project columns on a need-to-know basis and hide columns that are sensitive (e.g., payroll and credit card data), irrelevant, or confusing for the purpose of the view. SQL Server supports column-level security, a nd it’s a powerful feature. The problem is that ad hoc queries made by users who don’t understand the schema very well will often run into security errors. I recommend implementing SQL Server column-level security, and then also using views to shield users from ever encountering the security. Grant users read permission from only the views, and restrict access to the physical tables (see Chapter 50, ‘‘Authorizing Securables’’). I’ve seen databases that only use views for column-level security without any SQL Server–enforced security. This is woefully inadequate and will surely be penalized by any serious security audit. The goal when developing views is two-fold: to enable users to get to the data easily and to protect the data from the users. By building views that provide the correct data, you are preventing erroneous or inaccurate queries and misinterpretation. There are other advanced forms of views. Distributed partition views ,or federated databases , divide very large tables across multiple smaller tables or separate servers to improve performance. The partitioned view then spans the multiple tables or servers, thus sharing the query load across more disk spindles. These are covered in Chapter 68, ‘‘Partitioning.’’ Indexed views are a powerful feature that actually materializes the data, storing the results of the view in a clustered index on disk, so in this sense it’s not a pure view. Like any view, it can select data from multiple data sources. Think of the indexed view as a covering index but with greater control — you can include data from multiple data sources, and you don’t have to include the clustered index keys. The index may then be referenced when executing queries, regardless of whether the view is in the query, so the name is slightly confusing. Because designing an indexed view is more like designing an indexing structure than creat- ing a view, I’ve included indexed views in Chapter 64, ‘‘Indexing Strategies.’’ The Basic View Using SQL Server Management Studio, views may be created, modified, executed, and included within other queries, using either the Query Designer or the DDL code within the Query Editor. 331 www.getcoolebook.com . a partitioned OVER() clause. Simi- lar to the ranking example, the previous example could be partitioned by product category to generate percentiles within each category. Aggregate Functions SQL. of views. Distributed partition views ,or federated databases , divide very large tables across multiple smaller tables or separate servers to improve performance. The partitioned view then spans. 99 325 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 326 Part II Manipulating Data With Select 873 3354 99 707 3083 98 711 3090 98 922 2376 97 830 33 5 888 39 5 902 20 4 950 28 4 946 30 4 744