SQL VISUAL QUICKSTART GUIDE- P46 pps

Assigning Ranks Ranking, which allocates the numbers 1, 2, 3, … to sorted values, is related to top-n queries and shares the problem of interpret- ing ties. The following queries calculate ranks for sales values in the table empsales from the preceding section. Listings 15.30a to 15.30e rank employees by sales. The first two queries shows the most commonly accepted ways to rank values. The other queries show variations on them. Figure 15.30 shows the result of each ranking method, a to e, combined for brevity and ease of comparison. These queries rank highest to lowest; to reverse the order, change > (or >= ) to < (or <= ) in the WHERE comparisons. 430 Chapter 15 Assigning Ranks Listing 15.30a Rank employees by sales (method a). See Figure 15.30 for the result. SELECT e1.emp_id, e1.sales, (SELECT COUNT(sales) FROM empsales e2 WHERE e2.sales >= e1.sales) AS ranking FROM empsales e1; Listing Listing 15.30b Rank employees by sales (method b). See Figure 15.30 for the result. SELECT e1.emp_id, e1.sales, (SELECT COUNT(sales) FROM empsales e2 WHERE e2.sales > e1.sales) +1 AS ranking FROM empsales e1; Listing Listing 15.30c Rank employees by sales (method c). See Figure 15.30 for the result. SELECT e1.emp_id, e1.sales, (SELECT COUNT(sales) FROM empsales e2 WHERE e2.sales > e1.sales) AS ranking FROM empsales e1; Listing Listing 15.30d Rank employees by sales (method d). See Figure 15.30 for the result. SELECT e1.emp_id, e1.sales, (SELECT COUNT(DISTINCT sales) FROM empsales e2 WHERE e2.sales >= e1.sales) AS ranking FROM empsales e1; Listing These queries use correlated subqueries and so run slowly. If you’re ranking a large number of items, you should use a built-in rank function or OLAP component, if available. The SQL:2003 standard introduced the functions RANK() and DENSE_RANK() , which Microsoft SQL Server 2005 and later, Oracle, and DB2 support. For Microsoft SQL Server 2000, look at the Analysis Services (OLAP) function RANK() . Alternatively, you can use your DBMS’s SQL extensions to calculate ranks efficiently. The following MySQL script, for example, is equivalent to Listing 15.30b: SET @rownum = 0; SET @rank = 0; SET @prev_val = NULL; SELECT @rownum := @rownum + 1 AS row, @rank := IF(@prev_val <> sales, @rownum, @rank) AS rank, @prev_val := sales AS sales FROM empsales ORDER BY sales DESC; ✔ Tips ■ You can add the clause ORDER BY ranking ASC to a query’s outer SELECT to sort the results by rank. ■ Microsoft Access doesn’t support COUNT(DISTINCT) and won’t run Listings 15.30d and 15.30e. For a workaround, see “Aggregating Distinct Values with DISTINCT ” in Chapter 6. 431 SQL Tricks Assigning Ranks Listing 15.30e Rank employees by sales (method e). See Figure 15.30 for the result. SELECT e1.emp_id, e1.sales, (SELECT COUNT(DISTINCT sales) FROM empsales e2 WHERE e2.sales > e1.sales) AS ranking FROM empsales e1; Listing emp_id sales a b c d e E09 900 1 1 0 1 0 E02 800 2 2 1 2 1 E10 700 4 3 2 3 2 E05 700 4 3 2 3 2 E01 600 5 5 4 4 3 E04 500 8 6 5 5 4 E03 500 8 6 5 5 4 E06 500 8 6 5 5 4 E08 400 9 9 8 6 5 E07 300 10 10 9 7 6 Figure 15.30 Compilation of results of Listings 15.30a to 15.30e. Calculating a Trimmed Mean The trimmed mean is a robust order statis- tic that is the mean (average) of the data if the k smallest values and k largest values are discarded. The idea is to avoid influence of extreme observations. Listing 15.31 calculates the trimmed mean of book sales in the sample database by omitting the top three and bottom three sales figures. See Figure 15.31 for the result. For reference, the 12 sorted sales values are 566, 4095, 5000, 9566, 10467, 11320, 13001, 25667, 94123, 100001, 201440, and 1500200. This query discards 566, 4095, 5000, 100001, 201440, and 1500200 and calculates the mean in the usual way by using the remain- ing six middle values. Nulls are ignored. Duplicate values are either all removed or all retained. (If all sales are the same, none of them will be trimmed no matter what k is, for example.) Listing 15.32 is similar to Listing 15.40 but trims a fixed percentage of the extreme values rather than a fixed number. Trimming by 0.25 (25%), for example, discards the sales in the top and bottom quartiles and averages what’s left. See Figure 15.32 for the result. ✔ Tip ■ Microsoft SQL Server and DB2 return an integer for the trimmed mean because the column sales is defined as an INTEGER . To get a floating-point value, change AVG(sales) to AVG(CAST(sales AS FLOAT)) . For more information, see “Converting Data Types with CAST() ” in Chapter 5. 432 Chapter 15 Calculating a Trimmed Mean Listing 15.31 Calculate the trimmed mean for k = 3. See Figure 15.31 for the result. SELECT AVG(sales) AS TrimmedMean FROM titles t1 WHERE (SELECT COUNT(*) FROM titles t2 WHERE t2.sales <= t1.sales) > 3 AND (SELECT COUNT(*) FROM titles t3 WHERE t3.sales >= t1.sales) > 3; Listing TrimmedMean 27357.3333 Figure 15.31 Result of Listing 15.31. Listing 15.32 Calculate the trimmed mean by discarding the lower and upper 25% of values. See Figure 15.32 for the result. SELECT AVG(sales) AS TrimmedMean FROM titles t1 WHERE (SELECT COUNT(*) FROM titles t2 WHERE t2.sales <= t1.sales) >= (SELECT 0.25 * COUNT(*) FROM titles) AND (SELECT COUNT(*) FROM titles t3 WHERE t3.sales >= t1.sales) >= (SELECT 0.25 * COUNT(*) FROM titles); Listing TrimmedMean 27357.3333 Figure 15.32 Result of Listing 15.32. Picking Random Rows Some databases are so large, and queries on them so complex, that often it’s impractical (and unnecessary) to retrieve all the data relevant to a query. If you’re interested in finding an overall trend or pattern, for example, an approximate answer within some margin of error usually will do. One way to speed such queries is to select a random sample of rows. An efficient sample can improve performance by orders of magni- tude yet still yield accurate results. Standard SQL’s TABLESAMPLE clause returns a random subset of rows. DB2 and SQL Server 2005 and later support TABLESAMPLE , and Oracle has something similar. For the other DBMSs, use a (nonstandard) function that returns a uniform random number between 0 and 1 (Table 15.1). Listing 15.33a randomly picks about 25% (0.25) of the rows from the sample-database table titles . If necessary, change RAND() to the function that appears in Table 15.1 for your DBMS. For Oracle, use Listing 15.33b. For SQL Server 2005 and later and DB2, use Listing 15.33c. 433 SQL Tricks Picking Random Rows Table 15.1 Randomization Features DBMS Clause or Function Access RND() function SQL Server 2000 RAND() function SQL Server 2005/2008 TABLESAMPLE clause Oracle SAMPLE clause or DBMS_RANDOM package DB2 TABLESAMPLE clause MySQL RAND() function PostgreSQL RANDOM() function Listing 15.33a Select about 25% percent of the rows in the table titles at random. See Figure 15.33 for a possible result. SELECT title_id, type, sales FROM titles WHERE RAND() < 0.25; Listing Listing 15.33b Select about 25% percent of the rows in the table titles at random (Oracle only). See Figure 15.33 for a possible result. SELECT title_id, type, sales FROM titles SAMPLE (25); Listing Listing 15.33c Select about 25% percent of the rows in the table titles at random (SQL Server 2005 and later and DB2 only). See Figure 15.33 for a possible result. SELECT title_id, type, sales FROM titles TABLESAMPLE SYSTEM (25); Listing Figure 15.33 shows one possible result of a random selection. The rows and the number of rows returned will change each time you run the query. If you need an exact number of random rows, increase the sampling percentage and use one of the techniques described in “Limiting the Number of Rows Returned” earlier in this chapter. ✔ Tips ■ Randomizers take an optional seed argu- ment or setting that sets the starting value for a random-number sequence. Identical seeds yield identical sequences (handy for testing). By default, the DBMS sets the seed based on the system time to generate different sequences every time. ■ Listing 15.33a won’t run correctly on Microsoft Access or Microsoft SQL Server 2000 because the random-number function returns the same “random” number for each selected row. In Access, use Visual Basic or C# to pick random rows. For SQL Server 2000, search for the article “Returning Rows in Random Order” at www.sqlteam.com . To use the NEWID() function to pick n random rows in Microsoft SQL Server: SELECT TOP n title_id, type, sales FROM titles ORDER BY NEWID(); To use the VALUE() function in the DBMS_RANDOM package to pick n random rows in Oracle: SELECT * FROM (SELECT title_id, type, sales FROM titles ORDER BY DBMS_RANDOM.VALUE()) WHERE ROWNUM <= n; 434 Chapter 15 Picking Random Rows title_id type sales T03 computer 25667 T04 psychology 13001 T11 psychology 94123 Figure 15.33 One possible result of Listing 15.33a/b/c. Selecting Every nth Row Instead of picking random rows, you can pick every nth row by using a modulo expression: ◆ m MOD n (Microsoft Access) ◆ m % n (Microsoft SQL Server) ◆ MOD(m,n) (other DBMSs) This expression returns the remainder of m divided by n. For example, MOD(20,6) is 2 because 20 equals (3 ✕ 6) + 2. MOD(a,2) is 0 if a is an even number. The condition MOD(rownumber,n) = 0 picks every nth row, where rownumber is a column of consecutive integers or row identifiers. This Oracle query picks every third row in a table, for example: SELECT * FROM table WHERE (ROWID,0) IN (SELECT ROWID, MOD(ROWNUM,3) FROM table); Note that rownumber imposes a row order that doesn’t exist implicitly in a relational-database table. Handling Duplicates Normally you use SQL’s PRIMARY KEY or UNIQUE constraints (see Chapter 11) to prevent duplicate rows from appearing in production tables. But you need to know how to handle duplicates that appear when you accidentally import the same data twice or import data from a nonrelational environment, such as a spreadsheet or accounting package, where redundant information is rampant. This section describes how to detect, count, and remove duplicates. Suppose that you import rows into a staging table to detect and eliminate any duplicates before inserting the data into a production table (Listing 15.34 and Figure 15.34). The column id is a unique row identifier that lets you identify and select rows that other- wise would be duplicates. If your imported rows don’t already have an identity column, you can add one yourself; see “Unique Identifiers” in Chapter 3 and “Generating Sequences” earlier in this chapter. It’s a good practice to add an identity column to even short-lived working tables, but in this case it also makes deleting duplicates easy. The imported data might include other columns too, but you’ve decided that the combination of only book title, book type, and price deter- mines whether a row is a duplicate, regardless of the values in any other columns. Before you identify or delete duplicates, you must define exactly what it means for two rows to be considered “duplicates” of each other. Listing 15.35 lists only the duplicates by counting the number of occurrences of each unique combination of title_name , type , and price . See Figure 15.35 for the result. If this query returns an empty result, the table contains no duplicates. To list only the nonduplicates, change COUNT(*) > 1 to COUNT(*) = 1 . 435 SQL Tricks Handling Duplicates Listing 15.34 List the imported rows. See Figure 15.34 for the result. SELECT id, title_name, type, price FROM dups; Listing id title_name type price 1 Book Title 5 children 15.00 2 Book Title 3 biography 7.00 3 Book Title 1 history 10.00 4 Book Title 2 children 20.00 5 Book Title 4 history 15.00 6 Book Title 1 history 10.00 7 Book Title 3 biography 7.00 8 Book Title 1 history 10.00 Figure 15.34 Result of Listing 15.34. Listing 15.35 List only duplicates. See Figure 15.35 for the result. SELECT title_name, type, price FROM dups GROUP BY title_name, type, price HAVING COUNT(*) > 1; Listing title_name type price Book Title 1 history 10.00 Book Title 3 biography 7.00 Figure 15.35 Result of Listing 15.35. Listing 15.36 uses a similar technique to list each row and its duplicate count. See Figure 15.36 for the result. To list only the duplicates, change COUNT(*) >= 1 to COUNT(*) > 1 . Listing 15.37 deletes duplicate rows from dups in place. This statement uses the column id to leave exactly one occurrence (the one with the highest ID) of each duplicate. Figure 15.37 shows the table dups after running this statement. See also “Deleting Rows with DELETE ” in Chapter 10. 436 Chapter 15 Handling Duplicates Listing 15.36 List each row and its number of repetitions. See Figure 15.36 for the result. SELECT title_name, type, price, COUNT(*) AS NumDups FROM dups GROUP BY title_name, type, price HAVING COUNT(*) >= 1 ORDER BY COUNT(*) DESC; Listing title_name type price NumDups Book Title 1 history 10.00 3 Book Title 3 biography 7.00 2 Book Title 4 history 15.00 1 Book Title 2 children 20.00 1 Book Title 5 children 15.00 1 Figure 15.36 Result of Listing 15.36. Listing 15.37 Remove the redundant duplicates in place. See Figure 15.37 for the result. DELETE FROM dups WHERE id < ( SELECT MAX(d.id) FROM dups d WHERE dups.title_name = d.title_name AND dups.type = d.type AND dups.price = d.price); Listing id title_name type price 1 Book Title 5 children 15.00 4 Book Title 2 children 20.00 5 Book Title 4 history 15.00 7 Book Title 3 biography 7.00 8 Book Title 1 history 10.00 Figure 15.37 Result of Listing 15.37. ✔ Tips ■ If you define a duplicate to span every column in a row (not just a subset of columns), you can drop the column id and use SELECT DISTINCT * FROM table to delete duplicates. See “Eliminating Duplicate Rows with DISTINCT ” in Chapter 4. ■ If your DBMS offers a built-in unique row identifier, you can drop the column id and still delete duplicates in place. In Oracle, for example, you can replace id with the ROWID pseudocolumn in Listing 15.37; change the outer WHERE clause to: WHERE ROWID < (SELECT MAX(d.ROWID) To run Listing 15.45 in MySQL, change ORDER BY COUNT(*) DESC to ORDER BY NumDups DESC . You can’t use Listing 15.37 to do an in-place deletion because MySQL won’t let you use same table for both the subquery’s FROM clause and the DELETE target. 437 SQL Tricks Handling Duplicates Messy Data Deleting duplicates gets harder as data get messier. It’s not unusual to buy a mail- ing list with entries that look like this: name address1 —————————— —————————————————— John Smith 123 Main St John Smith 123 Main St, Apt 1 Jack Smiht 121 Main Rd John Symthe 123 Main St. Jon Smith 123 Mian Street DBMSs offer nonstandard tools such as Soundex (phonetic) functions to sup- press spelling variations, but creating an automated deletion program that works over thousands or millions of rows is a major project. Creating a Telephone List You can use the function COALESCE() with a left outer join to create a convenient telephone listing from a normalized table of phone numbers. Suppose that the sample database has an extra table named telephones that stores the authors’ work and home telephone numbers: au_id tel_type tel_no A01 H 111-111-1111 A01 W 222-222-2222 A02 W 333-333-3333 A04 H 444-444-4444 A04 W 555-555-5555 A05 H 666-666-6666 The table’s composite primary key is (au_id, tel_type) , where tel_type indicates whether tel_no is a work (W) or home (H) number. Listing 15.38 lists the authors’ names and numbers. If an author has only one number, that number is listed. If an author has both home and work numbers, only the work number is listed. Authors with no numbers aren’t listed. See Figure 15.38 for the result. The first left join picks out the work numbers, and the second picks out the home numbers. The WHERE clause filters out authors with no numbers. (You can extend this query to add cell-phone and other numbers.) ✔ Tips ■ For more information about COALESCE() , see “Checking for Nulls with COALESCE() ” in Chapter 5. For left outer joins, see “Creating Outer Joins with OUTER JOIN ” in Chapter 7. ■ Microsoft Access won’t run Listing 15.38 because of the restrictions Access puts on join expressions. 438 Chapter 15 Creating a Telephone List Listing 15.38 Lists the authors’ names and telephone numbers, favoring work numbers over home numbers. See Figure 15.38 for the result. SELECT a.au_id AS "ID", a.au_fname AS "FirstName", a.au_lname AS "LastName", COALESCE(twork.tel_no, thome.tel_no) AS "TelNo", COALESCE(twork.tel_type, thome.tel_type) AS "TelType" FROM authors a LEFT OUTER JOIN telephones twork ON a.au_id = twork.au_id AND twork.tel_type = 'W' LEFT OUTER JOIN telephones thome ON a.au_id = thome.au_id AND thome.tel_type = 'H' WHERE COALESCE(twork.tel_no, thome.tel_no) IS NOT NULL ORDER BY a.au_fname ASC, a.au_lname ASC; Listing ID FirstName LastName TelNo TelType A05 Christian Kells 666-666-6666 H A04 Klee Hull 555-555-5555 W A01 Sarah Buchman 222-222-2222 W A02 Wendy Heydemark 333-333-3333 W Figure 15.38 Result of Listing 15.38. Retrieving Metadata Metadata are data about data. In DBMSs, metadata include information about schemas, databases, users, tables, columns, and so on. You already saw metadata in “Getting User Information” in Chapter 5 and “Displaying Table Definitions” in Chapter 10. The first thing to do when meeting a new database is to inspect its metadata: What’s in the database? How big is it? How are the tables organized? Metadata, like other data, are stored in tables and so can be accessed via SELECT queries. Metadata also can be accessed, often more conveniently, by using com- mand-line and graphical tools. The following listings show DBMS-specific examples for viewing metadata. The DBMS itself main- tains metadata—look, but don’t touch. ✔ Tip ■ The SQL standard calls a set of metadata a catalog and specifies that it be accessed through the schema INFORMATION_SCHEMA . Not all DBMSs implement this schema or use the same terms. In Microsoft SQL Server, for example, the equivalent term for a catalog is a database and for a schema, an owner. In Oracle, the reposi- tory of metadata is the data dictionary. 439 SQL Tricks Retrieving Metadata . component, if available. The SQL: 2003 standard introduced the functions RANK() and DENSE_RANK() , which Microsoft SQL Server 2005 and later, Oracle, and DB2 support. For Microsoft SQL Server 2000, look. 15.33b. For SQL Server 2005 and later and DB2, use Listing 15.33c. 433 SQL Tricks Picking Random Rows Table 15.1 Randomization Features DBMS Clause or Function Access RND() function SQL Server. Server 2000 RAND() function SQL Server 2005/2008 TABLESAMPLE clause Oracle SAMPLE clause or DBMS_RANDOM package DB2 TABLESAMPLE clause MySQL RAND() function PostgreSQL RANDOM() function Listing

Định dạng
Số trang	10
Dung lượng	191,59 KB