Joe Celko s SQL for Smarties - Advanced SQL Programming P75 docx

10 193 0
Joe Celko s SQL for Smarties - Advanced SQL Programming P75 docx

Đang tải... (xem toàn văn)

Thông tin tài liệu

712 CHAPTER 31: OLAP IN SQL defined by an ordering clause (if one is specified), starting with one for the first row and continuing sequentially to the last row in the window. If an ordering clause, ORDER BY , isn’t specified in the window, the row numbers are assigned to the rows in arbitrary order as returned by the subselect. 31.2.3 GROUPING Operators OLAP functions add the ROLLUP and CUBE extensions to the GROUP BY clause. ROLLUP and CUBE are often referred to as supergroups. They can be written in older Standard SQL using GROUP BY and UNION operators. GROUP BY GROUPING SET The GROUPING SET(<column list>) is shorthand in SQL-99 for a series of UNION ed queries that are common in reports. For example, to find the total: SELECT dept_name, CAST(NULL AS CHAR(10)) AS job_title, COUNT(*) FROM Personnel GROUP BY dept_name UNION ALL SELECT CAST(NULL AS CHAR(8)) AS dept_name, job_title, COUNT(*) FROM Personnel GROUP BY job_title; The above can be rewritten like this. SELECT dept_name, job_title, COUNT(*) FROM Personnel GROUP BY GROUPING SET (dept_name, job_title); There is a problem with all of the OLAP grouping functions. They will generate NULL s for each dimension at the subtotal levels. How do you tell the difference between a real NULL and a generated NULL ? This is a job for the GROUPING() function, which returns zeros for NULL s in the original data, and ones for generated NULL s that indicate a subtotal. SELECT CASE GROUPING(dept_name) WHEN 1 THEN 'department total' ELSE dept_name END AS dept_name, 31.2 OLAP Functionality 713 CASE GROUPING(job_title) WHEN 1 THEN 'job total' ELSE job_title_name END AS job_title FROM Personnel GROUP BY GROUPING SETS (dept_name, job_title); The grouping set concept can be used to define other OLAP groupings. ROLLUP A ROLLUP group, an extension to the GROUP BY clause in SQL-99, produces a result set that contains subtotal rows in addition to the regular grouped rows. Subtotal rows are superaggregate rows that contain further aggregates whose values are derived by applying the same column functions that were used to obtain the grouped rows. A ROLLUP grouping is a series of grouping sets. GROUP BY ROLLUP (a, b, c) is equivalent to: GROUP BY GROUPING SETS (a, b, c) (a, b) (a) () Notice that the ( n ) elements of the ROLLUP translate to ( n +1) grouping set. Another point to remember is that the order in which the grouping expression is specified is significant for ROLLUP . The ROLLUP is basically a classic totals and subtotals report, presented as an SQL table. CUBES The CUBE supergroup, the other SQL-99 extension to the GROUP BY clause, produces a result set that contains all the subtotal rows of a ROLLUP aggregation and, in addition, contains cross-tabulation rows. Cross-tabulation rows are additional superaggregate rows. As the name implies, they are summaries across columns, representing the data as if it were a spreadsheet. Like ROLLUP , a CUBE group can also be thought of as a series of grouping sets. In the case of a CUBE , all permutations of the 714 CHAPTER 31: OLAP IN SQL cubed grouping expression are computed along with the grand total. Therefore, the n elements of a CUBE translate to 2 n grouping sets. GROUP BY CUBE (a, b, c) is equivalent to GROUP BY GROUPING SETS (a, b, c) (a, b) (a, c) (b, c) (a) (b) (c) () Notice that the three elements of the CUBE translate to eight grouping sets. Unlike ROLLUP , the order of specification of elements doesn’t matter for CUBE : CUBE (julian_day, sales_person) is the same as CUBE (sales_person, julian_day). CUBE is an extension of the ROLLUP function. The CUBE function not only provides the column summaries we saw in ROLLUP , but also calculates the row summaries and grand totals for the various dimensions. Aside from these functions, the ability to define a window is equally important to the OLAP functionality of SQL. You use windows to define a set of rows over which a function is applied and the sequence in which it occurs. Another way to view the concept of a window is to equate it with the concept of a slice. In other words, a window is simply a slice of the overall data domain. Moreover, when you use an OLAP function with a column function, such as AVG(), SUM(), MIN(),or MAX(), the target rows can be further refined relative to the current row, either as a range or as a number of rows preceding and following the current row. The point is that you can call upon the entire SQL vocabulary to combine with any of your OLAP- centric SQL statements. 31.2.4 The Window Clause The window clause has three subclauses: partitioning, ordering, and aggregation grouping. The general format is: <aggregate function> OVER (PARTITION BY <column list> ORDER BY <sort column list> [<aggregation grouping>]) 31.2 OLAP Functionality 715 A set of column names specifies the partitioning, which is applied to the rows that the preceding FROM, WHERE, GROUP BY, and HAVING clauses produced. If no partitioning is specified, the entire set of rows composes a single partition, and the aggregate function applies to the whole set each time. Though the partitioning looks like a GROUP BY, it is not the same thing. A GROUP BY collapses the rows in a partition into a single row. Partitioning within a window, though, simply organizes the rows into groups without collapsing them. The ordering within the window clause is like the ORDER BY clause in a CURSOR. It includes a list of sort keys and indicates whether they should be sorted in ascending or descending order. The important thing to understand is that ordering is applied only within each partition. The <aggregation grouping> defines a set of rows upon which the aggregate function operates for each row in the partition. Thus, in our example, for each month, you specify the set including it and the two preceding rows. This example is from an ANSI paper on the SQL-99 features. SELECT SH.region, SH.month, SH.sales, AVG(SH.sales) OVER (PARTITION BY SH.region ORDER BY SH.month ASC ROWS 2 PRECEDING) AS moving_average FROM SalesHistory AS SH ORDER BY SH.month ASC; Here, AVG(SH.sales) OVER (PARTITION BY ) is an OLAP function. The construct inside the OVER() clause defines the window of data to which the aggregate function, AVG() in this example, is applied. In this case, the window clause says to take SalesHistory table and then apply the following operations to it: 1. Partition SalesHistory by region 2. Order the data by month within each region 3. Group each row with the two preceding rows in the same region 4. Compute the moving average on each grouping 716 CHAPTER 31: OLAP IN SQL The database engine is not required to perform the steps in the order described here, but it must produce the same result set as if they had been carried out that way. There are two main types of aggregation groups: physical and logical. In physical grouping, you count a specified number of rows that are before or after the current row. The SalesHistory example used physical grouping. In logical grouping, you include all the data in a certain interval, defined in terms of a quantity that can be added to or subtracted from, the current sort key. For instance, you create the same group and can define it as the current month’s row plus either: 1. The two preceding rows, as defined by the ORDER clause 2. Any row containing a month no less than two months earlier Physical grouping works well for contiguous data and for programmers who think in terms of sequential files. Physical grouping works for a larger variety of data types than logical grouping, because it does not require operations on values. Logical grouping works better for data that has gaps or irregularities in the ordering and for programmers who think in SQL predicates. Logical grouping works only if you can do arithmetic on the values (such as numeric quantities and dates). 31.2.5 OLAP Examples of SQL The following example illustrates advanced OLAP function used in combination with traditional SQL. The result is a valuable SQL statement that epitomizes the power and relevance of BI at the database engine level. In this example, we want to perform a ROLLUP function of sales by region and city. SELECT B.region_type, S.city_id, SUM(S.sales) AS total_sales FROM SalesFacts AS S, MarketLookup AS M WHERE EXTRACT (YEAR FROM trans_date) = 1999 AND S.city_id = B.city_id AND B.region_type = 6 GROUP BY ROLLUP(B.region_type, S.city_id) ORDER BY B.region_type, S.city_id; 31.2 OLAP Functionality 717 The resultant set is reduced by explicitly querying region 6 and the year 1999. A sample result of the SQL is shown in Table 1, Yearly Sales by city and region. The result shows ROLLUP of two groupings (region, city), returning three totals: region, city, and grand total. Table 1 Yearly Sales by city and region region_type_id city_id total_sales 6 1 81655 6 2 131512 6 3 58384 6 19 77113 6 20 55520 6 21 63647 6 22 7166 6 23 92230 6 30 1733 6 31 5058 6 1190902 1190902 31.2.6 Enterprise-Wide Dimensional Layer The traditional data warehouse architecture includes an atomic layer of granular data, often normalized, that serves as the only source of data for subsequent, subject-specific data marts. Generally, the data marts are implemented as Star Schemas, proprietary MOLAP cubes, or both. Establishing a layer of data marts provides an excellent foundation from which to serve up consistent, multidimensional data on an enterprise scale. But when you couple the current notion of data marts with OLAP- centric SQL functions, it is important that BI architects confirm the value added from proprietary, OLAP-only technology, specifically proprietary multidimensional database servers. This is especially true when you consider the entire scope of relational technology currently focused on multidimensional data management, including: 718 CHAPTER 31: OLAP IN SQL  Database kernel support optimized to address multidimensional queries  RDBMS technology, such as Materialized Query/View Tables used to improve performance  Metadata capture and management of multidimensional structures, for example, dimensions, supported in the relational environment  Expanded OLAPcentric SQL vocabulary, standardized for consistent application Database-resident OLAP functions, coupled with the multidimensional solutions described above, afford the possibility of a single point of truth and efficient management of enterprise-wide traditional relational and multidimensional data requirements. This does not, perhaps, completely eliminate OLAP-only technology, but it certainly minimizes the needed investment to only true value-add. 31.3 A Bit of History IBM and Oracle jointly proposed these extensions in early 1999, and thanks to ANSI’s uncommonly rapid (and praiseworthy) actions, they are part of the SQL-99 Standard. IBM implemented portions of the specifications in DB2 UDB 6.2, which was commercially available in some forms as early as mid-1999. Oracle 8i version 2 and DB2 UDB 7.1, both released in late 1999, contain beefed-up implementations. Other vendors contributed, including database tool vendors Brio, MicroStrategy, and Cognos, and database vendor Informix, among others. A team lead by Dr. Hamid Pirahesh of IBM’s Almaden Research Laboratory played a particularly important role. After his team had researched the subject for about a year and come up with an approach to extending SQL in this area, he called Oracle. The companies then learned that each had independently done some significant work. With Andy Witkowski playing a pivotal role at Oracle, the two companies hammered out a joint standards proposal in about two months. Red Brick was actually the first product to implement this functionality before the standard, but in a less complete form. You can find details in the ANSI document “Introduction to OLAP Functions” by Fred Zemke, Krishna Kulkarni, Andy Witkowski, and Bob Lyle. CHAPTER 32 Transactions and Concurrency Control I IN THE OLD DAYS when we lived in caves and used mainframe computers with batch file systems, transaction processing was easy. You batched up the transactions to be made against the master file into a transaction file. The transaction file was sorted, edited, and ready to go when you ran it against the master file from a tape drive. The output of this process became the new master file, and the old master file and the transaction files were logged to magnetic tape in a huge closet in the basement of the company. When disk drives, multiuser systems, and databases came along, things got complex—and SQL made them more so. Mercifully, the user does not have to see the details. Well, this chapter is the first layer of the details. 32.1 Sessions The concept of a user session involves the user first connecting to the database. This is like dialing a phone number, but with a password, to get to the database. The Standard SQL syntax for this statement is: CONNECT TO <connection target> <connection target> ::= 720 CHAPTER 32: TRANSACTIONS AND CONCURRENCY CONTROL <SQL-server name> [AS <connection name>] [USER <user name>] | DEFAULT However, you will find many differences in various vendors’ SQL products and perhaps in operating system–level login procedures that have to be followed. Once the connection is established, the user has access to all the parts of the database to which he has been granted privileges. During this session, he can execute zero or more transactions. As one user inserts, updates, and deletes rows in the database, these changes are not made a permanent part of the database until that user issues a COMMIT WORK command for that transaction. However, if the user does not want to make the changes permanent, then he can issue a ROLLBACK WORK command, and the database stays as it was before the transaction. 32.2 Transactions and ACID There is a handy mnemonic for the four characteristics we want in a transaction: the ACID properties. The initials are short for four properties we have to have in a transaction processing system: atomicity, consistency, isolation, and durability. 32.2.1 Atomicity Atomicity means that either the whole transaction becomes persistent in the database or nothing in the transaction becomes persistent. The data becomes persistent in Standard SQL when a COMMIT statement is successfully executed. A ROLLBACK statement removes the transaction and the database is restored to its prior (consistent) state before the transaction began. The COMMIT or ROLLBACK statement can be explicitly executed by the user or by the database engine when it finds an error. Most SQL engines default to a ROLLBACK , unless they are configured to do otherwise. Atomicity also means that if I were to try to insert one million rows into a table, and one row of that million violated a referential constraint, then the whole set of one million rows would be rejected and the database would do an automatic ROLLBACK WORK . 32.2 Transactions and ACID 721 Here is the trade-off. If you do one long transaction, then you are in danger of being screwed by just one tiny little error. However, you do several short transactions in a session, then other users can have access to the database between your transactions and they might change things, much to your surprise. The solution has been to implement SAVEPOINT or CHECKPOINT options that act much like a bookmarker. A transaction sets savepoints during its execution, and lets the transaction perform a local rollback to the checkpoint. In our example, we might have been doing savepoints every 1,000 rows. When the 999,999th row inserted has an error that would have caused a ROLLBACK , the database engine removes only the work done after the last savepoint was set, and the transaction is restored to the state of uncommitted work (i.e., rows 1 thru 999,000) that existed before the last savepoint. You will need to look at your particular product to see if it has something like this. The usual alternatives are to break the work into chunks that are run as transactions with a hot program, or to use an ETL tool that scrubs the data completely before loading it into the database. 32.2.2 Consistency When the transaction starts, the database is in a consistent state; and when it becomes persistent in the database, the database is in a consistent state. The phrase “consistent state” means that all of the data integrity constraints, relational integrity constraints, and any other constraints are true. However, this does not mean that the database might not go through an inconsistent state during the transaction. Standard SQL has the ability to declare a constraint to be DEFERRABLE or NOT DEFERRABLE for finer control of a transaction. But the rule is that all constraints have to be true at the end of session. This can be tricky when the transaction has multiple statements, or when it fires triggers that affect other tables. 32.2.3 Isolation One transaction is isolated from all other transactions. Isolation is also called serializability, because it means that transactions act as if they were executed in isolation of each other. One way to guarantee isolation is to use serial execution, like we had to do in batch systems. In practice, this might not be a good idea, so the system must decide how to interleave the transactions to get the same effect. . more so. Mercifully, the user does not have to see the details. Well, this chapter is the first layer of the details. 32.1 Sessions The concept of a user session involves the user first. whole transaction becomes persistent in the database or nothing in the transaction becomes persistent. The data becomes persistent in Standard SQL when a COMMIT statement is successfully. In this example, we want to perform a ROLLUP function of sales by region and city. SELECT B.region_type, S. city_id, SUM (S. sales) AS total_sales FROM SalesFacts AS S, MarketLookup AS M WHERE

Ngày đăng: 06/07/2014, 09:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan