Joe Celko s SQL for Smarties - Advanced SQL Programming P74 pdf

702 CHAPTER 30: GRAPHS IN SQL arrival_town steps total_distance ================================== TOULOUSE 3 1015 TOULOUSE 2 795 TOULOUSE 3 995 The girl hiding in the cake is the ability to determine which different towns we visit by using each of these different routes: WITH RECURSIVE Journeys (arrival_town, steps, total_distance, way) AS (SELECT DISTINCT depart_town, 0, 0, CAST('PARIS' AS VARCHAR(MAX)) FROM Journeys WHERE depart_town = 'PARIS' UNION ALL SELECT arrival_town, Departures.steps + 1, Departures.total_distance + Arrivals.jny_distance, Departures.way || ’, ‘ ||Arrivals.arrival_town FROM Journeys AS Arrivals, Journeys AS Departures WHERE Departures.arrival_town = Arrivals.depart_town) SELECT arrival_town, steps, total_distance, way FROM Journeys WHERE arrival_town = 'TOULOUSE'; arrival_town steps total_distance way ===================================================================== TOULOUSE 3 1015 PARIS, LYON, MONTPELLIER, TOULOUSE TOULOUSE 2 795 PARIS, CLERMONT-FERRAND, TOULOUSE TOULOUSE 3 995 PARIS, CLERMONT-FERRAND, MONTPELLIER, TOULOUSE And now, ladies and gentleman, the recursive query is proud to present to you the means of solving a very complex problem, called the traveling salesman problem. This is one of the operational research problems for which Edsger Wybe Dijkstra found the first efficient algorithm and received the Turing Award in 1972. WITH RECURSIVE Journey (arrival_town, steps, total_distance, way) AS (SELECT DISTINCT depart_town, 0, 0, CAST('PARIS' AS VARCHAR(MAX)) FROM Journeys WHERE depart_town = 'PARIS' 30.4 Paths with CTE 703 UNION ALL SELECT arrival_town, Departures.steps + 1, Departures.total_distance + Arrivals.jny_distance, Departures.way ||’, ‘||Arrivals.arrival_town FROM Journeys AS Arrivals, Journeys AS Departures WHERE Departures.arrival_town = Arrivals.depart_town), ShortestDistance (total_distance) AS (SELECT MIN(total_distance) FROM Journeys WHERE arrival_town = 'TOULOUSE') SELECT arrival_town, steps, total_distance, way FROM Journeys AS J ShortestDistance AS S WHERE J.total_distance = S.total_distance AND arrival_town = 'TOULOUSE'; 30.4.1 Nonacyclic Graphs In fact, one thing that is limiting the process in our network of speedways is that we have made routes that only run in a single direction. In other words, we can go from Paris to Lyon, but we are not allowed to go from Lyon to Paris. For that, we need to add the reverse directions in the table, as shown: depart_town arrival_town jny_distance ======================================= LYON PARIS 470 This can be done with a very simple query: INSERT INTO Journeys SELECT arrival_town, depart_town, jny_distance FROM Journeys; The only problem is that the previous queries will not work properly: WITH RECURSIVE Journeys (arrival_town) AS (SELECT DISTINCT depart_town FROM Journeys WHERE depart_town = 'PARIS' UNION ALL 704 CHAPTER 30: GRAPHS IN SQL SELECT arrival_town FROM Journeys AS Arrivals, Journeys AS Departures WHERE Departures.arrival_town = Arrivals.depart_town) SELECT arrival_town FROM Journeys; This query will give you an error message about the maximum depth of recursion being violated. What happened? The problem is simply that you are trying all routes, including cycling routes like Paris, Lyon, Paris, Lyon, Paris . . . ad infinitum. Is there a way to avoid cycling routes? Maybe. In one of our previous queries, we have a column that gives the complete list of stepped towns. Why not use it to avoid cycling? The condition will be: Do not pass through a town that is already in the way. This can be written as: WITH RECURSIVE Journeys (arrival_town, steps, total_distance, way) AS (SELECT DISTINCT depart_town, 0, 0, CAST('PARIS' AS VARCHAR(255)) FROM Journeys WHERE depart_town = 'PARIS' UNION ALL SELECT arrival_town, Departures.steps + 1, Departures.total_distance + Arrivals.jny_distance, Departures.way ||', '||Arrivals.arrival_town FROM Journeys AS Arrivals, Journeys AS Departures WHERE Departures.arrival_town = Arrivals.depart_town AND Departures.way NOT LIKE '%' + Arrivals.arrival_town + '%') SELECT arrival_town, steps, total_distance, way FROM Journeys WHERE arrival_town = 'TOULOUSE'; arr ival_town steps total_distance way =================================================================================== TOULOUSE 3 1015 PARIS, LYON, MONTPELLIER, TOULOUSE TOULOUSE 4 1485 PARIS, LYON, MONTPELLIER, CLERMONT-FERRAND, TOULOUSE TOULOUSE 2 795 PARIS, CLERMONT-FERRAND, TOULOUSE TOULOUSE 3 995 PARIS, CLERMONT-FERRAND, MONTPELLIER, TOULOUSE As you see, a new route is determined. It’s the worst so far as distance is concerned, but it is perhaps the most beautiful! 30.5 Adjacency Matrix Model 705 A CTE can simplify the expression of complex queries. Recursive queries must be employed where recursion is needed. Trust your SQL product to terminate a bad query. There is usually an option to set the depth of recursion either in the SQL engine or as an OPTION clause at the end of the CTE clause. 30.5 Adjacency Matrix Model An adjacency matrix is a square array whose rows are out-nodes and columns are in-nodes of a graph. A one in a cell means that there is edge between the two nodes. Using the graph in Figure 30.1, we would have a array like this: A B C D E F G H ================ A| 1 1 1 0 0 0 0 0 B| 0 1 0 1 0 0 0 0 C| 0 0 1 1 0 0 1 0 D| 0 0 0 1 1 1 0 0 E| 0 0 0 0 1 0 0 1 F| 0 0 0 0 0 1 0 0 G| 0 0 0 0 0 0 1 1 H| 0 0 0 0 0 0 0 1 Many graph algorithms are based on the adjacency matrix model and can be translated into SQL. Go back to Chapter 25 for details on modeling matrices in SQL; in particular, review Section 25.3.3, which deals with matrix multiplication in SQL. For example, Dijkstra’s algorithm for the shortest distances between each pair of nodes in a graph looks like this in pseudocode: FOR k = 1 TO n DO FOR i = 1 TO n DO FOR j = 1 TO n IF a[i,k] + a[k,j] < a[i,j] THEN a[i,j] = a[i,k] + a[k,j] END IF; END FOR; END FOR; END FOR; 706 CHAPTER 30: GRAPHS IN SQL You need to be warned that for a graph of n nodes, the table will be of size (n^2). The algorithms often run in (n^3) time. The advantage it has is that once you have completed a table, it can be used for look-ups rather than for recomputing distances over and over. 30.6 Points inside Polygons Although polygons are not actually part of graph theory, this chapter seemed to be the reasonable place to put this section, since it is also related to spatial queries. A polygon can be described as a set of corner points in an (x, y) coordinate system. The usual query is to tell whether a given point is inside or outside of the polygon. This algorithm is due to Darel R. Finley. Its main advantage is that it can be done in Standard SQL without trigonometry functions. Its disadvantage is that it does not work for concave polygons. The workaround is to dissect the convex polygons into concave polygons, then add a column for the name of the original area. set up polygon, with any ordering of the corners CREATE TABLE Polygon (x FLOAT NOT NULL, y FLOAT NOT NULL, PRIMARY KEY (x, y)); INSERT INTO Polygon VALUES (2.00, 2.00), (1.00, 4.00), (3.00, 6.00), (6.00, 4.00), (5.00, 2.00); set up some sample points CREATE TABLE Points (xx FLOAT NOT NULL, yy FLOAT NOT NULL, location VARCHAR(10) NOT NULL, answer the question in advance! PRIMARY KEY (xx, yy)); INSERT INTO Points VALUES (2.00, 2.00, 'corner'), (1.00, 5.00, 'outside'), (3.00, 3.00, 'inside'), (3.00, 4.00, 'inside'), (5.00, 1.00, 'outside'), (3.00, 2.00, 'side'); 30.6 Points inside Polygons 707 do the query SELECT P1.xx, P1.yy, p1.location, SIGN( SUM (CASE WHEN (polyY.y < P1.yy AND polyY.x >= P1.yy OR polyY.x < P1.yy AND polyY.y >= P1.yy) THEN CASE WHEN polyX.y + (P1.yy - polyY.y) /(polyY.x - polyY.y) * (polyX.x - polyX.y) < P1.xx THEN 1 ELSE 0 END ELSE 0 END))AS flag FROM Polygon AS polyY, Polygon AS polyX, Points AS P1 GROUP BY P1.xx, P1.yy, p1.location; When flag = 1, the point is inside; when flag = 0, it is outside. xx yy location flag ======================== 1.0 5.0 outside 0 2.0 2.0 corner 0 3.0 3.0 inside 1 3.0 4.0 inside 1 5.0 1.0 outside 0 3.0 2.0 side 1 Sides are counted as inside, but if you want to count the corner points as inside, you should start the CASE expression with: CASE WHEN EXISTS (SELECT * FROM Polygon WHERE x = P1.xx AND y = P1.yy) THEN 1 ". CHAPTER 31 OLAP in SQL T HIS MATERIAL WAS PROVIDED by Michael L. Gonzales from his book, The IBM Data Warehouse (Gonzales 2003), as well as his article “The SQL Language of OLAP,” (Gonzales 2004). Most SQL programmers work with OLTP (Online Transaction Processing) databases and have had no exposure to Online Analytic Processing (OLAP) and data warehousing. OLAP is concerned with summarizing and reporting data, so the schema designs and common operations are very different from the usual SQL queries. As a gross generalization, everything you knew in OLTP is reversed in a data warehouse: 1. OLTP changes data in short, frequent transactions. A data warehouse is bulk-loaded with static data on a schedule, and the data remains constant once it is in place. 2. An OLTP database wants to store only the data needed to do its current work. A data warehouse wants all the historical data it can hold. For example, as of 2005, Wal-Mart had a corporate data warehouse with more than half a petabyte of data online. The definition of a petabyte is 2^50 = 1,125,899,906,842,624 bytes = 1,024 terabytes, or roughly 10^15 bytes. 710 CHAPTER 31: OLAP IN SQL 3. OLTP queries tend to be for simple facts. Data warehouse queries tend to be aggregate relationships that are more complex. For example, an OLTP query might ask, “How much chocolate did Joe Celko buy?” while a data warehouse might ask, “What is the correlation between chocolate purchases, geographic location, and wearing tweed?” 4. OLTP wants to run as fast as possible. A data warehouse is more concerned with the accuracy of computations, and it is willing to wait to get an answer to a complex query. 5. Properly designed OLTP databases are normalized. A data warehouse is usually a Star or Snowflake Schema, which is highly denormalized. The Star Schema is due to Ralph Kimball, and you can get more details about it in his books and articles. 31.1 Star Schema The Star Schema is a violation of basic normalization rules. There is a large central fact table. This table contains all the facts about an event that you wish to report on, such as sales, in one place. In an OLTP database, the inventory would be in one table, the sales in another table, customers in a third table, and so forth. In the data warehouse, they are all in one huge table. The dimensions of the values in the fact table are in smaller tables that allow you pick a scale or unit of measurement on that dimension in the fact table. For example, the time dimension for the Sales fact table might be grouped by year, month within year, week within month. Then a weight dimension could give you pounds, kilograms, or stock packaging sizes. A category dimension might classify the stock by department, and so forth. This arrangement lets me ask for my facts aggregated in any granularity of units I wish, and perhaps dropping some of the dimensions. Until recent changes in SQL, OLAP queries had to be done with special OLAP-centric languages, such as Microsoft’s Multidimensional Expressions (MDX). Be assured that the power of OLAP is not found in the wizards or GUIs presented in the vendor demos. The wizards and GUI are often the glitter that lures the uninformed. Many aspects of OLAP are already integrated with the relational database engine itself. This blending of technology blurs the distinction between an RDBMS and OLAP data management technology, effectively challenging the passive role often relegated to relational databases with regard to dimensional data. The more your RDBMS can address the 31.2 OLAP Functionality 711 needs of both traditional relational data and dimensional data, the more you can reduce the cost of OLAP-only technology and get more out of your investment in RDBMS technology, skills, and resources. 31.2 OLAP Functionality While OLAP systems have the ability to answer “who” and “what” questions, it is their ability to answer “what if” that sets them apart from other BI (business intelligence) tools. Leading RDBMS products, such as DB2 and Oracle, currently offer core OLAP-centric SQL functions, including categories such as ranking, numbering, and grouping. In fact, DB2 added extensions to its optimizer to identify a Star Schema and build a special execution plan for it. When specifying an OLAP function, a window defines the rows over which the function is applied, and in what order. When used with a column function, the applicable rows can be further refined, relative to the current row, as either a range or a number of rows preceding and following the current row. For example, within a partition by month, an average can be calculated over the previous three-month period. 31.2.1 RANK and DENSE_RANK So far, we have talked about extending the usual SQL aggregate functions. There are special functions that can be used with the window construct. RANK assigns a sequential rank to a row within a window. The RANK of a row is defined as one plus the number of rows that strictly precede the row. Rows that are not distinct within the ordering of the window are assigned equal ranks. If two or more rows are not distinct with respect to the ordering, then there will be one or more gaps in the sequential rank numbering. That is, the results of RANK may have gaps in the numbers resulting from duplicate values. DENSE_RANK also assigns a sequential rank to a row in a window. However, a rows DENSE_RANK is one plus the number of rows preceding it that are distinct with respect to the ordering. Therefore, there will be no gaps in the sequential rank numbering, with ties being assigned the same rank. 31.2.2 Row Numbering ROW_NUMBER uniquely identifies rows in a resultant set. This function computes the sequential row number of the row within the window . towns we visit by using each of these different routes: WITH RECURSIVE Journeys (arrival_town, steps, total_distance, way) AS (SELECT DISTINCT depart_town, 0, 0, CAST('PARIS' AS. units I wish, and perhaps dropping some of the dimensions. Until recent changes in SQL, OLAP queries had to be done with special OLAP-centric languages, such as Microsoft s Multidimensional. multiplication in SQL. For example, Dijkstra s algorithm for the shortest distances between each pair of nodes in a graph looks like this in pseudocode: FOR k = 1 TO n DO FOR i = 1 TO n DO FOR j =

Định dạng
Số trang	10
Dung lượng	231,18 KB