Joe Celko s SQL for Smarties - Advanced SQL Programming P54 ppsx

502 CHAPTER 22: AUXILIARY TABLES THEN RETURN (x1); END IF; END IF; try again with new guess using two-point formula SET x2 = x1 + 1.0e-5; SET f2 = NPV (x2); IF f2 IS NULL no irr THEN RETURN (f2); END IF; IF ABS (f2 - f1) < 1.0e-5 THEN RETURN (CAST (NULL AS FLOAT)); check for divide by zero END IF; SET guess = x1 - f1 * (x2 - x1)/ (f2 - f1); SET maxtry = maxtry - 1; END WHILE; END; Test table, holds results of straight line algorithm CREATE TABLE Test_StraightLine (rate DECIMAL(7,5) NOT NULL, npv FLOAT, irr DECIMAL(7,5)); CREATE TABLE Test_BoundedBox (rate DECIMAL(7,5) NOT NULL, npv FLOAT, irr DECIMAL(7,5)); original scenario try t = 0 cashflow of: - 391, irr undefined; try t = 0 cashflow of: -350, irr multivalued; 0, irr single-valued (well-behaved) DELETE FROM CashFlows INSERT INTO CashFlows VALUES (0, -350), (1, 100), (2, 100), (3, 100), (4, 100), (5, 100), (6, 100), (7, 100), (8, 100), (9, 100), (10, 100), (11, 100), (12, 100), (13, 100), (14, 100), (15, -1500); 22.3 Auxiliary Function Tables 503 scenario 1a: single valued irr DELETE FROM CashFlows INSERT INTO CashFlows VALUES (0, -800), (1, 100), (2, 100), (3, 100), (4, 100), (5, 100), (6, 100), (7, 100), (8, 100), (9, 100), (10, 100); scenario 1b: single valued irr, signs reversed DELETE FROM CashFlows; INSERT INTO CashFlows VALUES (0, 800), (1, -100), (2, -100), (3, -100), (4, -100), (5, -100), (6, -100), (7, -100), (8, -100), (9, -100), (10, -100); scenario 2: double valued irr DELETE FROM CashFlows; INSERT INTO CashFlows VALUES (0, -300), (1, 100), (2, 100), (3, 100), (4, 100), (5, 100), (6, 100),(7, 100), (8, 100), (9, 100), (10, -690); scenario 3: double valued irr with solutions very CLOSE together DELETE FROM CashFlows; INSERT INTO CashFlows VALUES (0, -310), (1, 100), (2, 100), (3, 100), (4, 100), (5, 100), (6, 100), (7, 100), (8, 100), (9, 100), (10, -690); scenario 4: undefined irr DELETE FROM CashFlows; INSERT INTO CashFlows VALUES (0, -320), (1, 100), (2, 100), (3, 100), (4, 100), (5, 100), (6, 100), (7, 100), (8, 100), (9, 100), (10, -690); run the test DELETE FROM Test_StraightLine; INSERT INTO Test_StraightLine (rate, npv, irr) SELECT rate, NPV (rate), irr_sl(rate) FROM Rates; 504 CHAPTER 22: AUXILIARY TABLES DELETE FROM Test_BoundedBox ; INSERT INTO Test_BoundedBox (rate, npv, irr) SELECT rate, NPV (rate), irr_bb(rate) FROM Rates; View results of the test SELECT SL.rate, SL.npv AS npv_sl, SL.irr AS irr_sl, BB.npv AS npv_bb, BB.irr AS irr_bb FROM Test_StraightLine AS SL, Test_BoundedBox WHERE BB.rate = SL.rate; A computational version of the IRR, due to Richard Romley, returns approximations that become more and more accurate as you feed estimates back into the formula. CREATE FUNCTION IRR(IN project_id CHAR(15), IN my_i DECIMAL(12,8)) RETURNS DECIMAL(12,8) LANGUAGE SQL DETERMINISTIC RETURN (SELECT CASE WHEN ROUND(my_i, 4) = ROUND(T.i, 4) THEN 100 * (my_I - 1) ELSE IRR(project_id, T.i) END FROM (SELECT SUM((amount * (time_period + 1)) /(POWER(my_i, time_period))) / SUM((amount * (time_period)) /(POWER(my_i, time_period + 1))) FROM CashFlows WHERE project_id = my_project_id)); 22.3.2 Interpolation with Auxiliary Function Tables SQL is not a functional programming language, so you often have to depend on vendor extensions providing a good library, or on being able to write the functions with the limited power in standard SQL. However, SQL is good at handling tables, and often, when the range of the function is relatively small, you can often set up auxiliary tables of the general form: CREATE TABLE SomeFunction (parameter <data type> NOT NULL, result <data type> NOT NULL); 22.3 Auxiliary Function Tables 505 Thus, the pseudocode expression: SELECT SomeFunction(T1.x), FROM TableOne AS T1 WHERE is replaced by: SELECT F1.result, FROM TableOne AS T1, SomeFunction AS F1 WHERE T1.x = F1.parameter AND However, if the function has a large range, the SomeFunction table can become huge or completely impractical. A technique that has fallen out of favor since the advent of cheap, fast computers is interpolation. It consists of using two known functional values, a and b, and their results in the functions f(a) and f(b), to find the result of a value, x, between them. Linear interpolation is the easiest method, and if the table has a high precision, it will work quite well for most applications. It is based on the idea that a straight line drawn between two function values f(a) and f(b) will approximate the function well enough that you can take a proportional increment of x relative to (a, b) and get a usable answer for f(x). The algebra looks like this: f(x) = f(a) + (x - a) * ((f(b) - f(a))/(b-a)) In this formula, (a <= x <= b), and x is not in the table. This can be translated into SQL as shown following (where x is :myparameter, F1 is related to the variable a, and F2 is related to the variable b): SELECT :myparameter AS my_input, (F1.answer + (:myparameter - F1.param) * ((F2.answer - F1.answer) / (CASE WHEN F1.param = F2.param THEN 1.00 ELSE F2.param - F1.param END))) AS answer FROM SomeFunction AS F1, SomeFunction AS F2 WHERE F1.param establish a and f(a) 506 CHAPTER 22: AUXILIARY TABLES = (SELECT MAX(param) FROM SomeFunction WHERE param <= :myparameter) AND F2.param establish b and f(b) = (SELECT MIN(param) FROM SomeFunction WHERE param >= :myparameter); The CASE expression in the divisor is to avoid division by zero errors when f(x) is in the table. The rules for interpolation methods are always expressible in four- function arithmetic, which is good for standard SQL. In the old days, with each parameter and result pair, the function tables gave an extra value called delta squared, which was based on finite differences. Delta squared was similar to a second derivative, and could be used in a formula to improve the accuracy of the approximation. This is not a book on numerical analysis, so you will have to go to a library to find details—or ask an old engineer. 22.4 Global Constants Tables When you configure a system, you might want to have a way to set and keep constants in the schema. One method for doing this is to have a one-row table that can be set with default values at the start and then updated only by someone with administrative privileges. CREATE TABLE Constants (lock CHAR(1) DEFAULT 'X' NOT NULL PRIMARY KEY CHECK (lock = 'X'), pi FLOAT DEFAULT 3.142592653 NOT NULL, e FLOAT DEFAULT 2.71828182 NOT NULL, phi FLOAT DEFAULT 1.6180339887 NOT NULL, ); To initialize the row, execute this statement: INSERT INTO Constants VALUES DEFAULTS; 22.4 Global Constants Tables 507 The lock column ensures that there is only one row, and the default values load the initial values. These defaults can include the current user and current timestamp, as well as numeric and character values. Another version of this idea—one that does not allow for any updates—is a VIEW defined with a table constructor. CREATE VIEW Constants (pi, e, phi, ) AS VALUES (3.142592653), (2.71828182), (1.6180339887), ; The next step is to put in a formula for the constants so that they can be computed on any platform to which this DDL is moved, using the local math library and hardware precision. CHAPTER 23 Statistics in SQL S QL IS NOT A statistical programming language. However, there are some tricks that will let you do simple descriptive statistics. Many vendors also include other descriptive statistics in addition to the required ones. Other sections of this book give portable queries for computing some of the more common statistics. Before using any of these queries, you should check to see if they already exist in your SQL product. Built-in functions will run far faster than these queries, so you should use them if portability is not vital. The most common extensions are the median, the mode, the standard deviation, and the variance. If you need to do a detailed statistical analysis, then you can extract data with SQL and pass it along to a statistical programming language, such as SAS or SPSS. However, you can build a lot of standard descriptive statistics using what you do have. First, the basic analysis of a single column can start with this VIEW or query to get cumulative and absolute frequencies and percentages: WITH SELECT F1.x, COUNT(F1.x) CAST(ROUND(100. * (COUNT(F1.x)) / (SELECT COUNT(*) FROM Foobar),0) 510 CHAPTER 23: STATISTICS IN SQL AS INTEGER) FROM Foobar AS F1 GROUP BY F1.x) AS F2(x, abs_freq, abs_perc) SELECT F2.x, F2.abs_freq, (SELECT SUM(F3.abs_freq) FROM F2 AS F3 WHERE F3.x <= F2.x) AS cum_freq, F2.abs_perc, (SELECT SUM(F3.abs_perc) FROM F2. AS F3 WHERE F3.x <= F2.x) AS cum_percc FROM F2; 23.1 The Mode The mode is the most frequently occurring value in a set. If there are two such values in a set, statisticians call it a bimodal distribution; three such values make it trimodal; and so forth. Most SQL implementations do not have a mode function, since it is easy to calculate. A simple frequency table can be written as a single query in SQL-92. This version is from Shepard Towindo, and it will handle multiple modes. SELECT salary, COUNT(*) AS frequency FROM Payroll GROUP BY salary HAVING COUNT(*) >= ALL (SELECT COUNT(*) FROM Payroll GROUP BY salary); As an exercise, here is a version that does not use a GROUP BY clause to compute the mode. However, the execution time might be a bit longer than you would like. SELECT DISTINCT salary FROM Personnel AS P1 23.1 The Mode 511 WHERE NOT EXISTS (SELECT * FROM Personnel AS P2 WHERE EXISTS (SELECT COUNT(*) FROM Personnel AS P3 WHERE P1.salary = P3.salary < (SELECT COUNT(*) FROM Personnel AS P4 WHERE P2.salary = P4.salary))); For a more accurate picture, you can allow for a 5% difference among frequencies that are near the mode, but would otherwise not technically qualify. SELECT AVG(salary) AS mode FROM Payroll GROUP BY salary HAVING COUNT(*) >= ALL (SELECT COUNT(*) * 0.95 FROM Payroll GROUP BY salary); The mode is a weak descriptive statistic, because it can be changed by small amounts of additional data. For example, if we have 100,000 cases where the value of the part_color variable is ‘red’ and 99,999 cases where the value is ‘green’, the mode is ‘red’. However, when two more ‘green’ cases are added to the set, the mode switches to ‘green’. A better idea is to allow for some variation ( k ), in the values. In general the best way to compute k is probably as a percentage of the total number of occurrences. Of course, knowledge of the actual situation could change this. For k = 2% error, the query would look like this: SELECT var_col AS mode, occurs FROM ModeFinder WHERE occurs BETWEEN (SELECT MAX(occurs) - (0.02 * SUM(occurs)) FROM ModeFinder)) AND (SELECT MAX(occurs) + (0.02 * SUM(occurs)) FROM ModeFinder)); . with SQL and pass it along to a statistical programming language, such as SAS or SPSS. However, you can build a lot of standard descriptive statistics using what you do have. First, the basic. common statistics. Before using any of these queries, you should check to see if they already exist in your SQL product. Built-in functions will run far faster than these queries, so you should. 100); scenario 1b: single valued irr, signs reversed DELETE FROM CashFlows; INSERT INTO CashFlows VALUES (0, 800), (1, -1 00), (2, -1 00), (3, -1 00), (4, -1 00), (5, -1 00), (6, -1 00), (7, -1 00),

Định dạng
Số trang	10
Dung lượng	228,95 KB