492 CHAPTER 22: AUXILIARY TABLES CHECK (CASE WHEN code_type = 'DDC' AND code_value SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0- 9]' THEN 1 WHEN code_type = 'ICD' AND code_value SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0- 9]' THEN 1 WHEN code_type = 'ISO3166' AND code_value SIMILAR TO '[A-Z][A-Z]' THEN 1 ELSE 0 END = 1), code_description VARCHAR(255) NOT NULL, PRIMARY KEY (code_value, code_type)); Since the typical application database can have dozens and dozens of codes in it, you just keep extending this pattern for as long as required. Not very pretty, is it? That is why most OTLT programmers do not bother with it, and thus destroy data integrity. The next thing you notice about this table is that the columns are pretty wide VARCHAR(n), or even worse, that they use NVARCHAR(n). The value of (n) is most often the largest one allowed in that particular SQL product. Since you have no idea what is going to be shoved into the table, there is no way to predict and design with a safe, reasonable maximum size. The size constraint has to be put into the WHEN clause of that second CHECK() constraint, between code_type and code_value. These large sizes tend to invite bad data. You give someone a VARCHAR(n) column, and you eventually get a string with a lot of white space and a small odd character sitting at the end of it. You give someone an NVARCHAR(255) column and eventually it will get a Buddhist sutra in Chinese Unicode. I am sure of this, because I load the Diamond or Heart Sutra when I get called to evaluate a database. If you make an error in the code_type or code_description among codes with the same structure, it might not be detected. You can turn 500.000 from “Natural Sciences and Mathematics” in Dewey Decimal codes into “Coal Workers Pneumoconiosis” in ICD, and vice versa. This can be really difficult to find when one of the similarly structured schemes had unused codes in it. 22.3 Auxiliary Function Tables 493 Now let’s consider the problems with actually using the OTLT in the DML. It is always necessary to add the code_type as well as the value that you are trying to look up. SELECT P1.ssn, P1.lastname, , L1.code_description FROM Lookups AS L1, Personnel AS P1 WHERE L1.code_type = 'ICD' AND L1.code_value = P1.sickness AND ; In this sample query, I need to know the code_type of the Personnel table sickness column and of every other encoded column in the table. If you get a code_type wrong, you can still get a result. I also need to allow for some overhead for type conversions. It would be much more natural to use DECIMAL (6,3) for Dewey Decimal codes instead of VARCHAR(n), so that is probably how it appears in the Personnel table. But why not use CHAR(7) for the code? If I had a separate table for each encoding scheme, then I would have used a FOREIGN KEY and matched the data types in the referenced and referencing tables. There is no definitive guide for data type choices in the OTLT approach. When I go to execute a query, I have to pull in the entire lookup table, even if I only use one code. If one code is at the start of the physical storage, and another is at the end of physical storage, I can do a lot of paging. When I update the lookup table, I have to lock out everyone until I am finished. It is like having to carry an encyclopedia set with you, when all you needed was a magazine article. I am going to venture a guess that this idea came from OO programmers who think of it as some kind of polymorphism done in SQL. They say to themselves that a table is a class, which it is not, and therefore it ought to have polymorphic behaviors, which it does not. Maybe there are good reasons for the data-modeling principle that a well-designed table is a set of things of the same kind, instead of a pile of unrelated items. 22.3 Auxiliary Function Tables SQL is not a computational language like FORTRAN and the specialized math packages. It typically does not have the numerical analysis routines to compensate for floating-point rounding errors, or algebraic reductions in the optimizer. But it is good at joins. 494 CHAPTER 22: AUXILIARY TABLES Most auxiliary lookup tables are for simple decoding, but they can be used for more complex functions. Let’s consider two financial calculations that you cannot do easily: the Net Present Value (NPV) and its related Internal Rate of Return (IRR). Let me stop and ask: how would you program the NPV and IRR in SQL? The answer posted on most newsgroup replies was to write a procedure directly from the equation in the vendor-specific 4GL language and then call it. As a quick review, let’s start with the net present value (NPV) calculation. Imagine that you just got an e-mail from some poor Nigerian civil servant who wants you to send him a small investment now on the promise that he will send you a series of payments over time from money he is stealing from a government bank account. Obviously, you would want the total of the cash flow to be at least equal to the initial investment, or the money is not worth lending. We can assume that you are making a little profit at the end of investment. But is this a good investment? That is, if I took that cash flow and invested it at a given interest rate, what would the result be? That is called the net present value (NPV), and you will want to do at least as well as this value on your investment. To make this more concrete, let’s show a little code and data for your two investment options. CREATE TABLE CashFlows (project_id CHAR(15) NOT NULL, time_period INTEGER NOT NULL, CHECK (time_period >= 0), amount DECIMAL(12,4) NOT NULL, PRIMARY KEY (project_id, time_period)); INSERT INTO CashFlows VALUES ('Acme', 0, -1000.0000), ('Acme', 1, 500.0000), ('Acme', 2,400.0000), ('Acme', 3, 200.0000), ('Acme', 4, 200.0000), ('Beta', 0, -1000.0000), ('Beta', 1, 100.0000), ('Beta', 2, 200.0000), ('Beta', 3, 200.0000), ('Beta', 4, 700.0000); I invest $1,000 at the start of each project; the time period is zero and the amount is always negative. Every year I get a different amount back on my investment, so that at the end of the fourth year, I’ve received a 22.3 Auxiliary Function Tables 495 total of $13,000 on the Acme project, less my initial $1,000, for a profit of $12,000. Likewise, the Beta project returns $15,000 at the end. Beta looks like a better investment. Let’s assume we can get 10% return on an investment and that we put our cash flows into that investment. The Net Present Value function in pseudocode is: FOR t FROM 0 TO n DO SUM(a[t]/(1.00 + r)^ t)) END FOR; In this case, a[t] is the cash flow for time period (t) and time period (t = 0) is the initial investment (it is always negative) and r is the interest rate. When we run them through the equation, we find that Acme has an NPV of $71.9896 and Beta is worth −$115.4293, so Acme is really the better project. We can get more out of the Acme cash flow than the Beta cash flow. 22.3.1 Inverse Functions with Auxiliary Tables The IRR depends on the NPV. It finds the interest rate at which your investment would break even if you invested back into the same project. Thus, if you can get a better rate, this is a good investment. Let’s build another table. CREATE TABLE Rates (rate DECIMAL(6,4) NOT NULL PRIMARY KEY); Now let’s populate it with some values. One trick to fill the Rates table with values is to use a CROSS JOIN and keep values inside a reasonable range. CREATE TABLE Digits(digit DECIMAL (6,4) PRIMARY KEY); INSERT INTO Digits VALUES (0.0000), (0.0001), (0.0002), (0.0003), (0.0004), (0.0005), (0.0006), (0.0007), (0.0008), (0.0009); INSERT INTO Rates (rate) SELECT DISTINCT (D1.digit *1000) + (D2.digit *100) + (D3.digit *10) + D4.digit FROM Digits AS D1, Digits AS D2, Digits AS D3, Digits AS D4 496 CHAPTER 22: AUXILIARY TABLES WHERE ((D1.digit *1000) + (D2.digit *100) + (D3.digit *10) + D4.digit) BETWEEN {{lower limit}} AND {{upper limit}}; pseudocode DROP TABLE Digits; We now have two choices. We can build a VIEW that uses the cash flow table, thus: CREATE VIEW NPV_by_Rate(project_id, rate, npv) AS SELECT CF.project_id, R1.rate, SUM(amount / POWER((1.00 + R1.rate), time_period)) FROM CashFlows AS CF, Rates AS R1 GROUP BY R1.rate, CF.project_id; Alternately, we can set the amount in the formula to 1 and store the multiplier for the (rate, time_period) pair in another table: INSERT INTO NPV_Mulipliers (time_period, rate, npv_multiplier) SELECT S.seq, R1.rate, SUM(1.00/(POWER((1.00 + R1.rate), seq))) FROM Sequence AS S, Rates AS R1 WHERE S.seq <= {{ upper limit }} pseudocode GROUP BY S.seq, R1.rate; The sequence table contains integers 1 to (n); it is a standard auxiliary table, used to avoid iteration. Assuming we use the VIEW, the IRR is now the single query: SELECT 'Acme', rate AS irr, npv FROM NPV_by_Rate WHERE ABS(npv) = (SELECT MIN(ABS(npv)) FROM NPV_by_Rate) AND project_id = 'Acme'; In my sample data, I get an IRR of 13.99% at an NPV of −0.04965 for the Acme project. Assume you have hundreds of projects to consider; would you rather write one query or hundreds of procedure calls? This Web site has a set of slides that deal with the use of interpolation to find the IRR: www.yorku.ca/adms3530/Interpolation.pdf. Using the 22.3 Auxiliary Function Tables 497 method described on the Web site, we can write the interpolation for the Acme example as: SELECT R1.rate + (R1.rate* (R1.npv/(R1.npv - R2.npv))) AS irr FROM NPV_by_Rate AS R1, NPV_by_Rate AS R2 WHERE R1.project_id = 'Acme' AND R2.project_id = 'Acme' AND R1.rate = 0.1000 AND R2.rate = 0.2100 AND R1.npv > 0 AND R2.npv < 0; The important points are that the NPVs from R1 and R2 have to be on both sides of the zero point, so that you can do a linear interpolation between the two rates with which they are associated. The trade-off is speed for accuracy. The IRR function is slightly concave and not linear; that means that if you graph it, the shape of the curve buckles toward the origin. Picking good ( R1.rate, R2.rate) pairs is important, but if you want to round off to the nearest whole percentage, you probably have a larger range than you might think. The answer from the original table lookup method, 0.1399, rounds to 14%, as do all of the following interpolations. RI R2 IRR ====================== 0.1000 0.2100 0.140135 0.1000 0.2000 0.143537 0.0999 0.2000 0.143457 0.0999 0.1999 0.143492 0.0800 0.1700 0.135658 The advantages of using an auxiliary function table are: 1. All host programs will be using the same calculations. 2. The formula can be applied to hundreds or thousands of projects at one time, instead of just doing one project as you would with a spreadsheet or financial calculator. 498 CHAPTER 22: AUXILIARY TABLES Robert J. Hamilton (bobha@seanet.com) posted proprietary T-SQL functions for the NPV and IRR functions. The NPV function was straightforward, but he pointed out several problems with finding the IRR. By definition, IRR is the rate at which the NPV of the cash flows equals zero. When IRR is well behaved, the graph of NPV as a function of rate is a curve that crosses the x-axis once and only once. When IRR is not well-behaved, the graph crosses the x-axis many times, which means the IRR is either multivalued or undefined. At this point, we need to ask what the appropriate domain is for IRR. As it turns out, NPV is defined for all possible rates, both positive and negative, except where NPV approaches an asymptote at a rate of −100%, and the power function blows up. What does a negative rate mean when calculating NPV? What does it mean to have a negative IRR? Well it depends on how you look at it. If you take a mathematical approach, a negative IRR is just another solution to the equation. If you take an economic approach, a negative IRR means you are losing money on the project. Perhaps if you live in a deflationary economy, then a negative cash flow might be profitable in terms of real money, but that is a very unusual situation, and we can dismiss negative IRRs as unreasonable. This means that a table lookup approach to the IRR must have a very fine granularity and enough scope to cover a lot of situations for the general case. It also means that the table lookup is probably not the way to go. Expressing rates to 5 or 6 decimal places is common in home mortgage finance (i.e., APR 5.6725%), and this degree of precision using the set-based approach does not scale well. Moreover, this is exacerbated by the requirements of using IRR in hyperinflationary economies, where solutions of 200%, 300%, and higher are meaningful. Here are Mr. Hamilton’s functions written in SQL/PSM; one uses a straight-line algorithm, such as you find in Excel and other spreadsheets, and a bounding box algorithm. The bounding box algorithm has better domain integrity, but can inadvertently “skip over” a solution when widening its search. CREATE TABLE CashFlows (t INTEGER NOT NULL CHECK (t >= 0), amount DECIMAL(12,4) NOT NULL); CREATE TABLE Rates (rate DECIMAL(7,5) NOT NULL); 22.3 Auxiliary Function Tables 499 CREATE TABLE Digits (digit DECIMAL(6,4)); INSERT INTO Digits VALUES (0.0000), (0.0001), (0.0002), (0.0003), (0.0004), (0.0005), (0.0006), (0.0007), (0.0008), (0.0009); INSERT INTO Rates SELECT D1.digit * 1000 + D2.digit * 100 + D3.digit * 10 + D4.digit FROM Digits AS D1, Digits AS D2, Digits AS D3, Digits AS D4; INSERT INTO Rates SELECT rate-1 FROM Rates WHERE rate >= 0; INSERT INTO Rates SELECT rate-2 FROM Rates WHERE rate >= 0; DROP TABLE Digits; CREATE FUNCTION NPV (IN my_rate FLOAT) RETURNS FLOAT DETERMINISTIC CONTAINS SQL RETURN (CASE WHEN prevent divide by zero at rate = -100% ABS (1.0 + my_rate) >= 1.0e-5 THEN (SELECT SUM (amount * POWER ((1.0 + my_rate), -t)) FROM CashFlows) ELSE NULL END); CREATE FUNCTION irr_bb (IN guess FLOAT) RETURNS FLOAT DETERMINISTIC CONTAINS SQL BEGIN DECLARE maxtry INTEGER; DECLARE x1 FLOAT; DECLARE x2 FLOAT; DECLARE f1 FLOAT; DECLARE f2 FLOAT; DECLARE x FLOAT; DECLARE dx FLOAT; 500 CHAPTER 22: AUXILIARY TABLES DECLARE x_mid FLOAT; DECLARE f_mid FLOAT; initial bounding box around guess SET x1 = guess - 0.005; SET f1 = NPV (x1); IF f1 IS NULL THEN RETURN (f1); END IF; SET x2 = guess + 0.005; SET f2 = NPV (x2); IF f2 IS NULL THEN RETURN (f2); END IF; expand bounding box to include a solution SET maxtry = 50; WHILE maxtry > 0 try until solution is bounded AND (SIGN(f1) * SIGN(f2)) <> -1 DO IF ABS (f1) < ABS (f2) THEN move lower bound SET x1 = x1 + 1.6 * (x1 - x2); SET f1 = NPV (x1); IF f1 IS NULL no irr THEN RETURN (f1); END IF; ELSE move upper bound SET x2 = x2 + 1.6 * (x2 - x1); SET f2 = NPV (x2); IF f2 IS NULL no irr THEN RETURN (f2); END IF; END IF; SET maxtry = maxtry - 1; END WHILE; IF (SIGN(f1) * SIGN(f2)) <> -1 THEN RETURN (CAST (NULL AS FLOAT)); END IF; END; now find solution with binary search SET x = CASE WHEN f1 < 0 THEN x1 ELSE x2 END; 22.3 Auxiliary Function Tables 501 SET dx = CASE WHEN f1 < 0 THEN (x2 - x1) ELSE (x1 - x2) END; SET maxtry = 50; WHILE maxtry > 0 DO SET dx = dx / 2.0; reduce steps by half SET x_mid = x + dx; SET f_mid = NPV (x_mid); IF f_mid IS NULL no irr THEN RETURN (f_mid); ELSE IF ABS (f_mid) < 1.0e-5 epsilon for problem THEN RETURN (x_mid); irr found END IF; END IF; IF f_mid < 0 THEN SET x = x_mid; END IF; SET maxtry = maxtry - 1; END WHILE; RETURN (CAST (NULL AS FLOAT)); END; If you prefer to compute the IRR as a straight line, you can use this function: CREATE FUNCTION irr_sl (IN guess FLOAT) RETURNS FLOAT DETERMINISTIC CONTAINS SQL BEGIN DECLARE maxtry INTEGER; DECLARE x1 FLOAT; DECLARE x2 FLOAT; DECLARE f1 FLOAT; DECLARE f2 FLOAT; SET maxtry = 50; iterations WHILE maxtry > 0 DO SET x1 = guess; SET f1 = NPV (x1); IF f1 IS NULL no irr THEN RETURN (f1); ELSE IF ABS (f1) < 1.0e-5 irr within epsilon range . SIMILAR TO '[ 0-9 ][ 0-9 ][ 0-9 ].[ 0-9 ][ 0-9 ][ 0- 9]' THEN 1 WHEN code_type = 'ISO3166' AND code_value SIMILAR TO '[A-Z][A-Z]' THEN 1 ELSE 0 END = 1), code_description. D4.digit FROM Digits AS D1, Digits AS D2, Digits AS D3, Digits AS D4; INSERT INTO Rates SELECT rate-1 FROM Rates WHERE rate >= 0; INSERT INTO Rates SELECT rate-2 FROM Rates WHERE rate >=. graph of NPV as a function of rate is a curve that crosses the x-axis once and only once. When IRR is not well-behaved, the graph crosses the x-axis many times, which means the IRR is either multivalued