Joe Celko s SQL for Smarties - Advanced SQL Programming P52 pot

10 61 0
Joe Celko s SQL for Smarties - Advanced SQL Programming P52 pot

Đang tải... (xem toàn văn)

Thông tin tài liệu

482 CHAPTER 22: AUXILIARY TABLES SELECT seq, MOD (((seq + (:n-1))/ :n), :n) FROM Sequence; As an example, consider the following problem in which we want to display an output with what is called “snaking” in a report. Each ID has several descriptions, and we want to see them in cycles of four (n =4); when a department has more than four job descriptions, we want to start a new row with an incremented position for each subset of four or fewer job descriptions. CREATE TABLE Company (dept_nbr INTEGER NOT NULL, job_nbr INTEGER NOT NULL, sequence within department job_descr CHAR(6) NOT NULL, PRIMARY KEY (dept_nbr, job_nbr)); INSERT INTO Company VALUES (1, 1, 'desc1'), (1, 2, 'desc2'), (1, 3, 'desc3'); INSERT INTO Company VALUES (2, 1, 'desc4'), (2, 2, 'desc5'), (2, 3, 'desc6'), (2, 4, 'desc7'), (2, 5, 'desc8'), (2, 6, 'desc9'); INSERT INTO Company VALUES (3, 1, 'desc10'), (3, 2, 'desc11'), (3, 3, 'desc12'); I am going to use a VIEW rather than a derived table to make the logic in the intermediate step easier to see. CREATE VIEW Foo2 (dept_nbr, row_grp, d1, d2, d3, d4) AS SELECT dept_nbr, (MOD((job_nbr + 3)/4), 4), MAX(CASE WHEN MOD(job_nbr, 4) = 1 22.1 The Sequence Table 483 THEN job_descr ELSE ' ' END) AS d1, MAX(CASE WHEN MOD(job_nbr, 4) = 2 THEN job_descr ELSE ' ' END) AS d2, MAX(CASE WHEN MOD(job_nbr, 4) = 3 THEN job_descr ELSE ' ' END) AS d3, MAX(CASE WHEN MOD(job_nbr, 4) = 0 THEN job_descr ELSE ' ' END) AS d4 FROM Company AS F1 GROUP BY dept_nbr, job_nbr; SELECT dept_nbr, row_grp, MAX(d1) AS d1, MAX(d2) AS d2, MAX(d3) AS d3, MAX(d4) AS d4 FROM Foo2 GROUP BY dept_nbr, row_grp ORDER BY dept_nbr, row_grp; Results dept_nbr row_grp d1 d2 d3 d4 =============================================== 1 1 desc1 desc2 desc3 2 1 desc4 desc5 desc6 desc7 2 2 desc8 desc9 3 1 desc10 desc11 desc12 This is bad coding practice. Display is a function of the front end, and should not be done in the database. 22.1.3 Replacing an Iterative Loop While is not recommended as a technique, and it will vary from SQL dialect to dialect, replacing an iterative loop is a good exercise in learning to think in sets. You are given a quoted string made up of integers separated by commas, and your goal is to break each of the integers out as a row in a table. The obvious approach is to write procedural code that will loop over the input string and cut off all characters from the beginning up to, but not including, the first comma, cast the substring as an integer and then iterate through the rest of the string. CREATE PROCEDURE ParseList (IN inputstring VARCHAR(1000)) LANGUAGE SQL BEGIN DECLARE i INTEGER; 484 CHAPTER 22: AUXILIARY TABLES SET i = 1; iteration control variable add sentinel comma to end of input string SET inputstring = TRIM (BOTH '' FROM inputstring || ', '); WHILE i < CHAR_LENGTH(inputstring) DO WHILE SUBSTRING(inputstring, i, 1) <> ', ' DO SET i = i + 1; END WHILE; SET outputstring = SUBSTRING(inputstring, 1, i-1); INSERT INTO Outputs VALUES (CAST (outputstring AS INTEGER)); SET inputstring = SUBSTRING(inputstring, i+1); END WHILE; END; Alternately, you can do this with an auxiliary table of sequential numbers and this strange-looking query written in Core SQL-99. CREATE PROCEDURE ParseList (IN inputstring VARCHAR(1000)) LANGUAGE SQL INSERT INTO ParmList (parmeter_position, param) SELECT S1.i, CAST (SUBSTRING ((', ' || inputstring ||', ') FROM (S1.i + 1) FOR (S2.i - S1.i - 1)) AS INTEGER) FROM Sequence AS S1, Sequence AS S2 WHERE SUBSTRING((', ' || inputstring ||', ') FROM S1.i FOR 1) = ', ' AND SUBSTRING((', ' || inputstring ||', ') FROM S2.i FOR 1) = ', ' AND S2.i = (SELECT MIN(S3.i) FROM Sequence AS S3 WHERE S1.i < S3.i AND SUBSTRING((', ' || inputstring ||', ') FROM S3.i FOR 1) = ', ') AND S1.i < CHAR_LENGTH (inputstring+ 1) AND S2.i < CHAR_LENGTH (inputstring+ 2); The trick here is to concatenate commas on the left and right sides of the input string. To be honest, you would probably want to trim blanks and perhaps do other tests on the string, such as seeing that 22.2 Lookup Auxiliary Tables 485 LOWER(:instring) = UPPER(:instring) to avoid alphabetic characters and so forth. That edited result string would be kept in a local variable and used in the INSERT INTO statement. The integer substrings are located between the i-th and (( i +1)-th comma pairs. In effect, the sequence table replaces the loop counter. The sequence table has to have enough numbers to cover the entire string, but unless you really like to type in long parameter lists, this should not be a problem. The last two predicates are to avoid a Cartesian product with the sequence table. 22.2 Lookup Auxiliary Tables In the old days, when engineers used slide rules, other people went to the back of their math and financial book to use printed tables of functions. Here you could find trigonometry, or compound interest, or statistical functions. Today, you would more likely calculate the function, because computing power is so cheap. Pocket calculators that sold for hundreds of dollars in the 1960s are now on spikes in the checkout line at office supply stores. In the days of keypunch data entry, there would be loose-leaf notebooks of the encoding schemes to use sitting next to the incoming paper forms. Today, you would more likely see a WIMP (Windows, Icons, Menus, and Pulldowns) interface. While the physical mechanisms have changed, the idea of building a table (in the nonrelational sense) is still valid. An auxiliary table holds a static or relatively static set of data. The users do not change the data. Updating one of these tables is a job for the DBA or the data repository administrator, if your shop is that sophisticated. One of the problems with even a simple lookup table change was that the existing data often had to be changed to the new encoding scheme, and this required administrative privileges. The primary key of an auxiliary table is never an identifier; an identifier is unique in the schema and refers to one entity anywhere it appears. Lookup tables work with values and are not entities by definition. Monstrosities likes “value_id” are absurd. This is a short list of postfixes that can be used as the name of the key column in auxiliary tables. There is a more complete list of postfixes in my book on SQL Programming Style ( ISBN: 0-12-088797-5).  “ _nbr” or “_num” This is a tag number, a string of digits that names something. Do not use “_no” because it looks like the Boolean yes/ 486 CHAPTER 22: AUXILIARY TABLES no value. I prefer “nbr” to “num,” since it is used as a common abbreviation in several European languages.  “_name” or “_nm” This is an alphabetic name and it explains itself. It is also called a nominal scale.  “_code” or “_cd” A code is a standard maintained by a trusted source, usually outside of the enterprise. For example, the ZIP code is maintained by the United States Postal Service. A code is well understood in its context, so you might not have to translate it for humans.  “_cat” Category is an encoding with an external source that has very distinct groups of entities. There should be strong formal criteria for establishing the category. The classification of Kingdom in biology is an example.  “_class” This is an internal encoding without an external source that reflects a subclassification of the entity. There should be strong formal criteria for the classification. The classification of plants in biology is an example.  “_type” This is an encoding that has a common meaning both internally and externally. A type is usually less formal than a class and might overlap. For example a driver’s license might be motorcycle, automobile, taxi, truck, and so forth. The differences among type, class, and category are an increasing strength of the algorithm for assigning the type, class, or category. A category is very distinct; you will not often have to guess if something is “animal, vegetable, or mineral” to put it in one of those categories. A class is a set of things that have some commonality; there are rules for classifying an animal as a mammal or a reptile. You may have some cases where it is harder to apply the rules, such as the egg-laying mammal in Australia, but the exceptions tend to become their own classification—monotremes, in this example. A type is the weakest of the three, and it might call for a judgment. For example, in some states a three-wheeled motorcycle is licensed as a motorcycle. In other states, it is licensed as an automobile. And in some states, it is licensed as an automobile only if it has a reverse gear. The three terms are often mixed in actual usage. Stick with the industry standard, even if violates the definitions given above. 22.2 Lookup Auxiliary Tables 487  “_status ” An internal encoding that reflects a state of being which can be the result of many factors. For example, ìcredit_statusî might be computed from several sources.  “_addr” or “_loc” An address or location for an entity. There can be a subtle difference between an address and location.  “_date” or “_dt” This represents the date or temporal dimension. More specifically, it is the date of something—employment, birth, termination, and so forth. There is no such column name as just a date by itself. 22.2.1 Simple Translation Auxiliary Tables The most common form of lookup has two columns: one for the value to be looked up and one for the translation of that value into something the user needs. A simple example would be the two-letter ISO-3166 country codes in a table like this: CREATE TABLE CountryCodes (country_code CHAR(2) NOT NULL PRIMARY KEY, iso-3166 country_name VARCHAR(20) NOT NULL); You can add a unique constraint on the descriptive column, but most programmers do not bother, because these tables do not change much. When they do change, it is done with data provided by a trusted source. This makes OLTP database programmers a bit uneasy, but data warehouse database programmers understand it. 22.2.2 Multiple Translation Auxiliary Tables While we want the encoding value to stay the same, we often need to have multiple translations. There can be a short description, a long description, or just a different one, depending on who was looking at the data. For example, consider displaying error messages in various languages in a single table: CREATE TABLE ErrorMessages (err_msg_code CHAR(5) NOT NULL PRIMARY KEY, english_err_msg CHAR(25) NOT NULL french_err_msg NCHAR(25) NOT NULL esperanto NCHAR (25) NOT NULL); 488 CHAPTER 22: AUXILIARY TABLES Yes, adding a new language does require a structure change. However, since the data is static, the convenience of having all the related translations in one place is probably worth it. This inherently forces you to have all the languages for each error code, while a strict First Normal Form (1NF) table does not. Your first thought is that an application using this table would be full of code such as: SELECT CASE :my_language WHEN 'English' THEN english_err WHEN 'French' THEN french_err END AS err_msg FROM ErrorMessages WHERE err_msg_code = '42'; This is not usually the case. You have another table that finds the language preferences of the CURRENT_USER and presents a VIEW to him in the language he desires. You don’t invent or add languages very often. However, I do know of one product that was adding Klingon to its error messages. Seriously, it was for a demo at a trade show to show off the internationalization features. (“Unknown error = Die in ignorance!!” It was a user-surly interface instead of user-friendly.) 22.2.3 Multiple Parameter Auxiliary Tables This type of auxiliary table has two or more parameters that it uses to seek a value. The classic example from college freshman statistics courses is the Student’s t -Distribution for small samples. The value of ( r ) is the size of the sample minus one, and the percentages are the confidence intervals. Loosely speaking, the Student’s t -Distribution is the best guess at the population distribution that we can make without knowing the standard deviation with a certain level of confidence. r 90% 95% 97.5% 99.5% ======================================= 1 3.07766 6.31371 12.7062 63.65600 2 1.88562 2.91999 4.30265 9.92482 3 1.63774 2.35336 3.18243 5.84089 4 1.53321 2.13185 2.77644 4.60393 5 1.47588 2.01505 2.57058 4.03212 10 1.37218 1.81246 2.22814 3.16922 22.2 Lookup Auxiliary Tables 489 30 1.31042 1.69726 2.04227 2.74999 100 1.29007 1.66023 1.98397 2.62589 1.28156 1.64487 1.95999 2.57584 William Gosset created this statistic in 1908. His employer, Guinness Breweries, required him to publish under a pseudonym, so he chose “Student” and that name stuck. 22.2.4 Range Auxiliary Tables In a range auxiliary table, there is one parameter, but it must fall inside a range of values. The most common example would be reporting periods or ranges. There is no rule that prevents these ranges from overlapping. For example, “Swimsuit Season” and “BBQ Grill Sale” might have a large number of days in common at a department store. However, it is usually a good idea not to have disjoint ranges. CREATE TABLE ReportPeriods (period_name CHAR(15) NOT NULL, start_date DATE NOT NULL, end_date DATE NOT NULL, CHECK(start_date < end_date), PRIMARY KEY (start_date, end_date)); The searching is done with a BETWEEN predicate. A NULL can be useful as a marker for an open-ended range. Consider a table for grades in a school. The CHECK() constraint is not needed because of the static nature of the data, but it gives the optimizer extra information about the two columns and might help improve performance. CREATE TABLE LetterGrades (letter_grade CHAR(1) NOT NULL PRIMARY KEY, low_score DECIMAL(6,3) NOT NULL, high_score DECIMAL(6,3)); INSERT INTO LetterGrades VALUES ('F', 0.000, 60.000), ('D', 60.999, 70.000), ('C', 70.999, 80.000), ('B', 80.999, 90.000), ('A', 90.999, NULL); 490 CHAPTER 22: AUXILIARY TABLES If we had made the last range ('A', 90.999, 100.000), then a student who did extra work and got a total score over 100.000 would not have gotten a grade. The alternatives are to use a dummy value, such as ('A', 90.999, 999.999), or to use a NULL and add the predicate. SELECT FROM WHERE Exams.score BETWEEN LetterGrades.low_score AND COALESCE (LetterGrades.high_score, Exams.score); The choice of using a dummy value or a NULL will depend on the nature of the data. 22.2.5 Hierarchical Auxiliary Tables In a hierarchical auxiliary table, there is one parameter, but it must fall inside one or more ranges of values, and those ranges must be nested inside each other. We want to get an entire path of categories back as a result. A common example would be the Dewey Decimal Classification system, which we might encode as: CREATE TABLE DeweyDecimalClassification (category_name CHAR(35) NOT NULL, low_dewey INTEGER NOT NULL, high_dewey INTEGER NOT NULL, CHECK (low_dewey <= high_dewey), PRIMARY KEY (low_dewey, high_dewey)); INSERT INTO DeweyDecimalClassification VALUES ('Natural Sciences & Mathematics', 500, 599), ('Mathematics', 510, 519), ('General Topics', 511, 511), ('Algebra & Number Theory', 512, 512), , ('Probabilities & Applied Mathematics', 519, 519); Thus, a search on 511 returns three rows in the lookup table. The leaf nodes of the hierarchy always have (low_dewey = high_dewey), 22.2 Lookup Auxiliary Tables 491 and the relative nesting level can be determined by (high_dewey - low_dewey) or by the range values themselves. You can add constraints to prevent overlapping ranges and other things, but these constraints are often left off because the table is “read- only” and the constraints can be verified at load time. This is not a good idea because the CHECK() constraints and PRIMARY KEY delcaration can pass along information to the optimizer to improve performance. 22.2.6 One True Lookup Table I think that Paul Keister was the first person to coin the term OTLT (One True Lookup Table) for a common SQL programming technique popular with newbies. Later, D. C. Peterson called it a MUCK (Massively Unified Code-Key) table. The technique crops up time and time again, but I’ll give Keister credit as the first guy to give it a name. Simply put, the idea is to have one table to do all of the code look-ups in the schema. It usually looks like this: CREATE TABLE Lookups (code_type CHAR(10) NOT NULL, code_value VARCHAR(255) NOT NULL, code_description VARCHAR(255) NOT NULL, PRIMARY KEY (code_value, code_type)); So if we have Dewey Decimal Classification (library codes), ICD (International Classification of Diseases), and two-letter ISO-3166 country codes in the schema, we have them all in one honking big table. Let’s start with the problems in the DDL and then look at the awful queries you have to write (or hide in VIEWs). So we need to go back to the original DDL and add a CHECK() constraint on the code_type column. (Otherwise, we might “invent” a new encoding system by typographical error.) The Dewey Decimal and ICD codes are numeric, and the ISO-3166 is alphabetic. Oops, we need another CHECK constraint that will look at the code_type and make sure that the string is in the right format. Now the table looks something like this, if anyone attempted to do it right, which is not usually the case: CREATE TABLE Lookups (code_type CHAR(10) NOT NULL CHECK(code_type IN ('DDC', 'ICD', 'ISO3166', ), code_value VARCHAR(255) NOT NULL, . ParmList (parmeter_position, param) SELECT S1 .i, CAST (SUBSTRING ((', ' || inputstring ||', ') FROM (S1 .i + 1) FOR (S2 .i - S1 .i - 1)) AS INTEGER) FROM Sequence AS S1 , Sequence. Sequence AS S2 WHERE SUBSTRING((', ' || inputstring ||', ') FROM S1 .i FOR 1) = ', ' AND SUBSTRING((', ' || inputstring ||', ') FROM S2 .i FOR 1). ', ' AND S2 .i = (SELECT MIN (S3 .i) FROM Sequence AS S3 WHERE S1 .i < S3 .i AND SUBSTRING((', ' || inputstring ||', ') FROM S3 .i FOR 1) = ', ')

Ngày đăng: 06/07/2014, 09:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan