192 CHAPTER 6: NULLS: MISSING DATA IN SQL 4 5 Now insert a NULL and reexecute the same query: INSERT INTO Table1 (col1) VALUES (NULL); SELECT col1 FROM Table2 WHERE col1 NOT IN (SELECT col1 FROM Table1); The result will be empty. This is counterintuitive, but correct. The NOT IN predicate is defined as: SELECT col1 FROM Table2 WHERE NOT (col1 IN (SELECT col1 FROM Table1)); The IN predicate is defined as: SELECT col1 FROM Table2 WHERE NOT (col1 = ANY (SELECT col1 FROM Table1)); This becomes: SELECT col1 FROM Table2 WHERE NOT ((col1 = 1) OR (col1 = 2) OR (col1 = 3) OR (col1 = 4) OR (col1 = 5) OR (col1 = NULL)); The last expression is always UNKNOWN, so, applying DeMorgan’s laws, the query is really: SELECT col1 FROM Table2 WHERE ((col1 <> 1) AND (col1 <> 2) AND (col1 <> 3) 6.7 Functions and NULLs 193 AND (col1 <> 4) AND (col1 <> 5) AND UNKNOWN); Look at the truth tables and you will see this always reduces to UNKNOWN, and an UNKNOWN is always rejected in a search condition in a WHERE clause. 6.5.2 Standard SQL Solutions SQL-92 solved some of the 3VL (three-valued logic) problems by adding a new predicate of the form: <search condition> IS [NOT] TRUE | FALSE | UNKNOWN This predicate will let you map any combination of three-valued logic to two values. For example, ((age < 18) OR (gender = ‘Female’)) IS NOT FALSE will return TRUE if (age IS NULL) or (gender IS NULL) and the remaining condition does not matter. 6.6 Math and NULLs NULLs propagate when they appear in arithmetic expressions (+, −, *, /) and return NULL results. See Chapter 3 on numeric data types for more details. 6.7 Functions and NULLs Most vendors propagate NULLs in the functions they offer as extensions of the standard ones required in SQL. For example, the cosine of a NULL will be NULL. There are two functions that convert NULLs into values: 1. NULLIF (V1, V2) returns a NULL when the first parameter equals the second parameter. The function is equivalent to the following case specification: CASE WHEN (V1 = V2) THEN NULL ELSE V1 END 194 CHAPTER 6: NULLS: MISSING DATA IN SQL 2. COALESCE (V1, V2, V3, , Vn) processes the list from left to right and returns the first parameter that is not NULL. If all the values are NULL, it returns a NULL. 6.8 NULLs and Host Languages This book does not discuss using SQL statements embedded in any particular host language. For that information, you will need to pick up a book for your particular language. However, you should know how NULLs are handled when they have to be passed to a host program. No standard host language for which an embedding is defined supports NULLs, which is another good reason to avoid using them in your database schema. Roughly speaking, the programmer mixes SQL statements bracketed by EXEC SQL and a language-specific terminator (the semicolon in Pascal and C, END-EXEC in COBOL, and so on) into the host program. This mixed-language program is run through an SQL preprocessor that converts the SQL into procedure calls the host language can compile; then the host program is compiled in the usual way. There is an EXEC SQL BEGIN DECLARE SECTION, EXEC SQL END DECLARE SECTION pair that brackets declarations for the host parameter variables that will get values from the database via CURSORs. This is the “neutral territory,” where the host and the database pass information. SQL knows that it is dealing with a host variable, because these have a colon prefix added to them when they appear in an SQL statement. A CURSOR is an SQL query statement that executes and creates a structure that looks like a sequential file. The records in the CURSOR are returned, one at a time, to the BEGIN DECLARE section of the host program with the FETCH statement. This avoids the impedance mismatch between record processing in the host language and SQL’s set orientation. NULLs are handled by declaring INDICATOR variables in the host language BEGIN DECLARE section, which are paired with the host variables. An INDICATOR is an exact numeric data type with a scale of zero—that is, some kind of integer in the host language. The FETCH statement takes one row from the cursor, then converts each SQL data type into a host-language data type and puts that result into the appropriate host variable. If the SQL value was a NULL, the INDICATOR is set to minus one; if no indicator was specified, an exception condition is raised. As you can see, the host program must be sure to check the INDICATORs, because otherwise the value of the 6.9 Design Advice for NULLs 195 parameter will be garbage. If the parameter is passed to the host language without any problems, the INDICATOR is set to zero. If the value being passed to the host program is a non- NULL character string and has an indicator, the indicator is set to the length of the SQL string and can be used to detect string overflows or to set the length of the parameter. Other SQL interfaces such as ODBC, JDBC, and so on have similar mechanisms for telling the host program about NULLs, even though they might not use cursors. 6.9 Design Advice for NULLs It is a good idea to declare all your base tables with NOT NULL constraints on all columns whenever possible. NULLs confuse people who do not know SQL, and NULLs are expensive. NULLs are usually implemented with an extra bit somewhere in the row where the column appears, rather than in the column itself. They adversely affect storage requirements, indexing, and searching. NULLs are not permitted in PRIMARY KEY columns. Think about what a PRIMARY KEY that was NULL (or partially NULL) would mean. A NULL in a key means that the data model does not know what makes the entities in that table unique from each other. That in turn says that DBMS cannot decide whether the PRIMARY KEY does or does not duplicate a key that is already in the table. NULLs should be avoided in FOREIGN KEYs. SQL allows this “benefit of the doubt” relationship, but it can cause a loss of information in queries that involve joins. For example, given a part number code in Inventory that is referenced as a FOREIGN KEY by an Orders table, you will have problems getting a listing of the parts that have a NULL. This is a mandatory relationship; you cannot order a part that does not exist. An example of an optional foreign key is a Personnel table having a foreign key to a ParoleOfficer table; obviously a NULL here means the person does not (currently) have a parole officer. The NULL can be avoided by forcing the separation of the foreign key into its own table, such that no row exists for a person who has no parole officer. However, this degree of normalization is not always possible, nor would it always be desirable to force the split. There is, too, the issue of what to return if a join of the two tables is required, to return personnel information plus parole officer, if any. There is also finally the issue of whether, when multiple such splits have been made, the retrieval of consolidated information will result in extremely slow queries to produce all the 196 CHAPTER 6: NULLS: MISSING DATA IN SQL joined data (and to substitute whatever indicator has been chosen to represent the “missing” data). NULLs should not be allowed in encoding schemes that are known to be complete. For example, employees are people and people are either male or female. On the other hand, if you are recording the gender of lawful persons (humans, corporations, and other legal entities), you need the ISO sex codes, which use 0 = unknown, 1 = male, 2 = female, 9 = not applicable. No, you have not missed a new gender; code 9 is for legal persons, such as corporations. The use of all zeros and all nines for “Unknown” and “N/A” is quite common in numeric encoding schemes. This convention is a leftover from the old punch card days, when a missing value was left as a field of blanks (i.e., no punches) that could be punched into the card later. Likewise, a field of all nines would sort to the end of the file, and it was easy to hold the “nine” key down when the keypunch machine was in numeric shift. However, you have to use NULLs in date fields when a DEFAULT date does not make sense. For example, if you do not know someone’s birthdate, a default date does not make sense; if a warranty has no expiration date, then a NULL can act as an “eternity” symbol. Unfortunately, you often know relative times, but it is difficult to express them in a database. For example, a pay raise occurs some time after you have been hired, not before. A convict serving on death row should expect a release date resolved by an event: his termination by execution or by natural causes. This leads to extra columns to hold the status and to control the transition constraints. There is a proprietary extension to date values in MySQL. If you know the year but not the month, you may enter ‘1949-00-00’. If you know the year and month, but not the day, you may enter ‘1949-09-00’. You cannot reliably use date arithmetic on these values, but they do help in some instances, such as sorting people’s birthdates or calculating their (approximate) age. For people’s names, you are probably better off using a special dummy string for unknown values rather than the general NULL. In particular, you can build a list of ‘John Doe #1’, ‘John Doe #2’, and so forth to differentiate them; and you cannot do that with NULL. Quantities have to use a NULL in some cases. There is a difference between an unknown quantity and a zero quantity; it is the difference between an empty gas tank and not having a car at all. Using negative numbers to represent missing quantities does not work, because it makes accurate calculations too complex. 6.9 Design Advice for NULLs 197 When programming languages had no DATE data types, this could have been handled with a character string of '9999-99-99 23:59:59.999999' for “eternity” or “the end of time.” When 4GL products with a DATE data type came onto the market, programmers usually inserted the maximum possible date for “eternity.” But again, this will show up in calculations and in summary statistics. The best trick was to use two columns, one for the date and one for a flag. But this made for fairly complex code in the 4GL. 6.9.1 Avoiding NULLs from the Host Programs You can avoid putting NULLs into the database from the Host Programs with some programming discipline. 1. Initialize in the host program: Initialize all the data elements and displays on the input screen of a client program before insert- ing data into the database. Exactly how you can make sure that all the programs use the same default values is another prob- lem. 2. Use automatic defaults: The database is the final authority on the default values. 3. Deduce values: Infer the missing data from the given values. For example, patients reporting a pregnancy are female; patients reporting prostate cancer are male. This technique can also be used to limit choices to valid values for the user. 4. Track missing data: Data is tagged as missing, unknown, in error, out-of-date, or whatever other condition makes it missing. This will involve a companion column with special codes. 5. Determine impact of missing data on programming and reporting: Numeric columns with NULLs are a problem, because queries using aggregate functions can provide misleading results. Aggregate functions drop out the NULLs before doing the math, and the programmer has to trap the SQLSTATE code for this to make corrections. 6. Prevent missing data: Use batch process to scan and validate data elements before it goes into the database. In the early 2000s, there was a sudden concern for data quality as CEOs started going to jail for failing audits. This has lead to a niche in the software trade for data quality tools. 198 CHAPTER 6: NULLS: MISSING DATA IN SQL 7. Ensure consistency: The data types and their NULL-ability constraints have to be consistent across databases (e.g., the chart of account should be defined the same way in both the desktop and enterprise-level databases). 6.10 A Note on Multiple NULL Values In a discussion on CompuServe in July 1996, Carl C. Federl came up with an interesting idea for multiple missing value tokens in a database. If you program in embedded SQL, you are used to having to work with an INDICATOR column. This column is used to pass information to the host program, mostly about the NULL or NOT NULL status of the SQL column in the database. What the host program does with the information is up to the programmer. So why not extend this concept a bit and provide an indicator column in SQL? Let’s work out a simple example: CREATE TABLE Bob (keycol INTEGER NOT NULL PRIMARY KEY, valcol INTEGER NOT NULL, multi_indicator INTEGER NOT NULL CHECK (multi_indicator IN (0, Known value 1, Not applicable value 2, Missing value 3 Approximate value)); Let’s set up the rules: when all values are known, we do a regular total. If a value is “not applicable,” then the whole total is “not applicable.” If we have no “not applicable” values, then “missing value” dominates the total; if we have no “not applicable” and no “missing” values, then we give a warning about approximate values. The general form of the queries will be: SELECT SUM (valcol), (CASE WHEN NOT EXISTS (SELECT multi_indicator FROM Bob WHERE multi_indicator > 0) THEN 0 WHEN EXISTS (SELECT * FROM Bob WHERE multi_indicator = 1) 6.10 A Note on Multiple NULL Values 199 THEN 1 WHEN EXISTS (SELECT * FROM Bob WHERE multi_indicator = 2) THEN 2 WHEN EXISTS (SELECT * FROM Bob WHERE multi_indicator = 3) THEN 3 ELSE NULL END) AS totals_multi_indicator FROM Bob; Why would I muck with the valcol total at all? The status is over in the multi_indicator column, just like it was in the original table. Here is an exercise for the reader: 1. Make up a set of rules for multiple missing values and write a query for the SUM(), AVG(), MAX(), MIN(), and COUNT() functions. 2. Set degrees of approximation (plus or minus five, plus or minus ten, etc.) in the multi_indicator. Assume the valcol is always in the middle. Make the multi_indicator handle the fuzziness of the situation. CREATE TABLE MultiNull (groupcol INTEGER NOT NULL, keycol INTEGER NOT NULL, valcol INTEGER NOT NULL CHECK (valcol >= 0), valcol_null INTEGER NOT NULL DEFAULT 0, CHECK(valcol_null IN (0, Known Value 1, Not applicable 2, Missing but applicable 3, Approximate within 1% 4, Approximate within 5% 5, Approximate within 25% 6 Approximate over 25% range)), PRIMARY KEY (groupcol, keycol), CHECK (valcol = 0 AND valcol_null NOT IN (1,2)); 200 CHAPTER 6: NULLS: MISSING DATA IN SQL CREATE VIEW Group_MultiNull (groupcol, valcol_sum, valcol_avg, valcol_max, valcol_min, row_cnt, notnull_cnt, na_cnt, missing_cnt, approximate_cnt, appr_1_cnt, approx_5_cnt, approx_25_cnt, approx_big_cnt) AS SELECT groupcol, SUM(valcol), AVG(valcol), MAX(valcol), MIN(valcol), COUNT(*), SUM (CASE WHEN valcol_null = 0 THEN 1 ELSE 0 END) AS notnull_cnt, SUM (CASE WHEN valcol_null = 1 THEN 1 ELSE 0 END) AS na_cnt, SUM (CASE WHEN valcol_null = 2 THEN 1 ELSE 0 END) AS missing_cnt, SUM (CASE WHEN valcol_null IN (3,4,5,6) THEN 1 ELSE 0 END) AS approximate_cnt, SUM (CASE WHEN valcol_null = 3 THEN 1 ELSE 0 END) AS appr_1_cnt, SUM (CASE WHEN valcol_null = 4 THEN 1 ELSE 0 END) AS approx_5_cnt, SUM (CASE WHEN valcol_null = 5 THEN 1 ELSE 0 END) AS approx_25_cnt, SUM (CASE WHEN valcol_null = 6 THEN 1 ELSE 0 END) AS approx_big_cnt FROM MultiNull GROUP BY groupcol; SELECT groupcol, valcol_sum, valcol_avg, valcol_max, valcol_min, (CASE WHEN row_cnt = notnull_cnt THEN 'All are known' ELSE 'Not all are known' END) AS warning_message, row_cnt, notnull_cnt, na_cnt, missing_cnt, approximate_cnt, appr_1_cnt, approx_5_cnt, approx_25_cnt, approx_big_cnt FROM Group_MultiNull; While this is a bit complex for the typical application, it is not a bad idea for a “staging area” database that attempts to scrub the data before it goes to a data warehouse. CHAPTER 7 Multiple Column Data Elements T HE CONCEPT OF A data element being atomic or scalar is usually taken to mean that it is represented with a single column in a table. This is not always true. A data element is atomic when it cannot be decomposed into independent, meaningful parts. Doing so would result in attribute splitting, a design flaw that we discussed in Section 1.1.11. Consider an ( x , y ) coordinate system. A single x or y value identifies a line of points, while the pair has to be taken together to give you a location on the plane. It would be inconvenient to put both coordinates into one column, so we model them in two columns. 7.1 Distance Functions Since geographical data is important, you might find it handy to locate places by their longitude and latitude, then calculate the distances between two points on the globe. This is not a standard function in any SQL, but it is handy to know. Assume that we have values (Latitude1, Longitude1, Latitude2, Longitude2) that locate the two points, and that they are in radians, and we have trigonometry functions. To convert decimal degrees to radians, multiply the number of degrees by pi/180 = 0.017453293 radians/degree, where pi is approximately 3.14159265358979: . tools. 198 CHAPTER 6: NULLS: MISSING DATA IN SQL 7. Ensure consistency: The data types and their NULL-ability constraints have to be consistent across databases (e.g., the chart of account should. Other SQL interfaces such as ODBC, JDBC, and so on have similar mechanisms for telling the host program about NULLs, even though they might not use cursors. 6.9 Design Advice for NULLs It is. database pass information. SQL knows that it is dealing with a host variable, because these have a colon prefix added to them when they appear in an SQL statement. A CURSOR is an SQL query statement