182 CHAPTER 5: CHARACTER DATA TYPES IN SQL d. SCH => SSS PH => FF e. H => If previous or next character is a consonant use the previous character. f. W => If previous character is a vowel, use the previous character. Add the current character to result if the current character is to equal to the last key character. (5) If last character is S, remove it (6) If last characters are AY, replace them with Y (7) If last character is A, remove it The stated reliability of NYSIIS is 98.72%, with a selectivity factor of .164% for a name inquiry. This was taken from Robert L. Taft, “Name Search Techniques,” New York State Identification and Intelligence System. 5.4 Cutter Tables Another encoding scheme for names has been used for libraries for more than 100 years. The catalog number of a book often needs to reduce an author’s name to a simple fixed-length code. While the results of a Cutter table look much like those of a Soundex, their goal is different. They attempt to preserve the original alphabetical order of the names in the encodings. But the librarian cannot just attach the author’s name to the classification code. Names are not the same length, nor are they unique within their first letters. For example, “Smith, John A.” and “Smith, John B.” are not unique until the last letter. What librarians have done about this problem is to use Cutter tables. These tables map authors’ full names into letter-and-digit codes. There are several versions of the Cutter tables. The older tables tended to use a mix of letters (both upper- and lowercase) followed by digits. The three- figure version uses a single letter followed by three digits. For example, using that table: "Adams, J" becomes "A214" "Adams, M" becomes "A215" "Arnold" becomes "A752" "Dana" becomes "D168" "Sherman" becomes "S553" "Scanlon" becomes "S283" 5.4 Cutter Tables 183 The distribution of these numbers is based on the actual distribution of names of authors in English-speaking countries. You simply scan down the table until you find the place where your name would fall and use that code. Cutter tables have two important properties. The first is that they preserve the alphabetical ordering of the original name list, which means that you can do a rough sort on them. The second is that each grouping tends to be of approximately the same size as the set of names gets larger. These properties can be handy for building indexes in a database. If you would like copies of the Cutter tables, you can find some of them on the Internet. Princeton University Library has posted its rules for names, locations, regions, and other things on its Web site, http:// infoshare1.princeton.edu/katmandu/class/cutter.html. You can also get hard copies from this publisher. Hargrave House 7312 Firethorn Littleton, CO 80125 Web site = www.cuttertables.com CHAPTER 6 NULLs: Missing Data in SQL A DISCUSSION OF HOW missing data should be handled enters a sensitive area in relational database circles. Dr. E. F. Codd, creator of the relational model, favored two types of missing-value tokens in his book on the second version of the relational model: one for “unknown” (the eye color of a man wearing sunglasses) and one for “not applicable” (the eye color of an automobile). Chris Date, leading author on relational databases, advocates not using any general- purpose tokens for missing values at all. Standard SQL uses one token, based on Dr. Codd’s original relational model. Perhaps Dr. Codd was right—again. In Standard SQL, adding ROLLUP and CUBE created a need for a function to test NULL s to see if they were in fact “real NULL s” (i.e., present in the data and therefore assumed to model a missing value) or “created NULL s” (i.e., created as place holders for summary rows in the result set). In their book A Guide to Sybase and SQL Server , David McGoveran and C. J. Date said: “It is this writer’s opinion than NULL s, at least as currently defined and implemented in SQL, are far more trouble than they are worth and should be avoided; they display very strange and inconsistent behavior and can be a rich source of error and confusion. (Please note that these comments and criticisms apply to any system that supports SQL-style NULL s, not just to SQL Server specifically.)” 186 CHAPTER 6: NULLS: MISSING DATA IN SQL SQL takes the middle ground and has a single general-purpose NULL for missing values. Rules for NULL s in particular statements appear in the appropriate sections of this book. This section will discuss NULL s and missing values in general. People have trouble with things that are not there. There is no concept of zero in Roman numerals and in other traditional numeral systems. It was centuries before Hindu-Arabic numerals became popular in Europe. In fact, many early Renaissance accounting firms advertised that they did not use the fancy, newfangled notation and kept records in well-understood Roman numerals instead. Many of the conceptual problems with zero arose from not knowing the difference between ordinal and cardinal numbers. Ordinal numbers measure position; cardinal numbers measure quantity or magnitude. The argument against the zero was this: if there is no quantity or magnitude there, how can you count or measure it? What does it mean to multiply or divide a number by zero? There was considerable linguistic confusion over words that deal with the lack of something. As the Greek paradox says: 1. No cat has 12 tails. 2. A cat has one more tail than no cat. 3. Therefore, a cat has 13 tails. Likewise, it was a long time before the idea of an empty set found its way into mathematics. The argument was that if there are no elements, how could you have a set of them? Is the empty set a subset of itself? Is the empty set a subset of all other sets? Is there only one universal empty set or one empty set for each type of set? Computer science now has its own problem with missing data. The Interim Report 75-02-08 to the ANSI X3 (SPARC Study Group 1975) identified 14 different kinds of incomplete data that could appear as the result of queries or as attribute values. These types included overflows, underflows, errors, and other problems in trying to represent the real world within the limits of a computer. Instead of discussing the theory for the different models and approaches to missing data, I would rather explain why and how to use NULL s in SQL. In the rest of this book, I will be urging you not to use them, which may seem contradictory, but it is not. Think of a NULL as a drug; use it properly and it works for you, but abuse it and it can ruin 6.2 Missing Values in Columns 187 everything. Your best policy is to avoid NULL s when you can and use them properly when you have to. 6.1 Empty and Missing Tables An empty table or view is a different concept from a missing table. An empty table is one that is defined with columns and constraints, but that has zero rows in it. This can happen when a table or view is created for the first time, or when all the rows are deleted from the table. It is a perfectly good table. By definition, all of its constraints are TRUE . A missing table has been removed from the database schema with a DROP TABLE statement, or it never existed at all (you probably typed the name wrong). A missing view is a bit different. It, too, can be absent because of a DROP VIEW statement or a typing error. But it can also be absent because a table or view from which it was built has been removed. This means that the view cannot be constructed at runtime, and the database reports a failure. If you used CASCADE behavior when you dropped a table, the view would also be gone; but we’ll explore that later. The behavior of an empty TABLE or VIEW will vary with the way it is used. The reader should look at sections of this book that deal with predicates that use a subquery. In general, an empty table can be treated either as a NULL or as an empty set, depending on context. 6.2 Missing Values in Columns The usual description of NULL s is that they represent currently unknown values that may be replaced later with real values when we know something. Actually, the NULL covers a lot of territory, since it is the only way of showing any missing values. Going back to basics for a moment, we can define a row in a database as an entity, which has one or more attributes (columns), each of which is drawn from some domain. Let us use the notation E(A) = V to represent the idea that an entity, E, has an attribute, A, which has a value, V. For example, I could write “John(hair) = black” to say that John has black hair. SQL’s general-purpose NULL s do not quite fit this model. If you have defined a domain for hair color and one for car color, then a hair color should not be comparable to a car color, because they are drawn from two different domains. You would need to make their domains comparable with an implicit or explicit casting function. This is now being done in Standard SQL, which has a CREATE DOMAIN statement, but most implementations do not have this feature yet. Trying to find out which employees drive cars that match their hair is a bit weird outside of 188 CHAPTER 6: NULLS: MISSING DATA IN SQL Los Angeles, but in the case of NULL s, do we have a hit when a bald- headed man walks to work? Are no hair and no car somehow equal in color? In SQL, we would get an UNKNOWN result, rather than an error, if we compared these two NULL s directly. The domain-specific NULL s are conceptually different from the general NULL , because we know what kind of thing is UNKNOWN . This could be shown in our notation as E(A) = NULL to mean that we know the entity, and we know the attribute, but we do not know the value. Another flavor of NULL is “Not Applicable” (shown as N/A on forms and spreadsheets and called “I-marks” by Dr. E. F. Codd in his second version of the Relational Model), which we have been using on paper forms and in some spreadsheets for years. For example, a bald man’s hair-color attribute is a missing-value NULL drawn from the hair-color domain, but his feather-color attribute is a Not Applicable NULL . The attribute itself is missing, not just the value. This missing-attribute NULL could be written as E( NULL ) = NULL in the formula notation. How could an attribute not belonging to an entity show up in a table? Consolidate medical records and put everyone together for statistical purposes. You should not find any male pregnancies in the result table. The programmer has a choice as to how to handle pregnancies. He can have a column in the consolidated table for “number of pregnancies,” put a zero or a NULL in the rows where sex = ‘male’, and then add some CHECK() clauses to make sure that this integrity rule is enforced. The other way is to have a column for “medical condition” and one for “number of occurrences” beside it. Another CHECK() clause would make sure male pregnancies do not appear. But what happens when the sex is unknown and all we have is a name like ‘Alex Morgan’, which could belong to either gender? Can we use the presence of one or more pregnancies to determine that Alex is a woman? What if Alex is a woman who has never borne children? The case where we have NULL (A) = V is a bit strange. It means that we do not know the entity, but we are looking for a known attribute, A, which has a value of V. This is like asking “What things are colored red?”—a perfectly good question, but one that is very hard to ask in an SQL database. If you want to try writing such a query in SQL, you have to get to the system tables to get the table and column names, then JOIN them to the rows in the tables and come back with the PRIMARY KEY of that row. For completeness, we could play with all eight possible combinations of known and unknown values in the basic E(A) = V formula. But such combinations are of little use or meaning. For example, NULL ( NULL ) = V would mean that we know a value, but not the entity or the attribute. 6.3 Context and Missing Values 189 This is like the running joke from The Hitchhiker’s Guide to the Galaxy (Adams 1979), in which the answer to the question, “What is the meaning of life, the universe, and everything” is 42. Likewise, “total ignorance NULL , shown as NULL ( NULL ) = NULL , means that we have no information about the entity, even about its existence, its attributes, or their values.” 6.3 Context and Missing Values Create a domain called Tricolor that is limited to the values ‘Red’, ‘White’, and ‘Blue’, and a column in a table drawn from that domain with UNIQUE constraint on it. If my table has a ‘Red’ and two NULL values in that column, I have some information about the two NULL s. I know they will be either (‘White’, ‘Blue’) or (‘Blue’, ‘White’) when their rows are resolved. This is what Chris Date calls a “distinguished NULL ,” which means we have some information in it. If my table has a ‘Red’, a ‘White’, and a NULL value in that column, can I change the last NULL to ‘Blue’ because it can only be ‘Blue’ under the rule? Or do I have to wait until I see an actual value for that row? There is no clear way to handle this in SQL. Multiple values cannot be put in a column, nor can the database automatically change values as part of the column declaration. This idea can be carried farther with marked NULL values. For example, we are given a table of hotel rooms that has columns for check- in date and checkout date. We know the check-in date for each visitor, but we do not know his or her checkout dates. Instead, we know relationships among the NULLs. We can put them into groups—Mr. and Mrs. X will check out on the same day, members of tour group Y will check out on the same day, and so forth. We can also add conditions on them: nobody checks out before his check-in date, tour group Y will leave after January 7, 2005, and so forth. Such rules can be put into SQL database schemas, but it is very hard to do. The usual method is to use procedural code in a host language to handle such things. David McGoveran has proposed that each column that can have missing data should be paired with a column that encodes the reason for the absence of a value (McGoveran 1993, 1994 January, February, March). The cost is a bit of extra logic, but the extra column makes it easy to write queries that include or exclude values based on the semantics of the situation. Finally, you might want to look at solutions statisticians have used for missing data. In many kinds of computations, the missing values 190 CHAPTER 6: NULLS: MISSING DATA IN SQL are replaced by an average, median or other value constructed from the data set. 6.4 Comparing NULLs A NULL cannot be compared to another NULL (equal, not equal, less than, greater than, and so forth). This is where we get SQL’s three- valued logic instead of two-valued logic. Most programmers do not easily think in three values. But think about it for a minute. Imagine that you are looking at brown paper bags and are asked to compare them without seeing inside of either of them. What can you say about the predicate “Bag A has more tuna fish than Bag B.”—is it TRUE or FALSE? You cannot say one way or the other, so you use a third logical value, UNKNOWN. If I execute SELECT * FROM SomeTable WHERE SomeColumn = 2; and then execute SELECT * FROM SomeTable WHERE SomeColumn <> 2;, I expect to see all the rows of SomeTable between these two queries. However, I also need to execute SELECT * FROM SomeTable WHERE SomeColumn IS NULL; to do that. The IS [NOT] NULL predicate will return only TRUE or FALSE. 6.5 NULLs and Logic George Boole developed two-valued logic and attached his name to Boolean algebra forever (Boole 1854). This is not the only possible system, but it is the one that works best with a binary (two-state) computer and with a lot of mathematics. SQL has three-valued logic: TRUE, FALSE, and UNKNOWN. The UNKNOWN value results from using NULLs in comparisons and other predicates, but UNKNOWN is a logical value and not the same as a NULL, which is a data value marker. That is why you have to say (x IS [NOT] NULL) in SQL and not use (x = NULL) instead. Table 6.1 shows the tables for the three operators that come with SQL. Table 6.1 SQL’s Three Operators x NOT ================== TRUE FALSE UNK UNK FALSE TRUE 6.5 NULLs and Logic 191 AND | TRUE UNK FALSE ============================= TRUE | TRUE UNK FALSE UNK | UNK UNK FALSE FALSE | FALSE FALSE FALSE OR | TRUE UNK FALSE ============================ TRUE | TRUE TRUE TRUE UNK | TRUE UNK UNK FALSE | TRUE UNK FALSE All other predicates in SQL resolve themselves to chains of these three operators. But that resolution is not immediately clear in all cases, since it is done at run time in the case of predicates that use subqueries. 6.5.1 NULLS in Subquery Predicates People forget that a subquery often hides a comparison with a NULL. Consider these two tables: CREATE TABLE Table1 (col1 INTEGER); INSERT Table1 (col1) VALUES (1); INSERT Table1 (col1) VALUES (2); CREATE TABLE Table2 (col1 INTEGER); INSERT Table2 (col1) VALUES (1); INSERT Table2 (col1) VALUES (2); INSERT Table2 (col1) VALUES (3); INSERT Table2 (col1) VALUES (4); INSERT Table2 (col1) VALUES (5); Notice that the columns are NULL-able. Execute this query: SELECT col1 FROM Table2 WHERE col1 NOT IN (SELECT col1 FROM Table1); Result col1 ====== 3 . and can be a rich source of error and confusion. (Please note that these comments and criticisms apply to any system that supports SQL- style NULL s, not just to SQL Server specifically.)” . CHAPTER 6: NULLS: MISSING DATA IN SQL SQL takes the middle ground and has a single general-purpose NULL for missing values. Rules for NULL s in particular statements appear in the. not using any general- purpose tokens for missing values at all. Standard SQL uses one token, based on Dr. Codd s original relational model. Perhaps Dr. Codd was right—again. In Standard SQL,