82 CHAPTER 2: NORMALIZATION Given: 1) (day, hour, gate) -> pilot 2) (day, hour, pilot) -> flight prove that: (day, hour, gate) -> flight. 3) (day, hour) -> (day, hour); Reflexive 4) (day, hour, gate) -> (day, hour); Augmentation on 3 5) (day, hour, gate) -> (day, hour, pilot); Union 1 & 4 6) (day, hour, gate) -> flight; Transitive 2 and 5 Q.E.D. The answer is to start by attempting to derive each of the FDs from the rest of the set. What we get is several short proofs, each requiring different “given” FDs in order to get to the derived FD. Here is a list of each of the proofs used to derive the ten fragmented FDs in the problem. With each derivation, we include every derivation step and the legal FD calculus operation that allows us to make that step. An additional operation that we include here, which was not included in the axioms we listed earlier, is left reduction. Left reduction says that if XX → Y then X → Y. The reason it was not included is that this is actually a theorem, and not one of the basic axioms (a side problem: can you derive left reduction?). Prove: (day, hour, pilot) -> gate a) day -> day; Reflexive b) (day, hour, pilot) -> day; Augmentation (a) c) (day, hour, pilot) -> (day, flight); Union (6, b) d) (day, hour, pilot) -> gate; Transitive (c, 3) Q.E.D. Prove: (day, hour, gate) -> pilot a) day -> day; Reflexive b) day, hour, gate -> day; Augmentation (a) c) day, hour, gate -> (day, flight); Union (9, b) d) day, hour, gate -> pilot; Transitive (c, 4) Q.E.D. Prove: (day, flight) -> gate a) (day, flight, pilot) -> gate; Pseudotransitivity (2, 5) 2.9 Domain-Key Normal Form (DKNF) 83 b) (day, flight, day, flight) -> gate; Pseudotransitivity (a, 4) c) (day, flight) -> gate; Left reduction (b) Q.E.D. Prove: (day, flight) -> pilot a) (day, flight, gate) -> pilot; Pseudotransitivity (2, 8) b) (day, flight, day, flight) -> pilot; Pseudotransitivity (a, 3) c) (day, flight) -> pilot; Left reduction (b) Q.E.D. Prove: (day, hour, gate) -> flight a) (day, hour) -> (day, hour); Reflexivity b) (day, hour, gate) -> (day, hour); Augmentation (a) c) (day, hour, gate) -> (day, hour, pilot); Union (b, 8) d) (day, hour, gate) -> flight; Transitivity (c, 6) Q.E.D. Prove: (day, hour, pilot) -> flight a) (day, hour) -> (day, hour); Reflexivity b) (day, hour, pilot) -> (day, hour); Augmentation (a) c) (day, hour, pilot) -> day, hour, gate; Union (b, 5) d) (day, hour, pilot) -> flight; Transitivity (c, 9) Q.E.D. Prove: (day, hour, gate) -> destination a) (day, hour, gate) -> destination; Transitivity (9, 1) Q.E.D. Prove: (day, hour, pilot) -> destination a) (day, hour, pilot) -> destination; Transitivity (6, 1) Q.E.D. Now that we’ve shown you how to derive eight of the ten FDs from other FDs, you can try mixing and matching the FDs into sets so that each set meets the following criteria: 1. Each attribute must be represented on either the left or right side of at least one FD in the set. 2. If a given FD is included in the set, then all the FDs needed to derive it cannot also be included. 84 CHAPTER 2: NORMALIZATION 3. If a given FD is excluded from the set, then the FDs used to derive it must be included. This produces a set of “nonredundant covers,” which can be found through trial and error and common sense. For example, if we exclude (day, hour, gate) → flight, we must then include (day, hour, gate) → pilot, and vice versa, because each is used in the other’s derivation. If you want to be sure your search was exhaustive, however, you may want to apply a more mechanical method, which is what the CASE tools do for you. The algorithm for accomplishing this task is basically to generate all the combinations of sets of the FDs. (flight → destination) and (flight → hour) are excluded in the combination generation because they cannot be derived. This gives us (2^8), or 256, combinations of FDs. Each combination is then tested against the criteria. Fortunately, a simple spreadsheet does all the tedious work. In this problem, the first criterion eliminates only 15 sets. Then the second criterion eliminates 152 sets, and the third criterion drops another 67. This leaves us with 22 possible covers, 5 of which are the answers we are looking for (we will explain the other 17 later). These five nonredundant covers are: Set I: flight -> destination flight -> hour (day, hour, gate) -> flight (day, hour, gate) -> pilot (day, hour, pilot) -> gate Set II: flight -> destination flight -> hour (day, hour, gate) -> pilot (day, hour, pilot) -> flight (day, hour, pilot) -> gate Set III: flight -> destination flight -> hour (day, flight) -> gate (day, flight) -> pilot (day, hour, gate) -> flight 2.9 Domain-Key Normal Form (DKNF) 85 Set IV: flight -> destination flight -> hour (day, flight) -> gate (day, hour, gate) -> pilot (day, hour, pilot) -> flight Set V: flight -> destination flight -> hour (day, flight) -> pilot (day, hour, gate) -> flight (day, hour, pilot) -> gate (day, hour, pilot) -> flight At this point, we perform unions on FDs with the same left-hand side and make tables for each grouping with the left-hand side as a key. We can also eliminate symmetrical FD’s (defined as X → Y and Y → X, and written with a two headed arrow, X ↔ Y) by collapsing them into the same table. These possible schemas are at least in 3NF. They are given in shorthand SQL DDL (Data Declaration Language) without data type declarations. Solution 1: CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE R2 (day, hour, gate, flight, pilot, PRIMARY KEY (day, hour, gate), UNIQUE (day, hour, pilot), UNIQUE (day, flight), UNIQUE (flight, hour)); Solution 2: CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE R2 (day, flight, gate, pilot, PRIMARY KEY (day, flight)); CREATE TABLE R3 (day, hour, gate, flight, PRIMARY KEY (day, hour, gate), UNIQUE (day, flight), 86 CHAPTER 2: NORMALIZATION UNIQUE (flights, hour)); CREATE TABLE R4 (day, hour, pilot, flight, PRIMARY KEY (day, hour, pilot)); Solution 3: CREATE TABLE R1 (flight, destination, hour, flight PRIMARY KEY (flight)); CREATE TABLE R2 (day, flight, gate, PRIMARY KEY (day, flight)); CREATE TABLE R3 (day, hour, gate, pilot, PRIMARY KEY (day, hour, gate), UNIQUE (day, hour, pilot), UNIQUE (day, hour, gate)); CREATE TABLE R4 (day, hour, pilot, flight PRIMARY KEY (day, hour, pilot), UNIQUE(day, flight), UNIQUE (flight, hour)); Solution 4: CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE R2 (day, flight, pilot, PRIMARY KEY (day, flight)); CREATE TABLE R3 (day, hour, gate, flight, PRIMARY KEY (day, hour, gate), UNIQUE (flight, hour)); CREATE TABLE R4 (day, hour, pilot, gate, PRIMARY KEY (day, hour, pilot)); These solutions are a mess, but they are a 3NF mess! Is there a better answer? Here is one in BCNF and only two tables, proposed by Chris Date (Date 1995, p. 224). CREATE TABLE DailySchedules (flight, destination, hour PRIMARY KEY (flight)); CREATE TABLE PilotSchedules (day, flight, gate, pilot, PRIMARY KEY (day, flight)); This is a workable schema, but we could expand the constraints to give us better performance and more precise error messages, since schedules are not likely to change: CREATE TABLE DailySchedules (flight, hour, destination, 2.10 Practical Hints for Normalization 87 UNIQUE (flight, hour, destination), UNIQUE (flight, hour), UNIQUE (flight)); CREATE TABLE PilotSchedules (day, flight, day, hour, gate, pilot, UNIQUE (day, flight, gate), UNIQUE (day, flight, pilot), UNIQUE (day, flight), FOREIGN KEY (flight, hour) REFERENCES R1(flight, hour)); 2.10 Practical Hints for Normalization CASE tools implement formal methods for doing normalization. In particular, E-R (entity-relationship) diagrams are very useful for this process. However, a few informal hints can help speed up the process and give you a good start. Broadly speaking, tables represent either entities or relationships, which is why E-R diagrams work so well as a design tool. Tables that represent entities should have a simple, immediate name suggested by their contents—a table named Students has student data in it, not student data and bowling scores. It is also a good idea to use plural or collective nouns as the names of such tables to remind you that a table is a set of entities; the rows are the single instances of them. Tables that represent many-to-many relationships should be named by their contents, and should be as minimal as possible. For example, Students are related to Classes by a third (relationship) table for their attendance. These tables might represent a pure relationship, or they might contain attributes that exist within the relationship, such as a grade for the class attended. Since the only way to get a grade is to attend the class, the relationship is going to have a column for it, and will be named “ReportCards,” “Grades” or something similar. Avoid naming entities based on many-to-many relationships by combining the two table names. For example, Student_Course is a bad name for the Enrollment entity. Avoid NULLs whenever possible. If a table has too many NULL-able columns, it is probably not normalized properly. Try to use a NULL only for a value that is missing now, but which will be resolved later. Even better, you can put missing values into the encoding schemes for that column, as discussed in as discussed in Section 5.2 of SQL Programming Style, ISBN 0-12-088797-5, on encoding schemes. 88 CHAPTER 2: NORMALIZATION A normalized database will tend to have a lot of tables with a small number of columns per table. Don’t panic when you see that happen. People who first worked with file systems (particularly on computers that used magnetic tape) tend to design one monster file for an application and do all the work against those records. This made sense in the old days, since there was no reasonable way to JOIN a number of small files together without having the computer operator mount and dismount lots of different tapes. The habit of designing this way carried over to disk systems, since the procedural programming languages were still the same. The same nonkey attribute in more than one table is probably a normalization problem. This is not a certainty, just a guideline. The key that determines that attribute should be in only one table, and therefore the attribute should be with it. As a practical matter, you are apt to see the same attribute under different names, and you will need to make the names uniform throughout the entire database. The columns date_of_birth, birthdate, birthday, and dob are very likely the same attribute for an employee. 2.11 Key Types The logical and physical keys for a table can be classified by their behavior and their source. Table 2.1 is a quick table of my classification system. Table 2.1 Classification System for Key Types Natural Artificial "Exposed Surrogate Physical Locator" ===================================================================== Constructed from attributes | in the reality | of the data model | Y N N Y | Verifiable in reality | Y N N N | Verifiable in itself | Y Y N N | Visible to the user | Y Y Y N 2.11 Key Types 89 2.11.1 Natural Keys A natural key is a subset of attributes that occur in a table and act as a unique identifier. The user sees them. You can go to external reality and verify them. Examples of natural keys include the UPC codes on consumer goods (read the package barcode) and coordinates (get a GPS). Newbies worry about a natural compound key becoming very long. My answer is, “So what?” This is the 21st century; we have much better computers than we did in the 1950s, when key size was a real physical issue. To replace a natural two- or three-integer compound key with a huge GUID that no human being or other system can possibly understand, because they think it will be faster, only cripples the system and makes it more prone to errors. I know how to verify the (longitude, latitude) pair of a location; how do you verify the GUID assigned to it? A long key is not always a bad thing for performance. For example, if I use (city, state) as my key, I get a free index on just (city) in many systems. I can also add extra columns to the key to make it a super-key, when such a super-key gives me a covering index (i.e., an index that contains all of the columns required for a query, so that the base table does not have to be accessed at all). 2.11.2 Artificial Keys An artificial key is an extra attribute added to the table that is seen by the user. It does not exist in the external reality, but can be verified for syntax or check digits inside itself. One example of an artificial key is the open codes in the UPC/EAN scheme that a user can assign to his own stuff. The check digits still work, but you have to verify them inside your own enterprise. Experienced database designers tend toward keys they find in industry standard codes, such as UPC/EAN, VIN, GTIN, ISBN, etc. They know that they need to verify the data against the reality they are modeling. A trusted external source is a good thing to have. I know why this VIN is associated with this car, but why is an auto-number value of 42 associated with this car? Try to verify the relationship in the reality you are modeling. It makes as much sense as locating a car by its parking space number. 2.11.3 Exposed Physical Locators An exposed physical locator is not based on attributes in the data model and is exposed to the user. There is no way to predict it or verify it. The system obtains a value through some physical process totally unrelated 90 CHAPTER 2: NORMALIZATION to the logical data model. The user cannot change the locators without destroying the relationships among the data elements. Examples of exposed physical locators would be physical row locations encoded as a number, string or proprietary data type. If hashing tables were accessible in an SQL product they would qualify, but they are usually hidden from the user. Many programmers object to putting IDENTITY and other auto- numbering devices into this category. To convert the number into a physical location requires a search rather than a hashing table lookup or positioning a read/writer head on a disk drive, but the concept is the same. The hardware gives you a way to go to a physical location that has nothing to do with the logical data model, and that cannot be changed in the physical database or verified externally. Most of the time, exposed physical locators are used for faking a sequential file’s positional record number, so I can reference the physical storage location—a 1960s ISAM file in SQL. You lose all the advantages of an abstract data model and SQL set-oriented programming, because you carry extra data and destroy the portability of code. The early SQLs were based on preexisting file systems. The data was kept in physically contiguous disk pages, in physically contiguous rows, made up of physically contiguous columns—in short, just like a deck of punch cards or a magnetic tape. Most programmers still carry that mental model, which is why I keep ranting about file versus table, row versus record and column versus field. But physically contiguous storage is only one way of building a relational database—and it is not the best one. The basic idea of a relational database is that the user is not supposed to know how or where things are stored at all, much less write code that depends on the particular physical representation in a particular release of a particular product on particular hardware at a particular time. This is discussed further in Section 1.2.1, “IDENTITY Columns.” Finally, an appeal to authority, with a quote from Dr. Codd: “Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them. . .” This means that a surrogate ought to act like an index: created by the user, managed by the system, and never seen by a user. That means never used in code, DRI, or anything else that a user writes. Codd also wrote the following: 2.11 Key Types 91 “There are three difficulties in employing user-controlled keys as permanent surrogates for entities. 1. The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g., if two companies merge, the two employee databases might be combined, with the result that some or all of the serial numbers might be changed). 2. Two relations may have user-controlled keys defined on distinct domains (e.g., one uses Social Security numbers, while the other uses employee serial numbers), and yet the entities denoted are the same. 3. It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value, or after it has ceased to have one (e.g., an applicant for a job and a retiree).” These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. One solution—proposed in Chapter 4 and more fully in Chapter 14—is to introduce entity domains, which contain system- assigned surrogates. “Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them. . .” (Codd 1979). 2.11.4 Practical Hints for Denormalization The subject of denormalization is a great way to get into religious wars. At one extreme, you will find relational purists who think that the idea of not carrying a database design to at least 3NF is a crime against nature. At the other extreme, you will find people who simply add and move columns all over the database with ALTER statements, never keeping the schema stable. The reason given for denormalization is performance. A fully normalized database requires a lot of JOINs to construct common VIEWs of data from its components. JOINs used to be very costly in terms of time and computer resources, so “preconstructing” the JOIN in a denormalized table can save quite a bit. Today, only data warehouses should be denormalized—never a production OLTP system. . encoding schemes for that column, as discussed in as discussed in Section 5.2 of SQL Programming Style, ISBN 0-1 2-0 8879 7-5 , on encoding schemes. 88 CHAPTER 2: NORMALIZATION A normalized database. single instances of them. Tables that represent many-to-many relationships should be named by their contents, and should be as minimal as possible. For example, Students are related to Classes by. relationships, which is why E-R diagrams work so well as a design tool. Tables that represent entities should have a simple, immediate name suggested by their contents—a table named Students has student