72 CHAPTER 2: NORMALIZATION section_id CHAR(1) NOT NULL, grade CHAR(1) NOT NULL); CREATE TABLE Students (student_name CHAR (25) NOT NULL PRIMARY KEY, major CHAR(10) NOT NULL); A common misunderstanding about relational theory is that 3NF tables have no transitive dependencies. As indicated above, if X → Y, X does not have to be a key if Y is part of a candidate key. We still have a transitive dependency in the example—(room_nbr, time_period) → (course_name, section_id)—but since the right side of the dependency is a key, it is technically in 3NF. This table structure still has unreasonable behavior, though: several courses can be assigned to the same room at the same time. Another form of transitive dependency is a computed column. For example: CREATE TABLE Stuff (width INTEGER NOT NULL, length INTEGER NOT NULL, height INTEGER NOT NULL, volume INTEGER NOT NULL CHECK (width * length * height = volume), PRIMARY KEY (width, length, height)); The volume column is determined by the other three columns, so any change to one of the three columns will require a change to the volume column. You can use a VIEW to create computed columns. 2.5 Elementary Key Normal Form (EKNF) Elementary Key Normal Form (EKNF) is a subtle enhancement on 3NF. By definition, EKNF tables are also in 3NF. This happens when there is more than one unique composite key and they overlap. Such cases can cause redundant information in the overlapping column(s). For example, in the following table, let’s assume that a subject title is also a unique identifier for a given subject in the following table: CREATE TABLE Enrollment (student_id INTEGER NOT NULL, 2.6 Boyce-Codd Normal Form (BCNF) 73 course_code CHAR(6) NOT NULL, course_name VARCHAR(15) NOT NULL, PRIMARY KEY (student_id, course_name) , UNIQUE (student_id, course_name) alternative key ); Enrollment student_id course_name course_name ======================================= 1 'CS-100' 'ER Diagrams' 1 'CS-114' 'Database Design' 2 'CS-114' 'Database Design' This table, although it is in 3NF, violates EKNF. The primary key of the above table is the combination of (student_id, course_name). However, we can also see an alternative key (student_id, course_name) as well. The above schema could result in update and deletion anomalies, because values of both course_name and course_name tend to be repeated for a given subject. The following schema is a decomposition of the above table to satisfy EKNF: CREATE TABLE Subjects (course_name CHAR(6) NOT NULL PRIMARY KEY, course_name VARCHAR(15) NOT NULL); CREATE TABLE Enrollment (student_id INTEGER NOT NULL, course_name CHAR(6) NOT NULL, PRIMARY KEY (student_id, course_name)); For reasons that will become obvious in the following section, most designers do not ensure a table is in EKNF, as they will move directly on to Boyce-Codd Normal Form after ensuring that a schema is in 3NF. Thus, EKNF is included here only for reasons of historical accuracy and completeness. 2.6 Boyce-Codd Normal Form (BCNF) A table is in BCNF when for all nontrivial FDs (X → A), X is a superkey for the whole schema. A superkey is a unique set of columns that 74 CHAPTER 2: NORMALIZATION identify each row in a table, but you can remove some columns from it and it will still be a key. Informally, a superkey is carrying extra weight. BCNF is the normal form that actually removes all transitive dependencies. A table is in BCNF if for all (X → Y), X is a key—period. We can go to this normal form just by adding another key with UNIQUE (room, time_period) constraint clause to the table Classes. There are some other interesting and useful “higher” normal forms, but they are outside of the scope of this discussion. In our example, we have removed all of the important anomalies with BCNF. 3NF was concerned with the relationship between key and nonkey columns. However, a column can often play both roles. Consider a table for computing salesmen’s bonus gifts, which has for each salesman his base salary, the number of sales points he has won in a contest, and the bonus gift awarded for that combination of salary range and points. For example, we might give a fountain pen to a beginning salesman with a base pay rate somewhere between $15,000 and $20,000 and 100 sales points, but give a car to a master salesman whose salary is between $30,000 and $60,000 and who has 200 points. The functional dependencies are, therefore: (paystep, points) -> gift gift -> points Let’s start with a table that has all the data in it and normalize it. Gifts Salary_amt points gift ========================== 15000 100 'Pencil' 17000 100 'Pen' 30000 200 'Car' 31000 200 'Car' 32000 200 'Car' This schema is in 3NF, but it has problems. You cannot insert a new gift into our offerings and points unless you have a salary to go with it. If you remove any sales points, you lose information about the gifts and salaries (e.g., only people in the $30,000 range can win a car). And, finally, a change in the gifts for a particular point score would have to affect all the rows within the same pay step. This table needs to be broken apart into two tables: 2.7 Fourth Normal Form (4NF) 75 PayGifts Salary_amt gift ================= 15000 'Pencil' 17000 'Pen' 30000 'Car' 31000 'Car' 32000 'Car' GiftsPoints gift points =============== 'Pencil' 100 'Pen' 100 'Car' 200 2.7 Fourth Normal Form (4NF) Fourth Normal Form (4NF) makes use of multivalued dependencies. The problem it solves is that the table has too many of them. For example, consider a table of departments, their projects, and the parts they stock. The MVDs in the table would be: dept_name ->> jobs dept_name ->> parts Assume that dept_name ‘d1’ works on jobs ‘j1’ and ‘j2’ with parts ‘p1’ and ‘p2’; that dept_name ‘d2’ works on jobs ‘j3’, ‘j4’, and ‘j5’ with parts ‘p2’ and ‘p4’; and that dept_name ‘d3’ works only on job ‘j2’ with parts ‘p5’ and ‘p6’. The table would look like this: dept job part ================= 'd1' 'j1' 'p1' 'd1' 'j1' 'p2' 'd1' 'j2' 'p1' 'd1' 'j2' 'p2' 'd2' 'j3' 'p2' 'd2' 'j3' 'p4' 'd2' 'j4' 'p2' 76 CHAPTER 2: NORMALIZATION 'd2' 'j4' 'p4' 'd2' 'j5' 'p2' 'd2' 'j5' 'p4' 'd3' 'j2' 'p5' 'd3' 'j2' 'p6' If you want to add a part to a dept_name, you must create more than one new row. Likewise, removing a part or a job from a row can destroy information. Updating a part or job name will also require multiple rows to be changed. The solution is to split this table into two tables, one with (dept_name, jobs) in it and one with (dept_name, parts) in it. The definition of 4NF is that we have no more than one MVD in a table. If a table is in 4NF, it is also in BCNF. 2.8 Fifth Normal Form (5NF) Fifth Normal Form (5NF), also called the Join-Projection Normal Form or the Projection-Join Normal Form, is based on the idea of a lossless JOIN or the lack of a join-projection anomaly. This problem occurs when you have an n-way relationship, where n > 2. A quick check for 5NF is to see if the table is in 3NF and all the candidate keys are single columns. As an example of the problems solved by 5NF, consider a table of house notes that records the buyer, the seller, and the lender: HouseNotes buyer seller lender ================================== 'Smith' 'Jones' 'National Bank' 'Smith' 'Wilson' 'Home Bank' 'Nelson' 'Jones' 'Home Bank' This table is a three-way relationship, but because many CASE tools allow only binary relationships, it might have to be expressed in an E-R (entity-relationship) diagram as three binary relationships, which would generate CREATE TABLE statements leading to these tables: 2.8 Fifth Normal Form (5NF) 77 BuyerLender buyer lender ============================= 'Smith' 'National Bank' 'Smith' 'Home Bank' 'Nelson' 'Home Bank' SellerLender seller lender ======================= 'Jones' 'National Bank' 'Wilson' 'Home Bank' 'Jones' 'Home Bank' BuyerSeller buyer seller ================ 'Smith' 'Jones' 'Smith' 'Wilson' 'Nelson' 'Jones' The trouble is that when you try to assemble the original information by joining pairs of these three tables together as follows: SELECT BS.buyer, SL.seller, BL.lender FROM BuyerLender AS BL, SellerLender AS SL, BuyerSeller AS BS WHERE BL.buyer = BS.buyer AND BL.lender = SL.lender AND SL.seller = BS.seller; you will recreate all the valid rows in the original table, such as (‘Smith’, ‘Jones’, ‘National Bank’), but there will also be false rows, such as (‘Smith’, ‘Jones’, ‘Home Bank’), which were not part of the original table. This is called a join-projection anomaly. There are also strong JPNFs and overstrong JPNFs, which make use of JOIN dependencies (JDs for short). Unfortunately, there is no systematic way to find a JPNF or 4NF schema, because the problem is known to be NP-complete. 78 CHAPTER 2: NORMALIZATION As an aside, 3NF is very popular with CASE tools, and most of them can generate a schema where all of the tables are in 3NF. They obtain the FDs from an E-R diagram or from a statistical analysis of the existing data, then put them together into tables and check for normal forms. The bad news is that it is often possible to derive more than one 3NF schema from a set of FDs. Most CASE tools that produce an E-R diagram will find only one of them, and go no further. However, if you use an ORM (Object Role Model) tool properly, the schema will be in 5NF. I suggest strongly that you get any of the books by Terry Halpin on this technique: Information Modeling and Relational Databases from Conceptual Analysis to Logical Design, ISBN 1-55860-672-6; Database Modeling with Microsoft® Visio for Enterprise Architects, ISBN 1-55860-919-9. 2.9 Domain-Key Normal Form (DKNF) Ronald Fagin defined Domain/Key Normal Form (DKNF) in 1981 as a schema that has all of the domain constraints and functional dependencies enforced. There is not yet a general algorithm that will always generate the DKNF solution, given a set of constraints. We can, however, determine DKNF in many special cases, and it is a good guide to writing DDL in the real world. Let’s back up a bit and look at the mathematical model under normalization. An FD has a defined system of axioms that can be used in normalization problems. These six axioms, known as Armstrong’s axioms, are given below: Reflexive: X → X Augmentation: If X → Y then XZ → Y Union: If (X → Y and X → Z) then X → YZ Decomposition: If X → Y and Z a subset of Y, then X → Z Transitivity: If (X → Y and Y → Z) then X → Z Pseudotransitivity: If (X → Y and YZ → W) then XZ → W They make good sense if you just look at them, which is something we like in a set of axioms. In the real world, the FDs are the business rules we are trying to model. 2.9 Domain-Key Normal Form (DKNF) 79 In the normalization algorithm for 3NF (developed by P. A. Berstein, 1976) we use the axioms to get rid of redundant FDs. For example, if we are given: A -> B A -> C B -> C DB -> E DAF -> E A → C is redundant, because it can be derived from A → B and B → C with transitivity. Also, DAF → E is redundant, because it can be derived from DB → E and A → B with transitivity (which gives us DA → E) and augmentation (which then allows DAF → E). What we would like to find is the smallest set of FDs from which we can generate all of the given rules. This is called a nonredundant cover. For the FDs above, one cover would be: A -> B B -> C DB -> E Once we do this, Berstein shows that we can just create a table for each of the FDs where A, B, and DB are the respective keys. We have taken it easy so far, but now it’s time for a challenge. As an example of a schema with multiple 3NF tables, here is a problem that was used in a demonstration by DBStar Corporation (now Evoke Software). The company used it as an example in a demonstration that comes with their CASE tool. We are given an imaginary and simplified airline that has a database for scheduling flights and pilots. Most of the relationships are obvious things. Flights have only one departure time and one destination. They can get a different pilot and can be assigned to a different gate each day of the week. The FDs for the database are given below: 1) flight -> destination 2) flight -> hour 3) (day, flight) -> gate 4) (day, flight) -> pilot 5) (day, hour, pilot) -> gate 6) (day, hour, pilot) -> flight 80 CHAPTER 2: NORMALIZATION 7) (day, hour, pilot) -> destination 8) (day, hour, gate) -> pilot 9) (day, hour, gate) -> flight 10) (day, hour, gate) -> destination A purist, looking at this collection of FDs, will be bothered by the redundancies in this list. But in the real world, when you interview people, they do not speak to you in a minimal set; they state things that they know to be true in their situation. In fact, they very often leave out relationships that they consider too obvious to mention. Your problem is to find 3NF or stronger database schemas in these FDs. You have to be careful! You have to have all of the columns, obviously, but your answer could be in 3NF and still ignore some of the FDs. For example, this will not work: CREATE TABLE PlannedSchedule (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE ActualSchedule (day, flight, gate, pilot, PRIMARY KEY (day, flight)); If we apply the Union axiom to some of the FDs, we get: (day, hour, gate) -> (destination, flight, pilot) (day, hour, pilot) -> (destination, flight, gate) This example says that the user has required that if we are given a day, an hour, and a gate we should be able to determine a unique flight for that day, hour, and gate. We should also be able to determine a unique flight given a day, hour, and pilot. Given the PlannedSchedule and ActualSchedule tables, you cannot produce views where either of the two constraints we just mentioned is enforced. If the query “What flight does pilot X have on day Y and hour Z?” gives you more than one answer, it violates the FDs and common sense. Here is an example of a schema that is allowable in this proposed schema, but is undesirable given our constraints: PlannedSchedule flight hour destination ========================= 118 17:00 Dallas 2.9 Domain-Key Normal Form (DKNF) 81 123 13:00 Omaha 155 17:00 Los Angeles 171 13:00 New York 666 13:00 Dis ActualSchedule day flight pilot gate ======================== Wed 118 Tom 12A Wed 155 Tom 13B Wed 171 Tom 12A Thu 123 John 12A Thu 155 John 12A Thu 171 John 13B The constraints mean that we should be able to find a unique answer to each of the following questions and not lose any information when inserting and deleting data: 1. Which flight is leaving from gate 12A on Thursdays at 13:00? This question looks fine until you realize that you don’t know any further information about flight 666, which was not required to have anything about its day or pilot in the Actu- alSchedule table. Likewise, I can add a flight to the Actu- alSchedule table that has no information in the PlannedSchedule table. 2. Which pilot is assigned to the flight that leaves gate 12A on Thursdays at 13:00? This has the same problem as before. 3. What is the destination of the flight in query 1 and 2? This has the same problem as before. 4. What gate is John leaving from on Thursdays at 13:00? 5. Where is Tom flying to on Wednesdays at 17:00? 6. What flight is assigned to Tom on Wednesdays at 17:00? It might help if we gave an example showing how one of the FDs in the problem can be derived using the axioms of FD calculus, just like you would do a geometry proof: . part ================= 'd1' 'j1' 'p1' 'd1' 'j1' 'p2' 'd1' 'j2' 'p1' 'd1' 'j2' 'p2' 'd2'. lender ================================== 'Smith' 'Jones' 'National Bank' 'Smith' 'Wilson' 'Home Bank' 'Nelson' 'Jones' 'Home Bank' This table is. 'p4' 'd2' 'j5' 'p2' 'd2' 'j5' 'p4' 'd3' 'j2' 'p5' 'd3' 'j2' 'p6' If you