Joe Celko s SQL for Smarties - Advanced SQL Programming P7 pot

32 CHAPTER 1: DATABASE DESIGN they should be one table with a column for a sex code. I would have split a table on sex. This is very obvious, but it can also be subtler. Consider a subscription database that has both organizational and individual subscribers. There are two tables with the same structure and a third table that holds the split attribute, subscription type. CREATE TABLE OrgSubscriptions (subscr_id INTEGER NOT NULL PRIMARY KEY REFERENCES SubscriptionTypes(subscr_id), org_name CHAR(35), last_name CHAR(15), first_name CHAR(15), address1 CHAR(35)NOT NULL, ); CREATE TABLE IndSubscriptions (subscr_id INTEGER NOT NULL PRIMARY KEY REFERENCES SubscriptionTypes(subscr_id), org_name CHAR(35), last_name CHAR(15), first_name CHAR(15), address1 CHAR(35)NOT NULL, ); CREATE TABLE SubscriptionTypes (subscr_id INTEGER NOT NULL PRIMARY KEY, subscr_type CHAR(1) DEFAULT 'I' NOT NULL CHECK (subscr_type IN ('I', 'O')); An organizational subscription can go to just a person (last_name, first_name), or just the organization name (org_name), or both. If an individual subscription has no particular person, it is sent to an organization called {Current Resident} instead. The original specifications enforce a condition that subscr_id be universally unique in the schema. The first step is to replace the three tables with one table for all subscriptions and move the subscription type back into a column of its own, since it is an attribute of a subscription. Next, we need to add constraints to deal with the constraints on each subscription. 1.1 Schema and Table Creation 33 CREATE TABLE Subscriptions (subscr_id INTEGER NOT NULL PRIMARY KEY REFERENCES SubscriptionTypes(subscr_id), org_name CHAR(35) DEFAULT '{Current Resident}', last_name CHAR(15), first_name CHAR(15), subscr_type CHAR(1) DEFAULT 'I' NOT NULL CHECK (subscr_type IN ('I', 'O'), CONSTRAINT known_addressee CHECK (COALESCE (org_name, first_name, last_name) IS NOT NULL); CONSTRAINT junkmail CHECK (CASE WHEN subscr_type = 'I' AND org_name = '{Current Resident}' THEN 1 WHEN subscr_type = 'O' AND org_name = '{Current Resident}' THEN 0 ELSE 1 END = 1), address1 CHAR(35)NOT NULL, ); The known_addressee constraint means that we have to have a line with some addressee for this to be a valid subscription. The junk mail constraint ensures that anything not aimed at a known person is classified as an individual subscription. Attribute Split Rows Consider this table, which directly models a sign-in/sign-out sheet. CREATE TABLE RegisterBook (emp_name CHAR(35) NOT NULL, sign_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL, sign_action CHAR (3) DEFAULT 'IN' NOT NULL CHECK (sign_action IN ('IN', 'OUT')), PRIMARY KEY (emp_name, sign_time)); To answer any basic query, you need to use two rows in a self-join to get the sign-in and sign-out pairs for each employee. The correction design would have been: 34 CHAPTER 1: DATABASE DESIGN CREATE TABLE RegisterBook (emp_name CHAR(35) NOT NULL, sign_in_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL, sign_out_time TIMESTAMP, null means current PRIMARY KEY (emp_name, sign_in_time)); The single attribute, duration, has to be modeled as two columns in Standard SQL, but it was split into rows identified by a code to tell which end of the duration each one represented. If this were longitude and latitude, you would immediately see the problem and put the two parts of the one attribute (geographical location) in the same row. 1.1.11 Modeling Class Hierarchies in DDL The classic scenario in an object-oriented (OO) model calls for a root class with all of the common attributes and then specialized subclasses under it. As an example, let’s take the class of Vehicles and find an industry standard identifier (the Vehicle Identification Number, or VIN), and add two mutually exclusive subclasses, sport utility vehicles and sedans ('SUV', 'SED'). CREATE TABLE Vehicles (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) NOT NULL CHECK(vehicle_type IN ('SUV', 'SED')), UNIQUE (vin, vehicle_type), ); Notice the overlapping candidate keys. I then use a compound candidate key (vin, vehicle_type) and a constraint in each subclass table to ensure that the vehicle_type is locked and agrees with the Vehicles table. Add some DRI actions and you are done: CREATE TABLE SUV (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT 'SUV' NOT NULL CHECK(vehicle_type = 'SUV'), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Vehicles(vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, 1.1 Schema and Table Creation 35 ); CREATE TABLE Sedans (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT 'SED' NOT NULL CHECK(vehicle_type = 'SED'), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Vehicles(vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ); I can continue to build a hierarchy like this. For example, if I had a Sedans table that broke down into two-door and four-door sedans, I could build a schema like this: CREATE TABLE Sedans (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT 'SED' NOT NULL CHECK(vehicle_type IN ('2DR', '4DR', ‘SED')), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Vehicles(vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ); CREATE TABLE TwoDoor (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT '2DR' NOT NULL CHECK(vehicle_type = '2DR'), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Sedans(vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ); CREATE TABLE FourDoor (vin CHAR(17) NOT NULL PRIMARY KEY, 36 CHAPTER 1: DATABASE DESIGN vehicle_type CHAR(3) DEFAULT '4DR' NOT NULL CHECK(vehicle_type = '4DR'), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Sedans (vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ); The idea is to build a chain of identifiers and types in a UNIQUE() constraint that goes up the tree when you use a REFERENCES constraint. Obviously, you can do variants of this trick to get different class structures. If an entity doesn’t have to be exclusively one subtype, you play with the root of the class hierarchy: CREATE TABLE Vehicles (vin CHAR(17) NOT NULL, vehicle_type CHAR(3) NOT NULL CHECK(vehicle_type IN ('SUV', 'SED')), PRIMARY KEY (vin, vehicle_type), ); Now, start hiding all this stuff in VIEWs immediately and add an INSTEAD OF trigger to those VIEWs. 1.2 Generating Unique Sequential Numbers for Keys One common vendor extension is using some method of generating a sequence of integers to use as primary keys. These are very nonrelational extensions that are highly proprietary, and have major disadvantages. They all are based on exposing part of the physical state of the machine during the insertion process, in violation of Dr. E. F. Codd’s rules for defining a relational database (i.e., rule 8, physical data independence). Dr. Codd’s rules are discussed in Chapter 2. Early SQL products were built on existing file systems. The data was kept in physically contiguous disk pages, in physically contiguous rows, made up of physically contiguous columns, in short, just like a deck of punch cards or a magnetic tape. Most of these sequence generators are an attempt to regain the physical sequence that SQL took out of its 1.2 Generating Unique Sequential Numbers for Keys 37 logical model, so we can pretend that we have physically contiguous storage. But physically contiguous storage is only one way of building a relational database, and it is not always the best one. Aside from that, the whole idea of a relational database is that user is not supposed to know how things are stored at all, much less write code that depends on the particular physical representation in a particular release of a particular product. The exact method used to generate sequences of integers varies from product to product, but the results are all the same, their behavior is unpredictable. Another major disadvantage of sequential numbers as keys is that they have no check digits, so there is no way to determine if they are valid or not (for a discussion of check digits, see Joe Celko’s Data and Databases: Concepts in Practice). So why do people use them? System-generated values are a fast and easy answer to the problem of obtaining a unique primary key. It requires no research and no real data modeling. Drug abuse is also a fast and easy answer to problems. I do not recommend either. 1.2.1 IDENTITY Columns The Sybase/SQL Server family allows you to declare an exact numeric column with the property IDENTITY in Sybase and DB2 or AUTOINCREMENT in SQL Anywhere attached to it. These columns will autoincrement with every row that is inserted into the table. The numbering is totally dependent on the order in which the rows were physically inserted into the table, even if they came into the table as a single statement (i.e., INSERT INTO Foobar SELECT ;). Since this “feature” is highly proprietary, you can get all kinds of implementations. For example, if the next value to be used causes an overflow, then you might get a wraparound to negative values. This occurs with numbers larger than (2^31 - 1) in SQL Anywhere, while Sybase allows the user to set a NUMERIC(p, 0) column to any desired size. Some products increment the internal counter before inserting a row, so a rollback can cause gaps in the sequence. You have to know the current release of your product and never expect your code to port to even consider this “feature” in production code. Let’s look at the logical problems. First, try to create a table with two columns and try to make them both IDENTITY columns. If you cannot declare more than one column to be of a certain data type, then that thing is not a data type at all, by definition. 38 CHAPTER 1: DATABASE DESIGN Next, create a table with one column and make it an IDENTITY column. Now try to insert, update, and delete different numbers from it. If you cannot insert, update, and delete rows from a table, then it is not a table by definition. Finally create a simple table with one IDENTITY column and a few other columns. Use a few statements such as: INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3'); These statements put a few rows into the table. Notice that the IDENTITY column sequentially numbered them in the order they were presented. If you delete a row, the gap in the sequence is not filled, and the sequence continues from the highest number that has ever been used in that column in that particular table. But now use a statement with a query expression in it, like this: INSERT INTO Foobar (a, b, c) SELECT x, y, z FROM Floob; Since a query result is a table, and a table is a set that has no ordering, what should the IDENTITY numbers be? The whole completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one do you pick? The answer has been to use whatever the physical order of the result set happened to be. There’s that nonrelational phrase “physical order” again. But it is actually worse than that. If the same query is executed again, but with new statistics, or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order. Can you explain from a logical model why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical. Think about trying to do replication on two databases that differ only by an index, or by cache size, or by something that occasionally gives them different execution plans for the same statements. Want to try to maintain such a system? 1.2 Generating Unique Sequential Numbers for Keys 39 1.2.2 ROWID and Physical Disk Addresses Oracle has the ability to expose the physical address of a row on the hard drive as a special variable called ROWID. This is the fastest way to locate a row in a table, since the read-write head is positioned to the row immediately. This exposure of the underlying physical storage at the logical level means that Oracle is committed to using contiguous storage for the rows of a table, which in turn means that Oracle cannot use hashing, distributed databases, dynamic bit vectors, or any of several newer techniques for VLDB (Very Large Databases). When the database is moved or reorganized for any reason, the ROWID is changed. 1.2.3 Sequential Numbering in Pure SQL The proper way to do this operation is to insert one row at a time with this Standard SQL statement: INSERT INTO Foobar (keycol, ) VALUES (COALESCE((SELECT MAX(keycol) FROM Foobar), 0) + 1, ); Notice the use of the COALESCE() function to handle the empty table and to get the numbering started with one. This approach generalizes from a row insertion to a table insertion: INSERT INTO Foobar (keycol, ) VALUES (COALESCE((SELECT MAX(keycol) FROM Foobar), 0) + 1, ), (COALESCE((SELECT MAX(keycol) FROM Foobar), 0) + 2, ), (COALESCE((SELECT MAX(keycol) FROM Foobar), 0) + n, ); Another approach is to put a TRIGGER on the table. Here is the code for SQL-99 TRIGGERs; actual products may have a slightly different syntax: CREATE TRIGGER Autoincrement BEFORE INSERT ON Foobar REFERENCING NEW AS N1 FOR EACH ROW BEGIN UPDATE N1 SET keycol = (SELECT COALESCE(MAX(F1.keycol), 0) + 1 40 CHAPTER 1: DATABASE DESIGN FROM Foobar AS F1); COMMIT; put each row into the table as it is processed END; Notice the use of the COALESCE() function to handle the first row inserted into an empty table. Umachandar Jayachandran (www.umachandar.com) suggested the following method for generating unique identifiers in SQL. His original note was for SQL Server, but it can be generalized to any product with a random number function. The idea is to first split the counters into several distinct ranges: CREATE TABLE Counters (id_nbr_set INTEGER NOT NULL PRIMARY KEY, low_val INTEGER NOT NULL, high_val INTEGER NOT NULL, CHECK (low_val < high_val), properly ordered CHECK (NOT EXISTS no overlaps (SELECT * FROM Counters AS C1 WHERE Counters.low_val BETWEEN C1.low_val AND C1.high_val OR Counters.high_val BETWEEN C1.low_val AND C1.high_val)) ); INSERT INTO Counters VALUES (0, 0000000, 0999999); INSERT INTO Counters VALUES (1, 1000000, 1999999); INSERT INTO Counters VALUES (2, 2000000, 2999999); INSERT INTO Counters VALUES (9, 9000000, 9999999); and so on The ranges can be any size you wish. However, uniform sizes have the advantage of matching the uniform random number generator we will be using in the code. The important thing is that the ranges should not overlap each other. Here is a skeleton procedure body: CREATE PROCEDURE GenerateCounters() LANGUAGE SQL IF (SELECT SUM(high_val - low_val) FROM Counters) > 0 THEN BEGIN DECLARE new_id_nbr INTEGER; 1.2 Generating Unique Sequential Numbers for Keys 41 DECLARE random_set INTEGER; SET new_id_nbr = NULL; WHILE (new_id_nbr IS NULL) DO SET random_set = CEILING(RAND() * 10); This will randomly pick one row SET new_id_nbr = (SELECT low_val FROM Counters WHERE id_set_nbr = random_set AND low_val < high_val); UPDATE Counters SET low_val = low_val + 1 WHERE id_nbr_set = random_set AND low_val < high_val; END WHILE; code to create a check digit can go here END; ELSE BEGIN you are out of numbers You can reset the Counters table with an UPDATE. If you take no action, the new id number will be NULL END; END IF; 1.2.4 GUIDs Global Unique Identifiers (GUIDs) are unique exposed physical locators generated by a combination of UTC time and the network address of the device creating it. Microsoft says that they should be unique for about a century. According to Wikipedia (http://en.wikipedia.org/wiki/GUID): “The algorithm used for generating new GUIDs has been widely criticized. At one point, the user’s network card MAC address was used as a base for several GUID digits, which meant that, e.g., a document could be tracked back to the com- puter that created it. After this was discovered, Microsoft changed the algorithm so that it no longer contains the MAC address. This privacy hole was used when locating the creator of the Melissa worm.” . digits, so there is no way to determine if they are valid or not (for a discussion of check digits, see Joe Celko s Data and Databases: Concepts in Practice). So why do people use them? System-generated. Use a few statements such as: INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); INSERT. with some addressee for this to be a valid subscription. The junk mail constraint ensures that anything not aimed at a known person is classified as an individual subscription. Attribute Split

Định dạng
Số trang	10
Dung lượng	134,04 KB