Joe Celko s SQL for Smarties - Advanced SQL Programming P8 potx

42 CHAPTER 1: DATABASE DESIGN Besides the usual problems with exposed physical locators, each GUID requires 16 bytes of storage, while a simple INTEGER needs only 4 bytes on most machines. Indexes and PRIMARY KEYs built on GUIDs will have worse performance than shorter key columns. This applies to compound keys of less than 16 bytes, too. I mention this because many newbies justify a GUID key on the grounds that it will improve performance. Besides being false, that level of performance is not a real problem in modern hardware. Computers built on 64-bit hardware are becoming common, and so are faster and faster disk drives. The real problem is that GUIDs are difficult to interpret, so it becomes difficult to work with them directly and trace them back to their source for validation. In fact, the GUID does not have any sorting sequence, so it is impossible to spot a missing value or use them to order results. All you can do is use a CHECK() with a regular expression for string of 36 digits and the letters 'A' to 'F' broken apart by four dashes. The GUID cannot participate in queries involving aggregate functions; you would first have to cast it as a CHAR(36) and use the string value. Your first thought might have been to make it into a longer INTEGER, but the two data types are not compatible. Other features of this data type are very proprietary and will not port out of a Microsoft environment. 1.2.5 Sequence Generator Functions COUNTER(*), NUMBER(*), IDENTITY, and so on are proprietary features that return a new incremented value each time this function is used in an expression. This is a way to generate unique identifiers. This can be either a function call or a column property, depending on the product. It is also a horrible, nonstandard, nonrelational proprietary extension that should be avoided whenever possible. We will spend some time later on ways to get sequences and unique numbers inside Standard SQL without resorting to proprietary code or the use of exposed physical locators in the hardware. 1.2.6 Unique Value Generators The most important property of any usable unique value generator is that it will never generate the same value twice. Sequential integers are the first approach vendors implemented in their product as a substitute for a proper key. 1.2 Generating Unique Sequential Numbers for Keys 43 In essence, they are a piece of code inside SQL that looks at the last allocated value and adds one to get the next value. Let’s start from scratch and build our own version of such a procedure. First create a table called GeneratorValues with one row and two columns: CREATE TABLE GeneratorValues (lock CHAR(1) DEFAULT 'X' NOT NULL PRIMARY KEY only one row CHECK (lock = 'X'), keyval INTEGER DEFAULT 1 NOT NULL positive numbers only CHECK (keyval > 0)); let everyone use the table GRANT SELECT, UPDATE(keyval) ON TABLE GeneratorValues TO PUBLIC; Now the table needs a function to get out a value and do the increment. CREATE FUNCTION Generator() RETURNS INTEGER LANGUAGE SQL DETERMINISTIC BEGIN SET ISOLATION = SERIALIZABLE; UPDATE GeneratorValues SET keyval = keyval + 1; RETURN (SELECT keyval FROM GeneratorValues); COMMIT; END; This solution looks pretty good, but if there are multiple users, this code fragment is capable of allocating duplicate values to different users. It is important to isolate the execution of the code to one and only one user at a time by using SET ISOLATION = SERIALIZABLE. Various SQL products will have slightly different ways of achieving this effect, based on their concurrency control methods. More bad news is that in pessimistic locking systems, you can get serious performance problems resulting from lock contention when a transaction is in serial isolation. The users are put in a single queue for access to the Generator table. 44 CHAPTER 1: DATABASE DESIGN If the application demands gap-free numbering, then we not only have to guarantee that no two sessions ever get the same value, we must also guarantee that no value is ever wasted. Therefore, the lock on the Generator table must be held until the key value is actually used and the entire transaction is committed. Exactly how to handle this is implementation-defined, so I am not going to comment on it. 1.2.7 Preallocated Values In the old days of paper forms, the company had a forms control officer whose job was to track the forms. A gap in the sequential numbers on a check, bond, stock certificate, or other form was a serious accounting problem. Paper forms were usually preprinted and issued in blocks of numbers as needed. You can imitate this procedure in a database with a little thought and a few simple stored procedures. Broadly speaking, there were two types of allocation blocks. In one, the sequence is known (the most common example of this is a checkbook). Gaps in the sequence numbers are not allowed, and a destroyed or damaged check has to be explained with a “void” or other notation. The system needs to record which block went to which user, the date and time, and any other information relevant to the auditors. CREATE TABLE FormsControl (form_nbr CHAR(7) NOT NULL, seq INTEGER NOT NULL CHECK(seq > 0), PRIMARY KEY (form_nbr, seq), recipient CHAR(25) DEFAULT CURRENT_USER NOT NULL, issue_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL, ); The tables that use the form numbers need to have constraints to verify that the numbers were issued and appear in the FormsControl table. The next sequence number is easy to create, but you probably should restrict access to the base table with a stored procedure designed for one kind of form, along these lines. CREATE FUNCTION NextFlobSeq( ) RETURNS INTEGER LANGUAGE SQL DETERMINISTIC BEGIN 1.2 Generating Unique Sequential Numbers for Keys 45 INSERT INTO FormsControl (form_nbr, seq, VALUES (‘Flob-1/R’, (SELECT MAX(seq)+1 FROM FormsControl WHERE form_nbr = ‘Flob-1/R’), ); You can also use views on the FormsControl table to limit user access. If you might be dealing with an empty table, then use this scalar expression: (SELECT COALESCE(MAX(seq), 0)+1 FROM FormsControl WHERE form_nbr = ‘Flob-1/R’), The COALESCE() will return a zero, thus ensuring that the sequence starts with one. 1.2.8 Random Order Values In many applications, we do not want to issue the sequence numbers in sequence. This pattern can give information that we do not wish to expose. Instead, we want to issue generated values in random order. Do not get confused; we want known values that are supplied in random order, not random numbers. Most random number generators can repeat values, which would defeat the purpose of this drill. While I usually avoid mentioning physical implementations, one of the advantages of random-order keys is to improve the performance of tree indexes. Tree-structured indexes (such as a B-Tree) that have sequential insertions become unbalanced and have to be reorganized frequently. However, if the same set of keys is presented in a random order, the tree tends to stay balanced and you get much better performance. The generator shown here is an implementation of the additive congruential method of generating values in pseudo-random order and is due to Roy Hann of Rational Commerce Limited, a CA-Ingres consulting firm. It is based on a shift-register and an XOR-gate, and it has its origins in cryptography. While there are other ways to do this, this code is nice because: 46 CHAPTER 1: DATABASE DESIGN 1. The algorithm can be written in C or another low-level language for speed. But the math is fairly simple, even in base ten. 2. The algorithm tends to generate successive values that are (usually) “far apart,” which is handy for improving the performance of tree indexes. You will tend to put data on separate physical data pages in storage. 3. The algorithm does not cycle until it has generated every possible value, so we don’t have to worry about duplicates. Just count how many calls have been made to the generator. 4. The algorithm produces uniformly distributed values, which is a nice mathematical property to have. It also does not include zero. Let’s walk through all the iterations of the four-bit generator illustrated in Figure 1.1. Initially, the shift register contains the value 0001. The two rightmost bits are XORed together, giving 1; the result is fed into the leftmost bit position, and the previous register contents shift one bit right. The iterations of the register are shown in this table, with their base-ten values: iteration 1: 0001 (1) iteration 2: 1000 (8) iteration 3: 0100 (4) iteration 4: 0010 (2) iteration 5: 1001 (9) iteration 6: 1100 (12) Figure 1.1 Four-bit Generator for Random Order Values. 1.2 Generating Unique Sequential Numbers for Keys 47 iteration 7: 0110 (6) iteration 8: 1011 (11) iteration 9: 0101 (5) iteration 10: 1010 (10) iteration 11: 1101 (13) iteration 12: 1110 (14) iteration 13: 1111 (15) iteration 14: 0111 (7) iteration 15: 0011 (3) iteration 16: 0001 (1) wrap-around! It might not be obvious that successive values are far apart when we are looking at a tiny four-bit register. But it is clear that the values are generated in no obvious order, all possible values except 0 are eventually produced, and the termination condition is clear, the generator cycles back to 1. Generalizing the algorithm to arbitrary binary word sizes, and therefore longer number sequences, is not as easy as you might think. Finding the tap positions where bits are extracted for feedback varies according to the word size in an extremely obscure way. Choosing incorrect tap positions results in an incomplete and usually very short cycle, which is unusable. If you want the details and tap positions for words of one to one hundred bits, see E. J. Watson’s article in Mathematics of Computation, “Primitive Polynomials (Mod 2)” (Watson 1962, pp. 368-369). The table below shows the tap positions for 8-, 16-, 31-, 32-, and 64- bit words. That should work with any computer hardware you have. The 31-bit word is the one that is probably the most useful, since it gives billions of numbers, uses only two tap positions to make the math easier, and matches most computer hardware. The 32-bit version is not easy to implement on a 32-bit machine, because it will usually generate an overflow error. Table 1.1 Tap Positions for Words Word Length 8 = {0, 2, 3, 4} 16 = {0, 2, 3, 5} 31 = {0, 3} 32 = {0, 1, 2, 3, 5, 7} 64 = {0 1, 3, 4} 48 CHAPTER 1: DATABASE DESIGN Using Table 1.1, we can see that we need to tap bits 0 and 3 to construct the 31-bit random-order generated value generator (which is the one most people would want to use in practice): UPDATE Generator31 SET keyval = keyval/2 + MOD(MOD(keyval, 2) + MOD(keyval/8, 2), 2)*2^30; Or if you prefer, the algorithm in C: int Generator31 () {static int n = 1; n = n >> 1 | ((n^n >> 3) & 1) << 30; return n; } 1.3 A Remark on Duplicate Rows Both of Dr. Codd’s relational models do not allow duplicate rows and are based on a set theoretical model. SQL has always allowed duplicate rows and is based on a multiset or bag model. When the question of duplicates came up in SQL committee, we decided to leave it in the Standard. The example we used internally, and which Len Gallagher used in a reply letter to Database Programming & Design magazine and David Beech used in a letter to Datamation, was a cash register receipt with multiple occurrences of cans of cat food on it. Because of this example, the literature now refers to this as the “cat food problem.” The fundamental question is: What are you modeling in a table? Dr. Codd and Chris Date’s position is that a table is a collection of facts. The other position is that a table can represent an entity, a class, or a relationship among entities. With that approach, a duplicate row means more than one occurrence of an entity. This leads to a more object- oriented view of data, where I have to deal with different fundamental relationships among duplicates, such as: Identity = “Clark Kent is Superman!” We really have only one entity, but multiple expressions of it. These expression are not substitutable (Clark Kent does not fly until he changes into Superman). 1.3 A Remark on Duplicate Rows 49 Equality = “Two plus two is four.” We really have only one entity with multiple expressions that are always substitutable. Equivalency = “You use only half as much Concentrated Sudso as your old detergent to get the same cleaning power!” We have two distinct entities, substitutable both ways under all condi- tions. Substitutability = “We are out of gin, would you like a vodka martini?” We have two distinct entities, whose replacement for each other is not always in both directions or under all condi- tions. You might be willing to accept a glass of vodka when there is no wine, but you cannot make a wine sauce with a cup of vodka. Dr. Codd later added a “degree of duplication” operator to his model as a way of handling duplicates when he realized that there is information in duplication that has to be handled. The degree of duplication is not exactly a COUNT(*) or a quantity column in the relation. It does not behave like a numeric column. For example, let us look at table A (let dod means the “degree of duplication” operator for each row): A x y ===== 1 a 2 b 3 b When I do a projection on them, I eliminate duplicate rows in Codd’s model, but I can reconstruct the original table from the dod function: A y dod ====== a 1 b 2 See the difference? It is an operator, not a value. 50 CHAPTER 1: DATABASE DESIGN Having said all of this, I try to only use duplicate rows for loading data into an SQL database from legacy sources. Duplicates are very frequent when you get data from the real world, like cash register tapes. Otherwise, I might leave duplicates in results because using a SELECT DISTINCT to remove them will cost too much sorting time, and the sort will force an ordering in the working table, which results in a bad performance hit later. Dr. Codd mentions this example as “The Supermarket Checkout Problem” in his book,The Relational Model for Database Management: Version 2 (Codd 1990, pp. 378-379). He critiques the problem and credits it to David Beech in an article entitled “The Need for Duplicate Rows in Tables,” Datamation, January 1989. 1.4 Other Schema Objects Let’s be picky about definitions. A database is the data that sits under the control of the database management system (DBMS). The DBMS has the schema, rules and operators that apply to the database. The schema contains the definitions of the objects in the database. But we always say “the database,” as if it had no parts to it. In the original SQL-89 language, the only data structure the user could access via SQL was the table, which could be permanent (base tables) or virtual (views). Standard SQL also allows the DBA to define other schema objects, but most of these new features are not yet available in SQL products, or the versions of them that are available are proprietary. Let’s take a quick look at these new features, but without spending much time on their details. 1.4.1 Schema Tables An SQL engine usually keeps the information it needs about the schema by putting it in SQL tables. No two vendors agree on how the schema tables should be named or structured. The SQL Standard defines a set of standard schema information tables. While each vendor will probably keep their own internal schema information tables, many products now have VIEWS that provide standard schema information tables for data exchange and interfaces to external products. Every SQL product allows users to query the schema tables. User groups have libraries of queries for getting useful information out of the schema tables; you should take the time to get copies of them. 1.4 Other Schema Objects 51 Standard SQL also includes tables for supporting temporal functions, collations, character sets, and so forth, but they might be implemented differently in your actual products. 1.4.2 Temporary Tables Tables in Standard SQL can be defined as persistent base tables, local temporary tables, or global temporary tables. The complete syntax is: <table definition> ::= CREATE [{GLOBAL | LOCAL} TEMPORARY] TABLE <table name> <table element list> [ON COMMIT {DELETE | PRESERVE} ROWS] A local temporary table belongs to a single user, and a global temporary table is shared by more than one user. When a session using a temporary table is over and the work is COMMITed, the table can be either emptied or saved for the next transaction in the user’s session. This is a way of giving the users working storage without giving them CREATE TABLE (and therefore DROP TABLE and ALTER TABLE) privileges. This has been a serious problem in SQL products for some time. When a programmer can create temporary tables on the fly, the design of the programmer’s code quickly becomes a sequential file-processing program, with all the temporary working tapes replaced by temporary working tables. Because the temporary tables are actual tables, they take up physical storage space. If a hundred users call the same procedure, it can allocate tables for a hundred copies of the same data and bring performance down to nothing. 1.4.3 CREATE DOMAIN Statement The DOMAIN is a new schema element in Standard SQL. It enables you to declare an in-line macro that will allow you to put a commonly used column definition in one place in the schema. You should expect to see this feature in SQL products shortly, since it is easy to implement. The syntax is: <domain definition> ::= CREATE DOMAIN <domain name> [AS] <data type> [<default clause>] [<domain constraint> ] [<collate clause>] . Numbers for Keys 45 INSERT INTO FormsControl (form_nbr, seq, VALUES (‘Flob-1/R’, (SELECT MAX(seq)+1 FROM FormsControl WHERE form_nbr = ‘Flob-1/R’), ); You can also use views on the FormsControl. bad news is that in pessimistic locking systems, you can get serious performance problems resulting from lock contention when a transaction is in serial isolation. The users are put in a single. the GUID does not have any sorting sequence, so it is impossible to spot a missing value or use them to order results. All you can do is use a CHECK() with a regular expression for string of

Định dạng
Số trang	10
Dung lượng	147,11 KB