222 CHAPTER 8: TABLE OPERATIONS The DEFAULT VALUES clause is actually VALUES (DEFAULT, DEFAULT, , DEFAULT), so it is just shorthand for a particular single row insertion. The tabular constant insertion is a simple tool, mostly used in interactive sessions, to put in small amounts of data. A query insertion executes the query and produces a working table, which is inserted into the target table all at once. In both cases, the optional list of columns in the target table has to be union-compatible with the columns in the query or with the values in the VALUES clause. Any column not in the list will be assigned NULL or its explicit DEFAULT value. 8.2.2 The Nature of Inserts In theory, an insert using a query will place the rows from the query in the target table all at once. The set-oriented nature of an insertion means that a statement like this: INSERT INTO SomeTable (somekey, transaction_time) SELECT millions, CURRENT_TIMESTAMP FROM HugeTable; will have one value for transaction_time in all the rows of the result, no matter how long it takes to load them into SomeTable. Keeping things straight requires a lot of checking behind the scenes. The insertion can fail if just one row violates a constraint on the target table. The usual physical implementation is to put the rows into the target table, but mark the work as uncommitted until the whole transaction has been validated. Once the system knows that the insertion is to be committed, it must rebuild all the indexes. Rebuilding indexes will lock out other users and might require sorting the table, if the table had a clustered index. If you have had experience with a file system, your first thought might be to drop the indexes, insert the new data, sort the table, and reindex it. The utility programs for index creation can actually benefit from having a known ordering. Unfortunately, this trick does not always work in SQL. The indexes maintain the uniqueness and referential integrity constraints, and they cannot be easily dropped and restored. Files stand independently of each other; tables are part of a whole database. 8.3 The UPDATE Statement 223 8.2.3 Bulk Load and Unload Utilities All versions of SQL have a language extension or utility program that will let you read data from an external file directly into a table. There is no standard for this tool, so they are all different. Most of these utilities require the name of the file and the format in which it is written. The simpler versions of the utility just read the file and put it into a single target table. At the other extreme, Oracle uses a miniature language that can do simple editing as each record is read. If you use a simpler tool, it is a good idea to build a working table in which you stage the data for cleanup before loading it into the actual target table. You can apply edit routines, look for duplicates, and put the bad data into another working table for inspection. The corresponding output utility, which converts a table into a file, usually offers a choice of format options; any computations and selection can be done in SQL. Some of these programs will accept a SELECT statement or a VIEW; some will only convert a base table. Most tools now have an option to output INSERT INTO statements along with the appropriate CREATE TABLE and CREATE INDEX statements. 8.3 The UPDATE Statement The function of the UPDATE statement in SQL is to change the values in zero or more columns of zero or more of rows of one table. SQL implementations will tell you how many rows were affected by an update operation or, at a minimum, return the SQLSTATE value for zero rows affected. There are two forms of UPDATE statements: positioned and searched. The positioned UPDATE is done with cursors; the searched UPDATE uses a WHERE that resembles the search condition in a SELECT statement. Positioned UPDATEs will not be mentioned in this book, for several reasons. Cursors are used in a host programming language, and we are concerned with pure SQL whenever possible. Secondly, cursors in SQL are different from cursors in SQL-89 and in current implementations, and are not completely available in any implementations at the time of this writing. 8.3.1 The UPDATE Clause The syntax for a searched update statement is <update statement> :: = 224 CHAPTER 8: TABLE OPERATIONS UPDATE <table name> SET <set clause list> [WHERE <search condition>] <set clause list> :: = <set clause> [{, <set clause>} ] <set clause> :: = <object column> = <update source> <update source> :: = <value expression> | NULL | DEFAULT <object column> :: = <column name> The UPDATE clause simply gives the name of the base table or updatable view to be changed. Notice that no correlation name is allowed in the UPDATE clause. The SQL model for an alias table name is that the engine effectively creates a new table with that new name and populates it with rows identical to the base table or updatable view from which it was built. If you had a correlation name, you would be deleting from this system-created temporary table, and it would vanish at the end of the statement. The base table would never have been touched. Having said this, you will find SQL products that allow the use of a correlation name. The SET clause is a list of columns to be changed or made; the WHERE clause tells the statement which rows to use. For this discussion, we will assume the user doing the update has applicable UPDATE privileges for each <object column>. Standard SQL allows a row constructor in the SET clause. The syntax looks like this: UPDATE Foobar SET (a, b, c) = (1, 2, 3) WHERE x < 12; This is shorthand for the usual syntax, where the row constructor values are matched position for position with the SET clause column list. 8.3.2 The WHERE Clause As mentioned, the most important thing to remember about the WHERE clause is that it is optional. If there is no WHERE clause, all rows in the 8.3 The UPDATE Statement 225 table are changed. This is a common error; if you make it, immediately execute a ROLLBACK statement, or call the DBA for help. All rows that test TRUE for the <search condition> are marked as a subset and not as individual rows. It is also possible that this subset will be empty. This subset is used to construct a new set of rows that will be inserted into the table when the subset is deleted from the table. Note that the empty subset is a valid update that will fire declarative referential actions and triggers. 8.3.3 The SET Clause Each assignment in the <set clause list> is executed in parallel, and each SET clause changes all the qualified rows at once—or at least that is the theoretical model. In practice, implementations will first mark all of the qualified rows in the table in one pass, using the WHERE clause. If there are no problems, then the SQL engine makes a copy of each marked row in working storage. Each SET clause is executed based on the old row image, and the results are put in the new row image. Finally, the old rows are deleted and the new rows are inserted. If an error occurs during all of this, then system does a ROLLBACK, the table is left unchanged and the errors are reported. This parallelism is not like what you find in a traditional third-generation programming language, so it may be hard to learn. This feature lets you write a statement that will swap the values in two columns, thus: UPDATE MyTable SET a = b, b = a; This is not the same thing as BEGIN ATOMIC UPDATE MyTable SET a = b; UPDATE MyTable SET b = a; END; In the first UPDATE, columns a and b will swap values in each row. In the second pair of UPDATEs, column a will get all of the values of column b in each row. In the second UPDATE of the pair, a, which now has the same value as the original value of b, will be written back into column b—no change at all. There are some limits as to what the value 226 CHAPTER 8: TABLE OPERATIONS expression can be. The same column cannot appear more than once in a <set clause list>—which makes sense, given the parallel nature of the statement. Since both go into effect at the same time, you would not know which SET clause to use. 8.3.4 Updating with a Second Table Most updating is done with simple expressions of the form SET <column name> = <constant value>, because UPDATEs are done via data-entry programs. It is also possible to have the <column name> on both sides of the equal sign! This will not change any values in the table, but it can be used as a way to trigger referential actions that have an ON UPDATE condition. However, the <set clause list> does not have to contain only simple expressions. It is possible to use one table to post summary data to another. The scope of the <table name> is the entire <update statement>, so it can be referenced in the WHERE clause. This is easier to explain with an example. Assume we have the following tables: CREATE TABLE Customers (cust_nbr INTEGER NOT NULL PRIMARY KEY, acct_amt DECIMAL(8,2) NOT NULL); CREATE TABLE Payments (trans_nbr INTEGER NOT NULL PRIMARY KEY, cust_nbr INTEGER NOT NULL, trans_amt DECIMAL(8,2) NOT NULL); The problem is to post all of the payment amounts to the balance in the Customers table, overwriting the old balance. Such a posting is usually a batch operation, so a searched UPDATE statement seems the logical approach. SQL-92 and some—but not all—current implementations allow you use the updated tables’ names in a subquery, thus : UPDATE Customers SET acct_amt = acct_amt - (SELECT SUM(amt) FROM Payments AS P1 WHERE Customers.cust_nbr = P1.cust_nbr) WHERE EXISTS 8.3 The UPDATE Statement 227 (SELECT * FROM Payments AS P2 WHERE Customers.cust_nbr = P2.cust_nbr); When there is no payment, the scalar query will return an empty set. The SUM() of an empty set is always NULL. One of the most common programming errors made when using this trick is to write a query that could return more than one row. If you did not think about it, you might have written the last example as: UPDATE Customers SET acct_amt = acct_amt - (SELECT payment_amt FROM Payments AS P1 WHERE Customers.cust_nbr = P1.cust_nbr) WHERE EXISTS (SELECT * FROM Payments AS P2 WHERE Customers.cust_nbr = P2.cust_nbr); But consider the case where a customer has made more than one payment and we have both of them in the Payments table; the whole transaction will fail. The UPDATE statement should return an error message and ROLLBACK the entire UPDATE statement. In the first example, however, we know that we will get a scalar result because there is only one SUM(amt). The second common programming error that is made with this kind of UPDATE is to use an aggregate function that does not return zero when it is applied to an empty table, such as the AVG(). Suppose we wanted to post the average payment amount made by the Customers; we could not just replace SUM() with AVG() and acct_amt with average balance in the above UPDATE. Instead, we would have to add a WHERE clause to the UPDATE that gives us only those customers who made a payment, thus: UPDATE Customers SET payment = (SELECT AVG(P1.amt) FROM Payments AS P1 WHERE Customers.cust_nbr = P1.cust_nbr) WHERE EXISTS (SELECT * FROM Payments AS P1 WHERE Customers.cust_nbr = P1.cust_nbr); 228 CHAPTER 8: TABLE OPERATIONS You can use the WHERE clause to avoid NULLs in cases where a NULL would propagate in a calculation. Another solution is to use a COALESCE() function to take care of the empty subquery result problem. The general form of this statement is UPDATE T1 SET c1 = COALESCE ((SELECT c1 FROM T2 WHERE T1.keycol = T2.keycol), T1.c1), c2 = COALESCE ((SELECT c2 FROM T2 WHERE T1.keycol = T2.keycol), T1.c2), WHERE ; This will also leave the unmatched rows alone, but it will do a table scan on T1. Jeremy Rickard improved this by putting the COALESCE() inside the subquery SELECT list. This solution assumes that you have row constructors in your SQL product. For example: UPDATE T2 SET (c1, c2, ) = (SELECT COALESCE (T1.c1, T2.c1), COALESCE (T1.c2, T2.c2), FROM T1 WHERE T1.keycol = T2.keycol) WHERE ; 8.3.5 Using the CASE Expression in UPDATEs The CASE expression is very handy for updating a table. The first trick is to realize that you can write SET a = a to do nothing. The statement given above can be rewritten as: UPDATE Customers SET payment = CASE WHEN EXISTS (SELECT * FROM Payments AS P1 WHERE Customers.cust_nbr = P1.cust_nbr) THEN (SELECT AVG(P1.amt) 8.3 The UPDATE Statement 229 FROM Payments AS P1 WHERE Customers.cust_nbr = P1.cust_nbr) ELSE payment END; This statement will scan the entire table since there is no WHERE clause. That might be a bad thing in this example—I would guess that only a small number of customers make a payment on any given day. But very often you were going to do table scans anyway, and this version can be faster. But the real advantage of the CASE expression is the ability to combine several UPDATE statements into one statement. The execution time will be greatly improved and will save you from having to write a lot of procedural code or really ugly SQL. Consider this example. We have an inventory of books, and we want to first, reduce the books priced $25.00 and up by 10%, and second, increase the price of the books under $25.00 by 15% to make up the difference. The immediate thought is to write: BEGIN ATOMIC wrong! UPDATE Books SET price = price * 0.90 WHERE price >= 25.00; UPDATE Books SET price = price * 1.15 WHERE price < 25.00; END; But this does not work. Consider a book priced at exactly $25.00. It goes through the first UPDATE and it is repriced at $22.50; then it goes through the second UPDATE and is repriced $25.88, which is not what we wanted. Flipping the two statements will produce the desired results for this book, but given a book priced at $24.95, we will get $28.69 and then $25.82 as a final price. UPDATE Books SET price = CASE WHEN price < 25.00 THEN price = price * 1.15 ELSE price = price * 0.90 END; This is not only faster, but also correct. However, you have to be careful and be sure that you did not really want a series of functions 230 CHAPTER 8: TABLE OPERATIONS applied to the same columns in a particular order. If that is the case, then you need to try to make each assignment expression within the SET clause stand by itself as a complete function instead of one step in a process. Consider this example: BEGIN ATOMIC UPDATE Foobar SET a = x WHERE r = 1; UPDATE Foobar SET b = y WHERE s = 2; UPDATE Foobar SET c = z WHERE t = 3; UPDATE Foobar SET c = z + 1 WHERE t = 4; END; This can be replaced by: UPDATE Foobar SET a = CASE WHEN r = 1 THEN x ELSE a END, b = CASE WHEN s = 2 THEN y ELSE b END, c = CASE WHEN t = 3 THEN z WHEN t = 4 THEN z + 1 ELSE c END WHERE r = 1 OR s = 2 OR t IN (3, 4); The WHERE clause is optional, but might improve performance if the index is right and the candidate set is small. Notice that this approach is driven by the destination of the UPDATE—the columns appear only once in the SET clause. The traditional approach is driven by the source of the changes; first you make updates from one data source, then from the next, and so forth. Think about how you would do this with a set of magnetic tapes applied against a master file. 8.4 A Note on Flaws in a Common Vendor Extension 231 8.4 A Note on Flaws in a Common Vendor Extension While I do not like to spend much time discussing nonstandard SQL- like languages, the T-SQL language from Sybase and Microsoft has a horrible flaw in it that users need to be warned about. They have a proprietary syntax that allows a FROM clause in the UPDATE statement. If the base table being updated is represented more than once in the FROM clause, then its rows can be operated on multiple times, in a total violation of relational principles. The correct answer is that when you try to put more than one value into a column, you get a cardinality violation and the UPDATE fails. Here is a quick example: CREATE TABLE T1 (x INTEGER NOT NULL); INSERT INTO T1 VALUES (1), (2), (3), (4); CREATE TABLE T2 (x INTEGER NOT NULL); INSERT INTO T2 VALUES (1), (1), (1), (1); Now try to update T1 by doubling all the rows that have a match in T2. The FROM clause in the original Sybase version gave you a CROSS JOIN. UPDATE T1 SET T1.x = 2 * T1.x FROM T2 WHERE T1.x = T2.x; T1 x ==== 16 2 3 4 This is a very simple example, as you can see, but you get the idea. Some of this problem has been fixed in the current version of Sybase, but the syntax is still not standard or portable. The Microsoft version solved the cardinality problem by simply grabbing one of the values based on the current physical arrangement of the rows in the table. This is a simple example from Adam Machanic: . condition> are marked as a subset and not as individual rows. It is also possible that this subset will be empty. This subset is used to construct a new set of rows that will be inserted into the. several reasons. Cursors are used in a host programming language, and we are concerned with pure SQL whenever possible. Secondly, cursors in SQL are different from cursors in SQL- 89 and in current. use. For this discussion, we will assume the user doing the update has applicable UPDATE privileges for each <object column>. Standard SQL allows a row constructor in the SET clause.