212 CHAPTER 8: TABLE OPERATIONS the searched deletion uses a WHERE clause like the search condition in a SELECT statement. 8.1.1 The DELETE FROM Clause The syntax for a searched deletion statement is: <delete statement: searched> :: = DELETE FROM <table name> [WHERE <search condition>] The DELETE FROM clause simply gives the name of the updatable table or view to be changed. Notice that no correlation name is allowed in the DELETE FROM clause. The SQL model for an alias table name is that the engine effectively creates a new table with that new name and populates it with rows identical to the base table or updatable view from which it was built. If you had a correlation name, you would be deleting from this system-created temporary table, and it would vanish at the end of the statement. The base table would never have been touched. For this discussion, we will assume the user doing the deletion has applicable DELETE privileges for the table. The positioned deletion removes the row in the base table that is the source of the current cursor row. The syntax is: <delete statement: positioned> :: = DELETE FROM <table name> WHERE CURRENT OF <cursor name> Cursors in SQL are generally more expensive than nonprocedural code and, despite the existence of the Standard, they vary widely in current implementations. If you have a properly designed table with a key, you should be able to avoid them in a DELETE FROM statement. 8.1.2 The WHERE Clause The most important thing to remember about the WHERE clause is that it is optional. If there is no WHERE clause, all rows in the table are deleted. The table structure still exists, but there are no rows. Most, but not all, interactive SQL tools will give the user a warning when he or she is about to do this and ask for confirmation. Unless you want to clear out the table, immediately do a ROLLBACK to restore it; if you COMMIT or have set the tool to automatically commit the work, then 8.1 DELETE FROM Statement 213 the data is pretty much gone. The DBA will have to do something to save you. And don’t feel badly about doing it at least once while you are learning SQL. Because we wish to remove a subset of rows all at once, we cannot simply scan the table one row at a time and remove each qualifying row as it is encountered. The way most SQL implementations do a deletion is with two passes on the table. The first pass marks all of the candidate rows that meet the WHERE clause condition. This is also when most products check to see if the deletion will violate any constraints. The most common violations involve trying to remove a value that is referenced by a foreign key (“Hey, we still have orders for those pink lawn flamingoes; you cannot drop them from inventory yet!”). But other constraints in CREATE ASSERTION statements’ CHECK() constraints can also cause a ROLLBACK . After the subset is validated, the second pass removes it, either immediately or by marking the rows so that a housekeeping routine can later reclaim the storage space. Then any further housekeeping, such as updating indexes, is done last. The important point is that while the rows are being marked, the entire table is still available for the WHERE condition to use. In many if not most cases, this two-pass method does not make any difference in the results. The WHERE clause is usually a fairly simple predicate that references constants or relationships among the columns of a row. For example, we could clear out some Personnel with this deletion: DELETE FROM Personnel WHERE iq <= 100; constant in simple predicate or: DELETE FROM Personnel WHERE hat_size = iq; uses columns in the same row A good optimizer could recognize that these predicates do not depend on the table as a whole, and would use a single scan for them. The two passes make a difference when the table references itself. Let’s fire employees with IQs that are below average for their departments. DELETE FROM Personnel WHERE iq < (SELECT AVG(P1.iq) FROM Personnel AS P1 must have correlation name 214 CHAPTER 8: TABLE OPERATIONS WHERE Personnel.dept_nbr = P1.dept_nbr); We have the following data: Personnel emp_nbr dept_nbr iq ====================== 'Able' 'Acct' 101 'Baker' 'Acct' 105 'Charles' 'Acct' 106 'Henry' 'Mkt' 101 'Celko' 'Mkt' 170 'Popkin' 'HR' 120 If this were done one row at a time, we would first go to Accounting and find the average IQ, (101 + 105 + 106)/3.0 = 104, and fire Able. Then we would move sequentially down the table, and again find the average IQ, (105 + 106)/2.0 = 105.5 and fire Baker. Only Charles would escape the downsizing. Now sort the table a little differently, so that the rows are visited in reverse alphabetic order. We first read Charles’s IQ and compute the average for Accounting (101 + 105 + 106)/3.0 = 104, and retain Charles. Then we would move sequentially down the table, with the average IQ unchanged, so we also retain Baker. Able, however, is downsized when that row comes up. It might be worth noting that early versions of DB2 would delete rows in the sequential order in which they appear in physical storage. Sybase’s SQL Anywhere ( née WATCOM SQL) has an optional ORDER BY clause that sorts the table, then does a sequential deletion on the table. This feature can be used to force a sequential deletion in cases where order does not matter, thus optimizing the statement by saving a second pass over the table. But it also can give the desired results in situations where you would otherwise have to use a cursor and a host language. Anders Altberg, Johannes Becher, and I tested different versions of a DELETE statement whose goal was to remove all but one row of a group. The column dup_cnt is a count of the duplicates of that row in the original table. The three statements tested were: D1: DELETE FROM Test 8.1 DELETE FROM Statement 215 WHERE EXISTS (SELECT * FROM Test AS T1 WHERE T1.dup_id = Test.dup_id AND T1.dup_cnt < dup_cnt) D2: DELETE FROM Test WHERE dup_cnt > (SELECT MIN(T1.dup_cnt) FROM Test AS T1 WHERE T1.dup_id = Test.dup_id); D3: BEGIN ATOMIC INSERT INTO WorkingTable(dup_id, min_dup_cnt) SELECT dup_id, MIN(dup_cnt) FROM Test GROUP BY dup_id; DELETE FROM Test WHERE dup_cnt > (SELECT min_dup_cnt FROM WorkingTable WHERE Working.dup_id = Test.dup_id); END; Their relative execution speeds in one SQL desktop product were: D1 3.20 seconds D2 31.22 seconds D3 0.17 seconds Without seeing the execution plans, I would guess that statement D1 went to an index for the EXISTS() test and returned TRUE on the first item it found. On the other hand, D2 scanned each subset in the partitioning of Test by dup_id to find the MIN() over and over. Finally, the D3 version simply does a JOIN on simple scalar columns. With full SQL-92, you could write D3 as: D3-2: DELETE FROM Test WHERE dup_cnt > (SELECT min_dup_cnt FROM (SELECT dup_id, MIN(dup_cnt) 216 CHAPTER 8: TABLE OPERATIONS FROM Test GROUP BY dup_id) AS WorkingTable(dup_id, min_dup_cnt) WHERE Working.dup_id = Test.dup_id); Having said all of this, the faster way to remove redundant duplicates is most often with a CURSOR that does a full table scan. 8.1.3 Deleting Based on Data in a Second Table The WHERE clause can be as complex as you wish. This means you can have subqueries that use other tables. For example, to remove customers who have paid their bills from the Deadbeats table, you can use a correlated EXISTS predicate, thus: DELETE FROM Deadbeats WHERE EXISTS (SELECT * FROM Payments AS P1 WHERE Deadbeats.cust_nbr = P1.cust_nbr AND P1.amtpaid >= Deadbeats.amtdue); The scope rules from SELECT statements also apply to the WHERE clause of a DELETE FROM statement, but it is a good idea to qualify all of the column names. 8.1.4 Deleting within the Same Table SQL allows a DELETE FROM statement to use columns, constants, and aggregate functions drawn from the table itself. For example, it is perfectly all right to remove everyone who is below average in a class with this statement: DELETE FROM Students WHERE grade < (SELECT AVG(grade) FROM Students); But the DELETE FROM clause does not allow for correlation names on the table in the DELETE FROM clause, so not all WHERE clauses that could be written as part of a SELECT statement will work in a DELETE FROM statement. For example, a self-join on the working table in a subquery is impossible. DELETE FROM Personnel AS B1 correlation name is INVALID SQL 8.1 DELETE FROM Statement 217 WHERE Personnel.boss_nbr = B1.emp_nbr AND Personnel.salary > B1.salary); There are ways to work around this. One trick is to build a VIEW of the table and use the VIEW instead of a correlation name. Consider the problem of finding all employees who are now earning more than their boss and deleting them. The employee table being used has a column for the employee’s identification number, emp_nbr, and another column for the boss’s employee identification number, boss_nbr. CREATE VIEW Bosses AS SELECT emp_nbr, salary FROM Personnel; DELETE FROM Personnel WHERE EXISTS (SELECT * FROM Bosses AS B1 WHERE Personnel.boss_nbr = B1.emp_nbr AND Personnel.salary > B1.salary); Simply using the Personnel table in the subquery will not work. We need an outer reference in the WHERE clause to the Personnel table in the subquery, and we cannot get that if the Personnel table is in the subquery. Such views should be as small as possible, so that the SQL engine can materialize them in main storage. Redundant Duplicates in a Table Redundant duplicates are unneeded copies of a row in a table. You most often get them because you did not put a UNIQUE constraint on the table and then you inserted the same data twice. Removing the extra copies from a table in SQL is much harder than you would think. If fact, if the rows are exact duplicates, you cannot do it with a simple DELETE FROM statement. Removing redundant duplicates involves saving one of them while deleting the other(s). But if SQL has no way to tell them apart, it will delete all rows that were qualified by the WHERE clause. Another problem is that the deletion of a row from a base table can trigger referential actions, which can have unwanted side effects. For example, if there is a referential integrity constraint that says a deletion in Table1 will cascade and delete matching rows in Table2, removing redundant duplicates from T1 can leave me with no matching rows in T2. Yet I still have a referential integrity rule that says there must be at least one match in T2 for the single row I preserved in T1. SQL 218 CHAPTER 8: TABLE OPERATIONS allows constraints to be deferrable or nondeferrable, so you might be able to suspend the referential actions that the transaction below would cause: BEGIN INSERT INTO WorkingTable use DISTINCT to kill duplicates SELECT DISTINCT * FROM MessedUpTable; DELETE FROM MessedUpTable; clean out messed-up table INSERT INTO MessedUpTable put working table into it SELECT * FROM WorkingTable; DROP TABLE WorkingTable; get rid of working table END; Removal of Redundant Duplicates with ROWID Leonard C. Medel came up with several interesting ways to delete redundant duplicate rows from a table in an Oracle database. Let’s assume that we have a table: CREATE TABLE Personnel (emp_id INTEGER NOT NULL, name CHAR(30) NOT NULL, ); The classic Oracle “delete dups” solution is the statement: DELETE FROM Personnel WHERE ROWID < (SELECT MAX(P1.ROWID) FROM Personnel AS P1 WHERE P1.dup_id = Personnel.dup_id AND P1.name = Personnel.name); AND ); The column, or more properly pseudo-column, ROWID is based on the physical location of a row in storage. It can change after a user session but not during the session. It is the fastest possible physical access method into an Oracle table, because it goes directly to the physical address of the data. It is also a complete violation of Dr. Codd’s rules, which require that the physical representation of the data be hidden from the users. 8.1 DELETE FROM Statement 219 Doing a quick test on a 100,000-row table, Mr. Medel achieved a nearly tenfold improvement with these two alternatives. In English, the first alternative is to find the highest ROWID for each group of one or more duplicate rows, and then delete every row, except the one with highest ROWID. DELETE FROM Personnel WHERE ROWID IN (SELECT P2.ROWID FROM Personnel AS P2, (SELECT P3.dup_id, P3.name, MAX(P3.ROWID) AS max_rowid FROM Personnel AS P3 GROUP BY P3.dup_id, P3.name, ) AS P4 WHERE P2.ROWID <> P4.max_rowid AND P2.dup_id = P4.dup_id AND P2.name = P4.name); Notice that the GROUP BY clause needs all the columns in the table. The second approach is to notice that the set of all rows in the table minus the set of rows we want to keep defines the set of rows to delete. This gives us the following statement: DELETE FROM Personnel WHERE ROWID IN (SELECT P2.ROWID FROM Personnel AS P2 EXCEPT SELECT MAX(P3.ROWID) FROM Personnel AS P3 GROUP BY P3.dup_id, P3.name, ); Both of these approaches are faster than the short, classic version because they avoid a correlated subquery expression in the WHERE clause. 220 CHAPTER 8: TABLE OPERATIONS 8.1.5 Deleting in Multiple Tables without Referential Integrity There is no way to directly delete rows from more than one table in a single DELETE FROM statement. There are two approaches to removing related rows from multiple tables. One is to use a temporary table of the deletion values; the other is to use referential integrity actions. For the purposes of this section, let us assume that we have a database with an Orders table and an Inventory table. Our business rule is that when something is out of stock, we delete it from all the orders. Assume that no referential integrity constraints have been declared at all. First, create a temporary table of the products to be deleted based on your search criteria, then use that table in a correlated subquery to remove rows from each table involved. CREATE MODULE Foobar CREATE LOCAL TEMPORARY TABLE Discontinue (part_nbr INTEGER NOT NULL UNIQUE) ON COMMIT DELETE ROWS; PROCEDURE CleanInventory( ) BEGIN ATOMIC INSERT INTO Discontinue SELECT DISTINCT part_nbr pick out the items to be removed FROM WHERE ; using whatever criteria you require DELETE FROM Orders WHERE part_nbr IN (SELECT part_nbr FROM Discontinue); DELETE FROM Inventory WHERE part_nbr IN (SELECT part_nbr FROM Discontinue); COMMIT WORK; END; END MODULE; In the Standard SQL model, the temporary table is persistent in the schema, but its content is not. TEMPORARY tables are always empty at the start of a session, and they always appear to belong only to the user of the session. The GLOBAL option means that each application gets one copy of the table for all the modules, while LOCAL would limit the scope to the module in which it is declared. 8.2 INSERT INTO Statement 221 8.2 INSERT INTO Statement The INSERT INTO statement is the only way to get new data into a base table. In practice, there are always other tools for loading large amounts of data into a table, but they are very vendor-dependent. 8.2.1 INSERT INTO Clause The syntax for INSERT INTO is: <insert statement> :: = INSERT INTO <table name> <insert columns and source> <insert columns and source> :: = [(<insert column list>)] <query expression> | VALUES <table value constructor list> | DEFAULT VALUES <table value constructor list> :: = <row value constructor> [{<comma> <row value constructor>} ] <row value constructor> :: = <row value constructor element> | <left paren> <row value constructor list> <right paren> | <row subquery> <row value constructor list> :: = <row value constructor element> [{<comma> <row value constructor element>} ] <row value constructor element> :: = <value expression> | NULL |DEFAULT The two basic forms of an INSERT INTO are a table constant (usually a single row) insertion and a query insertion. The table constant insertion is done with a VALUES() clause. The list of insert values usually consists of constants or explicit NULLs, but in theory they could be almost any expression, including scalar SELECT subqueries. . 'Baker' 'Acct' 105 'Charles' 'Acct' 106 'Henry' 'Mkt' 101 &apos ;Celko& apos; 'Mkt' 170 'Popkin' 'HR' 120 . not most cases, this two-pass method does not make any difference in the results. The WHERE clause is usually a fairly simple predicate that references constants or relationships among. constant insertion is done with a VALUES() clause. The list of insert values usually consists of constants or explicit NULLs, but in theory they could be almost any expression, including scalar