Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 362 Part II Manipulating Data With Select DatePosition DATE NOT NULL ) INSERT dbo.Dept (DeptName, RaiseFactor) VALUES (’Engineering’, 1.2), (’Sales’, .8), (’IT’, 2.5), (’Manufacturing’, 1.0) ; INSERT dbo.Employee (DeptID, LastName, FirstName, Salary, PerformanceRating, DateHire, DatePosition) VALUES (1, ‘Smith’, ‘Sam’, 54000, 2.0, ‘19970101’, ‘19970101’), (1, ‘Nelson’, ‘Slim’, 78000, 1.5, ‘19970101’, ‘19970101’), (2, ‘Ball’, ‘Sally’, 45000, 3.5, ‘19990202’, ‘19990202’), (2, ‘Kelly’, ‘Jeff’, 85000, 2.4, ‘20020625’, ‘20020625’), (3, ‘Guelzow’, ‘Jo’, 120000, 4.0, ‘19991205’, ‘19991205’), (3, ‘Ander’, ‘Missy’, 95000, 1.8, ‘19980201’, ‘19980201’), (4, ‘Reagan’, ‘Sam’, 75000, 2.9, ‘20051215’, ‘20051215’), (4, ‘Adams’, ‘Hank’, 34000, 3.2, ‘20080501’, ‘20080501’); When developing complex queries, I work from the inside out. The first step performs the date math; it selects the data required for the raise calculation, assuming June 25, 2009, is the effective date of the raise, and ensures the performance rating won’t count if it’s only 1: SELECT EmployeeID, Salary, CAST(CAST(DATEDIFF(d, DateHire, ‘20090625’) AS DECIMAL(7, 2)) / 365.25 AS INT) AS YrsCo, CAST(CAST(DATEDIFF(d, DatePosition, ‘20090625’) AS DECIMAL(7, 2)) / 365.25 * 12 AS INT) AS MoPos, CASE WHEN Employee.PerformanceRating >= 2 THEN Employee.PerformanceRating ELSE 0 END AS Perf, Dept.RaiseFactor FROM dbo.Employee JOIN dbo.Dept ON Employee.DeptID = Dept.DeptID Result: EmployeeID Salary YrsCo MoPos Perf RaiseFactor 1 54000.00 12 149 2.00 1.20 2 78000.00 12 149 0.00 1.20 3 45000.00 10 124 3.50 0.80 362 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 363 Modifying Data 15 4 85000.00 7 84 2.40 0.80 5 120000.00 9 114 4.00 2.50 6 95000.00 11 136 0.00 2.50 7 75000.00 4 42 2.90 1.00 8 34000.00 1 13 3.20 1.00 The next step in developing this query is to add the raise calculation. The simplest way to see the calcu- lation is to pull the values already generated from a subquery: SELECT EmployeeID, Salary, (2 + ((YearsCompany * .1) + (MonthPosition * .02) + (Performance * .5)) * RaiseFactor) / 100 AS EmpRaise FROM (SELECT EmployeeID, FirstName, LastName, Salary, CAST(CAST(DATEDIFF(d, DateHire, ‘20090625’) AS DECIMAL(7, 2)) / 365.25 AS INT) AS YearsCompany, CAST(CAST(DATEDIFF(d, DatePosition, ‘20090625’) AS DECIMAL(7, 2)) / 365.25 * 12 AS INT) AS MonthPosition, CASE WHEN Employee.PerformanceRating >= 2 THEN Employee.PerformanceRating ELSE 0 END AS Performance, Dept.RaiseFactor FROM dbo.Employee JOIN dbo.Dept ON Employee.DeptID = Dept.DeptID) AS SubQuery Result: EmployeeID Salary EmpRaise 1 54000.00 0.082160000 2 78000.00 0.070160000 3 45000.00 0.061840000 4 85000.00 0.048640000 5 120000.00 0.149500000 6 95000.00 0.115500000 7 75000.00 0.046900000 8 34000.00 0.039600000 The last query was relatively easy to read, but there’s no logical reason for the subquery. The query could be rewritten combining the date calculations and the case expression into the raise formula: SELECT EmployeeID, Salary, (2 + years with company + ((CAST(CAST(DATEDIFF(d, DateHire, ‘20090625’) AS DECIMAL(7, 2)) / 365.25 AS INT) * .1) months in position + (CAST(CAST(DATEDIFF(d, DatePosition, ‘20090625’) AS DECIMAL(7, 2)) / 365.25 * 12 AS INT) * .02) 363 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 364 Part II Manipulating Data With Select Performance Rating minimum + (CASE WHEN Employee.PerformanceRating >= 2 THEN Employee.PerformanceRating ELSE 0 END * .5)) Raise Factor * RaiseFactor) / 100 AS EmpRaise FROM dbo.Employee JOIN dbo.Dept ON Employee.DeptID = Dept.DeptID It’s easy to verify that this query gets the same result, but which is the better query? From a perfor- mance perspective, both queries generate the exact same query execution plan. When considering maintenance and readability, I’d probably go with the second query carefully formatted and commented. The final step is to convert the query into an UPDATE command. The hard part is already done — it just needs the UPDATE verb at the front of the query: UPDATE Employee SET Salary = Salary * (1 + ((2 years with company + ((CAST(CAST(DATEDIFF(d, DateHire, ‘20090625’) AS DECIMAL(7, 2)) / 365.25 AS INT) * .1) months in position + (CAST(CAST(DATEDIFF(d, DatePosition, ‘20090625’) AS DECIMAL(7, 2)) / 365.25 * 12 AS INT) * .02) Performance Rating minimum + (CASE WHEN Employee.PerformanceRating >= 2 THEN Employee.PerformanceRating ELSE 0 END * .5)) Raise Factor * RaiseFactor) / 100 )) FROM dbo.Employee JOIN dbo.Dept ON Employee.DeptID = Dept.DeptID A quick check of the data confirms that the update was successful: SELECT FirstName, LastName, Salary FROM dbo.Employee Result: FirstName LastName Salary Sam Smith 58436.64 364 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 365 Modifying Data 15 Slim Nelson 83472.48 Sally Ball 47782.80 Jeff Kelly 89134.40 Jo Guelzow 137940.00 Missy Anderson 105972.50 Sam Reagan 78517.50 Hank Adams 35346.40 The final step of the exercise is to clean up the sample tables: DROP TABLE dbo.Employee, dbo.Dept; This sample code pulls together techniques from many of the previous chapters: creating and dropping tables, CASE expressions, joins, and date scalar functions, not to mention the inserts and updates from this chapter. The example is long because it demonstrates more than just the UPDATE statement. It also shows the typical process of developing a complex UPDATE, which includes the following: 1. Checking the available data: The first SELECT joins employee and dept, and lists all the columns required for the formula. 2. Testing the formula: The second SELECT is based on the initial SELECT and assembles the formula from the required rows. From this data, a couple of rows can be hand-tested against the specs, and the formula verified. 3. Performing the update: Once the formula is constructed and verified, the formula is edited into an UPDATE statement and executed. The SQL UPDATE command is powerful. I have replaced terribly complex record sets and nested loops that were painfully slow and error-prone with UPDATE statements and creative joins that worked well, and I have seen execution times reduced from hours to a few seconds. I cannot overemphasize the importance of approaching the selection and updating of data in terms of data sets, rather than data rows. Deleting Data The DELETE command is dangerously simple. In its basic form, it deletes all the rows from a table. Because the DELETE command is a row-based operation, it doesn’t require specifying any column names. The first FROM is optional, as are the second FROM and the WHERE conditions. However, although the WHERE clause is optional, it is the primary subject of concern when you’re using the DELETE command. Here’s an abbreviated syntax for the DELETE command: DELETE [FROM] schema.Table [FROM data sources] [WHERE condition(s)]; Notice that everything is optional except the actual DELETE command and the table name. The following command would delete all data from the Product table — no questions asked and no second chances: 365 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 366 Part II Manipulating Data With Select DELETE FROM OBXKites.dbo.Product; SQL Server has no inherent ‘‘undo’’ command. Once a transaction is committed, that’s it. That’s why the WHERE clause is so important when you’re deleting. By far, the most common use of the DELETE command is to delete a single row. The primary key is usually the means of selecting the row: USE OBXKites; DELETE FROM dbo.Product WHERE ProductID = ‘DB8D8D60-76F4-46C3-90E6-A8648F63C0F0’; Referencing multiple data sources while deleting There are two techniques for referencing multiple data sources while deleting rows: the double FROM clause and subqueries. The UPDATE command uses the FROM clause to join the updated table with other tables for more flexi- ble row selection. The DELETE command can use the exact same technique. When using this method, the first optional FROM can make it look confusing. To improve readability and consistency, I recom- mend that you omit the first FROM in your code. For example, the following DELETE statement ignores the first FROM clause and uses the second FROM clause to join Product with ProductCategory so that the WHERE clause can filter the DELETE basedontheProductCategoryName. This query removes all videos from the Product table: DELETE dbo.Product FROM dbo.Product JOIN dbo.ProductCategory ON Product.ProductCategoryID = ProductCategory.ProductCategoryID WHERE ProductCategory.ProductCategoryName = ‘Video’; The second method looks more complicated at first glance, but it’s ANSI standard and the preferred method. A correlated subquery actually selects the rows to be deleted, and the DELETE command just picks up those rows for the delete operation. It’s a very clean query: DELETE FROM dbo.Product WHERE EXISTS (SELECT * FROM dbo.ProductCategory AS pc WHERE pc.ProductCategoryID = Product.ProductCategoryID AND pc.ProductCategoryName = ‘Video’); 366 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 367 Modifying Data 15 It terms of performance, both methods generate the exact same query execution plan. As with the UPDATE command’s FROM clause, the DELETE command’s second FROM clause is not an ANSI SQL standard. If portability is important to your project, then use a subquery to reference additional tables. Cascading deletes Referential integrity (RI) refers to the idea that no secondary row foreign key should point to a primary row primary key unless that primary row does in fact exist. This means that an attempt to delete a pri- mary row will fail if a foreign-key value somewhere points to that primary row. For more information about referential integrity and when to use it, turn to Chapter 3, ‘‘Relational Database Design,’’ and Chapter 20, ‘‘Creating the Physical Database Schema.’’ When implemented correctly, referential integrity will block any delete operation that would result in a foreign key value without a corresponding primary key value. The way around this is to first delete the secondary rows that point to the primary row, and then delete the primary row. This technique is called a cascading delete. In a complex database schema, the cascade might bounce down several levels before working its way back up to the original row being deleted. There are two ways to implement a cascading delete: manually with triggers or automatically with declared referential integrity (DRI) via foreign keys. Implementing cascading deletes manually is a lot of work. Triggers are significantly slower than foreign keys (which are checked as part of the query execution plan), and trigger-based cascading deletes usu- ally also handle the foreign key checks. While this was commonplace a decade ago, today trigger-based cascading deletes are very rare and might only be needed with a very complex nonstandard foreign key design that includes business rules in the foreign key. If you’re doing that, then you’re either very new at this or very, very good. Fortunately, SQL Server offers cascading deletes as a function of the foreign key. Cascading deletes may be enabled via Management Studio, in the Foreign Key Relationship dialog, or in SQL code. The sample script that creates the Cape Hatteras Adventures version 2 database ( CHA2_Create.sql) provides a good example of setting the cascade-delete option for referential integrity. In this case, if either the event or the guide is deleted, then the rows in the event-guide many-to-many table are also deleted. The ON DELETE CASCADE foreign-key option is what actually specifies the cascade action: CREATE TABLE dbo.Event_mm_Guide ( EventGuideID INT IDENTITY NOT NULL PRIMARY KEY, EventID INT NOT NULL FOREIGN KEY REFERENCES dbo.Event ON DELETE CASCADE, GuideID INT NOT NULL FOREIGN KEY REFERENCES dbo.Guide ON DELETE CASCADE, LastName 367 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 368 Part II Manipulating Data With Select VARCHAR(50) NOT NULL, ) ON [PRIMARY]; As a caution, cascading deletes, or even referential integrity, are not suitable for every relationship. It depends on the permanence of the secondary row. If deleting the primary row makes the secondary row moot or meaningless, then cascading the delete makes good sense; but if the secondary row is still a valid row after the primary row is deleted, then referential integrity and cascading deletes would cause the database to break its representation of reality. As an example of determining the usefulness of cascading deletes from the Cape Hatteras Adventures database, consider that if a tour is deleted, then all scheduled events for that tour become meaningless, as do the many-to-many schedule tables between event and customer, and between event and guide. Conversely, a tour must have a base camp, so referential integrity is required on the Tour.BaseCampID foreign key. However, if a base camp is deleted, then the tours originating from that base camp might still be valid (if they can be rescheduled to another base camp), so cascading a base-camp delete down to the tour is not a reasonable action. If RI is on and cascading deletes are off, then a base camp with tours cannot be deleted until all tours for that base camp are either manually deleted or reassigned to other base camps. Alternatives to physically deleting data Some database developers choose to completely avoid deleting data. Instead, they build systems to remove the data from the user’s view while retaining the data for safekeeping (like dBase][ did) . This can be done in several different ways: ■ A logical-delete bit flag, or nullable MomentDeleted column, in the row can indicate that the row is deleted. This makes deleting or restoring a single row a straightforward matter of setting or clearing a bit. However, because a relational database involves multiple related tables, there’s more work to it than that. All queries must check the logical-delete flag and filter out logically deleted rows. This means that a bit column (with extremely poor selectivity) is probably an important index for every query. While SQL Server 2008’s new filtered indexes are a perfect fit, it’s still a performance killer. ■ To make matters worse, because the rows still physically exist in SQL Server, and SQL Server’s declarative referential integrity does not know about the logical-delete flag, custom referential integrity and cascading of logical delete flags are also required. Restoring, or undeleting, cascaded logical deletes can become a nightmare. ■ The cascading logical deletes method is complex to code and difficult to maintain. This is a case of complexity breeding complexity, and I no longer recommend this method. ■ Another alternative to physically deleting rows is to archive the deleted rows in an archive or audit table. This method is best implemented by an INSTEAD OF trigger that copies the data to the alternative location and then physically deletes the rows from the production database. ■ This method offers several advantages. Data is physically removed from the database, so there’s no need to artificially modify SELECT queries or index on a bit column. Physically removing the data enables SQL Server referential integrity to remain in effect. In addition, the database is not burdened with unnecessary data. Retrieving archived data remains relatively straightforward and can be easily accomplished with a view that selects data from the archive location. 368 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 369 Modifying Data 15 Chapter 53, ‘‘Data Audit Triggers,’’ details how to automatically generate the audit system discussed here that stores, views, and recovers deleted rows. Merging Data An upsert operation is a logical combination of an insert and an update. If the data isn’t already in the table, the upsert inserts the data; if the data is already in the table, then the upsert updates with the dif- ferences. Ignoring for a moment the new MERGE command in SQL Server 2008, there are a few ways to code an upsert operation with T-SQL: ■ The most common method is to attempt to locate the data with an IF EXISTS;andiftherow was found, UPDATE,otherwiseINSERT. ■ If the most common use case is that the row exists and the UPDATE was needed, then the best method is to do the update, and if @@RowCount = 0, then the row was new and the insert should be performed. ■ If the overwhelming use case is that the row would be new to the database, then TRY to INSERT the new row; if a unique index blocked the INSERT andfiredanerror,thenCATCH the error and UPDATE instead. All three methods are potentially obsolete with the new MERGE command. The MERGE command is very well done by Microsoft — it solves a complex problem well with a clean syntax and good performance. First, it’s called ‘‘merge’’ because it does more than an upsert. Upsert only inserts or updates; merge can be directed to insert, update, and delete all in one command. In a nutshell, MERGE sets up a join between the source table and the target table, and can then perform operations based on matches between the two tables. To walk through a merge scenario, the following example sets up an airline flight check-in scenario. The main work table is FlightPassengers, which holds data about reservations. It’s updated as travelers check in, and by the time the flight takes off, it has the actual final passenger list and seat assignments. In the sample scenario, four passengers are scheduled to fly SQL Server Airlines flight 2008 (Denver to Seattle) on March 1, 2008. Poor Jerry, he has a middle seat on the last row of the plane — the row that doesn’t recline: USE tempdb; Merge Target Table CREATE TABLE FlightPassengers ( FlightID INT NOT NULL IDENTITY PRIMARY KEY, LastName VARCHAR(50) NOT NULL, FirstName VARCHAR(50) NOT NULL, FlightCode CHAR(6) NOT NULL, FlightDate DATE NOT NULL, Seat CHAR(3) NOT NULL 369 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 370 Part II Manipulating Data With Select ); INSERT FlightPassengers (LastName, FirstName, FlightCode, FlightDate, Seat) VALUES (‘Nielsen’, ‘Paul’, ‘SS2008’, ‘20090301’, ‘9F’), (‘Jenkins’, ‘Sue’, ‘SS2008’, ‘20090301’, ‘7A’), (‘Smith’, ‘Sam’, ‘SS2008’, ‘20090301’, ‘19A’), (‘Nixon’, ‘Jerry’, ‘SS2008’, ‘20090301’, ‘29B’); The day of the flight, the check-in counter records all the passengers as they arrive, and their seat assignments, in the CheckIn table. One passenger doesn’t show, a new passenger buys a ticket, and Jerry decides today is a good day to burn an upgrade coupon: Merge Source table CREATE TABLE CheckIn ( LastName VARCHAR(50), FirstName VARCHAR(50), FlightCode CHAR(6), FlightDate DATE, Seat CHAR(3) ); INSERT CheckIn (LastName, FirstName, FlightCode, FlightDate, Seat) VALUES (‘Nielsen’, ‘Paul’, ‘SS2008’, ‘20090301’, ‘9F’), (‘Jenkins’, ‘Sue’, ‘SS2008’, ‘20090301’, ‘7A’), (‘Nixon’, ‘Jerry’, ‘SS2008’, ‘20090301’, ‘2A’), (‘Anderson’, ‘Missy’, ‘SS2008’, ‘20090301’, ‘4B’); Before the MERGE command is executed, the next three queries look for differences in the data. The first set-difference query returns any no-show passengers. A LEFT OUTER JOIN between the FlightPassengers and CheckIn tables finds every passenger with a reservation joined with their CheckIn row if the row is available. If no CheckIn row is found, then the LEFT OUTER JOIN fills in the CheckIn column with nulls. Filtering for the null returns only those passengers who made a reservation but didn’t make the flight: NoShows SELECT F.FirstName + ‘ ’ + F.LastName AS Passenger, F.Seat FROM FlightPassengers AS F LEFT OUTER JOIN CheckIn AS C ON C.LastName = F.LastName AND C.FirstName = F.FirstName AND C.FlightCode = F.FlightCode AND C.FlightDate = F.FlightDate WHERE C.LastName IS NULL Result: Passenger Seat Sam Smith 19A 370 www.getcoolebook.com Nielsen c15.tex V4 - 07/21/2009 12:51pm Page 371 Modifying Data 15 The walk-up check-in query uses a LEFT OUTER JOIN and an IS NULL in the WHERE clause to locate any passengers who are in the CheckIn table but not in the FlightPassenger table: Walk Up CheckIn SELECT C.FirstName + ‘ ’ + C.LastName AS Passenger, C.Seat FROM CheckIn AS C LEFT OUTER JOIN FlightPassengers AS F ON C.LastName = F.LastName AND C.FirstName = F.FirstName AND C.FlightCode = F.FlightCode AND C.FlightDate = F.FlightDate WHERE F.LastName IS NULL Result: Passenger Seat Missy Anderson 4B The last difference query lists any seat changes, including Jerry’s upgrade to first class. This query uses an inner join because it’s searching for passengers who both had previous seat assignments and now are boarding with a seat assignment. The query compares the seat columns from the FlightPassenger and CheckIn tables using a not equal comparison, which finds any passengers with a different seat than previously assigned. Go Jerry! Seat Changes SELECT C.FirstName + ‘ ’ + C.LastName AS Passenger, F.Seat AS ‘previous seat’, C.Seat AS ‘final seat’ FROM CheckIn AS C INNER JOIN FlightPassengers AS F ON C.LastName = F.LastName AND C.FirstName = F.FirstName AND C.FlightCode = F.FlightCode AND C.FlightDate = F.FlightDate AND C.Seat <> F.Seat WHERE F.Seat IS NOT NULL Result: Passenger previous seat final seat Jerry Nixon 29B 2A For another explanation of set difference queries, flip over to Chapter 10, ‘‘Merging Data with Joins and Unions.’’ With the scenario’s data in place and verified with set-difference queries, it’s time to merge the check-in data into the FlightPassenger table. 371 www.getcoolebook.com . query. While SQL Server 2008 s new filtered indexes are a perfect fit, it’s still a performance killer. ■ To make matters worse, because the rows still physically exist in SQL Server, and SQL Server s declarative. dif- ferences. Ignoring for a moment the new MERGE command in SQL Server 2008, there are a few ways to code an upsert operation with T -SQL: ■ The most common method is to attempt to locate the data. assignments. In the sample scenario, four passengers are scheduled to fly SQL Server Airlines flight 2008 (Denver to Seattle) on March 1, 2008. Poor Jerry, he has a middle seat on the last row of the plane