ptg 1624 CHAPTER 42 What’s New for Transact-SQL in SQL Server 2008 For typical ETL-type applications, querying for change data is an ongoing process, making periodic requests for all the changes that occurred since the last request which need to be applied to the target. For these types of queries, you can use the sys.fn_cdc_increment_lsn function to determine the next lowest LSN boundary that is greater than the max LSN boundary of the previous query. To demonstrate this, let’s first execute some additional data modifications against the MyCustomer table: Insert MyCustomer (PersonID, StoreID, TerritoryID, AccountNumber, rowguid, ModifiedDate) Values (20779, null, 12, ‘AW’ + RIGHT(‘00000000’ + convert(varchar(8), IDENT_Current(‘MyCustomer’)), 8), NEWID(), GETDATE()) delete MyCustomer where CustomerID = 30119 The max LSN from the previous examples is 0x000000390000144C0004. We want to incre- ment from this LSN to find the next set of changes. In Listing 42.24, you pass this value to the sys.fn_cdc_increment_lsn to set the min LSN value you’ll use with the cdc.fn_cdc_get_net_changes_dbo_MyCustomer function as the lower bound. LISTING 42.24 Using sys.fn_cdc_increment_lsn to Return the Net Changes to the MyCustomer CDC Capture Table Since the Last Retrieval declare variables to represent beginning and ending lsn DECLARE @from_lsn BINARY(10), @to_lsn BINARY(10) get the Next lowest LSN after the previous Max LSN SELECT @from_lsn = sys.fn_cdc_increment_lsn(0x000000390000144C0004) get the last LSN for table changes SELECT @to_lsn = sys.fn_cdc_get_max_lsn() get all changes in the range using “all with_merge” parameter SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_MyCustomer (@from_lsn, @to_lsn, ‘all with merge’); GO __$start_lsn __$operation __$update_mask CustomerID PersonID StoreID TerritoryID AccountNumber rowguid ModifiedDate ptg 1625 Change Data Capture 42 0x00000039000017D30004 5 NULL 30120 20779 NULL 12 AW00030120 CE8BBAA1-04C0-4A81-9A7E-85B4EDB5C36D 2010-04-27 23:52:36.477 ccc0x00000039000017E50004 1 NULL 30119 ccc20778 NULL 3 AW00030119 ccc2385A86E-6FD2-4815-8BFE-B3F4DF4AEA74 2010-04-27 22:38:48.263 If you want to retrieve the changes captured during a specific time period, you can use the sys.fn_cdc_map_time_to_lsn function, as shown in Listing 42.25. LISTING 42.25 Retrieving all Changes to MyCustomer During a Specific Time Period DECLARE @begin_time datetime, @end_time datetime, @begin_lsn binary(10), @end_lsn binary(10); SET @begin_time = ‘2010-04-27 22:38:48.250’ SET @end_time = ‘2010-04-27 23:52:36.500’ SELECT @begin_lsn = sys.fn_cdc_map_time_to_lsn (‘smallest greater than’, @begin_time); SELECT @end_lsn = sys.fn_cdc_map_time_to_lsn (‘largest less than or equal’, @end_time); SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_MyCustomer (@begin_lsn, @end_lsn, ‘all’); Go __$start_lsn __$operation __$update_mask CustomerID PersonID StoreID TerritoryID AccountNumber rowguid ModifiedDate 0x000000390000144C0004 4 NULL 30119 20778 NULL 3 AW00030119 2385A86E-6FD2-4815-8BFE-B3F4DF4AEA74 2010-04-27 22:38:48.263 ccc0x00000039000017D30004 2 NULL 30120 ptg 1626 CHAPTER 42 What’s New for Transact-SQL in SQL Server 2008 ccc20779 NULL 12 AW00030120 cccCE8BBAA1-04C0-4A81-9A7E-85B4EDB5C36D 2010-04-27 23:52:36.477 CDC and DDL Changes to Source Tables One of the common challenges when capturing data changes from your source tables is how to handle DDL changes to the source tables. This can be an issue if the downstream consumer of the changes has not reflected the same DDL changes for its destination tables. Enabling Change Data Capture on a source table in SQL Server 2008 does not prevent DDL changes from occurring. However, Change Data Capture does help to mitigate the effect on the downstream consumers by allowing the delivered result sets that are returned from the CDC capture tables to remain unchanged even as the column structure of the underlying source table changes. Essentially, the capture process responsible for populat- ing the change table ignores any new columns not present when the source table was enabled for Change Data Capture. If a tracked column is dropped, NULL values are supplied for the column in the subsequent change entries. However, if the data type of a tracked column is modified, the data type change is also propagated to the change table to ensure that the capture mechanism does not introduce data loss in tracked columns as a result of mismatched data types. When a column is modified, the capture process posts any detected changes to the cdc.ddl_history table. Downstream consumers of the change data from the source tables that may need to be alerted of the column changes (and make similar adjustments to the destination tables) can use the stored procedure sys.sp_cdc_get_ddl_history to identify any modifications to the source table columns. So how do you modify the capture instance to recognize any added or dropped columns in the source table? Unfortunately, the only way to do this is to disable CDC on the table and re-enable it. However, in an active source environment where it’s not possible to suspend processing while CDC is being disabled and re-enabled, there is the possibility of data loss between when CDC is disabled and re-enabled. Fortunately, CDC allows two capture instances to be associated with a single source table. This makes it possible to create a second capture instance for the table that reflects the new column structure. The capture process then captures changes to the same source table into two distinct change tables having two different column structures. While the original change table continues to feed current operational programs, the new change table feeds environments that have been modified to incorporate the new column data. Allowing the capture mechanism to populate both change tables in tandem provides a mechanism for smoothly transitioning from one table structure to the other without any loss of change data. When the transition to the new table structure has been fully effected, the obsolete capture instance can be removed. ptg 1627 Change Tracking 42 Change Tracking In addition to Change Data Capture, SQL Server 2008 also introduces Change Tracking. Change Tracking is a lightweight solution that provides an efficient change tracking mechanism for applications. Although they are similar in name, the purposes of Change Tracking and Change Data Capture are different. Change Data Capture is an asynchronous mechanism that uses the transaction log to record all the changes to a data row and store them in change tables. All intermediate versions of a row are available in the change tables. The information captured is stored in a relational format that can be queried by client applications such as ETL processes. Change Tracking, in contrast, is a synchronous mechanism that tracks modifications to a table but stores only the fact that a row has been modified and when. It does not keep track of how many times the row has changed or the values of any of the intermediate changes. However, having a mechanism that records that a row has changed, you can check to see whether data has changed and obtain the latest version of the row directly from the table itself rather than querying a change capture table. NOTE Unlike Change Data Capture, which is available only in the Enterprise, Datacenter, and Developer Editions of SQL Server, Change Tracking is available in all editions. Change Tracking operates by using tracking tables that store a primary key and version number for each row in a table that has been enabled for Change Tracking. Applications can then check to see whether a row has changed by looking up the row in the tracking table by its primary key and see if the version number is different from when the row was first retrieved. One of the common uses of Change Tracking is for applications that have to synchronize data with SQL Server. Change Tracking can be used as a foundation for both one-way and two-way synchronization applications. One-way synchronization applications, such as a client or mid-tier caching application, can be built to use Change Tracking. The caching application, which requires data from a SQL Server database to be cached in other data stores, can use Change Tracking to deter- mine when changes have been made to the database tables and refresh the cache store by retrieving data from the modified rows only to keep the cache up-to-date. Two-way synchronization applications can also be built to use Change Tracking. A typical example of a two-way synchronization application is the occasionally connected applica- tion—for example, a sales application that runs on a laptop and is disconnected from the central SQL Server database while the salesperson is out in the field. Initially, the client ptg 1628 CHAPTER 42 What’s New for Transact-SQL in SQL Server 2008 application queries and updates its local data store from the SQL Server database. When it reconnects with the database later, the application synchronizes with the database, and data changes will flow from the laptop to the database and from the database to the laptop. Because data changes happen in both locations while the client application is disconnected, the two-way synchronization application must be able to detect conflicts. A conflict occurs if the same data is changed in both data stores in the time between synchronizations. The client application can use Change Tracking to detect conflicts by identifying rows whose version number has changed since the last synchronization. The application can implement a mechanism to resolve the conflicts so that the data changes are not lost. Implementing Change Tracking To use Change Tracking, you must first enable it for the database and then enable it at the table level for any tables for which you want to track changes. Change Tracking can be enabled via T-SQL statements or through SQL Server Management Studio. To enable Change Tracking for a database in SSMS, right-click on the database in Object Explorer to bring up the Properties dialog and select the Change Tracking page. To enable Change Tracking, set the Change Tracking option to True (see Figure 42.6). Also on this page, you can configure the retention period for how long SQL Server retains the Change Tracking information for each data row and whether to automatically clean up the Change Tracking information when the retention period has been exceeded. FIGURE 42.6 Enabling Change Tracking for a database. ptg 1629 Change Tracking 42 Change Tracking can also be enabled with the ALTER DATABASE command: ALTER DATABASE AdventureWorks2008R2 SET CHANGE_TRACKING = ON (CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON) After enabling Change Tracking at the database level, you can then enable Change Tracking for the tables for which you want to track changes. To enable Change Tracking for a table in SSMS, right-click on the table in Object Explorer to bring up the Properties dialog and select the Change Tracking page. Set the Change Tracking option to True to enable Change Tracking (see Figure 42.7). The TRACK_COLUMNS_UPDATED option specifies whether SQL Server should store in the internal Change Tracking table any extra informa- tion about which specific columns were updated. Column tracking allows an application to synchronize only when specific columns are updated. This capability can improve the efficiency and performance of the synchronization process, but at the cost of additional storage overhead. This option is set to OFF by default. Change Tracking can also be enabled via T-SQL with the ALTER TABLE command: FIGURE 42.7 Enabling Change Tracking for a table. ptg 1630 CHAPTER 42 What’s New for Transact-SQL in SQL Server 2008 USE [AdventureWorks2008R2] GO ALTER TABLE [dbo].[MyCustomer] ENABLE CHANGE_TRACKING WITH(TRACK_COLUMNS_UPDATED = ON) TIP To determine which tables and databases have Change Tracking enabled, you can use the sys.change_tracking_databases and sys.change_tracking_tables catalog views. Identifying Tracked Changes After Change Tracking is enabled for a table, any data modification statements that affect rows in the table cause Change Tracking information for each modified row to be recorded. To query for the rows that have changed and to obtain information about the changes, you can use the built-in Change Tracking functions. Unless you enabled the TRACK_COLUMNS_UPDATED option, only the values of the primary key column are recorded with the change information to allow you to identify the rows that have been changed. To identify the changed rows, use the CHANGETABLE (CHANGES ) Change Tracking function. The CHANGETABLE (CHANGES ) function takes two parame- ters: the first is the table name, and the second is the last synchronization version number. If you pass 0 for the last synchronization version parameter, you get a list of all the rows that have been modified since version 0, which means all the changes to the table since first enabling Change Tracking. Typically, however, you do not want all the rows that have changed from the beginning of Change Tracking, but only those rows that have changed since the last time you retrieved the changed rows. Rather than having to keep track of the version numbers, you can use the CHANGE_TRACKING_CURRENT_VERSION() function to obtain the current version that will be used the next time you query for changes. The version returned represents the version of the last committed transaction. Before an application can obtain changes for the first time, the application must first execute a query to obtain the initial data from the table and a query to retrieve the initial synchronization version using CHANGE_TRACKING_CURRENT_VERSION() function. The version number that is retrieved is passed to the CHANGETABLE(CHANGES ) function the next time it is invoked. The following example illustrates how to obtain the initial synchronization version and initial data set: USE AdventureWorks2008R2 Go declare @synchronization_version bigint Select change_tracking_version = CHANGE_TRACKING_CURRENT_VERSION(); ptg 1631 Change Tracking 42 Obtain initial data set. select CustomerID, TerritoryID, @synchronization_version as version from MyCustomer where CustomerID <= 5 go change_tracking_version 0 CustomerID TerritoryID 1 1 2 1 3 4 4 4 5 4 As you can see, because no updates have been performed since Change Tracking was enabled, the initial version is 0. Now let’s perform some updates on these rows to effect some changes: update MyCustomer set TerritoryID = 5 where CustomerID = 4 update MyCustomer set TerritoryID = 4 where CustomerID = 5 Now you can use the CHANGETABLE(CHANGES ) function to find the rows that have changed since the last version (0): declare @last_synchronization_version bigint set @last_synchronization_version = 0 SELECT CT.CustomerID as CustID, CT.SYS_CHANGE_OPERATION, CT.SYS_CHANGE_COLUMNS, CT.SYS_CHANGE_CONTEXT FROM CHANGETABLE(CHANGES MyCustomer, @last_synchronization_version) AS CT Go CustID SYS_CHANGE_OPERATION SYS_CHANGE_COLUMNS SYS_CHANGE_CONTEXT ptg 1632 CHAPTER 42 What’s New for Transact-SQL in SQL Server 2008 4 U 0x0000000004000000 NULL 5 U 0x0000000004000000 NULL You can see in these results that this query returns the CustomerIDs of the two rows that were changed. However, most applications also want the data from these rows as well. To return the data, you can join the results from CHANGETABLE(CHANGES ) with the data in the user table. For example, the following query joins with the MyCustomer table to obtain the values for the PersonID, StoredID, and TerritoryID columns. Note that the query uses an OUTER JOIN to make sure that the change information is returned for any rows that may have been deleted from the user table. Also, at the same time you are retrieving the data rows, you also want to retrieve the current version as well to use the next time the application comes back to retrieve the latest changes: declare @last_synchronization_version bigint set @last_synchronization_version = 0 select current_version = CHANGE_TRACKING_CURRENT_VERSION() SELECT CT.CustomerID as CustID, C.PersonID, C.StoreID, C.TerritoryID, CT.SYS_CHANGE_OPERATION, CT.SYS_CHANGE_COLUMNS, CT.SYS_CHANGE_CONTEXT FROM MyCustomer C RIGHT OUTER JOIN CHANGETABLE(CHANGES MyCustomer, @last_synchronization_version) AS CT on C.CustomerID = CT.CustomerID go current_version 2 CustID PersonID StoreID TerritoryID SYS_CHANGE_OPERATION SYS_CHANGE_COLUMNS SYS_CHANGE_CONTEXT 4 NULL 932 5 U 0x0000000004000000 NULL 5 NULL 1026 4 U 0x0000000004000000 NULL You can see in the output from this query that the current version is now 2. The next time the application issues a query to identify the rows that have been changed since this ptg 1633 Change Tracking 42 query, it will pass the value of 2 as the @last_synchronization_version to the CHANGETABLE(CHANGES ) function. CAUTION The version number is NOT specific to a table or user session. The Change Tracking version number is maintained across the entire database for all users and change tracked tables. Whenever a data modification is performed by any user on any table that has Change Tracking enabled, the version number is incremented. For example, immediately after running an update on change tracked table A in the cur- rent application and incrementing the version to 3, another application could run an update on change tracked table B and increment the version to 4, and so on. This is why you should always capture the current version number whenever you are retrieving the latest set of changes from the change tracked tables. If an application has not synchronized with the database in a while, the stored version number could no longer be valid if the Change Tracking retention period has expired for any row modifications that have occurred since that version. To validate the version number, you can use the CHANGE_TRACKING_MIN_VALID_VERSION() function. This function returns the minimum valid version that a client can have and still obtain valid results from CHANGETABLE(). Your client applications should check the last synchronization version obtained against the value returned by this function and if the last synchroniza- tion version is less than the version returned by this function, that version is invalid. The client application has to reinitialize all the data rows from the table. The following T-SQL code snippet can be used to validate the last_synchronization_version: Check individual table. IF (@last_synchronization_version < CHANGE_TRACKING_MIN_VALID_VERSION(OBJECT_ID(‘MyCustomer’))) BEGIN Handle invalid version and do not enumerate changes. Client must be reinitialized. END Identifying Changed Columns In addition to information about which rows were changed and the operation that caused the change (insert, update, or delete—reported as I, U, or D in the SYS_CHANGE_OPERATION), the CHANGETABLE(CHANGES ) function also provides information on which columns were modified if you enabled the TRACK_COLUMNS_UPDATED option. You can use this infor- mation to determine whether any action is needed in your client application based on which columns changed. To identify whether a specific column has changed, you can use the CHANGE_TRACKING_IS_COLUMN_IN_MASK (column_id, change_columns) function. This func- . enabled via T -SQL with the ALTER TABLE command: FIGURE 42.7 Enabling Change Tracking for a table. ptg 1630 CHAPTER 42 What’s New for Transact -SQL in SQL Server 2008 USE [AdventureWorks200 8R2] GO ALTER. disconnected from the central SQL Server database while the salesperson is out in the field. Initially, the client ptg 1628 CHAPTER 42 What’s New for Transact -SQL in SQL Server 2008 application queries. ptg 1624 CHAPTER 42 What’s New for Transact -SQL in SQL Server 2008 For typical ETL-type applications, querying for change data is an ongoing process,