558 Hands-On Microsoft SQL Server 2008 Integration Services backups and reduce the size of the backup file. As SQL Server backup compression compresses the backup, the device I/O decreases due to reduced size, but the CPU usage increases due to compression overhead. However, the effect of reduced size and I/O makes backups much faster. If you find that the CPU usage has increased to the level that it starts affecting other applications, you can create a low-priority compressed backup in a session whose CPU usage is limited by the Resource Governor. The Resource Governor is another new component that is released in SQL Server Enterprise Edition to enable you to manage SQL Server workload and resources by specifying limits on resource consumption by the incoming requests. Refer to Books Online for more details on the Resource Governor. Though the backup compression is a feature in Enterprise Edition, you can restore compressed backups on any edition of SQL Server. There are some restrictions that you should be aware of; for instance, you cannot mix compressed and uncompressed backups on the same media, and you don’t get much compression of the database when you use transparent data encryption at the same time; hence the use of both together is not recommended. You can use backup compression in large data warehouse implementations to achieve the following goals: Reduce the size of SQL backups in the vicinity of 50 percent or more. c Save hard disk space to keep backups online. c Be able to keep more copies of backups online for the same storage space. c Reduce the time required to back up or restore a database. c Backup compression configuration is specified at the server level but can be overridden. The default setting is off for the server. You can change this setting in SSMS by checking the Compress Backup option in the Database Settings page of Server Properties or by using the sp_configure stored procedure to set the default value of backup compression and then execute the reconfigure statement. You can override the server level setting for one-off or single backups by using one of the following methods: Specify the Compress Backup or Do Not Compress Backup option in the Set c Backup Compression field while using the Back Up Database task in an SSIS package. Refer to the bottom of Figure 5-30 in Chapter 5 to see this option. Specify the Compress Backup or Do Not Compress Backup option in the Options c page of the Back Up Database dialog box while backing up a database in SSMS. Specify the WITH NO_COMPRESSION or WITH COMPRESSION switch c in the Backup statement. Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 559 MERGE Statement SQL Server 2008 includes this new T-SQL statement that can be used to perform multiple DML operations—i.e., INSERT, UPDATE, and DELETE operations—on a table or view in a single statement. As the operations are applied in one statement, by their very nature they are executed under a single atomic operation. Using a MERGE statement, you join a source table with a target table or view and then perform multiple DML operations against the target table based on the results of that join. Typically, you apply these operations using a batch or a stored procedure individually, which means the source and the target tables are evaluated and processed multiple times, once for each operation. For a MERGE statement this evaluation happens only once for all the operations, thus making it more efficient when compared to applying the DML operations individually. The MERGE statement specifies a target table in the MERGE INTO clause and the source table in the USING clause. Then it uses the ON clause to specify the join condition for the source and target tables, which primarily determines the rows in the source table that have a match in the target table, the rows that do not, and the rows in the target table that do not have a match in the source table. For the ON clause, it is recommended that at least you have unique indexes on join columns in both the source and target tables for better performance. After evaluation of match cases, you can use any or all of the three WHEN clauses to perform a specific DML action on a given row. The clauses are: WHEN MATCHED THEN c You can UPDATE or DELETE the given row in the target table for every row that exists in both the target table and the source table. WHEN NOT MATCHED [BY TARGET] THEN c You can INSERT the given row in the target table for every row that exists in the source table, but not in the target table. WHEN NOT MATCHED BY SOURCE THEN c You can UPDATE or DELETE the given row in the target table for every row that exists in the target table, but not in the source table. You can also specify a search condition with each of the WHEN clauses to choose the rows to apply to the DML operation. The MERGE statement also supports the OUTPUT clause, enabling you to return attributes from the modified rows. The OUTPUT clause includes a virtual column called $action, which returns the action that modified the row—i.e., INSERT, UPDATE, or DELETE. 560 Hands-On Microsoft SQL Server 2008 Integration Services You can embed a MERGE statement inside the Execute SQL task to use it in your SSIS package. Typically, you would stage data to a staging table to change capture before loading it in the data warehouse with an Integration Services package. Such an SSIS package would include a Lookup task to identify whether a row is a new row or an update, an SQL Server Destination to INSERT new rows, and OLE DB Command transformations to perform UPDATE and DELETE operations. The Lookup transformation and the OLE DB Command transformation work on a row- by-row basis; thus they perform at a speed that can’t match the set-based operation of a MERGE statement. If you stage data for change capture and make it as a source table for a MERGE statement, you will find that the data is loaded at a far better rate than using SSIS with or without staging data. This performance becomes better especially on servers where lookup is working against large data sets and is running short of memory. You can also replace a Slowly Changing Dimension (SCD) transformation with a MERGE statement in some cases. Again, as the MERGE evaluates the source and the target data only once, it performs much better than the SCD transformation, which otherwise has been recognized as a performance pain point in SSIS. In the following example code, DimProduct is updated with the changes being received in the ProductChanges table. The changes might include changes in price for some existing products that need to be updated and some new products that need to be added. Considering the changes as Type 2, the IsCurrent flag has to be reset to 0 for the updates and new rows for them need to be inserted along with the new products. Look at the inner query where the MERGE statement updates the IsCurrent flag to 0 under the WHEN MATCHED clause and adds a new row under the WHEN NOT MATCHED clause for the new products. Finally, it outputs the rows affected with the action taken for them and then the outer query inserts the rows that have been updated back into DimProduct table. INSERT INTO DimProduct (ProductID, ProductName, Price, IsCurrent) SELECT ProductID, ProductName, Price, 1 FROM ( MERGE DimProduct as TGT USING ProductChanges AS SRC ON (TGT.ProductID = SRC. ProductID and TGT.IsCurrent = 1) WHEN MATCHED THEN UPDATE SET TGT.IsCurrent = 0 WHEN NOT MATCHED THEN INSERT VALUES (SRC.ProductID, SRC.ProductName, SRC.Price, 1) OUTPUT $action, SRC.ProductID, SRC.ProductName, SRC.Price ) AS Changes (action, ProductID, ProductName, Price) WHERE action = 'UPDATE'; Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 561 Just as MERGE can replace of some of the SSIS components for performance reasons, it can be used with change data capture (CDC) functionality to perform inserts and updates, thus replacing parameterized OLE DB Command transformation and resulting in a considerable gain in performance. One such application could be moving staged data into production servers. GROUP BY Extensions The GROUP BY clause has been enhanced in SQL Server 2008 and now adds a new operator: GROUPING SETS. Actually, the GROUP BY clause in SQL Server 2008 now supports ROLLUP, CUBE, and GROUPING SETS operators in line with ANSI SQL-2006 standards. So, an ISO-compliant syntax has been adopted, though a non-ISO-compliant syntax is still supported for backward compatibility—e.g., the earlier operators supported in SQL Server 2005 (WITH CUBE and WITH ROLLUP) are still supported but will be removed from future versions. As you know, a GROUP BY clause enables you to select one summary row for each group of rows created from a selected set of rows on the basis of values of one or more columns or expressions. Adding the operators to a GROUP BY clause provide an enhanced result set. The ROLLUP operator generates simple aggregate groupings with expressions rolled up from right to left and the number of groupings in the result set equal to number of expressions plus one. The CUBE operator generates groupings for every combination of the expressions without any regard to expression order and generates 2 n groupings, where n is the number of expressions used with the CUBE operator. The GROUPING SETS allow you to produce multiple groupings of data only for the specified groups in a single result set. This is simple and better than the ROLLUP and CUBE operator that generate the full set of aggregations. GROUPING SETS can specify groupings equivalent to those returned by ROLLUP or CUBE and can also work in conjunction with ROLLUP and CUBE. The result set thus generated is equivalent to a UNION ALL of the specified groups. So using GROUPING SET, you can generate only the required levels of groupings and the information can be readily made available to reports or requesting applications in a ready-to-digest format. You might use a pivot table to display the results. However, the important thing to take away is that using the GROUPING SETS operator can allow you to retrieve aggregated information in the required levels of grouping all in one single statement and above all efficiently. This could enhance your data analysis experience and improve reporting performance. If you are using an Aggregate transformation to produce aggregations for a report, you may find using GROUPING SETS performs better in some cases, especially where you are running a data flow task only for performing aggregations. 562 Hands-On Microsoft SQL Server 2008 Integration Services Star Join Query Processing Enhancement This is one of those improvements in SQL Server 2008 Enterprise Edition that is applied automatically and does not require users to make any changes to their queries. If you are using a dimensional data model for your warehouse, you may realize this performance gain just by upgrading to SQL Server 2008 and without making any change to your database structure or the queries. Microsoft claims to have achieved 15 to 30 percent improvement in the whole of the star join payload on the dimensional database server, whereas some individual queries can benefit from it by a factor of seven. The star join optimization is provided in Microsoft SQL Server 2008 Enterprise Edition. In a dimensional data warehouse, most of the queries perform aggregations on columns of a large fact table and join it to one or many smaller dimensional tables, apply filter conditions on the non-key dimensional table columns and form groups by one or many dimensional table columns. Such queries are called star join queries. These queries follow a similar processing pattern that you can find out by looking at the execution query plan of some of these queries. Typically, the fact table will be scanned for a range of data by running a seek on the clustered index that will be hash- joined with the results of the seek on one of the dimension table. These hash joins are repeated for as many times as there are dimensions involved in the query, and finally the results are sent to an aggregation hash. If you check out this plan on an SQL Server 2008 server, you will notice some bitmap filters have also been applied. The bitmap filters, also called the bloom filters, are generated as a by-product of the hash joins. The bloom filters are data structures that can probabilistically test whether an element belongs to a set. Though this technology can allow some false positives, it does not allow any false negatives. So, once a fact table row fails the test, it will be excluded from further processing. SQL Server 2008 can generate multiple bloom filters, as against SQL Server 2005, which could do only one. These filters are pushed down in the query execution path to the scan of the fact table to eliminate the nonqualifying rows. Also, the SQL Server 2008 query execution engine can change the sequence in which these filters are applied on the basis of their selectivity—i.e., by placing the most selective filter first, and then the next most selective filter, and so on—thus extracting maximum performance out of this enhancement. So, the bitmap filters enable SQL Server to eliminate nonqualifying fact table rows from further processing quite early on during the query evaluation. This avoids unnecessary processing of the data rows that would have been dropped later in the query processing anyway and results in saving a considerable amount of CPU time. Change Data Capture When you are loading data into a data warehouse system from a source system, you need to know about the rows in the source system that have been either changed or Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 563 inserted or deleted. Until now, the options were to either extract complete source system data to a staging table to compare it against the data warehouse, so as to determine the changes or to use alternate techniques such as timestamp columns, triggers, or complex queries for this purpose. Though these methods work, they have some issues. The first option is very inefficient due to its doing lot of work to find out the changes. Implementing the second option is also not a good choice, as timestamp columns require application changes and triggers or complex queries are not efficient. SQL Server 2008 Enterprise Server (and Developer Edition) has provided an efficient feature called Change Data Capture (CDC) to track data changes such as inserts, updates, and deletions. So, instead of comparing staged data with data warehouse data or using intrusive methods to find the changes in data, you can use CDC to identify changes and eventually load the data warehouse in incremental steps. The Change Data Capture feature is not enabled by default, so you have to enable CDC first in order to use it. The CDC feature is enabled at two levels—the database and table levels. When you enable CDC on a database, it creates a new schema, CDC, and creates a user account, CDC, in the database. The CDC schema is used to store all the change tables and their metadata. When you decide to track a table for changes, you enable CDC on that table. Enabling CDC on first table creates the following: A change table to capture changes to data and metadata c Up to two query functions to allow you to retrieve data from the change tables. By c default, all changes will be captured and only one query function is created in this case; however, if you want to track only the net changes over a period of time, you can do so. You can specify a parameter @support_net_changes = 1 in the CDC enabling command to return only the net changes. is setting creates the second query function that allows you to return only the net changes. is feature can also potentially reduce the number of updates you perform on your data warehouse. A group of change data capture metadata tables to retain metadata configuration c detail Two SQL Server jobs: capture job and cleanup job. e SQL Server Agent service c must be running when you enable CDC tracking on a table. When the changes are applied to the tracked source table, the database engine as usual writes the changes in to the transaction log. The CDC capture job reads the log automatically and adds information about the changes into the associated change table. This table holds the capture columns from the source table to capture the changed data and five additional metadata columns to provide information relevant to the captured change. Two columns are of particular importance and worth mentioning here. The first column, __$start_lsn, records the commit log sequence number (LSN) that 564 Hands-On Microsoft SQL Server 2008 Integration Services was assigned to the change. The commit LSN not only identifies changes that were committed within the same transaction but orders them as well. The second column we want to discuss is __$operation, which records the operation that is associated with the change, for instance, 1 = delete, 2 = insert, 3 = source record prior to update, and 4 = source record after update. This column makes the ETL very efficient by eliminating the need to identify whether a row is an update, a delete, or an insert. Refer to Figure 12-5, which shows methods to load a data warehouse both with and without CDC. With CDC, instead of performing an expensive lookup operation, you just split data conditionally using a Conditional Split transformation. The split condition uses the __$operation column to divert the rows to appropriate output. Partitioned Table Parallelism In SQL Server 2005, when you run a query against large tables that have been partitioned across several disks probably based on dates, it has been observed that the executor allocates threads in an inconsistent manner. If a query touches only one partition, the executor allocates all the threads to the query; however, if a query needs to touch more Read data from source systems Stage data for change capture Read data from stage database Perform lookup operation against data warehouse Update data warehouse Insert into data warehouse Map the starting and ending LSNs for the increment capture interval Prepare query Read data from CDC schema Perform conditional split opertion Update data warehouse Insert into data warehouse Update data warehouse Loading a Data Warehouse without CDC Loading a Data Warehouse with CDC Figure 12-5 Loading a data warehouse using Change Data Capture Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 565 than one partition, it gets one thread allocated per partition, though additional threads might still be available on the server. Sometimes this has produced inconsistent results for the same or similar queries, depending on when they have been executed. In SQL Server 2008, the parallel query plans against partitioned tables have been improved to utilize more threads regardless of how many partitions a query touches. The SQL Server query executor allocates all the available threads to partition table queries in a round robin fashion, keeping CPUs fully utilized at all times so that the performance of the same or similar queries over time is more comparable. The performance boost achieved with this round robin–style thread allocation is noticeably high, especially when more processor cores are available compared to the number of partitions a query touches. You don’t need to configure anything in the SQL Server, as this feature works by default. Partition-Aligned Indexed Views In SQL Server 2005, both indexed views and the partition tables can be used together. For instance, if you have a very large fact table that has been partitioned, say, on a yearly basis and you want to create summarized aggregates on this fact table, you use indexed views to achieve this. So far, so good; however, there is an issue with the way people use fact table partitioning. The fact table is generally partitioned on the basis of a time period, which could be a month, quarter, or year, depending upon your business needs, and as a time period completes, the oldest partition need to be switched out and a new partition has to be switched in to keep data for a fixed period of time. For example, if you are required to keep data for five years and you are keeping partitions on year-by-year basis, then at the beginning of a year you need to remove the oldest partition (six years old) and add a new partition for the new running year. The point to note here is that switching a partition in and out from the fact table is very efficient and is best possible way you can manage data in terms of retention. This way, your data warehouse will always keep the data for the defined period of time and the retained data is partitioned for performance. This is known as a sliding window scenario. The issue with SQL Server 2005 is that you cannot switch a partition in or out from a partitioned table that has indexed views created on top of them. You will end up dropping the indexed view, switching the partition in or out, and then recreating the indexed view. The creation of an indexed view could be a very expensive process that can delay data delivery to your end users. SQL Server 2008 aligns both these features and makes them work together. You don’t need to drop and recreate indexed views when switching a partition in or out of it. With the enhancement of partition-aligned indexed views, you save lot of extra processing that would otherwise be required to rebuild aggregates on the entire partitioned table. 566 Hands-On Microsoft SQL Server 2008 Integration Services Summary This chapter has covered basics about data warehousing and some of the enhancements provided in the SQL Server database engine. The two new editions that have been introduced in the R2 release are interesting to watch, as they realize an approach to bring together software, hardware, and best practices to achieve the highest performance. This chapter is targeted to wet your feet, so if you feel interested, go and take a deep dive in the area of data warehousing. For now, stay on as we still have some interesting topics to cover: migration, package deployment, troubleshooting, and performance tuning. Deploying Integration Services Packages Chapter 13 In This Chapter c Package Configurations c Deployment Utility c Deploying Integration Services Projects c Custom Deployment c Summary . performing aggregations. 562 Hands-On Microsoft SQL Server 2008 Integration Services Star Join Query Processing Enhancement This is one of those improvements in SQL Server 2008 Enterprise Edition. 558 Hands-On Microsoft SQL Server 2008 Integration Services backups and reduce the size of the backup file. As SQL Server backup compression compresses the. table. 566 Hands-On Microsoft SQL Server 2008 Integration Services Summary This chapter has covered basics about data warehousing and some of the enhancements provided in the SQL Server database