648 Hands-On Microsoft SQL Server 2008 Integration Services do not block data. It’s bit difficult to envisage the downstream components working on the data set that is still to be passed by the upstream component. However, in reality, data doesn’t flow. Data stays in the same buffer and the transformation works on the relevant column. So as the data is not flowing and is remaining static in the memory buffer; a transformation works on a column while the other transformation performs operations on another column. This is how multiple-row transformations can work so fast (almost simultaneously) on a data buffer. These transformations fall under one execution tree. The following Row transformations are examples of nonblocking synchronous row-based transformations: Audit transformation c Character Map transformation c Conditional Split transformation c Copy Column transformation c Data Conversion transformation c Derived Column transformation c Export Column transformation c Import Column transformation c Lookup transformation c Multicast transformation c OLE DB Command transformation c Percentage Sampling transformation c Row Count transformation c Row Sampling transformation c Script Component transformation configured as Nonblocking Synchronous c Row–based transformation Slowly Changing Dimension transformation c Partially Blocking Asynchronous Row-Set-Based Transformations Integration Services provides some transformations that essentially add new rows in the data flow and hold on to the data buffers before they can perform the transformation. The nature of such transformations is defined as partially blocking because these Chapter 15: Troubleshooting and Performance Enhancements 649 transformations hold on to data for a while before they start releasing buffers. As these transformations add new rows in the data flow, they are asynchronous in nature. For example, a Merge transformation combines two sorted data sets that may be the same data, but it essentially adds new rows in the data flow. The time taken by this transformation before starting to release the buffers depends on when these components receive matching data from both inputs. Following is a list of partially blocking asynchronous row set-based transformations: Data Mining Query transformation c Merge Join transformation c Merge transformation c Pivot transformation c Term Lookup transformation c UnPivot transformation c Union All transformation c Script Component transformation configured as Partially Blocking Asynchronous c Row-set-based transformation Blocking Asynchronous Full Row-Set-Based Transformations These transformations require all the rows to be assembled before they can perform their operation. These transformations can also change the number of rows in the data flow and have asynchronous outputs. For example, the Aggregate transformation needs to see each and every row to perform aggregations such as summation or finding an average. Similarly, the Sort transformation needs to see all the rows to sort them in proper order. This requirement of collecting all rows before performing a transformation operation puts a heavy load on both processor and memory of the server. You can understand from the nature of the function they perform that these transformations block data until they have seen all the rows and do not pass the data to the downstream component until they have finished their operation. If you are working with a large data set and have an Aggregate or Sort transformation in your package, you can see the memory usage ramp up consistently in the Task Manager when the transformation starts execution. If the server doesn’t have enough free memory to support this, you may also see disk activity and virtual memory being used, which affects performance drastically. When all the rows are loaded in memory, which you can confirm by looking at the row number shown in the Designer and after the processing is done, you may see in the Task Manager that all the built-up memory is released immediately. 650 Hands-On Microsoft SQL Server 2008 Integration Services The following are blocking asynchronous full row-set-based transformations: Aggregate transformation c Fuzzy Grouping transformation c Fuzzy Lookup transformation c Sort transformation c Term Extraction transformation c Optimization Techniques The techniques used for optimizing your Integration Services packages are quite simple and similar to any other application built around database operations. Here we will discuss the basic principles that help you optimize packages by applying common-sense techniques. For some of us, it is easiest to spend more money to get the fastest 64-bit machines with multicore CPUs with several gigs of memory and consider optimization done. It does help to throw more hardware and more processing power at performance issues, but if you don’t debug the underlying performance bottlenecks, they come back to haunt you a few months after you’ve upgraded to newer, beefier machines. From the architecture point of view, you also need to decide whether you’d like to go with Integration Services or any other third-party application, or maybe write the T-SQL scripts directly in SQL Server to perform the work. A bit of analysis needs to be done before you make a decision. The choice to use a particular application may be based on its architectural advantages. For example, using custom-built scripts directly on large, properly indexed databases may prove to be more efficient for batch transformations compared to Integration Services, which is more suitable for complex row-by-row transformations. Most of the transformations of Integration Services are row-based and are quite lightweight and efficient for performing transformations compared to row-set-based transformations. If you need to use lots of row-set-based transformations that move buffers around in the memory, you need to consider whether that can be done using direct SQL. This requires testing and analysis, which you must do to optimize your data transformation operations. On the other hand, if you are working with different data sources such as SQL Server, DB2, Oracle Server, flat files, and probably a piece of XML feed, you may find it very difficult to write a query to join these sources together and then optimize it. In this case, you may find SSIS is simpler and better performing. Monitoring, measuring, testing, and improving are vital to any optimization plan, and you cannot avoid this process with Integration Services, either. Once you’ve decided to use Integration Services and are building your package, you can apply the following simple considerations to design your package to perform optimally. Chapter 15: Troubleshooting and Performance Enhancements 651 Choose the Right Operation You need to have a critic’s eye when designing an Integration Services package, and whenever you choose a specific operation in your package, keep asking yourself the following: Why am I using this operation? c Can I avoid using this operation? c Can I share this operation? c It is easy to get carried away when considering what you want to achieve with the various requirements of data flow components. Not all the operations you do in an SSIS package are functional requirements; some of them are forced on to you because of environmental conditions. For example, you have to convert the text strings coming out of a flat file into the proper data types such as integer and datetime. Though you cannot avoid such a conversion, you can still decide where to perform this conversion— i.e., at a source or a destination, or maybe somewhere in between. Second, you need to consider whether you’re using any redundant operations in the package that can be safely removed. For example, you may be converting a text column to a datetime column when you extract data from a flat file and later converting the same column into a date type column because of data store requirements. You can avoid one operation in this case by directly converting the column to the data type required by destination store. Similarly, if you are not using all the columns from the flat file, are you still parsing and converting those columns? When it comes to performance troubleshooting, remember every little bit helps! Third, consider whether you can distribute processing on multiple machines. Some other specific and thought-provoking components-based examples are discussed in the following paragraphs. While designing your package, you should always think of better alternatives for performing a unit of work. This calls for testing different components for performance comparisons within Integration Services or using alternative options provided by SQL Server. For example, if you are loading data in to the local SQL Server where your SSIS package is running, it is better to use an SQL Server destination than an OLE DB destination. An SQL Server destination uses memory interfaces to transfer data to SQL Server at a faster rate, but requires that the package be run on the local server. Also, keep in mind that T-SQL can be used to import small sets of data faster than using the Integration Services Bulk Insert task or SQL Server destination because of the extra effort of loading the Integration Services run-time environment. When you use a Lookup transformation to integrate data from two different sources, think about whether this can be replaced with a Merge Join transformation, which 652 Hands-On Microsoft SQL Server 2008 Integration Services sometimes could perform better than the Lookup transformation for a large data set. Try using either of them and test the performance difference. Before you go ahead with your test, read on as there are issues with the way you can configure Merge Join transformation and you need to configure your package in a better way to extract the best performance. From the earlier discussion, you know that a Lookup transformation is a nonblocking synchronous component that wouldn’t require data to be moved to new buffers, whereas a Merge Join transformation is a row-set-based partially blocking asynchronous transformation that will move data to new buffers. However, moving data to new buffers may call for a new execution tree, which may also get an additional processing thread and increase the number of CPUs used. Second, a Merge Join transformation requires data inputs to be sorted in the same order for the join columns. You may think that adding a Sort transformation in the pipeline would be expensive, as it is a blocking asynchronous full row-set-based transformation that needs to cache all the data before sorting it. There is a better alternative to Sort transformation: if your data is coming from an RDBMS where it is indexed on the join column, you can use an ORDER BY clause in T-SQL in the Data Flow Source adapter to sort the data in the RDMS, or if the data is being read from a flat file and the flat file is already sorted in the required order, you can avoid using a Sort transformation by telling the Merge Join transformation that the incoming data is already sorted. The output of data flow source has a property called IsSorted that you can use to specify that the data being output by this component is in sorted order. You need to set this property to True, as shown in Figure 15-4. After setting the IsSorted property, you still need to specify the columns that have been sorted and the order in which they have been sorted. This works exactly like the ORDER BY clause in which you specify column names and then the order of the sort by using the ASC or DESC keywords. To set this property, expand the Output Columns node, choose the column, and assign a value of 1 to the SortKeyPosition property for specifying that the column is the first column that is sorted in an ascending order (see Figure 15-5), or use a value of –1 to specify that the first column is sorted in descending order. For specifying second column in the sort order, set the SortKeyPosition value for that column to either 2 or –2 to indicate ascending or descending order, respectively. By understanding how you can use a Merge Join transformation with already-sorted data, you can determine whether you can extract better performance with a Merge Join transform or by using a Lookup transformation. The next thing is to keep the data buffers in memory. If you do not have enough memory available to SSIS, partition your data. Smaller batches do not put heavy demands on memory, and if properly configured, values can be fit in the memory available on the server and the process will complete in a much shorter time. I experienced an incident you might find interesting. One of my clients asked me to optimize their DTS package, Chapter 15: Troubleshooting and Performance Enhancements 653 which was loading about 12 million rows into SQL Server. The package was simple but was taking about one and a half days to complete processing of all the files. I started checking the usual things—that files were copied to a local server before the process, that the database had enough free space, and that the disks were configured in RAID 10. The package code was relatively simple. I started believing (against my gut feeling) that this was probably just the time it should take to complete the process. I asked for a baseline. The DBA who implemented the package had done a good job at deployment time by providing the baseline for 100,000 records. The DBA had moved on after implementation of this package, and the data grew massively after that. I quickly did the calculations and realized that something was wrong—the package should finish quicker. Further research uncovered the issue: the commit transaction setting was left blank, which meant that all the rows were being committed together in a single batch. DTS was being required to keep 12 million rows in the memory, so it was no surprise that the server was running out of memory and swapping data out to disks and the processes were taking so much time to complete. I reduced the batch size to 10,000 rows and the Figure 15-4 Indicating that the data is sorted using the IsSorted property 654 Hands-On Microsoft SQL Server 2008 Integration Services package completed in less than 4 hours. So, keep the commit size smaller if you need not or cannot use bigger batches. Figure 15-6 shows two properties of the OLE DB destination: Rows Per Batch and Maximum Insert Commit Size, which you can use to specify the number of rows in a batch and the maximum size that can be committed together. However, if you are more concerned about the failures of this task and want to the failing restart package, you need to use this property carefully or consider other options that can help you design your package in such a way. Refer to Chapter 8 to know more about restart options. Do Only the Work You Need to Do You should consider this tip seriously, as this is generally an overlooked problem that causes performance issues. Always ask yourself these questions: Am I doing too much? c Can I reduce the amount of work I’m doing? c Figure 15-5 Specifying sort order on the columns Chapter 15: Troubleshooting and Performance Enhancements 655 One common mistake developers generally make is to start with a larger data set than required in order to build the required functionality quicker, especially when developing a proof of concept (POC). Once the package is created, the data set is reduced by fine-tuning the data flow sources or other components. For example, you may have imported a complete table in the pipeline using a data flow source with a Table or View Data access option, whereas you might have wanted only part of the rows. You can do that while trying to build your package as a proof of concept, but preferably you should use the SQL Command data access mode in the data flow sources to specify the precise number of rows you want to import in the pipeline. Also, remember not to use the Select * from TableName query with this option, which is equally bad. You can also reduce the number of data columns from the data set. If you’re importing more columns than are actually required and you parse and convert these columns before you decide not to use them in the data flow, this is unnecessary processing that can be avoided by using the SQL command properly to Figure 15-6 OLE DB destination can control the commit size of a batch. 656 Hands-On Microsoft SQL Server 2008 Integration Services access the precise data that you want to use in the pipeline. Consider whether you need to use a complete data set in your package or whether a much-reduced data set in rows or columns that consist of only incremental data would be sufficient. You should select only the columns that are required not only in data sources but in the Lookup transformation reference data set as well. Write an SQL select statement with the minimum required data instead of using a table or view to keep the reference data set small that stays in the memory; it will perform better. Run an SSIS Package at the Most Suitable Location Choosing a location for running your package is also important when a data source or destination is remote. If you are accessing data over the network or writing to a remote destination, package performance will be affected. Determine whether you can transfer a file to a local server or run a package locally on a machine where the data is stored. Avoid transferring data over a network if possible, as network connectivity issues and traffic conditions may unnecessarily affect performance of your package. Among other factors, the choice of running your package may be guided by whether you are exploding your data using transformations or reducing the data. For example, if you are doing aggregations and storing the results on a remote SQL Server, you can be better off running it on a computer that is on or near to the data source computer. Alternatively, if you are expanding a data set by copying columns or using data conversions, you may be better off getting the raw data from the data source and running the package on the computer on or near to destination computer. The choice of data type can affect the bytes handled by the package. For example, if you are using an integer data type to specify an integer, it will be represented by a space of 2 bytes in the memory, whereas using a real data type for this number will use 4 bytes. When you are dealing with a large data set that has, say, 10 million rows, you will be adding an extra 20 million bytes to your package that will further degrade the performance when this data is being transferred over the network. Run an SSIS Package at the Most Suitable Machine If your SQL Server is being used heavily or scheduled jobs are running on the SQL Server at the same time that you want to run your SSIS package, you can explore the possibility of using a server other than the SQL Server that may have more resources available at that time. This will ease the pressure of a narrow processing window; however, the down side is that you will have to buy an extra license for SQL Server for the server on which you are running Integration Services. If you are planning to buy a new machine for SQL Server or wherever you will be running Integration Services, consider buying the server with the fastest possible drives, as the disk read and write operations are affected by disk performance. Configure Chapter 15: Troubleshooting and Performance Enhancements 657 RAID on your server with a clear understanding of performance options offered by different types. RAID 10 provides the best read and write performance but uses more disk real estate, whereas RAID 5 offers a balance between the disk read speed and the redundancy cost. Use RAID 10 if your budget can afford it. Avoid Unnecessary Contention for Resources When deploying your packages, think of the run-time conditions. Ask yourself these questions: Will other users be accessing the server at the same time? c Will other processes or jobs, such as server maintenance plans or backup jobs, be c running at that time? Integration Services can happily coexist with any other application and run packages with minimal requirements for resources. The amount of data to be processed, the transformations, and the ways you’ve configured your package to use resources all determine the impact on server performance as a whole. If a server is being used heavily, avoid running your packages on the server, as it may not be able to acquire locks quickly enough and will also affect the users connected to other services on the server. Check out the availability of memory on the server at the time when the package will be run. Memory is the most precious resource for any database application, and any contention for memory will negatively affect performance. Archive Historical Data You should consider archiving the data that is no longer required for business analysis or reporting, as this type of data is an unnecessary drag on performance. For example, if the business needs only the last three years of data, then the data older than three years will not only take more hard disk space but will slow down all the data access queries, data loading processes, and data manipulation processes. It is worth considering the archival process and the data retention requirements during the design phase of a project so that you can develop your processes accordingly. Keeping your databases lean with only the needed data will avoid unwanted performance hassles. Discuss and Optimize As with database development, you discuss the data modeling with developers and other information analysts to agree on properly optimized yet resilient and compliant rules for database design, and adopt best practices for designs and techniques used within the development team. You will find this is not only encouraging for the team, . license for SQL Server for the server on which you are running Integration Services. If you are planning to buy a new machine for SQL Server or wherever you will be running Integration Services, . avoided by using the SQL command properly to Figure 15-6 OLE DB destination can control the commit size of a batch. 656 Hands-On Microsoft SQL Server 2008 Integration Services access the precise. whether this can be replaced with a Merge Join transformation, which 652 Hands-On Microsoft SQL Server 2008 Integration Services sometimes could perform better than the Lookup transformation for