428 Hands-On Microsoft SQL Server 2008 Integration Services Figure 10-13 Configurations for the Unpivot transformation Chapter 10: Data Flow Transformations 429 Aggregate Transformation The Aggregate transformation is an asynchronous transformation that helps you to perform aggregate operations such as SUM, AVERAGE, and COUNT. To perform these aggregate operations, you need to have a complete data set. The Aggregate transformation consumes all the rows before applying any aggregation and extracting the transformed data. Because of being an asynchronous transformation, the output data, which most likely has a new schema, is populated in the new memory buffers. The Aggregate transformation can perform operations such as AVERAGE, COUNT, COUNT DISTINCT, GROUP BY, selecting a minimum or maximum from a group, and SUM on column values. The aggregated data is then extracted out in the new output columns. The output columns may also contain the input columns, which form part of the groupings or aggregations. When you select a column in the Aggregate Transformation Editor and click in the Operation field, you will see a list of operations that coincide with the type of the column you selected. This makes sense as the aggregate operations require appropriate column types—for example, a SUM would work on a numeric data type column and would not work with a string data type column. Following are the operation descriptions in detail: AVERAGE c is operation is available only for numeric data type columns and returns the average of the column values. COUNT c Counts the number of rows for the selected column. is operation will not count the rows that have null values for the specified column. In the Aggregate Transform Editor, a special column (*) has been added that allows you to perform the COUNT ALL operation to count all the rows in a data set, including those with null values. COUNT DISTINCT c Counts the number of rows containing distinct non-null values in a group. GROUP BY c is operation can be performed on any data type column and returns the data set in groups of row sets. MAXIMUM c is operation can be performed on numeric, date, and time data type columns and returns the maximum value in a group. MINIMUM c is operation can be performed on numeric, date, and time data type columns and returns the minimum value in a group. SUM c is operation is available only for numeric data type columns and returns the sum of values in a column. 430 Hands-On Microsoft SQL Server 2008 Integration Services The Aggregate transformation’s user interface provides features and options to configure aggregations that we will cover in the following Hands-On exercise, as they will be easier to understand as you work with them. Before we dive in, consider the following: You can perform multiple aggregate operations on the same set of data. For example, c you can perform SUM and AVERAGE operations on a column in the same transformation. As the result from these two different aggregate operations will be different, you must direct the results to different outputs. is is fully supported by the transformation, and you can add multiple outputs to this transformation. e null values are handled as specified in the SQL-92 standard, that is, in the c same way they are handled by T-SQL. e COUNT ALL operation will count all the rows including those containing null values, whereas COUNT or COUNT DISTINCT operations for a specific column will count only rows with non-null values in the specified columns. In addition, GROUP BY operation puts null values in a separate group. When an output column requires special handling because it contains an oversized c data value greater than four billion, or the data requires precision that is beyond a float data type, you can set the IsBig property of the output column to 1 so that the transformation uses the correct data type for storing the column value. However, columns that are involved in a GROUP BY, MINIMUM, or MAXIMUM operation cannot take advantage of this. Hands-On: Aggregating SalesOrders The SalesOrders.xls file has been extended with the pricing information by adding unit price and total price columns. The extracted data has been saved as SalesOrdersExtended .xls. From this extended sales orders data that has a price listed for each product against the SalesOrderID, you are required to aggregate a total price for each order and calculate the average price per product and the number of each product sold. Before starting this exercise, open the SalesOrdersExtended.xls Excel file to verify that the file has only one worksheet named SalesOrders. This exercise adds two more worksheets in it; if the file has additional sheets, delete them before you start this exercise. Also, if you are using the provided package code, you may get a validation error, as the Excel Destination used in the package looks for the worksheets during this exercise. In this case, leave the worksheets as is. Method As with the Pivot transformation Hands-On exercise, you will use an Excel file to access the data from a worksheet and create two new worksheets in the same Excel file to extract the processed data. You will configure a Data Flow task and an Aggregate transformation. Chapter 10: Data Flow Transformations 431 Exercise (Add Connection Manager and Data Flow Task to a New Package) Like the previous exercise, you will start this exercise by adding a new package to the Data Flow transformations project and then adding an Excel Connection Manager in to it. 1. Open the Data Flow transformations project in BIDS. Right-click the SSIS Packages in Solution Explorer and choose New SSIS Package. This will add a new SSIS package named Package1.dtsx. 2. Rename Package1.dtsx as Aggregating SalesOrders.dtsx. 3. Add an Excel Connection Manager to connect to an Excel file C:\SSIS\ RawFiles\SalesOrdersExtended.xls. 4. Add a Data Flow task from Toolbox and rename it Aggregating SalesOrders. Double-click it to open the Data Flow tab. Exercise (Configure Aggregating SalesOrders) The main focus of this part is to learn to configure an Aggregate transformation. You will configure the Excel source and Excel destination as well complete the data flow configurations. 5. Add an Excel source to extract data from the SalesOrders$ worksheet in the SalesOrdersExtended.xls file. Rename this Excel Source Sales Orders Data Source. 6. Drag and drop an Aggregate transformation from the Toolbox onto the Data Flow surface just below the Excel source. Connect both the components with a data flow path. 7. Double-click the Aggregate transformation to open the Aggregate Transformation Editor that displays two tabs—Aggregations and Advanced. In the Aggregations tab, you select columns for aggregations and specify aggregation properties for them. This tab has two display types: basic and advanced. An Advanced button on the Aggregations tab converts the basic display of the Aggregations tab into an advanced display. This advanced display allows you to perform multiple groupings or GROUP BY operations. Click Advanced to see the advanced display. Selecting multiple GROUP BY operations adds rows using multiple Aggregation Names in the advanced display section. This also means that you will be generating different types of output data sets that will be sent to multiple outputs. So, when you add an Aggregation Name, you add an additional output to the transformation outputs. 8. Click the Advanced tab to see the properties you can apply at the Aggregate component level. As you can see, there are configurations to be done in more than one place in this editor. In fact, you can configure this transformation at three levels—the component level, the output level, and the column level. 432 Hands-On Microsoft SQL Server 2008 Integration Services The properties you define on the Advanced tab apply at a component level, the properties configured in the advanced display of the Aggregations tab apply at the output level, and the properties configured in the column list at the bottom of the Aggregations tab apply at the column level. The capability to specify properties at different levels enables you to configure it for the maximum performance benefits. The following descriptions explain the properties in the Advanced tab: Key Scale c is is an optional property that helps the transformation decide the initial cache size. By default, this property is not used and can have a low, medium, or high value if selected. Using the low value, the aggregation can write approximately 500,000 keys, medium enables it to write about 5 million keys, and high enables it to write approximately 25 million keys. Number Of Keys c is optional setting is used to override the value of a key scale by specifying the exact number of keys that you expect this transformation to handle. Specifying the keys upfront allows the transformation to manage cache properly and avoids reorganizing the cache at run time; thus it will enhance performance. Count Distinct Scale c You can specify an approximate number of distinct values that the transformation is expected to handle. is is an optional setting and is unspecified by default. You can select low, medium, or high values. Using the low value, the aggregation can write approximately 500 thousand distinct values, medium enables it to write about 5 million distinct values, and high enables it to write approximately 25 million distinct values. Count Distinct Keys c Using this property, you can override the count distinct scale value by specifying the exact number of distinct values that the transformation can write. is will avoid reorganizing cached totals at run time and will enhance performance. Auto Extend Factor c Using this property, you can specify a percentage value by which this transformation can extend its memory during runtime. You can use a value between 1 and 100 percent—the default is 25 percent. 9. Go to the Aggregations tab, and select SalesOrderID and TotalPrice from the Available Input Columns. As you select them, they will be added to the columns list below as SalesOrderID with GROUP BY operation and TotalPrice with SUM operation. 10. Click Advanced to display the options for configuring aggregations for multiple outputs. You will see Aggregate Output 1 already configured using SalesOrderID as a GROUP BY column. Rename Aggregate Output 1 as Total Per Order. Next, click in the second row in the Aggregation Name field and type Products Sold and Average Price in the cell. Then select the ProductName, OrderQuantity, and UnitPrice columns from the Available Input Columns list. These columns Chapter 10: Data Flow Transformations 433 will appear in the columns list with default operations applied to them. Change these operations as follows: GROUP BY operation to the ProductName column, SUM operation to the OrderQuantity column, and AVERAGE operation to the UnitPrice column, as shown in the Figure 10-14. Note that you can specify a key scale and keys in the advanced display for each output, thus enhancing performance by specifying the number of keys the output is expected to contain. Similarly, you can specify Count Distinct Scale and Count Distinct Keys values for each column in the list to specify the number of distinct values the column is expected to contain. Figure 10-14 Configuring Aggregate transformation for multiple outputs 434 Hands-On Microsoft SQL Server 2008 Integration Services 11. You’re done with the configuration of Aggregation transformation. Click OK to close the editor. Before you start executing the package, check out one more thing. Open the Advanced Editor for Aggregate transformation and go to the Input and Output Properties tab. You’ll see the two outputs you created earlier in the advanced display of Aggregations tab of the custom editor. Expand and view the different output columns. Also, note that you can specify the IsBig property in the output columns here, as shown in Figure 10-15. Click OK to return to the Designer. Figure 10-15 Multiple outputs of Aggregate transformation Chapter 10: Data Flow Transformations 435 12. Let’s direct the outputs to different worksheets in the same Excel file. Add an Excel destination on the left, just below the Aggregate transformation, and drag the green arrow from Aggregate transformation to Excel destination. As you drop it on the Excel destination, an Input Output Selection dialog box will pop-up, asking you to specify the output you want to connect to this destination. Select Total Per Order in the Output field and click OK to add the connector. 13. Double-click the Excel Destination and click the New button opposite to the “Name of the Excel sheet” field to add a new worksheet in the Excel file. In the Create Table dialog box, change the name of the table from Excel Destination to TotalPerOrder and click OK to return to the Excel Destination Editor dialog box. Select the TotalPerOrder sheet in the field. Next, go to the Mappings page and the mappings between the Available Input Columns with the Available Destination Columns will be done for you by default. Click OK to close the Destination Editor. Rename this destination Total Per Order. 14. Much as you did in Steps 12 and 13, add another Excel destination below the Aggregate transformation on the right side (see Figure 10-16) and rename it Figure 10-16 Aggregating the SalesOrders package 436 Hands-On Microsoft SQL Server 2008 Integration Services Products Sold and Average Price. Connect the Aggregate transformation to the Products Sold and Average Price destination using the second green arrow. Add a new worksheet by clicking the New button next to the “Name of the Excel sheet” field by the name of ProductsSoldandAveragePrice. Go to the Mappings page to create the mappings. Click OK to close the editor. Exercise (Run the Aggregations) In the final part of this Hands-On, you will execute the package and see the results. If you wish, you can add data viewers where you would like to see the data grid. 15. Press 5 to run the package. The package will complete execution almost immediately. Stop the debugging by pressing - 5. Save all files and close the project. 16. Explore to the C:\SSIS\RawFiles folder and open the SalesOrdersExtended.xls file. You will see two new worksheets created and populated with data. Check out the data to validate the Aggregate transformation operations. Review You’ve done some aggregations in this exercise and have added multiple outputs to the Aggregate transformation. You’ve also learned that the Aggregate transformation can be configured at the component level by specifying the properties in the Advanced tab, can be configured at the output level by specifying keys for each output, and can also be configured at the column level for distinct values each column is expected to have. You can achieve high levels of performance using these configurations. However, if you find that the transformation is still suffering from memory shortage, you can use Auto Extend to extend the memory usage of this component. Audit Transformations Audit transformations are important as well. This category includes only two transformations in this release. The Audit transformation includes environmental data such as system data or login name into the pipeline. The Row Count transformation counts the number of rows in the pipeline and stores the data to a variable. Audit Transformation One of the common requirements of data maintenance and data warehousing is to timestamp the record whenever it is either added or updated. Generally in data marts, you get a nightly feed for new as well as updated records and you want to timestamp the records that are inserted or updated to maintain a history, which is also quite Chapter 10: Data Flow Transformations 437 helpful in data analysis. By providing the Audit transformation, Integration Services has extended this ability and allows environment variable values to be included in the data flow. With the Audit transformation, not only can you include the package execution start time but you can include much more information—for example, the name of the operator, computer, or package to indicate who has made changes to data and the source of data. This is like using Derived Column transformation in a special way. To perform the assigned functions, this transformation supports one input and one output. As it does not perform any transformation on the input columns data (rather it just adds known environment values using system variables), an error in this transformation is not expected, and hence it does not support an error output. The Audit transformation provides access to nine system variables. Following is a brief description of each of these variables: ExecutionInstanceGUID C Each execution instance of the package is allocated a GUID contained in this variable. PackageID C Represents the unique identifier of the package. PackageName C A package name can be added in the data flow. VersionID C Holds the version identifier of the package. ExecutionStartTime C Include the time when the package started to run. MachineName C Provides the computer name on which the package runs. UserName C Adds the login name of the person who runs the package. TaskName C Holds the name of the Data Flow task to which this transformation belongs. TaskID C Holds the unique identifier of the Data Flow task to which this transformation belongs. When you open the editor for this transformation after connecting an input, you will be able to select the system variable from a drop-down list by clicking in the Audit Type column, as shown in Figure 10-17. When you select a system variable, the Output Column Name shows the default name of the variable, which you can change. This Output Column will be added to the transformation output as a new output column. Row Count Transformation Using the Row Count transformation, you can count the rows that are passing through the transformation and store the final count in a variable that can be used by other components, such as a script component and property expressions or can be useful . 428 Hands-On Microsoft SQL Server 2008 Integration Services Figure 10-13 Configurations for the Unpivot transformation Chapter. numeric data type columns and returns the sum of values in a column. 430 Hands-On Microsoft SQL Server 2008 Integration Services The Aggregate transformation’s user interface provides features. levels—the component level, the output level, and the column level. 432 Hands-On Microsoft SQL Server 2008 Integration Services The properties you define on the Advanced tab apply at a component