368 Hands-On Microsoft SQL Server 2008 Integration Services did in the control flow—drag the output of a data flow component and drop it onto the input of the next component, and the line formed connecting the data flow components on the Data Flow Designer surface is called the data flow path. They may look similar on the Designer, but there are major differences between a precedence constraint and a data flow path, as both represent different functionalities in their own right. Note that the data flow path line is thinner than the precedence constraint line and can be either green or red, depending on whether it is representing Output Path or Error Output Path. In the Control Flow, when you connect a task to another using a precedence constraint and click the task again, you will see another green arrow, indicating that you can configure multiple precedence constraints for the tasks; However, in the data flow the data flow paths are limited to the number of outputs and error outputs available to the source component. There’s another important difference— the data flow path actually simulates a pipe connecting the pipeline components in the data flow path (remember that data flow is also known as pipeline) through which data flows, whereas a precedence constraint specifies a condition when the next task can be executed in the workflow. When you click a component, for example OLE DB source, in the Data Flow Designer surface, depending upon the outputs available from the component, you may see a combination of green and red arrows. Some components have both output and error output paths; some have only one and some have no output, such as destinations. Our example component, OLE DB source, has both output and error output available and hence shows both green and red arrows. After you connect a component to another component using a data flow path on the Data Flow Designer, you can configure the properties of the data flow path using Data Flow Path Editor. This editor can be opened by choosing the Edit command from the context menu or simply by double- clicking the path. Once in the Data Flow Path Editor, you will be able to configure properties such as name, description, and annotation of the path on the General page; you can see the metadata of the data columns flowing through the path on the Metadata page; and you can add data viewers on the Data Viewers page. We will configure the properties of the data flow path in the following Hands-On exercise. Hands-On: An Introduction to the Data Flow Task The purpose of this Hands-On exercise is to introduce you to the Data Flow task and how you can monitor the data flowing through the package by exporting data from [Person].[Contact] table of AdventureWorks database to a flat file. Method We will not do much research in this package but will keep it simple as this is just an introduction to data flow. You will drop a Data Flow task on the control flow and then Chapter 9: Data Flow Components 369 go on to configure this Data Flow task. As you know by now that the Data Flow task has its own development and designer environment in BIDS, which opens up when you double-click the Data Flow task or by clicking the Data Flow tab. Exercise (Configure an OLE DB Connection Manager and Add a Data Flow Task) To begin this exercise, create a new package, configure a connection manager for connecting to the AdventureWorks database, and then add a Data Flow task to the package. 1. Start BIDS. Create a New Project with the following details: Template Integration Services Project Name Introduction to Data Flow Location C:\SSIS\Projects 2. When the blank solution is created, go to Solution Explorer and rename the package.dtsx file to My First Data Flow.dtsx. 3. As we will be exporting data from Adventure Works database, we need to have a connection manager to establish a connection to the database. Right-click anywhere in the Connection Managers area and choose New OLE DB Connection from the context menu. In the Configure OLE DB Connection Manager dialog box, click New to specify settings for the Connection Manager dialog box. Specify localhost or your computer name in the Server Name field and leave the Use Windows Authentication radio button selected. Choose the AdventureWorks database from the drop-down list in the Select Or Enter A Database Name field. Test the connection to the database using the Test Connection button before closing the open windows by clicking OK twice. 4. Go to the Toolbox; drag and drop the Data Flow task onto the Control Flow Designer surface. Right-click the Data Flow task and choose Rename from the context menu. Rename the Data Flow task as Export PersonContact. 5. Double-click Export PersonContact and you will be taken to the Data Flow tab, where the Data Flow Task field displays the currently selected task: Export PersonContact. Using this field, you can select the required Data Flow task from the drop-down list when your package has multiple Data Flow tasks. 6. Go to the Toolbox, and you will notice that the available list of tasks in the Toolbox has changed. The Data Flow tab has a different set of Integration Services components that are designed to handle data flow operations and are divided into three sections: Data Flow sources, Data Flow transformations, and Data Flow destinations. See the list of components available under each section. 370 Hands-On Microsoft SQL Server 2008 Integration Services Exercise (Add an OLE DB Source and a Flat File Data Flow Destination) Now you can build your first data flow using an OLE DB source and a Flat File destination. 7. From the Data Flow Sources section in the Toolbox, drag and drop the OLE DB Source onto the Data Flow Designer surface. Double-click the OLE DB source to open the OLE DB Source Editor. You will see that the OLE DB Connection Manager field has automatically picked up the already configured connection manager. Expand the list of Data Access Mode to see the available options. Leave the Table Or View option selected. 8. When you click in the name of the table or the view field, the Data Flow source goes out using the connection manager settings to display you a list of tables and views. Select [Person].[Contact] table from the list. Click Preview to see the first 200 rows from the selected table. Once done, close the preview window. 9. Click the Columns page from the left pane of the editor window. Note that all the external columns have been automatically selected. Uncheck the last five columns—PasswordHash, PasswordSalt, AdditionalContactInfo, rowguid, and ModifiedDate—as we do not want to output these columns. The Output Column shows the names given to the output columns of OLE DB source, though you can change these names if you wish to do so (see Figure 9-6). 10. Go to the Error Output page and note that the default setting for any error or truncation in data for each column is to fail the component. This is fine for the time being. Click OK to close the OLE DB Source Editor. 11. Right-click the OLE DB source and choose the Show Advanced Editor context menu command. This will open the Advanced Editor dialog box for OLE DB source, in which you can see its properties exposed in four different tabs. The Connection Managers tab shows the connection manager you configured in earlier steps. Click the Component Properties tab and specify the following properties: Name Person_Contact Description OLE DB source fetching data from [Person].[Contact] table of AdventureWorks database. Go to the Column Mappings tab to see the mappings of the columns from Available External Columns to Available Output columns. Go to the Input and Output Properties tab and expand the outputs listed there. You will see the External Columns, Output Columns, and the Error Output Columns. If you click a column, you will see the properties of the column in the right side. Depending upon the category of the column you’re viewing, you will see different Chapter 9: Data Flow Components 371 levels and types of properties that you may be able to change as well. Click OK to close the Advanced Editor and you will see the OLE DB source has been renamed. If you hover your mouse over the OLE DB source, you will see the description appear as a screen tip. Make a habit to clearly define the name and description properties of the Integration Services components, as this helps in self-documenting the package and goes a long way in reminding you what this component does, especially when you open a package after several months to modify some details. Figure 9-6 You can select the external columns and assign output names for them. 372 Hands-On Microsoft SQL Server 2008 Integration Services 12. Go to the Toolbox and scroll down to the Data Flow destinations section. Drag and drop the Flat File destination from the Toolbox onto the Designer surface just below the Person_Contact OLE DB source. Click the Person_Contact and you will see a green arrow and a red arrow emerging from the source. Drag the green arrow over to the Flat File Destination to connect the components together. 13. Double-click the Flat File destination to invoke the Flat File Destination Editor. In the Connection Manager page, click the New button shown opposite the Flat File Connection Manager field. You will be asked to choose a format for the flat file to which you want to output data. Select the Delimited radio button if it is not already selected and click OK. This will open a Flat File Connection Manager Editor dialog box. Specify C:\SSIS\RawFiles\PersonContact.txt in the File Name field and check the box for the Column names in the first data row option. All other fields will be filled in automatically for you with default values. Click the Columns page from the left pane of the dialog box and see that all the columns you’ve selected in the OLE DB source have been added. This list of columns is actually taken by the Flat File Destination’s input from the output columns of OLE DB source. If you go to the Advanced page, you will see the available columns and their properties. Click OK to add this newly configured connection manager to the Flat File Connection Manager field of the Flat File Destination Editor dialog box. Leave the Overwrite Data In The File option checked. 14. Go to the Mappings page to review the mappings between Available Input Columns and Available Destination Columns. Click OK to close the Flat File destination. Rename the Flat File destination to PersonContact Flat File. Exercise (Configure the Data Flow Path and Execute the Package) In this part, you will configure the Data Flow path that you’ve used to connect the two components in the last exercise to view the flow of data at run time. 15. Double-click the green line connecting the Person_Contact and PersonContact Flat File components to open the Data Flow Path Editor. In the General page of the editor, you can specify a unique Name for the path, type in a Description, and annotate the path. The PathAnnotation provides four options for annotation: Never for disabling path annotation, AsNeeded for enabling annotation, SourceName to annotate using the value of Source Name field, and PathName to annotate using the value specified in Name field. 16. The Metadata page of the Data Flow Path Editor shows you the metadata of the data flowing through it. You can see the name, data type, precision, scale, length, code page, sort key position, comparison flags, and source component of each column. The source component is the name of component that generated the column. You can also copy this metadata to the clipboard if needed. Chapter 9: Data Flow Components 373 17. In the Data Viewers page you can add data viewers to see the actual data that is flowing through the data flow path. This is an excellent debugging tool, especially when you’re trying to find out what happened to the data. Let’s add a data viewer. Click Add to configure a data viewer. In the General tab of the Configure Data Viewer dialog box, choose how you want to view the data by selecting from Grid, Histogram, Scatter Plot (x,y), and Column Chart types of the data viewers. Depending upon your choice of data viewer type, the second tab is changed appropriately. Grid c Shows the data columns and rows in a grid. You can select the data columns to be included in the grid in the Grid tab. Histogram c Select a numerical column in the Histogram tab to model the histogram when you choose this data viewer type. Scatter Plot (x,y) c Select this option and the second tab changes to Scatter Plot (x,y), in which you can select a numerical column each for the x-axis and the y-axis. e two columns that you select here will be plotted against each other to draw a point for each record on the Scatter Plot. Column Chart c Visualize the data as column charts of counts of distinct data values. For example, if you are dealing with persons and use city as a column in the data, then the Column Chart can show the number of persons for each city drawn as columns on the chart. For our exercise, choose Grid as a data viewer type and leave all the columns selected in the Grid tab. Click OK to return to the Data Flow Path Editor dialog box, where you will see a grid-type data viewer has been added in the Data Viewers list. Click OK to close this editor and you will see a Data Viewer icon alongside the Data Flow path on the Designer. 18. The package configuration is complete now, but before we execute the package, it is worth exploring two of the properties of the data flow engine that affect the flow of data buffer by buffer through it. Right-click anywhere on the Data Flow Designer surface and choose Properties. Scroll down to the Misc section in the Properties window and note the following two listed properties: DefaultBufferMaxRows 10000 DefaultBufferSize 10485760 These properties define the default size of the buffer as 10MB and the maximum rows that a buffer can contain by default as 10,000. These settings give you control to optimize the flow of data through the pipeline. 374 Hands-On Microsoft SQL Server 2008 Integration Services 19. Press the 5 key on the keyboard to execute the package. As the package starts executing you will see a Grid Data Viewer window. As the package executes and starts sending data down the pipeline, the data viewer gets attached to the data flow and shows the data in the buffer flowing between the two components. If you look on the status bar at the bottom of the data viewer window, you can see the counts of the total number of rows that have passed through the data viewer, the total number of buffers when the viewer is detached, and the rows displayed in this buffer. On the top of the data viewer window, you can see three buttons: Copy Data allows you to copy the data currently shown in the data viewer to the Clipboard, Detach toggles to Attach when clicked and allows you to detach the data viewer from the data flow and lets the data continue to flow through the path without being paused, and the green arrow button allows you to move data through the data flow buffer by buffer. When the package is executed, the data is moved in the chunk sizes (buffer by buffer) limited by the default buffer size and the default buffer maximum rows, 10MB and 10,000 by default. Clicking this green arrow button will allow the data in the first buffer to pass through and the data in the second buffer will be held up for you to view (see Figure 9-7). Click the green arrow to see the data in the next buffer. Figure 9-7 Data Viewer showing the data flow in the grid Chapter 9: Data Flow Components 375 20. After a couple of seconds, you will see the data in the next buffer. This time the total rows will be shown at a little less than 20,000 and the rows displayed will be a little less than 10,000. The total number of rows is also shown next to the Data Flow path on the Designer surface. This number of rows may vary for a different data flow depending upon the width of the rows. Click Detach to complete the execution of the package. 21. Press -5 to stop debugging the package. Press -- to save all the items in this project. Review In this exercise, you built a basic data flow for a package to extract data from a database table to a flat file. You’ve also used the Data Viewer to see the data flowing past and learned how to optimize the data buffer settings to fine-tune the data flow buffer by buffer through the data flow. Summary You are now familiar with the components of data flow and know how the data is being accessed from the external source by the data flow source, passes through the data flow transformations, and then gets loaded into the data flow destinations. You have studied the data flow sources, data flow destinations, and data flow path in detail in this chapter and have briefly learned about the data flow transformations. In the next chapter, you will learn more about data flow transformations by studying them in detail and doing Hands-On exercises using most of the transformations. This page intentionally left blank Data Flow Transformations Chapter 10 In This Chapter c Row Transformations c Split and Join Transformations c Rowset Transformations c Audit Transformations c Business Intelligence Transformations c Summary . 368 Hands-On Microsoft SQL Server 2008 Integration Services did in the control flow—drag the output of a data flow component. destinations. See the list of components available under each section. 370 Hands-On Microsoft SQL Server 2008 Integration Services Exercise (Add an OLE DB Source and a Flat File Data Flow Destination) Now. You can select the external columns and assign output names for them. 372 Hands-On Microsoft SQL Server 2008 Integration Services 12. Go to the Toolbox and scroll down to the Data Flow destinations