338 Hands-On Microsoft SQL Server 2008 Integration Services data into the Integration Services data type. During design time, when you drop the data flow source onto the Data Flow Designer surface and configure it to connect to an external data source from which data will be read, the data flow source copies the metadata—i.e., the schema of the data—to its external columns metadata. At run time, when the data is being pulled in, the incoming data columns are parsed into the Integration Services data types using the external columns metadata. In the example, the names Steve and Sarah are parsed into the DT_STR data type, sex, which is indicated using 1 and 0, and have been assigned DT_BOOL; House_ Number has been assigned DT_I4; and the Moved_in data has been assigned the DT_ DBTIMESTAMP data type. Integration Services provides 29 different data types to cover various types of data, such as character data, numerical data, Boolean, dates, text, and image fields. Integration Services provides a wide range of data types and is particular about how they are used; hence, if data has a data type that does not match with the data types available in Integration Services, an error occurs. We will explore many data types along the way as we progress with our Hands-On exercises; however, to know more about each data type, refer to Microsoft SQL Server 2008 Books Online. The process that converts the source data into Integration Services data types is called data parsing. Data flow components can be configured to use either fast parsing or standard parsing. Fast parsing supports the most commonly used data types with a simple set of routines and can be configured at the column level. Because it uses simple routines and does not check for many other data types, it can achieve high levels of performance. However, it is not available for most of the data flow components Person Steve 1 250 07-11-2005 Sarah 0 130 25-21-2004 DT_STR DT_I4 DT_DBDATE DT_BOOL Steve 1 250 07-11-2005 Processing logic Error output Output Sarah 0 130 25-21-2004 46 -1071607694 External columns metadata External data source Parsing Data flow source Name Sex House_Number Moved_in varchar(50) bit int datetime Figure 9-1 A data flow source extracting and parsing data Chapter 9: Data Flow Components 339 and can parse only a narrow range of data types. For example, fast parsing does not support locale-specific parsing, special currency symbols, and date formats other than year-month-date. Also, fast parsing can be used only when using a Flat File source, a data conversion transformation or derived column transformation, and a Flat File destination, because these are the only components that convert data between string and binary data types. You may use fast parsing for performance reasons when your data meets these requirements, but for all other occasions, you will be using standard parsing. Before using fast parsing, check out the Books Online to find out the data types supported by the fast parsing. Standard parsing uses a rich set of parsing routines that are equivalent to OLE DB parsing APIs and supports all the data type conversions provided by the automation data type conversion APIs available in Oleaut32.dll and Ole2dsip.dll. For example, standard parsing provides support for locale-sensitive parsing and international data type conversions. Returning to the data flow, as the data rows arrive and are parsed, they are validated against the external columns metadata. The data that passes this validation check at run time is copied to the output columns; the data that doesn’t pass is treated slightly differently. You can define the action the component can take when a failure occurs. Typical errors you can expect at the validation stage include data type mismatches or a data length that exceeds the length defined for the column. Integration Services handles these as two different types of errors—data type errors and data length or data truncation errors. You can specify the action you want the data flow component to take for each type of error from the three options—fail the component, ignore the error, or redirect the failing row to error output fields. You can specify one of these actions on all columns or different actions for different columns. If you redirect the failing rows to the error output fields and link the error output to a data flow destination, the failing rows will be written to the data flow destination you specified. The error output contains all the output columns and two additional columns for the failing rows, ErrorCode and ErrorColumn, which indicate the type of error and the failing column. In Figure 9-1, note that the record holding data for Sarah has a wrong date specified and hence fails during extract process. As the source was configured to redirect rows, the failing row data is sent to the Error Output. Also, note that the two rightmost columns indicate the type of error and the column number. Every output field is assigned an ID automatically, and the number shown in the ErrorColumn is the ID number of the column failing the extract process. If two columns fail on the same row, the column that fails first is captured and reported. So, in case of multiple failures on a row, you might not know about them till the package fails again after you have fixed the first error. After pulling in the data rows, a data flow source passes the output rows to the next data flow component—generally a transformation—and the failing rows are passed to another data flow component—which could be a data flow destination, if configured 340 Hands-On Microsoft SQL Server 2008 Integration Services to redirect the failing rows. You can also redirect the failing rows to an alternate branch in the data flow via a transformation in which you can apply logic to correct the failing rows and bring them back into the main data flow after corrections. This is a highly useful capability that, if used properly, can reduce wastage and improve data quality. When the output of a data flow source is connected to the input of a data flow transformation, the data flows from the input of the transformation, through the processing logic of transformation, and then to its output. Based on the logic of transformation, some rows may fail the process and may be sent to the error output. The main difference to note in comparison to a data flow source is that the transformations do not have external columns and have input columns instead. Figure 9-2 shows the functional layout of a data flow transformation. Finally, after passing through the data flow source and the data flow transformations, the data will reach a data flow destination so that it can be written to an external data store. Like data flow sources, a data flow destination can also read the schema information from Processing logic Error outputOutput Input Data flow transformation ErrorCode ErrorColumn Name Name Sex Sex House number House_Number Moved_In Moved_In Name Sex House_Number Moved_In Figure 9-2 Data flow transformation showing Input, Output, and Error Output columns Chapter 9: Data Flow Components 341 the external data store and copy the metadata to the external columns. When the data flows through the data flow destination, it gets validated against this metadata before being written to the external data store. If configured to redirect the failing rows, the failing rows may be sent to the error output columns and the rest of the successful data is written to the external data store. Note that a data flow destination does not have output columns. Figure 9-3 shows the data flow through a data flow destination. At this point it is worth mentioning that a pipeline or a data flow path usually terminates in a data flow destination; however, that is not necessary. Sometimes, you may prefer to terminate a data flow path in a transformation. For example, while testing and debugging, you can break and terminate the data flow path by adding a Row Count transformation to see what happens at a particular point in the data flow path while ignoring rest of the pipeline. With the preceding description of how the data is extracted from a data source, and how it flows through the transformations and destinations before being written to the external data store, let’s study these components to learn more about them and to learn how many types of data sources or destinations are available in the data flow. Processing logic Error output Data flow destination 1 170 External columns metadata DT_STR DT_I4 DT_DBDATE DT_BOOL Steve 1 250 07-11-2005 Parsing Input Christopher 05-02-2005 ErrorCode ErrorColumn 1 170 250 07-11-2005 05-02-2005 Steve 1 Christopher Person External data source Name Sex House_Number Moved_in varchar(10) bit int datetime Figure 9-3 Data flow destination showing the flow of data to the external data store 342 Hands-On Microsoft SQL Server 2008 Integration Services Data Flow Sources While building a data flow for a package using BIDS, your first objective is to bring the data inside the Integration Services data flow so that you can then modify the data using data flow transformations. Data flow sources are designed to bring data from the external sources into the Integration Services data flow. A data flow source reads the external data source, such as a flat file or a table in relational database, and brings in the data to the SSIS data flow by passing this data through the output columns on to the downstream component, usually a data flow transformation. During design time, the data flow source keeps a snapshot of the metadata of external columns and can refer to it at run time. If the data in certain rows doesn’t match with this schema at run time, the data sources can be configured to redirect those rows to the error output columns that can be dealt with separately. Integration Services provides six preconfigured data flow sources to read data from a variety of data sources, plus a script component that can also be scripted as a data flow source. If existing data flow sources do not meet your requirements, you can always build yourself a custom data flow source using the Integration Services object model. Scripting a data flow source has been discussed in Chapter 11. The following two tables list the preconfigured data flow source adapters and the interfaces they have. Source Description Extracts data from .NET Framework data providers, such as SQL Server 2008 using ADO.NET Connection Managers. Extracts data from an Excel worksheet using Excel Connection Manager. Reads data from a text file using Flat File Connection Manager. Extracts data from OLE DB–compliant relational databases using OLE DB Connection Manager. Extracts data from a raw file using a direct connection. Hosts and runs a script that can be used to extract, transform, or load data. Though not shown under data flow sources in the Toolbox, the Script Component can also be used as a data flow source. This component is covered in Chapter 11. Reads data from an XML data source by specifying the location of the XML file or a variable. Chapter 9: Data Flow Components 343 Data Flow Source Input Output Error Output Custom UI Connection Manager ADO NET source No 1 1 Yes ADO.NET Connection Manager Excel source No 1 1 Yes Excel Connection Manager Flat File source No 1 1 Yes Flat File Connection Manager OLE DB source No 1 1 Yes OLE DB Connection Manager Raw File source No 1 No Yes Not required XML source No 1 1 Yes Not required From the preceding table, you can make out that not all data flow sources have Error Output and not all data flow sources require a connection manager to connect to the external data source; rather, some can directly connect to the data source such as XML Source. But the important thing to understand here is that data flow sources don’t have an input. They use the external columns interface to get the data and use output columns to pass the data to downstream components. We will study more about each of the data flow sources in the following topics. ADO NET Source When you need to access data from a .NET provider, you use an ADO.NET Connection Manager to connect to the data source and then can use the ADO NET source to bring the data inside the data flow pipeline. You can configure this source to use either a table or a view or use an SQL command to extract rows from the external .NET source. The ADO NET source has one regular output and one error output. The ADO NET source has a custom UI though you can also use advanced editor to configure some of the properties that are not exposed in the custom UI. The custom UI has three pages to configure—Connection Manager, Columns, and Error Output. Connection Manager c Specify the ADO.NET Connection Manager here that the ADO NET Source uses to connect to the data source. Select one of the ADO.NET Connection Managers already configured in the package from the drop-down list provided under the ADO.NET Connection Manager field, or you can use the New button to add a new ADO.NET Connection Manager. Columns c is page shows you the list of columns read from the external .NET data source and cached into the External Columns interface. It also shows you the corresponding Output Columns that, by default, have the same names as the cached schema in the External Columns. You can change the Output Column names here if you want to call them differently inside your pipeline. 344 Hands-On Microsoft SQL Server 2008 Integration Services Error Output c When outputting the columns read from the external data source, some rows may fail due to wrong data coming through. ese failures can be categorized as errors or truncations. Errors can be data conversion errors or expression evaluation errors. e data may be of wrong type—i.e., alphabetical characters arriving in an integer field—causing errors to be generated. Truncation failures may not be as critical as errors—in fact, sometimes they are desirable. Truncation lets the data through, but truncates the data characters for which the length becomes more than the specified length—for example, if you specify the city column as VARCHAR(10), then all the characters after first ten characters will be truncated when it exceeds the ten-character length. You can configure this component for data type errors or truncations of data in columns to fail, ignore the error, or redirect the failing row to error output. See Figure 9-4 for an explanation. Figure 9-4 Setting error conditions for data type mismatches and truncation errors Chapter 9: Data Flow Components 345 Fail Component c is is the default option for both errors and truncations and will fail the data flow component, ADO NET Source in this case, when an error or a truncation occurs. Ignore Failure c You can ignore the error or the truncation and carry on outputting the data from this component. Redirect Row c You can configure the ADO NET Source to redirect the failing row to the error output of the source adapter, which will be handled by the components capturing the error output rows. If you have many columns to configure for different settings, you may find using the Set This Value To Selected Cells field easier to apply a value to all of your selected cells. Excel Source When you need to work with data in Excel files, you can either use the OLE DB Source with Microsoft Jet 4.0 OLE DB Provider or simply use an Excel source to get the data into the pipeline. You will configure the Excel Connection Manager to connect to an Excel workbook and then use that connection manager inside an Excel source to extract the data from a worksheet and bring it in to the pipeline. You can treat an Excel workbook as a database and its worksheets as tables while configuring the Excel source. A range in the Excel workbook can also be treated as a table or a view on database. The Excel source adapter has one regular output and one error output. This component has its own user interface, though you can also use Advanced Editor to configure its properties. When you open the Excel Source Editor in the data flow designer you will see that this source adapter also has three pages to configure. Connection Manager c e Connection Manager page opens by default, where you can select the connection manager from the drop-down list in the OLE DB Connection Manager field. e Data Access Mode drop-down list provides four options. Depending upon your choice in the Data Access Mode field, the interface changes the fields to provide relevant information. Table or view c is option treats the Excel sheet as a table or a view and extracts data from the specified worksheet in the Name Of e Excel Sheet field. Table name or view name variable c When you select this option, the subsequent field changes to Variable Name field. is option works similar to the Table Or View option, except instead of expecting the name of the worksheet or Excel range to be specified directly, it lets you specify the name of a variable in the Variable Name field, from which it can read the name of the Excel range or the worksheet. 346 Hands-On Microsoft SQL Server 2008 Integration Services SQL Command c Selecting this option in the Data Access Mode field changes the interface to let you provide an SQL statement in the SQL Command Text field to access data from an Excel workbook. You can either type in SQL directly or use the provided Build Query button to build an SQL query. (I recommend you to use this query builder to access data from the Excel sheet, even if you know how to write complex SQL queries for SQL Server, because there are some lesser-known issues on accessing data from Excel workbooks using SQL.) You can also use a parameterized SQL query, for which you can specify parameter mappings using the Parameters button. When you click Parameters, you get an interface that lets you map a parameter to a variable. SQL Command From Variable c is option works as an SQL command, except it reads the SQL statement from a variable specified in the Variable Name field. Columns c You can map external columns to output columns in this page and can change the names of output columns. Error Output c As explained in the ADO NET Source, you can configure this page to fail the component, ignore the error, or redirect the row in case an error occurs in a data column. While the Excel Source Editor allows you to configure the properties for Excel source, you may need to use the Advanced Editor to configure the properties not exposed by the Excel Source Editor. These properties include assigning a name and description to the component, specifying a timeout value for the SQL query, or most important, changing the data type for a column. While working with the Advanced Editor, get acquainted with the various options available in its interface. Flat File Source The Flat File source lets your package read data from a text file and bring that data into the pipeline. You configure a Flat File Connection Manager to connect to a text file and specify how the file is formatted. Also, you will specify the data type and length of each column in the Flat File Connection Manager that will set guidelines for the Flat File source to handle it appropriately. The Flat File source can read a delimited, fixed width, or ragged right–formatted flat file. To know more about these file types, refer to Chapter 3. The Flat File source has a custom user interface that you can use to configure its properties. Also, as with the Excel source adapter, its properties can be configured using an Advanced Editor. When you open the Flat File Source Editor, the Connection Chapter 9: Data Flow Components 347 Manager page opens up by default. You can select the Connection Manager from the drop-down list provided in the Flat File Connection Manager field. The flat files contain nothing for the null values, and if you want to keep these null values, check the box for “Retain null values from the source as null values” in the data flow option. By default, this check box is unchecked, which means the Flat File source will not keep null values in the data but will replace null values with the appropriate default values for each column type—for example, empty strings for string columns and zero for numeric columns. Note that the file you are trying to access must be in a delimited format. This is because the fixed width and/or ragged right–format files do not contain blank spaces; you need to pad the fields with a padding character to the maximum width so that the data cannot be treated as null values by the Flat File source adapter. These format settings for the flat file are done in the Flat File Connection Manager. The Columns and Error Output pages can be configured as described in the Excel source adapter. You can use Columns page on the Flat File source to map external columns to the output columns and the Error Output page to configure the error and truncation behavior when the mismatched data comes along, using the three options of Fail Component, Redirect Row, or Ignore Failure. This source has two important custom properties—FastParse and UseBinaryFormat— that are exposed in the Advanced Editor. The configurations for both these properties are done in the Input and Output Properties tab of the Advanced Editor for the Flat File source. Depending on the data you are dealing with, you can set the FastParse option for each of the columns by selecting the column in the Output Columns section and then going to the Custom Properties category of the column properties. By default, the FastParse option is set to False (see Figure 9-5), which means the standard parsing technique will be used. (Remember that standard parsing is a rich set of algorithms that can provide extensive data type conversions, where as fast parsing is relatively simplified set of parsing routines that supports only the most commonly used data and time formats without any locale-specific data type conversions.) The second property, UseBinaryFormat (also shown in Figure 9-5), allows you to let the binary data in the input column pass through to the output column without parsing. Sometimes you have to deal with binary data, such as data with the packed decimal format, especially when you’re receiving data from a mainframe system or the data source is storing data in the COBOL binary format—for example, you might be dealing with IBM EBCDIC–formatted data. In such cases, you may not prefer the Flat File source to parse the data; rather, you would like to parse it separately using special rules that are built based on how it has been packed into the column in the first place. By default UseBinaryFormat is set to false, which means the data in the input column will be parsed using Flat File source parsing techniques. To use this property, set UseBinaryFormat to true and the data type on the output column to DT_BYTES, to let the binary data be passed on to the output column as is. . 346 Hands-On Microsoft SQL Server 2008 Integration Services SQL Command c Selecting this option in the Data Access Mode field changes the interface to let you provide an SQL statement in the SQL. 338 Hands-On Microsoft SQL Server 2008 Integration Services data into the Integration Services data type. During design time, when you drop the. store 342 Hands-On Microsoft SQL Server 2008 Integration Services Data Flow Sources While building a data flow for a package using BIDS, your first objective is to bring the data inside the Integration