68 Hands-On Microsoft SQL Server 2008 Integration Services Summary You created an Integration Services blank project in Chapter 1. In this chapter, you created packages using the SQL Server Import and Export Wizard and then added those packages into your blank project. You also created a package directly in the BIDS again using the SQL Server Import and Export Wizard. But above all, you explored those packages by opening component properties and configurations, and now hopefully you better understand the constitution of an Integration Services package. Last, you worked with the Data Profiling Task to identify quality issues with your data. In the next chapter, you will learn about the basic components, the nuts and bolts of Integration Services packages, before jumping in to make complex packages in Chapter 4 using various preconfigured components provided in BIDS. Figure 2-14 Column Length Distribution Profiles Nuts and Bolts of the SSIS Workflow Chapter 3 In This Chapter c Integration Services Objects c Solutions and Projects c File Formats c Connection Managers c Data Sources and Data Source Views c SSIS Variables c Precedence Constraints c Integration Services Expressions c Summary 70 Hands-On Microsoft SQL Server 2008 Integration Services S o far, you have moved data using the SQL Server Import and Export Wizard and viewed the packages created by opening them in the Business Intelligence Development Studio (BIDS). In this chapter, you will extend your learning by understanding the nuts and bolts of Integration Services such as use of variables, connection managers, precedence constraints, and SSIS Expressions. If you have used Data Transformation Services (DTS 2000), you may grasp these issues quickly; however, there is a lot of new stuff about them in Integration Services. Usability and management of variables have been greatly enhanced, connectivity needs for packages are now satisfied by connection managers, enhanced precedence constraints have been included to provide you total control on the package workflow, and, above all, the SSIS Expression language offers a powerful programming interface to let you generate values at run time. Integration Services Objects Integration Services performs its operations with the help of various objects and components such as connection managers, sources, tasks and transformations, containers, event handlers, and destinations. All these components are threaded together to achieve the desired functionality—that is, they work hand in hand, yet they can be configured separately. A major enhancement Microsoft has provided to DTS 2000 to make it Integration Services is the separation of workflow from the data flow. SSIS provides two different designer surfaces, which are effectively different integrated development environments (IDEs) for developing packages. You can design and configure workflow in the Control Flow Designer surface and the data movement and transformations in the Data Flow Designer surface. Different components have been provided in each of the designer environment, and the Toolbox window is unique with each environment. The following objects are involved in an Integration Services package: Integration Services package c e top-level object in the SSIS component hierarchy. All the work performed by SSIS tasks occurs within the context of a package. Control flow c Helps to build the workflow in an ordered sequence using containers, tasks, and precedence constraints. Containers provide structure to the package and looping facility, tasks provide functionality, and precedence constraints build an ordered workflow by connecting containers, tasks, and other executables in an orderly fashion. Data flow c Helps to build the data movement and transformations in a package using data adapters and transformations in ordered sequential paths. Chapter 3: Nuts and Bolts of the SSIS Workflow 71 Connection managers c Handle all the connectivity needs. Integration Services variables c Help to reuse or pass values between objects and provide a facility to derive values dynamically at run time. Integration Services event handlers c Help extend package functionality using events occurring at run time. Integration Services log providers c Help in capturing information when log- enabled events occur at run time. To enhance the learning experience while you are working with the SSIS components, first you will be introduced to the easier and more often-used objects, and later will be presented with the more complex configurations. Solutions and Projects Integration Services offers different environments for developing and managing your SSIS packages. The SSIS packages are designed and developed in, most likely, the development environment using BIDS, while the SQL Server Management Studio can be used to deploy, manage, and run packages, though there are other options to deploy and manage the packages as you will study in Chapter 13. Both environments have special features and toolsets to help you perform the jobs efficiently. While BIDS has the whole toolset to develop and deploy SSIS packages, SQL Server Management Studio cannot be used to edit or design Integration Services solutions or projects. However, in both environments, you use solutions and projects to organize and manage your files and code in a logical, hierarchical manner. A solution is a container that allows you to bring together scattered projects so that you can organize and manage them as one unit. In general, you will use a solution to focus on one area of the business—such as one solution for accounts and a separate solution for marketing. However, complex business problems may require multiple solutions to achieve specific objectives. Figure 3-1 shows a solution that not only affects multiple projects but also includes projects of multiple types. This figure shows an analysis services project having a Sales cube, an integration services projects having two SSIS packages, and a reporting services project with a Monthly Sales report, all in one solution. Within a solution, one or more projects, along with related files for databases, connections, scripts, and miscellaneous files, can be saved together. Not only can multiple projects be stored under one solution, but multiple types of projects can be stored under one solution. For example, while working in BIDS, you can store a data transformation project as well as a data-mining project under the same solution. Grouping multiple projects in one solution has several benefits such as reduced development time, code 72 Hands-On Microsoft SQL Server 2008 Integration Services reusability, interdependencies management, settings management for all the projects at a single location, and the facility to save all the projects to Visual SourceSafe or Team Foundation Server in the same hierarchical manner as you have in development environment. Both SQL Server Management Studio and BIDS provide templates for working with different types of projects. These templates provide appropriate environments—such as designer surfaces, scripts, connections, and so on—for each project with which you are working. When you create a new project, Visual Studio tools automatically generate a solution for you while giving you an option to create a separate folder for the solution. If you don’t choose to create a directory for the solution, then the solution file is created along with other project files in the same folder; however, if you choose to create a directory for the solution, then a folder is created with project folder created under this as a subfolder. So, you get a hierarchical structure created for you to which you can then add various projects—data sources, data source views, SSIS packages, scripts, miscellaneous files—as and when required. Solution Explorer lists the projects and the files contained in them in a tree view that helps you to manage the projects and the files (as shown Figure 3-1 Solution Explorer showing a solution with different types of projects Chapter 3: Nuts and Bolts of the SSIS Workflow 73 in Figure 3-1). The logical hierarchy reflected in the tree view of a solution does not necessarily relate to the physical storage of files and folders on the hard disk drive, however. Solution Explorer provides the facility to integrate with Visual SourceSafe or Team Foundation Server for version control, which is a great feature when you want to track changes or roll back code. File Formats Whenever an ETL tool has to integrate with legacy systems, mainframes, or any other proprietary database systems, the easiest way to transfer data between the systems is to use flat files. Integration Services can deal with flat files that are fixed width, delimited, and ragged right format types. For the benefit of users who are new to the ETL world, these formats are explained next. Fixed Width If you have been working with mainframes or legacy systems, you may be familiar with this format. Fixed-width files use different widths for columns, but the chosen width per column stays fixed for all the rows, regardless of the contents of those columns. If you open such a file, you will likely see lots of blank spaces between the two columns. As most of the data in a column with variable data tends to be smaller than the width provided, you’ll see a lot of wasted space. As a result, these types of files are more likely to be larger in size than the other formats. Delimited The most common format used by most of the systems to exchange data with foreign systems, delimited files separate the columns using a delimiter such as a comma or tab and typically use a character combination (for example, a combination of carriage return plus linefeed characters—{CR}{LF}) to delimit rows/records. Generally, importing data using this format is quite easy, unless the delimiter used also appears in the data. For example, if users are allowed to enter data in a field, some users may use a comma while entering notes in the specified field, but this comma will be treated as column delimiter and will distort the whole row format. This free-format data entry conflicts with the delimiter and imports data in the wrong columns. Because of potential conflicts, you need to pay particular attention to the quality of data you are dealing with while choosing a delimiter. Delimited files are usually smaller in size compared to fixed-width files, as the free space is removed by the use of a delimiter. 74 Hands-On Microsoft SQL Server 2008 Integration Services Ragged Right If you have a fixed-width file and one of the columns (the rightmost one) is a nonuniform column, and you want to save some space, you can add a delimiter (such as {CR}{LF}) at the end of the row and make it a ragged-right file. Ragged-right files are similar to fixed-width files except they use a delimiter to mark the end of a row/ record—that is, in ragged-right files, the last column is of variable size. This makes the file easier to work with when displayed in Notepad or imported into an application. Also, some vendors use this type of format when they want the flexibility to change the number of columns in the file. In such situations, they keep all the regular columns (the columns that always exist) in the first part of the file and the columns that may or may not exist combined as a single string of data in the end of the row. Depending upon the columns that have been included the length of the last column will vary. The applications generally use substring logic to separate out the columns from the last variable-length combined column. Connection Managers As data grow in random places, it’s the job of the information analyst to bring it all together to draw out pertinent information. The biggest problem of bringing together such data sets and merging them to a single storage location is how to handle different data sources, such as legacy mainframe systems, Oracle databases, flat files, Excel spreadsheets, Microsoft Access files, and so on. Connection managers provided in Integration Services come to the rescue. In Chapter 2, you saw how the connection managers were used inside the package to import data. The components defined inside an Integration Services package require that physical connections be made to data stores during run time. The source adapter reads data from the data source and then passes it on to the data flow for transformations, while the destination adapter loads the transformed data to the destination store. Not only do the extraction and loading components require connections, but these connections are also required by some other components. For example, during the lookup, transformation values are read from a reference table to perform transformations based on the values in the lookup table. Then there are logging and auditing requirements that also need connections to storage systems such as databases or text files. A connection manager is a logical representation of a connection. You use a connection manager to describe the connection properties at design time, and these are interpreted to make a physical connection at run time by Integration Services. For example, at design time, you can set a connection string property within a connection manager, which is then read by the Integration Services run-time engine to make a physical connection. A connection manager is stored in the package metadata and cannot be shared with other packages. Chapter 3: Nuts and Bolts of the SSIS Workflow 75 Connection managers enhance connection flexibility. Multiple connection managers of the same type can be created to meet the needs of Integration Services packages and enhance performance. For example, a package can use, say, five OLE DB connection managers, all built on the same data connection. You can add connection managers to your package using one of the following methods in BIDS: Choose New Connection from the SSIS menu. c Choose the New Connection command from the context menu that opens when c you right-click the blank surface in the Connection Managers area. Add a connection manager from within the editor or advanced editor dialog boxes c of some of the tasks, transformations, source adapters, and destination adapters that require connection to a data store. The connection managers you add to the project at design time appear in the Connection Managers area in the BIDS designer surfaces, but they do not appear in the Connection Managers collection in Package Explorer until you run the package successfully for the first time. At run time, Integration Services resolves the settings of all the added connections, sets the connection manager properties to each of them, and then adds them to the Connection Managers collection in Package Explorer. You will be using many of the connection managers in Hands-On exercises while you create solutions for business problems later on. For now, open BIDS, create a new blank project, and check out the properties of all the connection managers as you read through the following descriptions. Figure 3-2, which appears in the later section “Microsoft Connector 1.0 for SAP BI,” shows the list of all the connection managers provided in SQL Server 2008 Integration Services. ADO Connection Manager The ADO Connection Manager enables a package to connect to an ADO recordset. This connection manager has been provided mainly for legacy support. You will most likely use it when you’re working with a legacy application that is using ActiveX Data Objects (ADO) to connect to the data sources. You might have to use this connection manager when developing a custom component where such legacy application is used. ADO.NET Connection Manager The current model of software applications is very different from the earlier connected, tightly coupled client/server scenario, where a connection was held open for the lifetime. Now, you’ve varied types of data stores and these data stores are being hit with several 76 Hands-On Microsoft SQL Server 2008 Integration Services hundred connections every minute. ADO.NET overcomes these shortcomings and provides disconnected data access, integration with XML, optimized interaction with databases, and the ability to combine data from numerous data sources. These features make ADO.NET connection managers quite reliable and flexible with lots of options; however, they might be a little bit slower than the customized or dedicated connection managers for a particular source. You can also have consistent access to data sources using ADO.NET providers. The ADO.NET Connection Manager provides access to data sources, such as SQL Server or sources exposed through OLE DB or XML, using a .NET provider. You can choose from the .NET Framework Data Provider for SQL Server (SqlClient), the .NET Framework Data Provider for Oracle Server (OracleClient), the .NET Framework Data Provider for ODBC (Open Database Connectivity), and the .NET Framework Data Provider for OLE DB. The configuration options of the ADO.NET Connection Manager change, depending on the choice of .NET provider. Cache Connection Manager The Cache Connection Manager is primarily used for creating cache for the Lookup Transformation. When you have to repeatedly run a Lookup Transformation in a package or have to share the reference (lookup) data set among multiple packages, then you might prefer to persist this cache to a file to improve the performance. You would then use a cache transformation, which in turn uses the Cache Connection Manager to write the cached information to a cache file (.caw). Later in Chapter 10, “Data Flow Transformations,” when you will be working with the Lookup Transformation, you will use this connection manager to cache data to a file. Excel Connection Manager This connection manager provides access to the Microsoft Excel workbook file. It is used when you add Excel Source or Excel Destination in your package. With the launch of Excel 2007, the data provider for Excel is changed to OLE DB provider for the Microsoft Office 12.0 Access Database Engine from the earlier used Microsoft Jet OLE DB Provider. If you check the ConnectionString property of the Excel Connection Manager after adding it using the Microsoft Excel 97-2003 version, you will see the Provider listed as Microsoft.Jet.OLEDB.4.0, whereas this property will show you the provider as Microsoft.ACE.OLEDB.12.0 when you add the Excel Connection Manager using Microsoft Excel 2007 version. It is important to understand the connection string, as you may need to write the connection string yourself in some packages, for example, if you’re getting the file path at run time and you want to dynamically create the connection string. Here is the connection string shown for both versions of the Excel driver: Chapter 3: Nuts and Bolts of the SSIS Workflow 77 Provider=Microsoft.Jet.OLEDB.4.0; Data Source=C:\SSIS\RawFiles\ RawDataTxt.xls;Extended Properties="Excel 8.0;HDR=YES"; Provider=Microsoft.ACE.OLEDB.12.0; Data Source=C:\SSIS\RawFiles\ RawDataTxt.xlsx;Extended Properties="Excel 12.0;HDR=YES"; Note the differences between the providers for both the versions as has been explained earlier. There are some additional properties that you need to specify in the extended properties section. The first is that you use Excel 8.0 for Excel versions 97, 2000, 2002, and 2003 in the extended properties, while you use Excel 12.0 for Excel 2007 version. Second, you use the HDR property to specify if the first row has column names. The default value is yes; that is, if you do not specify this property, the first row will be deemed to contain columns. Also, sometimes the Excel driver fails to pick up some values in the columns where you have string and numeric values mixed up. The Excel driver samples, by default the first eight rows, to determine the data type of the column and returns the null values if other data types exist in the column. You can override this behavior by importing all the values as strings using the import mode setting IMEX=1 in the extended properties of the connection string. If you will be deploying this connection manager to a 64-bit server, which is most likely the case these days, you will need to run the package in 32-bit mode, as both the aforesaid providers are available in 32-bit version only. You will need to run the package using the 32-bit version of dtexec.exe from the 32-bit area, which is by default in the C:\Program Files(x86)\Microsoft SQL Server\100\DTS\Binn folder. File Connection Manager This connection manager enables you to reference a file or folder that already exists or is created at run time. While executing a package, Integration Services tasks and data flow components need input for values of property attributes to perform their functions. These input values can be directly configured by you within the component’s properties, or they can be read from external sources such as files or variables. When you configure to get this input information from a file, you use the File Connection Manager. For example, the Execute SQL task executes an SQL statement, which can be directly input by you in the Execute SQL task, or this SQL statement can be read from a file. You can use an existing file or folder, or you can create a file or a folder by using the File Connection Manager. However, you can reference only one file or folder. If you want to reference multiple files or folders, you must use a Multiple Files Connection Manager, described a bit later. To configure this connection manager, choose from the four available options in the Usage Type field of the File Connection Manager Editor. Your choice in this field sets . 68 Hands-On Microsoft SQL Server 2008 Integration Services Summary You created an Integration Services blank project in Chapter 1. In this chapter, you created packages using the SQL Server. Constraints c Integration Services Expressions c Summary 70 Hands-On Microsoft SQL Server 2008 Integration Services S o far, you have moved data using the SQL Server Import and Export Wizard. solution has several benefits such as reduced development time, code 72 Hands-On Microsoft SQL Server 2008 Integration Services reusability, interdependencies management, settings management