Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 832 Part V Data Connectivity Unlike DTS, in which transformations happen in a single step as data is read from a source and written to a destination, Integration Services enables several transformations to be used between reading and writing data. Data flows can come from several sources, and they can be split and merged, and writ- ten to several destinations within the confines of a single Data Flow task. Because the transformations occur without reading and writing the database at every step, well-designed data flows can be surpris- ingly fast. FIGURE 37-2 The Data Flow tab of Integration Services’ design environment Connection managers A connection manager is a wrapper for the connection string and properties required to make a connec- tion at runtime. Once the connection is defined, it can be referenced by other elements in the package without duplicating the connection definition, thus simplifying the management of this information and configuration for alternate environments. Create a new connection manager by right-clicking in the Connection Managers pane or by choosing the New option when configuring a task that requires a connection manager. When right-clicking, notice that several of the more popular connection types are listed directly on the menu, but additional con- nection types are available by choosing the New Connection option. 832 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 833 Performing ETL with Integration Services 37 Each connection type has an editor dialog and properties that appear in the Properties pane, both of which vary according to the connection type. Each of the two lists may contain properties not available in the other. For example, the connection timeout can be set only in the OLE DB editor, while the delay validation property must be set in the Properties pane. Variables As with all proper programming environments, Integration Services provides variables to control exe- cution, pass around values, and so on. Right-click the design surface and choose Variables to show the Variables pane. Notice that along with Name, Data Type, and Value columns, the Scope column indi- cates at which level in the package hierarchy the variable is visible. Variables with package scope (scope equals the package name) are visible everywhere, whereas variables scoped to a task or event handler are visible only within that object. Variables scoped to a container are visible to the container and any objects it contains. By default, the Variables pane displays only variables whose scope matches the currently selected object or one of its parents. For e xample, clicking on the design surface will select the package object and dis- play only the variables scoped at the package level, but selecting a Control Flow task will show variables for both the selected task and the package (the task’s parent). Alternately, the full variable list can be displayed by selecting the Show All Variables button on the pane’s toolbar. Create a new variable by first selecting the object to provide the scope and then click the Variable pane’s Add Variable toolbar button. Once created, set the variable’s name, data type, and value. Note that you cannot change a variable’s scope without deleting and recreating it. In addition to scope, each variable has a namespace, which by default is either User or System. You can change the namespace for user-created variables, but there is very little that you can change (only the occasional value) for system namespace variables. The namespace is used to fully qualify a variable reference. For example, a variable called MyVar in the user namespace is referred to as @[User::MyVar]. Variable usage Variable values can be manually set via the Variables pane, but their values can also come from a num- ber of other sources, including the following: ■ Variable values can be provided at runtime via the /SET switch on the dtexec utility (or equivalent dialog of the dtexecui utility). ■ Variable values can be entered as expressions, which are evaluated at runtime. Enter the expression by clicking the Expression ellipses on the variable’s Properties pane, and then use the Expression Builder to enter the appropriate formula. Be sure to set the EvaluateAsExpression property to True to cause the contents of the variable to be evaluated as a formula. ■ For and Foreach container tasks can set a variable to contain a simple numeric sequence, each file in a directory on disk, each node in an XML document, and items from other lists and collections. ■ Query results can provide variable values, either as an individual value or an entire result set. ■ Scripts can read and/or set variable values. 833 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 834 Part V Data Connectivity Among the many places for variables to be used, property e xpressions are one of the most useful, as nearly any task property can be determined at runtime based on an expression. This enables variables to control everything from the text of a query to the enabling/disabling of a task. Expressions Expressions are used throughout Integration Services to calculate values used in looping, splitting data streams, setting variable values, and setting task properties. The language used to define an expression is a totally new syntax, resembling a cross between C# and Transact-SQL. Fortunately, an Expression Builder is available in many places where an expression can be entered. Some of the key themes include the following: ■ Variables are referred to by prefixing them with an @, and can be qualified by names- pace, making @[User::foo] the fully qualified reference to the user variable foo. Columns are referred to by their name, and can be qualified by their source name, mak- ing [RawFileSource].[Customer Name]the fully qualified reference to the Customer Name column read from the RawFileSource. Square brackets are optional for names with no embedded spaces or other special characters. ■ Operators are very C-like, including == (double equal signs) for equality tests, prefix of an exclamation mark for not (for example, !> and !=), && for logical AND, || for logical OR, and ? for conditional expressions (think IIf() function). For example, @[User::foo] == 17 && CustomerID < 100 returns true if the variable foo equals 17 AND the CustomerID column is less than 100. ■ String constants are enclosed in double quotes, and special characters are C-like backslash escape sequences, such as \n for new line and \t for tab. ■ The cast operator works by describing the target type in parentheses immediately before the value to be converted. For example, (DT_I4)"193" will convert the string ‘‘193’’ to a four-byte integer, whereas (DT_STR,10,1252)@[User::foo]converts the value of the user variable foo to a 10-character string using codepage 1252. The codepage has no default, so everyone will learn the number of their favorite codepage. ■ Functions mostly come from the Transact-SQL world, including the familiar date ( GETDATE(), DATEADD(), YEAR()), string (SUBSTRING(), REPLACE(), LEN()), and mathematical ( CEILING(), SIGN()) entries. Details do differ from standard T-SQL, however, so use the Expression Builder or Books Online to check availability and syntax. A codepage, not to be confused with a locale identifier, maps character representa- tions to their corresponding codes. Two good sources for codepage references are www.i18nguy.com/unicode/codepages.html and www.microsoft.com/typography/unicode/ cscp.htm . Configuring elements A large number of elements work together in a functioning Integration Services package, including Con- trol Flow tasks, task precedence, and data flow components. This section describes the concepts and settings common to each area. Later, this chapter describes the functions and unique properties for indi- vidual elements. 834 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 835 Performing ETL with Integration Services 37 Control flow Work flow for both the Control Flow and Event Handler tabs is configured by dragging control flow elements (tasks and/or containers) onto the design surface, configuring each element’s properties, and then setting execution order by connecting the items using precedence constraints. Each item can be configured using the overlapping sets of properties in the Properties pane and the Editor dialog. Right- click an item and choose Edit to invoke its Editor, which presents multiple pages (content varies accord- ing to the type of task). All editors include an Expressions page that enables many of the configurable properties to be specified by expressions, rather than static values. You can view and modify existing expression assignments directly on the page, or you can click the ellipses next to an expression to launch the Expression Builder. You can add additional expression assignments by clicking the e llipses in the top line of the expressions page, launching the Property Expressions Editor, shown in Figure 37-3. Choose the property to be set in the left column, and then enter the expression in the right column, pressing the ellipses to use the Expression Builder if desired. FIGURE 37-3 Property Expressions Editor While many of the properties available vary by item, several are available across all items, including packages, containers, and individual tasks. These common properties include the following: ■ DelayValidation: Normally, each task in a package is validated before beginning execution to avoid unnecessary partial runs (such as waiting 20 minutes to discover that the last step’s filename was mistyped). Set this property to true to delay validation until the task actually runs. This option is useful for tasks that reference objects that don’t exist when the package starts, but that will exist by the time the task executes. ■ Disable: When set to true, the task will not execute. This option is also available from the context menu’s Disable/Enable toggle. Note how disabled tasks display in a darker color. 835 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 836 Part V Data Connectivity ■ DisableEventHandler: This keeps event handlers from executing for the current task, although event handlers for parent objects (e.g., containers, packages) still execute. ■ Error handling properties are best considered as a group: ■ FailPackageOnFailure: When set to true, the entire package fails when the individual item fails. The default is false. ■ FailParentOnFailure: When set to true, the parent container fails when the individual task fails. If a task is not explicitly included in a container (e.g., For Loop, Foreach Loop , or Sequence), then it is implicitly wrapped in an invisible TaskHost container, which acts as the parent. The default i s false. ■ MaximumErrorCount: Maximum number of errors a task or container can see before failing itself. The default is 1, so the first error encountered will fail the task. Because of the default settings that apply at the package, container, and task levels, any task that fails will cause its container to fail, which in turn will fail the package, all based on the MaximumErrorCount. This is true regardless of any failure branches defined by precedence constraints. You can increase the MaximumErrorCount on a task to allow error branching to succeed. Given this behavior, where do the ‘‘FailOn’’ properties fit in? Consider a container with two tasks, one that is expected to fail in certain cases (call it ‘‘Try’’) and another that will recover from the expected failure but is not itself expected to fail (call it ‘‘Recover’’). The container’s MaximumErrorCount must be increased to allow the ‘‘Recover’’ to be reached when ‘‘Try’’ fails, but this has the side effect of ignoring failures in ‘‘Recover’’! Use the FailPackageOnFailure property on ‘‘Recover’’ to stop the entire package when the task fails, or FailParentOnFailure to take the failure precedence branch from the container when ‘‘Recover’’ fails. ■ LoggingMode: This property defaults to UseParentSetting so that logging can be defined for the entire package at once, but individual items can also be enabled or disabled. ■ Transactions can be used to ensure that a sequence of operations, such as changes to multi- ple tables, either succeed or fail together. The following properties control transactions in a package: ■ IsolationLevel: Specifies the isolation level of a transaction as one of the following: Unspecified, Chaos, ReadUncommitted, ReadCommitted, RepeatableRead, Serializable,orSnapshot . The default is Serializable. ■ TransactionOption: This property offers three options: NotSupported (the item will not participate in a transaction), Supported (if a parent container requires a transaction, then this item will participate), and Required (if a parent container has not started a transaction, then this container will start one). Once begun by a parent container, all child items can participate in that transaction by specifying a TransactionOption setting of either Supported or Required. Control flow precedence As described earlier, precedence constraints determine the order in which tasks will execute. Select any task or container to expose its precedence constraint arrow, and then drag that arrow to the task that should follow it, repeating until all items are appropriately related. Any unconstrained task will be run at the discretion of the runtime engine in an unpredictable and often parallel ordering. Each constraint 836 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 837 Performing ETL with Integration Services 37 defaults to an ‘‘On Success’’ constraint, which can be adjusted by double-clicking the constraint to reveal the Precedence Constraint Editor, shown in Figure 37-4. FIGURE 37-4 Precedence Constraint Editor The upper half of the editor, ‘‘Constraint options,’’ determines when the constraint should fire. It relies on two evaluation operation concepts: ■ Constraint: How the previous item completed — Success, Failure, or Completion (Comple- tion being any outcome, either success or failure) ■ Expression: The evaluation of the entered expression, which must resolve to either true or false These concepts are combined as four separate options — constraint, expression, expression and con- straint, expression or constraint — enabling very flexible constraint construction. For example, consider a task that processes a previously loaded table of data and counts the successfully processed rows. The processing task could have two outgoing paths: a success path indicating that the task was successful and that the processed rowcount matches the loaded rowcount, and a failure path indicating that either the task failed or the rowcounts don’t match. The lower half of the editor, labeled ‘‘Multiple constraints,’’ determines how the downstream tasks should deal with multiple incoming arrows. If logical AND is chosen (the default), then all the incoming constraints must fire before the task can execute. If logical OR is chosen, then any incoming constraint firing will cause the task to execute. Logical AND is the most frequently used behavior, but logical 837 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 838 Part V Data Connectivity OR is useful for work flows that split apart and then rejoin. For example, control can split when an upstream task has b oth success and failure branches, but the failure branch needs to rejoin the normal processing once the error has been resolved. Using a logical AND at the merge point would require both the success and the failure branches to execute before the next task could run, which cannot happen by definition. Logical AND constraints are presented visually as solid lines, whereas logical OR constraints are dotted lines. The arrows that represent precedence constraints provide other visual clues as to the type of constraint. Green arrows denote a success constraint, red a failure constraint, and blue a completion constraint. Constraints that use an expression include an f(x) icon. There is no visual queue to distinguish between Constraint AND expression versus Constraint OR expression, so it is best to double-check the Precedence Constraint Editor when an f(x) is displayed. For example, a green arrow with an f(x) displayed could fire even if the preceding task had failed, given the expression had been satisfied and the Constraint OR expression option was chosen. Data flow Unlike other tasks that can be configured in the control flow, a Data Flow task does not show an Edi- tor dialog i n response to an edit request. Instead, it switches to the Data Flow tab to view/configure the task details. Each component appearing on the design surface can in turn be configured in the Prop- erties pane, b y a component-specific editor dialog, and, for many components, by an advanced editor as well. Each data flow must begin with at least one Data Flow source, and generally ends with one or more Data Flow destinations, providing a source and sink for the data processed within the task. Between source and destination, any number of transformations may be configured to sort, convert, aggregate, or otherwise change the data. Out of each source or transformation, a green Data Flow path arrow is available to be connected to the next component. Place the next component on the design surface and connect it to the path before attempting to configure the new component, as the path provides necessary meta-data for configuration. Follow a similar process for the red error flow for any component that has been configured to redirect error rows. Use the Data Flow Path Editor to view/configure paths as necessary, double-clicking on a path to invoke its editor. The editor has three pages: ■ General: For name, description, and annotation options. While the default annotations are usually adequate, c onsider enabling additional annotations for more complex flows with intertwined paths. ■ Metadata: Displays metadata for each column in the Data Flow path, including data type and source component. This information is read-only, so adjust upstream components as necessary to make changes, or use a Data Conversion transformation to perform necessary conversions. ■ Data Viewers: Allows different types of Data Viewers to be attached to the path for testing and debugging. Because a data flow occurs within a single Control Flow task, any component that fails will cause the task to fail. 838 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 839 Performing ETL with Integration Services 37 Event handlers Event handlers can be defined for a long list of possible events for any Control Flow task or container. Use them for custom logging, error handling, common initialization code, and a variety of other tasks. If a handler is not defined for a given item when an event fires, then Integration Services will search par- ent containers up to the package level looking for a corresponding event handler to use instead. It is this ‘‘inheritance’’ that makes event handlers useful, enabling a single handler to be built once and then used repeatedly over many tasks and containers. To construct an event handler, switch to the Event Handlers tab and choose the Control Flow item (Executable) in the upper-left drop-down list and the event in the upper-right list. Then click the hotlink on the design surface to initialize the event. Build the logic within the handler as if it were just another control flow. Executing a package in development As portions of a package are completed, they can be tested by running the package within the develop- ment environment. Right-click a package in the Solution Explorer and c hoose Execute Package to start the package i n debug mode. Packages run in debug mode display progress within the designer environ- ment, with tasks and components changing from white (not yet run) to yellow (running) to green or red (completed with success or failure, respectively). There are other convenient methods for executing a package from within Business Intelli- gence Development Studio, but you must ensure that the correct object executes. Select- ing Start Debugging from the menu, keyboard (F5), or toolbar can be very convenient, but ensure that the package to be executed has been ‘‘Set as Startup Object’’ by right-clicking on that package in the Solution Explorer. In addition, solutions that contain more than one project may execute unexpected actions (such as deploying an Analysis Services database) regardless of startup object/project settings before beginning to debug the selected package. Even in development, inadvertently starting a six-hour data load or stepping on a cube definition can be quite painful. Once the debug run begins, an Execution Results tab appears displaying the execution trace, including detailed messages and timing for each element of the package. When the package completes, it remains in debug mode to enable variables and state information to be reviewed. To return to design mode, choose the Stop button on the Debug toolbar, or choose Stop Debugging from the Debug menu (Shift+F5). You can set breakpoints on any task, container, or the package by right-clicking on the object and selecting Edit Breakpoints. The Set Breakpoints dialog (see Figure 37-5) enables a breakpoint to be set on any event associated with that object. PreExecute and PostExecute events are common choices; select- ing an object and pressing F9 is a shortcut for toggling the PreExecute event breakpoint. Optionally, instead of breaking at every execution (Always), a breakpoint can be ignored until the nth execution (Hit count equals), any time at or after the nth execution (Hit count greater than or equal to), or ignored except for the nth, 2nth, etc., execution (Hit count multiple). While execution is suspended at a breakpoint, use the Locals window to view the current values of vari- ables. You can also check the Output window for useful messages and warnings, and the Progress tab for details on run history across all tasks. The analogue to the breakpoint for data flows is the Data Viewer. Double-click on a data path of interest to add a viewer. Then, during a debug run, package execution will be suspended when the Data 839 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 840 Part V Data Connectivity Viewer has been populated with data. Choose the Go or Detach buttons on the Data Viewer to resume execution. FIGURE 37-5 Set Breakpoints dialog Breakpoints can also be placed in the code of a Script task. Open the script, set a breakpoint on the line of interest, and Integration Services will stop in the script debugger at the appropriate place. Integration Services Package Elements This section describes in detail the individual elements that can be used in constructing an Integration Services package. For general concepts and common properties, review the earlier sections of this chapter. Connection managers A connection manager is a wrapper for the connection string and properties required to make a c onnec- tion at runtime. Once the connection is defined, it can be referenced by other elements in the package without duplicating the connection definition. This simplifies the management of this information and configuration for alternate environments. 840 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 841 Performing ETL with Integration Services 37 Database Defining database connections through one of the available connection managers requires setting a few key properties: ■ Provider: The driver to be used in accessing the database ■ Server: The server or filename containing the database to be accessed ■ Initial Catalog: The default database in a multi-database source ■ Security: Database authentication method and any username/password required The first choice for accessing databases is generally an OLE DB connection manager using one of the many native providers, including SQL Server, Oracle, Jet (Access), and a long list of other source types. Other database connection managers include the following: The key to most Integration Services packages is speed. ADO.NET has more capabilities, but in most cases that is not what you are after. Most developers prefer OLE DB for that reason. ■ ADO: Provides ADO abstractions (such as command, recordset) on top of the OLE DB provider. ADO is not used by Integration Services built-in elements, but it could be required by custom tasks written to the ADO i nterface. ■ ADO.NET: Provides ADO.NET abstractions (such as named parameters, data reader, data set) for the selected database connection. While not as fast as using OLE DB, an ADO.NET connection can execute complex parameterized scripts, provide an in-memory recordset to a Foreach loop, or support custom tasks written using C# or VB.NET. ■ ODBC: Allows a connection manager to be configured based on an ODBC DSN. This is useful when OLE DB or .NET providers are not available for a given source (e.g., Paradox). ■ OLE DB: The OLE DB connection manager is generally the preferred database connection due to its raw speed. It provides methods for basic parameter substitution but falls short of ADO.NET’s flexibility. ■ Analysis Services: When accessing an existing Analysis Services database, this connection manager is equivalent to an OLE DB connection using the Analysis Services 10.0 provider. Alternately, an Analysis Services database in the same solution can be referenced — a useful feature for packages being developed in support of a new database. If one of the older OLAP providers is needed for some reason, it can be accessed via the OLE DB connection manager. ■ SQL Server Mobile: Allows a connection to mobile database . SDF files As individual tasks execute, a connection described by the connection manager is opened and closed for each task. This default setting safely isolates tasks, keeping prior tasks from tweaking the connection of subsequent tasks. If you would like to keep the same connection between tasks, then set the RetainSameConnection property to True. With appropriate care, this allows a session to be shared between tasks for the manual control of transactions, the passing of temp tables, and so on. File Remember that every file or folder referenced needs to be available not only at design time, but after a package is deployed as well. Consider using Universal Naming Convention (UNC) paths for global information or package configurations (see ‘‘Maintainable Packages,’’ later in this chapter) 841 www.getcoolebook.com . options: NotSupported (the item will not participate in a transaction), Supported (if a parent container requires a transaction, then this item will participate), and Required (if a parent container. logical 837 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 838 Part V Data Connectivity OR is useful for work flows that split apart and then rejoin. For example, control can split when an upstream. setting a few key properties: ■ Provider: The driver to be used in accessing the database ■ Server: The server or filename containing the database to be accessed ■ Initial Catalog: The default database