Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 852 Part V Data Connectivity Fori=1To20 Output0Buffer.AddRow() Output0Buffer.RandomInt = CInt(Rnd() * 100) Next End Sub This example works for a single output with the default name Output 0 containing a single integer column RandomInt. Notice how each output is exposed as name+"buffer" and embedded spaces are removed from the name. New rows are added using the AddRow method and columns are populated by referring to them as output properties. An additional property is exposed f or each column with the suffix _IsNull (e.g., Output0Buffer.RandomInt_ IsNull )tomarkavalueasNULL. Reading data from an external source requires some additional steps, including identify- ing the connection managers that will be referenced within the script on the Connection Managers page of the editor. Then, in the script, additional methods must be overridden: AcquireConnections and ReleaseConnections to open and close any connections, and PreExecute and PostExecute to open and close any record sets, data readers, and so on (database sources only). Search for the topic ‘‘Extending the Data Flow with the Script Component’’ in SQL Server Books Online for full code samples and related information. Destinations Data Flow destinations provide a place to write the data transformed by the Data Flow task. Configuring destinations is similar to configuring sources, including both basic and advanced editors, and the three common steps: ■ Connection Manager: Specify the particular table, file(s), view, or query to which data will be written. Several destinations will accept a table name f rom a variable. ■ Columns: Map the columns from the data flow (input) to the appropriate destination columns. ■ Error Output: Specify what to do should a row fail to insert into the destination: ignore the row, cause the component to fail (default), or redirect the problem row to error o utput. The available destinations are as follows: ■ OLE DB: Writes rows to a table, view, or SQL command (ad hoc view) for which an OLE DB driver exists. Table/view names can be selected directly in the destination or read from a string variable, and each can be selected with or without fast load. Fast load can decrease runtime by an order of magnitude or more depending on the particular data set and selected options. Options for fast load are as follows: ■ Keep identity: When the target table contains an identity column, either this option must be chosen to allow the identity to be overwritten with inserted values (ala SET IDENTITY_INSERT ON ) or the identity column must be excluded from mapped columns so that new identity values can be generated by SQL Server. ■ Keep nulls: Choose this option to load null values instead of any column defaults that would normally apply. ■ Table lock: Keeps a table-level lock during execution 852 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 853 Performing ETL with Integration Services 37 ■ Check constraints: Enables CHECK constraints (such as a valid range on an integer col- umn) for i nserted rows. Note that other types of constraints, including UNIQUE, PRIMARY KEY , FOREIGN KEY,andNOT NULL cannot be disabled. Loading data with CHECK con- straints disabled will result in those c onstraints being marked as ‘‘not trusted’’ by SQL Server. ■ Rows per batch: Specifying a batch size provides a hint to building the query plan, but it does not change the size of the transaction used to put rows in the destination table. ■ Maximum insert commit size: Similar to the BatchSize property of the Bulk Insert task (see ‘‘Control flow tasks’’ earlier in the chapter), the maximum insert commit size is the largest number of rows included in a single transaction. The default value is very large (maximum integer value), allowing most any load task to be committed in a single transaction. ■ SQL Server: This destination uses the same fast-loading mechanism as the Bulk Insert task but is r estricted in that the package must execute on the SQL Server that contains the target table/view. Speed can exceed OLE DB fast loading in some circumstances. ■ ADO.NET: Uses an ADO.NET connection manager to write data to a selected table or view ■ DataReader: Makes the data flow available via an ADO.NET DataReader, which can be opened by other applications, notably Reporting Services, to read the output from the package. ■ Flat File: Writes the data flow to a file specified by a Flat File connection manager. Because the file is described in the connection manager, limited options are available in the destination: Choose whether to overwrite any existing file and provide file header text if desired. ■ Excel: Sends rows from the data flow to a sheet or range in a workbook using an Excel connection manager. Note that versions of Excel prior to 2007 can handle at most 65,536 rows and 256 columns of data, the first r ow of which is consumed by header information. Excel 2007 format supports 1,048,576 rows and 16,384 columns. Strings are required to be Unicode, so any DT_STR types need to be converted to DT_WSTR before reaching the Excel destination. ■ Raw: Writes rows from the data flow to an Integration Services format suitable for fast loads by a raw source component. It does not use a connection manager; instead, specify the AccessMode by choosing to supply a filename via direct input or a string variable. Set the WriteOption property to an appropriate value: ■ Append: Adds data to an existing file, assuming the new data matches the previously written format ■ Create always: Always start a new file ■ Create once: Creates initially and then appends on subsequent writes. This is useful for loops that write to the same destination many times in the same package. ■ Truncate and append: Keeps the existing file’s meta-data, but replaces the data. Raw files cannot handle BLOB data, which excludes any of the large data types, including text, varchar(max),andvarbinary(max). ■ Recordset: Writes the data flow to a variable. Stored as a recordset, the object variable is suitable for use as the source of a Foreach loop or other processing within the package. 853 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 854 Part V Data Connectivity ■ SQL Server Compact: Writes rows from the data flow into a SQL Mobile database table. Configure by identifying the SQL Server Mobile connection manager that points to the appropriate . SDF file, and then enter the name of the table o n the Component Properties tab. ■ Dimension Processing and Partition Processing: These tasks enable the population of Anal- ysis Services cubes without first populating the underlying relational data source. Identify the Analysis Services connection manager of interest, choose the desired dimension or partition, and then select a processing mode: ■ Add/Incremental: Minimal processing required to add new data ■ Full: Complete reprocess of structure and data ■ Update/Data-only: Replaces data without updating the structure ■ Data Mining Model Training: Provides training data to an existing data mining structure, thus preparing it for prediction queries. Specify the Analysis Services connection manager and the target mining structure in that database. Use the Columns tab to map the training data to the appropriate mining structure attributes. ■ Script: A script can also be used as a destination, using a similar process to that already described for using a script as a source. Use a script as a destination to format output in a manner not allowed by one of the standard destinations. For example, a file suitable for input to a COBOL program could be generated from a standard data flow. Start by dragging a script c omponent onto the design surface, choosing Destination from the pop-up Select Script Component Type dialog. Identify the i nput columns of interest and configure the script properties as described previously. After pressing the Edit Script button to access the code, the primary routine to be coded is named after the Input name with a _ProcessInputRow suffix (e.g., Input0_ProcessInputRow). Note the row object passed as an argument to this routine, which provides the input column information for each row (e.g., Row.MyColumn and Row.MyColumn_IsNull). Connection configuration and preparation is the same as described in the source topic. Search for the topic ‘‘Extending the Data Flow with the Script Component’’ in SQL Server Books Online for full code samples and related information. Transformations Between the source and the destination, transformations provide functionality to change the data from what was read into what is needed. Each transformation requires one or more data flows as input and provides one or more data flows a s output. Like sources and destinations, many transformations provide a way to configure error output for rows that fail the transformation. In addition, many transformations provide both a basic and an advanced editor to configure the component, with normal configurations offered by the basic editor when available. The standard transformations available in the Data Flow task are as follows: ■ Aggregate: Functions rather like a GROUP BY query in SQL, generating Min, Max, Average, and so on, on the input data flow. Due to the nature of this operation, Aggregate does not pass through the data flow, but outputs only aggregated rows. Begin on the Aggregations tab by selecting the columns to include and adding the same column multiple times in the bottom pane if necessary. Then, for each column, specify the output column name (Output Alias), the operation to be performed (such as Group by, Count ), and any comparison flags for deter- mining value matches (e.g., Ignore case). For columns being distinct counted, performance hints can be supplied for the exact number (Distinct Count Keys) or an approximate number 854 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 855 Performing ETL with Integration Services 37 (Distinct Count Scale) of distinct values that the transform will encounter. The scale ranges are as follows: ■ Low: Approximately 500,000 values ■ Medium: Approximately 5,000,000 values ■ High: Approximately 25,000,000 values Likewise, performance hints can be specified for the Group By columns by expanding the Advanced section of the Aggregations tab, entering either an exact (Keys) or an approximate (Keys Scale) count of different values to be processed. Alternately, you can specify performance hints for the entire component, instead of individual columns, on the Advanced tab, along with the amount to expand memory when additional memory is required. ■ Audit: Adds execution context columns to the data flow, enabling data to be written with audit information about when it was written and where it came from. Available columns are ExecutionInstanceGUID, PackageID, PackageName, VersionID, ExecutionStartTime, MachineName, UserName, TaskName, and T askID. ■ Cache: Places selected columns from a data flow into a cache for later use by a Lookup transform. Identify the Cache connection manager and then map the data flow columns into the cache c olumns as necessary. The cache is a write once, read many data store: All the data to be included in the cache must be written by a single Cache transform but can then be used by many Lookup transforms. ■ Character Map: Allows strings in the data flow to be transformed by a number of operations: Byte reversal, Full width, Half width, Hiragana, Katakana, Linguistic casing, Lowercase, Sim- plified Chinese, Traditional Chinese, and Uppercase. Within the editor, choose the columns to be transformed, adding a column multiple times in the lower pane if necessary. Each column can then be given a destination of a New column or In-place change (replaces the contents of a column). Then choose an operation and the name for the output column. ■ Conditional Split: Enables rows of a data flow to be split between different outputs depend- ing on the contents of the row. Configure by entering output names and expressions in the editor. When the transform receives a row, each expression is evaluated in order, and the first one that evaluates to true will receive that row of data. When none of the expressions evaluate to true, the default output (named at the bottom of the editor) receives the row. Once configured, as data flows are connected to downstream components, an Input Output Selection pop-up appears, and the appropriate output can be selected. Unmapped outputs are ignored and can result in data loss. ■ Copy Column: Adds a copy of an existing column to the data flow. Within the editor, choose the columns to be copied, adding a column multiple times in the lower pane if necessary. Each new column can then be given an appropriate name (Output Alias). ■ Data Conversion: Adds a copy of an existing column to the data flow, enabling data type conversions in the process. Within the editor, choose the columns to be converted, adding a column multiple times in the lower pane if necessary. Each new column can then be given an appropriate name (Output Alias) and data type. Conversions between code pages are not allowed. Use the advanced editor to enable locale-insensitive fast parsing algorithms by setting the FastParse property to true on each output column. ■ Data Mining Query: Runs a DMX query for each row of the data flow, enabling rows to be associated with predictions, such as the likelihood that a new customer will make a purchase or the probability that a transaction is fraudulent. Configure by specifying an Analysis Services 855 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 856 Part V Data Connectivity connection manager, choosing the mining structure and highlighting the mining model to be queried. On the Query tab, click the Build New Query button and map columns in the data flow to the columns of the model (a default mapping is created based on column name). Then specify the columns to be added to the data flow in the lower half of the pane (usually a prediction function) and give the output an appropriate name (Alias). ■ Derived Column: Uses expressions to generate values that can either be added to the data flow or replace existing columns. Within the editor, construct Integration Services expressions to produce the desired value, using type casts to change data types as needed. Assign each expression to either replace an existing column or be added as a new c olumn. Give new columns an a ppropriate name and data type. ■ Export Column: Writes large object data types ( DT_TEXT, DT_NTEXT,orDT_IMAGE)to file(s) specified by a filename contained in the data flow. For example, large text objects could be extracted into different files for inclusion in a website or text index. Within the editor, specify two columns for each extract defined: a large object column and a column containing the target filename. A file can receive any number of objects. Set Append/Truncate/Exists options to indicate the desired file create behavior. ■ Fuzzy Grouping: Identifies duplicate rows in the data flow using exact matching for any data type and/or fuzzy matching for string data types ( DT_STR and DT_WSTR). Configure the task to examine the key columns within the data flow that identify a unique row. Several columns are added to the output as a result of this transform: ■ Input key ( default name _key_in): A sequential number assigned to identify each input row ■ Output key ( default name _key_out): The Input key of the row this row matches (or its own Input key if not a duplicate). One way to cull the duplicate rows from the data flow is to define a downstream conditional split on the c ondition [_key_in] == [_key_out] . ■ Similarity score ( default name _score): A measure of the similarity of the entire row, on a scale of 0 to one, to the first row of the set of duplicates. ■ Group Output ( default name <column>_clean): For each key column selected, this is the value from the first r ow of the set of duplicates (that is, the value from the r ow indicated by _key_out). ■ Similarity Output (default name _Similarity_<column>): For each key column selected, this is the similarity score for that individual column versus the first row o f the set of duplicates. Within the editor, specify an OLE DB connection manager, where the transform will have per- missions to create a temporary table. Then configure each key column by setting its Output, Group Output, and Similarity Output names. In addition, set the following properties for each column: ■ Match Type: Choose between Fuzzy and Exact Match types for each string column (non-string data types always match exactly). ■ Minimum Similarity: Smallest similarity score allowed for a match. Leaving fuzzy match columns at the default of 0 enables similarity to be controlled from the slider on the Advanced tab of the editor. 856 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 857 Performing ETL with Integration Services 37 ■ Numerals: Specify whether leading or trailing numerals are significant in making compar- isons. The default of Neither specifies that leading and training numerals are not considered in matches. ■ Comparison Flags: Choose settings appropriate to the type of strings being compared. ■ Fuzzy Lookup: Similar to the Lookup transform, except that when an exact lookup fails, a fuzzy lookup is attempted for any string columns ( DT_STR and DT_WSTR). Specify an OLE DB connection manager and table name where values will be looked up, and a new or existing index to be used to cache fuzzy lookup information. On the Columns tab, specify a join between the data flow and the reference table, and which columns from the reference table will be added to the data flow. On the Advanced tab, select the similarity required for finding a match: The lower the number the more liberal the matches become. In addition to the specified columns added to the data flow, match meta-data is added as follows: ■ _Similarity: Reports the similarity between all of the values compared ■ _Confidence: Reports the confidence level that the chosen match was the correct one compared to other possible matches in the lookup table ■ _Similarity_<column name>: Similarity for each individual column The advanced editor has settings of MinimumSimilarity and FuzzyComparisonFlags for each individual column. ■ Import Column: Reads large object data types ( DT_TEXT, DT_NTEXT,orDT_IMAGE)from files specified by a filename contained in the data flow, a dding the text or image objects as a new column in the data flow. Configure in the advanced editor by identifying each column that contains a filename to be read on the Input Columns tab. Then, on the Input and Output Properties tab, create a new output column for each filename column to contain the contents of the files as they are read, giving the new column an appropriate name and data type. In the output column properties, note the grayed-out ID property, and locate the properties for the corresponding input (filename) column. Set the input column’s FileDataColumnID property to the output column’s ID value to tie the filename and contents columns together. Set the ExpectBOM property to true for any DT_NTEXT data being read that has been written with byte-order marks. ■ Lookup: Finds rows in a database table or cache that match the data flow and includes selected columns in the data flow, much like a join between the data flow and a table or cache. For example, a product ID could be added to the data flow by looking up the product name in the master table. Note that all lookups are case sensitive regardless of the collation of the underlying database. Case can be effectively ignored by converting the associated text values to a single case before comparison (e.g., using the UPPER function in a derived column expression). The Lookup transform operates in three possible modes: ■ No cache: Runs a query against the source database for each lookup performed. No cache is kept in memory in order to minimize the number of database accesses, but each lookup reflects the latest value stored in the database. ■ Full cache: Populates an in-memory cache from either the database or a Cache connection manager (see Cache transform and connection manager descriptions earlier in this chapter) and relies solely on that cache for lookups during execution. This minimizes the disk accesses required but may exceed available memory for very large data sets, which can 857 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 858 Part V Data Connectivity dramatically reduce performance. Because no error message appears as performance degrades, it is useful to monitor resource usage while processing sample datasets to determine whether the cache size will work for the range of data sizes expected in production uses. ■ Partial cache: Populates an in-memory cache with a subset of the data available from the database, and then issues queries against the database for any values not found within the in-memory cache. This method provides a compromise between speed and available memory. Whenever possible, this mode should be used with a query that fills the cache with the most likely rows encountered. For example, many warehousing applications are more likely to access values recently added to the database. Start the lookup transform configuration process by selecting the cache mode and the connec- tion type for Full Cache transforms. The most common handling of rows with no matching entries is to ‘‘Redirect rows to no match output’’ for further processing, but the context may require one of the other options. On the Connections page, choose the connection manager containing the reference data, and the table or query to retrieve that data from (for database connections). Usually, the best choice is a query that returns only the columns used in the lookup, which avoids reading and storing unused columns. On the Columns tab, map the join columns between the data flow and the reference table by dragging and dropping lines between corresponding columns. Then check the reference table columns that should be added to the data flow, adjusting names as desired in the bottom pane. The Advanced tab provides an opportunity to optimize memory performance of the Lookup transform for Partial Cache mode, and to modify the query used for row-by-row lookups. Set the size for in-memory caching based on the number of rows that will be loaded — these values often require testing to refine. ‘‘Enable cache for rows with no matching entries’’ enables data from row-by-row lookups that fail to be saved in the in-memory cache along with the data originally read at the start of the transform, thus avoiding repeated database accesses for missing values. Review the custom query to ensure that the row-by-row lookup statement is properly built. ■ Merge: Combines the rows of two sorted data flows into a single data flow. For example, if some of the rows of a sorted data flow are split by an error output or Conditional Split transform, then they can be merged again. The upstream sort must have used the same key columns for both flows, and the data types of columns to b e merged must be compatible. Configure by dragging two different inputs to the transform and mapping columns together in the editor. See the Union All description later in this list for the unsorted combination of flows. ■ Merge Join: Provides SQL join functionality between data flows sorted on the join columns. Configure by dragging the two flows to be joined to the transform, paying attention to which one is connected to the left input if a left outer join is desired. Within the editor, choose the join type, map the join columns, and choose which columns are to be included in the output. ■ Multicast: Copies every row of an input data flow to many different outputs. Once an output has been connected to a downstream component, a new output will appear for connection to the next downstream component. Only the names of the output are c onfigurable. ■ OLE DB Command: Executes a SQL statement (such as UPDATE or DELETE) for every row inadataflow.ConfigurebyspecifyinganOLEDB connection manager to use when executing the command, and then switch to the Component Properties tab and enter the SQL statement using question marks for any parameters (e.g., UPDATE MyTable SET Col1 = ? WHERE 858 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 859 Performing ETL with Integration Services 37 Col2=?). On the Column Mappings tab, associate a data flow column with each parameter in the SQL statement. ■ Percentage Sampling: Splits a data flow by randomly sampling the rows for a given per- centage. For example, this could b e used to separate a data set into training and testing sets for data mining. Within the editor, specify the approximate percentage of rows to allo- cate to the selected output, while the remaining rows are sent to the unselected output. If a sampling seed is provided, the transform will always select the same rows from a given data set. ■ Pivot: Denormalizes a data flow, similar to the way an Excel pivot table operates, making attribute values into columns. For example, a data flow with three columns, Quarter, Region, and Revenue, could be transformed into a data flow with columns for Quarter, Western Region, and Eastern Region, thus pivoting on Region. ■ Row Count: Counts the number of rows i n a data flow and places the result into a variable. Configure by populating the VariableName property. ■ Row Sampling: Nearly identical to the Percentage Sampling transform, except that the approximate number of rows to be sampled is entered, rather than the percentage of rows ■ Script: Using a script as a transformation enables transformations with very complex logic to act on a data flow. Start by dragging a script component onto the design surface, choosing Transformation from the pop-up Select Script Component Type dialog. Within the editor’s Input Columns tab, mark the columns that will be available in the script, and indicate which will be ReadWrite versus ReadOnly. O n the Inputs and Outputs tab, add any output columns that will be populated by the script above and beyond the input columns. On the Script page of the editor, list the read and read/write variables to be accessed within the script, separated by commas, in the ReadOnlyVariables and ReadWriteVariables properties, respectively. Click the Edit Script button to expose the code itself, and note that the primary method to be coded overrides <inputname>_ProcessInputRow,asshownin this simple example: Public Overrides Sub Input0_ProcessInputRow _ (ByVal Row As Input0Buffer) ‘Source system indicates missing dates with old values, ‘replace those with NULLs. Also determine if given time ‘is during defined business hours. If Row.TransactionDate < #1/1/2000# Then Row.TransactionDate_IsNull = True Row.PrimeTimeFlag_IsNull = True Else ‘Set flag for prime time transactions If Weekday(Row.TransactionDate) > 1 _ And Weekday(Row.TransactionDate) < 7 _ And Row.TransactionDate.Hour > 7 _ And Row.TransactionDate.Hour < 17 Then Row.PrimeTimeFlag = True Else Row.PrimeTimeFlag = False End If End If End Sub 859 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 860 Part V Data Connectivity This example uses one ReadWrite input (TransactionDate) and one output (Prime TimeFlag ), with the input name left with the default of Input 0. Each column is exposed as a property of the Row object, as is the additional property with the suffix _IsNull to test or set the column value as NULL. The routine is called o nce for each row in the data flow. ■ Slowly Changing Dimension: Compares the data in a data flow to a dimension table, and, based on the roles assigned to particular columns, maintains the dimension. This component is unusual in that it does not have an editor; instead, a wizard guides the steps to define col- umn roles and interactions with the dimension table. At the c onclusion of the wizard, several components are placed on the design surface to accomplish the dimension maintenance task. ■ Sort: Sorts the rows in a data flow by selected columns. Configure by selecting the columns to sort by. Then, in the lower pane, choose the sort type, the sort order, and the comparison flags appropriate to the data being sorted. ■ Term Extraction: Builds a new data flow based on terms it finds in a Unicode text column ( DT_WSTR or DT_NTEXT). This is the training part of text mining, whereby strings of a partic- ular type are used to generate a list of commonly used terms, which is later used by the Term Lookup component to identify similar strings. For example, the text of saved RSS documents could be used to find similar documents in a large population. Configure by identifying the column containing the Unicode text to be analyzed. If a list of terms to be excluded has been built, then i dentify the table and column on the Exclusions tab. The Advanced tab controls the extraction algorithm, including whether terms are single words or phrases (articles, pronouns, etc., are never included), the scoring algorithm, minimum frequency before extraction, and maximum phrase l ength. ■ Term Lookup: Provides a ‘‘join’’ between a Unicode text column ( DT_WSTR or DT_NTEXT)in the data flow and a reference table of terms built by the Term Extraction component. One row appears in the output data flow for each term matched. The output data flow also contains two columns in addition to the selected input columns: Term and Frequency. Term is the noun or noun phrase that was matched and Frequency is the number of occurrences in the data flow column. Configure the transform by specifying the OLE DB connection manager and table that contains the list of terms. Use the Term Lookup tab to check the input columns that should be passed through to the output data flow, and then map the input Unicode text column to the Term column of the reference table by dragging and dropping between those columns in the upper pane. ■ Union All: Combines rows from multiple data flows into a single data flow, assuming the source columns are of compatible types. Configure by connecting as many data flows as needed to the component. Then, using the editor, ensure that the correct columns from each data flow are mapped to the appropriate output column. ■ Unpivot: Makes a data flow more normalized by turning columns into attribute values. For example, a data flow with one row for each quarter and a column for revenue by region could be turned into a three-column data flow: Quarter, Region, and Revenue. Maintainable and Manageable Packages Integration Services enables applications to be created with relatively little effort, which is a great advantage from a development perspective, but can be a problem if quickly developed systems are deployed without proper planning. Care is r equired to build maintainable and manageable applications 860 www.getcoolebook.com Nielsen c37.tex V4 - 07/21/2009 2:13pm Page 861 Performing ETL with Integration Services 37 regardless of the implementation. Fortunately, Integration Services is designed with many features that support long-term maintainability and manageability. Designing before developing is especially important when first getting started with Integration Services, as practices established early are often reused in subsequent efforts, especially logging, auditing, and overall structure. Perhaps the k ey advantage to developing with Integration Services is the opportunity to centralize everything about a data processing task in a single place, with clear precedence between steps, and opportunities to handle errors as they occur. Centralization greatly increases maintainability compared to the traditional ‘‘script here, program there, stored procedure somewhere else’’ approach. Other topics to consider during design include the following: ■ Identify repeating themes for possible package reuse. Many tasks that repeat the same activities on objects with the same metadata are good candidates for placing in reused subpackages. ■ Appropriate logging strategies are the key to operational success. When an error occurs, who will be responsible for noticing and how will they know? For example, how will someone know whether a package was supposed to run but did not for some reason? What level of logging is appropriate? (More is not always better; too many irrelevant details mask true problems.) What kinds of environment and package state information will be required to understand why a failure has occurred after the fact? (For more information about logging, see the next section.) ■ Auditing concepts may be useful for both compliance and error-recovery operations. What type of information should be associated with data created by a package? If large quantities of information are required, then consider adding the details to an audit or lineage log, adding only an ID to affected records. Alternately, the Audit transform described earlier in this chapter can be used to put audit information on each row. ■ For packages that run on multiple servers or environments, what configuration details change for those environments? Which storage mode (registry,SQL,XML,etc.)willbemosteffec- tive at distributing configuration data? (See the ‘‘Package configurations’’ section later in this chapter.) ■ Determine how to recover from a package failure. Will manual intervention be required before the package can run again? For example, a package that loads data may be able to use transactions to ensure that rerunning a package does not load duplicate rows. ■ Consider designing checkpoint restartable logic for long-running packages. (See the ‘‘Check- point restart’’ section later in this chapter.) ■ Determine the most likely failure points in a package. What steps will be realistically taken to address a failure? Add those steps to the package if possible, using error data flows and task constraints now to avoid labor costs later. Good development practices help increase maintainability as well. Give packages, tasks, c omponents, and other visible objects meaningful names. Liberal use of annotations to note non-obvious mean- ings and motivations will benefit future developers, too. Finally, use version-control software to maintain a history of package and related file versions. Logging Because many packages are destined f or unattended operation, generating an execution log i s an excellent method for tracking operations and collecting debug information. To configure logging for a package, right-click on the package design surface and choose Logging. On the Providers and Logs 861 www.getcoolebook.com . 07/21/2009 2:13pm Page 854 Part V Data Connectivity ■ SQL Server Compact: Writes rows from the data flow into a SQL Mobile database table. Configure by identifying the SQL Server Mobile connection. single transaction. ■ SQL Server: This destination uses the same fast-loading mechanism as the Bulk Insert task but is r estricted in that the package must execute on the SQL Server that contains. CHECK con- straints disabled will result in those c onstraints being marked as ‘‘not trusted’’ by SQL Server. ■ Rows per batch: Specifying a batch size provides a hint to building the query plan,