508 Hands-On Microsoft SQL Server 2008 Integration Services import. We have to write custom code to parse the good rows of data into the correct columns. So, let’s start configuring our first Script component. 1. Add a new Integration Services package in the Programming SSIS project and rename it as Extending Data Flow with Script Component.dtsx. 2. Add a Flat File Connection Manager for C:\SSIS\RawFiles\Sales.txt file with default settings and rename it to Sales. 3. Add a Data Flow task in the newly created package and double-click it to go to the Data Flow tab. Drag and drop the Script component onto the Data Flow Designer surface. The Select Script Component Type dialog box will pop up on the screen (see Figure 11-10). Select the radio button to configure the component as a Source and click OK to place this task as a source on the Data Flow Designer surface. 4. Rename the component as Script Source Component and double-click to open the Script Transformation Editor. Change the ScriptLanguage to Microsoft Visual Basic 2008. 5. Rename the Output 0 in the Inputs And Outputs page to SalesOutput. Note that you have only one output available in this tab and no inputs. Though you can add more outputs if you wish to create a multiple outputs source, you can’t add an input here. This is because the Sources have no inputs. You refer to outputs (and inputs) with their names suffixed by buffer in the code. For example, the SalesOutput will be referenced in the code with the name SalesOutputBuffer. Figure 11-10 Selecting the Script component type in the Data Flow Designer Chapter 11: Programming Integration Services 509 6. Expand the SalesOutput and click the Output Columns node. Now you can add columns to SalesOutput by clicking Add Column. Click this button four times to add four columns with the following details as shown in Figure 11-11. Column Name Data Type Length FirstName string [DT_STR] 20 LastName string [DT_STR] 20 Title string [DT_STR] 20 SalesAmount four-byte signed integer [DT_I4] Figure 11-11 Adding columns in the Script Source Component Outputs 510 Hands-On Microsoft SQL Server 2008 Integration Services Remember that when you decide on a data type for the columns here, it is efficient to choose the correct data types; for instance, we used length 20 in this case instead of default 50. The space we save by using the correct data type that is sufficient enough to fit the data will mean more rows can be fit into the data buffers and hence SSIS can work faster. It is as if to say, you need to take fewer buckets out of the well if you fill your buckets up to the full capacity. 7. Now go to the Connection Managers page and add the Sales Connection Manager. Change the name from Connection to Sales in the Name field. This name is exposed inside the script and makes accessing the package connection managers quite easy in your script. 8. Now click the Edit Script button in the Script page to open the scripting environment. As you can see this environment is quite different than the one you’ve seen earlier in the Script task. So, let’s spend some time here to understand various parts of this auto-generated code. First of all, notice in the Project Explorer window that there are three project items in the ScriptComponent project—BufferWrapper, ComponentWrapper, and Main. BufferWrapper c e classes in the BufferWrapper project item provide methods for working with the data flow buffers and typed properties for each column. Double-click the BufferWrapper.vb project item to the auto- generated code. It contains a public class for each output buffer, in our case, only one SalesOutputBuffer, and two typed write-only properties for each column; one with the column name to refer to the column in the code and the other one with the column name suffixed with _IsNull to set the column value to null. Scroll all the way down to see an AddRow method that is used to add an empty row to the output buffer, and a SetEndOfRowset method that determines, using the EndOfRowset function, that the current buffer is the last buffer of data and passes this information to the data flow engine. Note that the ScriptBuffer class serves as the base class for the read-only classes representing the input and the outputs. You are not supposed to edit this auto-generated code, as it will be overwritten when you modify the Script component. ComponentWrapper c e classes in the ComponentWrapper project item provide methods and properties to process data and to interact with the package objects. Double-click ComponentWrapper.vb to see the auto- generated code. is project item creates a UserComponent class that is inherited from the ScriptComponent class. e ComponentWrapper has an overridden implementation of PrimeOutput method that is called only once at run time. e PrimeOutput method prepares the outputs to accept Chapter 11: Programming Integration Services 511 new rows and is used for the outputs where you add new rows to the output buffers such as a source or a transformation with asynchronous output. is method then passes the processing to CreateNewOutputRows method, the FinishOutputs method, and MarkOutputsAsFinished, which sets the end of rowset on the last output buffer. Note that the ComponentWrapper item also contains Connections and Variables collection classes. As with BufferWrapper, you are not supposed to edit this auto-generated code, as it will be overwritten when you modify the Script component. Main c is project item contains the ScriptMain class, which inherits from the UserComponent class. In contrast to the other two project items where you don’t write your code directly against them, you will be writing your code here using methods and properties provided by the derived classes in this project item. Double-click main.vb to see the auto-generated code for the configurations you’ve done earlier in the metadata design mode. Also, bear in mind that the auto-generated code in the Main.vb item is generated only when you click the Edit Script button for the first time. Later, when you make changes to the Script component, for example, you may add more outputs; the Main.vb code is not updated with those changes and you have to add methods and properties manually to perform the additional functions. is behavior also means that the code you write in the Main.vb project item is persisted within the Script component and the auto-generation of code doesn’t affect it. Note that the imports statements have two wrappers. These are the Primary Interop Assemblies for their respective namespaces. The run-time wrapper provides the classes and interfaces used to access the Control Flow components in the run time, while the pipeline wrapper provides the classes and interfaces used to create custom Data Flow components. This implies that the wrappers help your custom script to access package objects such as variables and connection managers, in the same way as the Dts global object provides access in the Script task. The ScriptMain class is the entry Point class here, but there is no Main() subroutine such as you’ve seen in the Script task. In a script Source component, most of the work is performed in the CreateNewOutputRows() subroutine. The CreateNewOutputRows method is used along with the AddRow method to add new rows to the output and is primarily used when you’re writing code for a source or for an asynchronous output. You will study the synchronous and asynchronous outputs in the next part of this Hands-On exercise. There are two more subroutines, PreExecute() and PostExecute(). As the names indicate, these subroutines run any one-time tasks necessary before and after the Script component has processed its inputs and outputs to perform, such as initializing or closing connections. 512 Hands-On Microsoft SQL Server 2008 Integration Services 9. As we are going to use StreamReader to read our nonstandard text file, add the following namespace in the Imports statements. Imports System.IO Declare the following in the ScriptMain class: Private textReader As StreamReader Private SalesFile As String 10. Next, to use the package connection manager Sales in your script—the connection manager you specified in the Connection Managers page in the Script component GUI—you can use the AcquireConnections method along with the IDTSConnectionManager100 interface that returns a reference to the required connection manager. You can override the AcquireConnections method to retrieve the connection information from the Sales Connection Manager. Add the following subroutine below the previous statements as shown in Figure 11-12. Public Overrides Sub AcquireConnections(ByVal Transaction As Object) Dim connMgr As IDTSConnectionManager100 = Me.Connections.Sales SalesFile = CType(connMgr.AcquireConnection(Nothing), String) End Sub 11. As the connection has been set to the Sales.txt file using SalesFile connection manager string in the previous step, now you can initialize the textReader to connect to the SalesFile. As this is a one-time operation that needs to be performed before the component starts reading the file, we will use the PreExecute method. Add the following lines in the PreExecute subroutine: MyBase.PreExecute() textReader = New StreamReader(SalesFile) 12. While you have opened the text reader, it needs to be closed after all the rows have been processed. As this is a one-time operation that needs to be performed after processing of rows, you will use the PostExecute method for this. Add the following code in the PostExecute subroutine: textReader.Close() 13. Finally, you can create new output rows as the text reader reads the text file. Add the following piece of code in the CreateNewOutputRows subroutine (refer to Figure 11-12). Dim textLine As String textLine = textReader.ReadLine Do While textLine IsNot Nothing If textLine.StartsWith("BEGIN_RECORD") Then SalesOutputBuffer.AddRow() ElseIf textLine.StartsWith("FirstName: ") Then SalesOutputBuffer.FirstName = textLine.Remove(0, 11) ElseIf textLine.StartsWith("LastName: ") Then SalesOutputBuffer.LastName = textLine.Remove(0, 10) Chapter 11: Programming Integration Services 513 ElseIf textLine.StartsWith("Title: ") Then SalesOutputBuffer.Title = textLine.Remove(0, 7) ElseIf textLine.StartsWith("SalesAmount: ") Then SalesOutputBuffer.SalesAmount = textLine.Remove(0, 13) End If textLine = textReader.ReadLine Loop Figure 11-12 Code listing for the Script Source Component 514 Hands-On Microsoft SQL Server 2008 Integration Services This code reads a line of using the textReader ReadLine function and stores the line into the textLine string variable. The read process is in the while loop that goes on till there is nothing to read. Once a line has been read, it then passes through various case statements and, depending upon the evaluated conditions, it either adds a new row to the output buffer using the AddRow method or adds the data into any of the evaluated columns. Last, it is a good practice to use a SetEndOfRowset method that determines that the current buffer is the last buffer of data so that the downstream components know that no more rows are expected. One example of SetEndOfRowset is shown later in the chapter (refer to Figure 11-17). Close the scripting environment and click the OK button on the Script Transformation Editor as you’ve created a Script Source component. 14. Though we will debug the package when it is ready, running the package now to test the script source will be a good idea and will confirm that the code we have built so far is working. To run the package and see how the file has been read, we will use a Row Count transformation, as it can consume pipeline rows without requiring a destination. Also, we will need a variable to configure this transformation. 15. Create a variable varRecords at the package scope of the Int32 data type with 0 as its value. Drop the Row Count transformation below the Script Source component and join both of them with the data flow path. 16. Add a grid-type data viewer to the path to row count to see the records in a tabular format. To add a data viewer, double-click the data flow path, go to Data Viewers page, click Add, and choose the Grid type data viewer. 17. Execute the package and you will see the SalesOutput Data Viewer showing the records that we wanted to read in a tabular format as shown in Figure 11-13. At this time if you open the file, you can appreciate that the Script component has read the data from a nonstandard text file and formatted it in a much easier to read and easier to operate on format. Also, it has nicely ignored the comments in the text file that we didn’t want to read anyway. Figure 11-13 SalesOutput Data Viewer showing the data in a grid format Chapter 11: Programming Integration Services 515 Script Component as a Transformation In this section, you will learn to configure a Script component as a transformation. You will be writing code for a script transformation more often than writing code for a script source or a script destination. So, it is more important to understand the components involved in designing code for a transformation. You have imported Sales data into the pipeline using a script source; now in this part of the exercise, you will derive bonuses paid out to the employees based on title and the sales amount that each employee achieves. Business also wants to know the total sales amount and the total bonus disbursed as a separate reports. The business rules are defined in this way: Bonuses will be paid to all employees who achieve their targets. The target for a Sales representative is ten thousand dollars, that for a Sales Manager is fifty thousand dollars, and that for Vice President is one hundred thousand dollars. A bonus is paid at a fixed rate of 2 percent of the sales amount, and business wants to know who has been paid bonuses and how much. Second, business also wants to know the aggregated sales and bonus amounts. From this description, you can very well understand that you need to derive an indicator for bonuses paid and derive the bonus amount per employee if he or she has achieved the target. Second, you need to aggregate sales amount and the bonus amount and write those values into a separate file. It is also evident that while the first requirement can be derived using row-based operations, the second requirement is a complete rowset-based operation. This also leads us to a brief discussion of synchronous and asynchronous components. These components will be covered in detail in Chapter 15, but here just keep in mind that a synchronous component is one in which the output is synchronous to the input—for instance, the rows get processed as they come and row-level operations are performed such as deriving a column value based on the other column values. On the other hand, the asynchronous components are the ones that perform operations on the complete rowset instead of one row, such as aggregations and sorting operations. These operations need all the rows before they can provide outputs. Moreover, the output rows can be different (generally less) than the input rows, which is opposite to the synchronous component, which outputs the same number of rows it receives at the input. One last but very important difference from the point of view of writing code: The synchronous components work on the same buffers of data sets and do not create or write data to new buffers; they simply add or change data in the same buffer. On the other hand, asynchronous components block inputs, collect all the input rows, perform the required operations, and write outputs to new buffers, which means they work with more than one buffer at a time. This also explains why the asynchronous components generally need more memory. So, we will create two outputs for our script transformation. They will be, as you can guess, synchronous and asynchronous to meet both the requirements. 516 Hands-On Microsoft SQL Server 2008 Integration Services Scripting a Synchronous Transformation In this part you will create an IsBonusPaid indicator based on the targets and will derive the bonus using the sales amount. Though this kind of operation doesn’t need SSIS to be scripted, as the Derived Column transformation is quite capable of performing such operations, this example will show you how you can implement complex business requirements that otherwise can’t be met using preconfigured components in Integration Services. 18. When you are ready to proceed, create a variable for the bonus calculation at the package scope with the following details: Name varBonusMultiplier Data Type Int32 Value 2 19. Delete the data flow path between Row Count and the Script Source component. Drop a Script component in between the Script Source component and the Row Count transformation. Choose Transformation and click OK in the Select Script Component Type dialog box. Rename this as the Script Transformation Component. 20. Join this new transformation component with the source and the row count transformation. Double-click the newly added script transformation component to open the editor. Change the ScriptLanguage to Microsoft Visual Basic 2008. 21. Just as with a Script task, you can make the package variables available to your script using ReadOnlyVariables or ReadWriteVariables fields. Select the User::varBonusMultiplier variable in the ReadOnlyVariables field. 22. Note that you have one additional page, this time called Input Columns, in which you can select the columns you want to work with. In this page, when you select an input column, you can specify the Usage Type as ReadOnly or ReadWrite. This is quite a handy feature from a data security perspective, as it won’t let the columns marked for the ReadOnly usage type get accidentally updated. For example, you may want to derive data using your custom code and want to output that derived data in the new output column; it is advisable to mark the input column’s usage type as ReadOnly. You can assign an alias to the output column in the Output Alias column in case this is a ReadWrite column and you’re going to update this column. Also, keep in mind that you don’t need to select all of the available fields in the input; rather, it is efficient to select only the fields you’re going to work with. The unselected fields will be passed on to the downstream component as is without any changes. Typically the synchronous component works only on the columns used in the derivations in a buffer and rest of columns Chapter 11: Programming Integration Services 517 in the buffer stay untouched and the buffer is passed on to the next component in the data flow. 23. Select the Title and SalesAmount columns and assign them a ReadOnly usage type. 24. Go to Inputs and Outputs page. Note that you’ve two items here, Input 0 and Output 0. This is because a transformation has both the inputs and the outputs. Rename the input 0 to XfrInput and the Output 0 to SynchXfrOutput. Expand the XfrInput to see the Title and SalesAmount fields that you’ve selected in the Input Columns page. Note that you can add more outputs here but not inputs. This is because a Script component can have only one input when used as a transformation or a destination. Also, remember that a script source doesn’t have any input. Click the SynchXfrOutput. Here are two important properties that need to be understood properly, as they can be quite useful on some occasions. The first property is the ExclusionGroup, which is set to 0 by default. This implies that all the input rows are sent to all the outputs. If you add more than one output to your Script component and you do not want all the rows to be sent to all the outputs—i.e., you want to split the rows between outputs based on some criteria—you will need to set the ExclusionGroup value to be the same on all the outputs to indicate that you want to split the rows among all the outputs; this may be any arbitrary nonzero value. In this case you will use DirectRowTo<Output Name>Buffer method in your code to decide where to direct the different types of rows. For instance, along with usual data split, you can actually use this method to direct error rows to one of the outputs and have a made-up error output, which otherwise is not provided by the Script component. The second property is SynchronousInputID that by default contains the ID of the input for the first output. This tells the Data Flow task to add rows from the input buffer to the output buffer of the component. When you add a second output or more outputs to your component, you need to set this ID yourself, as it is not set automatically. If you set the value of SynchronousInputID to None, the output becomes asynchronous, in which case you must add the rows to the output buffer after applying transformation logic on the input rows. You will use this attribute while adding an asynchronous output to this component in the next part of this Hands-On exercise. 25. With this component, you will be creating two more derived columns to meet the exercise requirements. So, add two output columns in the SynchXfrOutput with the following details: Column Name Data Type Length IsBonusPaid string [DT_STR] 1 BonusAmount numeric [DT_NUMERIC] As you won’t be connecting to any outside data source, you don’t need to any connection manager in the Connection Managers page. . asynchronous to meet both the requirements. 516 Hands-On Microsoft SQL Server 2008 Integration Services Scripting a Synchronous Transformation In this part you will create an IsBonusPaid indicator. [DT_I4] Figure 11-11 Adding columns in the Script Source Component Outputs 510 Hands-On Microsoft SQL Server 2008 Integration Services Remember that when you decide on a data type for the columns here,. and outputs to perform, such as initializing or closing connections. 512 Hands-On Microsoft SQL Server 2008 Integration Services 9. As we are going to use StreamReader to read our nonstandard