Hands-On Microsoft SQL Server 2008 Integration Services part 51 doc

10 91 0
Hands-On Microsoft SQL Server 2008 Integration Services part 51 doc

Đang tải... (xem toàn văn)

Thông tin tài liệu

478 Hands-On Microsoft SQL Server 2008 Integration Services 30. Drop an Excel Destination just below the Splitting Fuzzy Grouped Duplicates component and join both of these components using the green arrow. Select Fuzzy Grouped Matches in the Output field of the Input Output Selection dialog box and click OK. Rename this Excel destination Fuzzy Grouped Duplicates. 31. Double-click the Excel destination to configure it. Select DuplicateOwners Connection in the OLE DB Connection Manager field if it’s not already selected. 32. Click the New button next to the “Name of the Excel sheet” field and verify in the CREATE TABLE statement that it is creating the Fuzzy Grouped Duplicates table—i.e., worksheet in Excel—then click OK to accept. Select Fuzzy_Grouped_ Duplicates in this field. 33. Go to the Mappings page to create the necessary column mappings automatically. Note that in addition to _key_in, _key_out, and _score columns, the Fuzzy Lookup transformation has added one _Similarity_ColumnName column for each column that participated in the fuzzy grouping. Click OK to close this editor. 34. Drop an OLE DB destination on the Data Flow surface below the Splitting Fuzzy Grouped Duplicates component and join the component to the OLE DB destination using the available green arrow. Select Canonical Row in the Output field of Input Output Selection dialog box and click OK. 35. Double-click the OLE DB destination and select [dbo].[Owner] table in the Name of the table or the view field. Go to the Mappings page and verify that the necessary mappings have been automatically created. Click OK to close this editor. Rename the OLE DB Destination Owner. Press -- to save all the files in the project. Exercise (Execute Removing Duplicates Package) Finally, execute the package to see how the various transformations remove the duplicate data. 36. You can add the data viewers after each transformation to see the records that have been removed from the pipeline. Later, I’ve explained how you can re-run the package to see the workings of various transformations over and over. As a first go, just run it without any data viewer. 37. Press 5 to execute the package. When the package completes execution, note the number of records after each transformation (see Figure 10-32). Following is an explanation of the execution results: e Excel source brings 13 rows into the data flow. c e Sort transformation removes two exact duplicate records—one for Johnathon c Skinner and one for Kathrine Morris. Chapter 10: Data Flow Transformations 479 e Lookup transformation then matches Johnathon Skinner in the reference c table and diverts it to the Excel destination; the remaining ten rows are then passed on to the Fuzzy Lookup transformation. e Fuzzy Lookup transformation fuzzy matches John, Jonothon, and Jonathon c with the Johnathon in the reference table and marks all three records with a high similarity and confidence score. is high similarity and confidence score is then used by the Conditional Split transformation to filter out these three records to the Excel destination. Figure 10-32 Executing Removing Duplicates package 480 Hands-On Microsoft SQL Server 2008 Integration Services e remaining seven records are then passed on to Fuzzy Grouping transformation, c which looks for records that are likely duplicates in the data flow and groups them together using _key_in, _key_out, and _score column values. ese values are then used by the conditional Split transformation to filter out records for Kathy, Kath, and Kathey that are fuzzy-grouped together. e remaining four unique records are then sent to the OLE DB destination for loading into the Owner table. Review You’ve seen various types of duplicates and the methods for removing them in this package. This package will give you a kick start into real-life de-duplication problems. However, bear in mind that whenever you use Fuzzy Lookup and Fuzzy Group transformations, you need to find the similarity threshold values by running the package on sample data that correctly represents the main data. The more effort you put into finding the value of the similarity threshold that works with your data, the more accurate your results will be. If you want to re-run the package, you need to execute the following SQL statement against the Campaign database in the SQL Server Management Studio. This will refresh the Owner table for you to run the package with the same results. USE Campaign; DROP TABLE [dbo].[Owner]; SELECT * INTO [dbo].[Owner] FROM Owner_original; Summary Having used lots of Data Flow transformations, you must by now feel a lot more confident and ready for real-life challenges. You have used various transformations to perform functions such as pivoting; sorting; performing an exact lookup for de- duplication of data; and standardizing data, fuzzy lookup, and fuzzy groups to eliminate duplicates in a pipeline, aggregate data, and load a slowly changing dimension table. You’ve also studied several other preconfigured transformations that are straightforward to use in your packages. Having come so far, you can now create control flow and data flow in your packages to perform workflow and transformation functions, store and manage your SSIS packages, and secure them as well. In the next chapter, you will study how to deploy your packages in an enterprise environment. Programming Integration Services Chapter 11 In This Chapter c The Two Engines of Integration Services c Programming Options c Extending Packages with Scripting c The Legacy Scripting Task: ActiveX Script Task c Script Task c Script Component c Script Task vs. Script Component c Summary 482 Hands-On Microsoft SQL Server 2008 Integration Services B y now you have worked with almost all the preconfigured tasks and components that Integration Services provides and can appreciate the power and ease it provides to developers for building enterprise-wide solutions. However, businesses are doing so many different things that it is sometimes not easy or even possible to build a solution for every scenario that can exist in an enterprise using the preconfigured tasks and components. SSIS does provide a way to cover even those complex scenarios. You can extend SSIS by writing your own custom code. Not only can you extend SSIS using custom code, Microsoft has tried to make it easier to extend your custom solution and, hence, provides you various programming levels. SSIS provides a much enhanced object model that can be easily programmed with different options to choose from, based on the problem you’re trying to solve. You can choose to extend your packages using scripting, you can develop custom components that can be deployed into SSIS and used as preconfigured components, or you can program your packages all over from scratch. In this chapter you will learn more about these options and when to choose the one that is more appropriate for your particular scenario. However, at this point I want to clarify that as programming Integration Services is a vast subject, it will be very difficult to cover it completely or even do some justice to it in just one chapter. This subject probably requires a complete book in itself. So, in this chapter scripting SSIS has been covered in detail, as I think it is the one area that will be of interest to most readers, and other options have been covered only up to an introduction level so that you can choose the best method to extend SSIS. Refer to the Books Online for other programming options. The Two Engines of Integration Services As you know, all of your packages contain Control Flow tasks and most of them also contain a Data Flow task, which is a special task and has its own components such as sources, destinations, and transformations. You also understand that the work flow and the management of the tasks are designed in Control Flow pane while the data movement and the transformations are designed in Data Flow pane. If you refer back to Figure 1-1, you can notice that top half of the figure—the object model of the Integration Services run time—includes connection managers, event handlers, and log providers, along with tasks and containers. This represents the Integration Services run-time engine and, as you can see, provides the necessary infrastructure for package execution and management support such as execution order, logging, event handling, connections, breakpoints, and transactions. The second engine of Integration Services, shown in the lower half of the Architecture diagram, manages the data movement and the transformations. The Data Flow task that performs the actual work of data movement and transformation runs under the management of the Data Flow Chapter 11: Programming Integration Services 483 engine, also popularly called the pipeline. When you drop the first Data Flow task on the Control Flow Designer surface, you invoke the data flow engine. Though you’ll have only one Control Flow within a package, you can include multiple Data Flow tasks in the package, with each Data Flow task able to support multiple data sources, transformations, and destinations. These two engines of Integration Services provide complete control over the execution of a package and the flexibility to deal with buffer-oriented data movement and transformations in a very efficient way. You may also notice in the architecture diagram that both engines provide scope for building custom objects such as custom tasks, custom log providers, and custom connection managers for the run-time engine, and custom data flow components such as custom sources, custom transformations, and custom destinations for the pipeline. In fact, whenever you extend Integration Services programmatically, you will be working with these engines using different classes, methods, and properties exposed by the engines. Some of the tasks or components provided in Integration Services are written in managed code, where the run-time engine and the data flow engine have been written in the native code for providing enhanced performance, but they are exposed for development through the managed object model of Integration Services and providing ease of extension. The run-time engine represents the Microsoft.SqlServer.Dts.Runtime namespace that contains the classes and interfaces to create packages, custom tasks, and other package control flow elements. And the data flow engine represents the Microsoft.SqlServer.Dts.Pipeline namespace that contains the classes used to develop managed data flow components. Programming Options The object model of Integration Services allows you to program any aspect of SSIS; for instance, you can extend the prebuilt functionality and manage interaction of SSIS with other applications by building interfaces or SSIS packages created programmatically by your custom-built application. This is possible because Integration Services fully supports the Microsoft .NET Framework and allows you to choose any of the .NET- compliant languages. The SSIS development team has done an excellent job by making it easier to program SSIS by extending the packages, yet more powerful by enabling the development of custom objects that can be developed outside the package, but can be included in the package just like prebuilt objects once deployed into the Integration Services object model. You can use Microsoft Visual Studio or any other development environment with your preferred .NET-compliant language to write custom code. So, you can choose from the options of simple one-off scripting or custom develop SSIS objects or in fact build complete packages depending upon your requirements and ability to program. Let’s explore these options in detail. 484 Hands-On Microsoft SQL Server 2008 Integration Services Scripting As mentioned earlier, when you have a need that can’t be met using prebuilt tasks or components, you’ll need to create the required functionality. Now, if your need is one off—i.e., the particular functionality is not going to be required in other packages and you are looking for the least development effort—scripting SSIS should be your choice. The code developed using scripting is generally reused within the development team working on the same project. The developers who have worked with SQL Server Data Transformation Services (versions 7.0 and 2000) might have used the scripting option already. The ActiveX Script task is the only method to extend Data Transformation Services (DTS), so many developers have used it extensively and some have actually built quite complex scripting solutions that were deployed throughout the enterprise. Later it has been realized that the cost of maintaining such solutions is quite high, as the ActiveX Script task was not designed to create enterprise solutions. Integration Services has overcome this limitation and includes many flexible scripting solutions along with the possibility of developing reusable custom objects that are easy to maintain. Though the ActiveX Script task has been provided in Integration Services, you should refrain from using it for new development work. This is provided only for backward compatibility, as an interim support while migrating your packages to Integration Services. The two scripting objects, the Script task and the Script component, replace the Dts scripting functionality with a much better and more powerful programming environment, Microsoft Visual Studio Tools for Applications (VSTA). This embedded scripting environment allows you to choose Microsoft Visual Basic 2008 or Microsoft Visual C# 2008 as your preferred language for scripting. You can create a custom task for use in the Control Flow with the Script task or a custom component such as a source or a transformation, or else a destination for use in the Data Flow task with the Script component. When you use the VSTA environment to write scripts for either of these script objects, the scripting environment creates lots of infrastructure code for you and leaves you to focus on writing the code for the required functionality. This makes writing scripts much easier using VSTA. There are several other benefits to use of this powerful IDE, such as extensive debugging and testing of the written code. Due to .NET Framework support, you can use the .NET namespaces, take advantage of the class libraries, and also reference external .NET assemblies quite easily in your scripts. This is a very powerful feature that can save your many man-days of redevelopment effort for the assemblies that already exist. You can simply reference existing assemblies and use already-developed business rules or functionality within your package. All this power and ease of use doesn’t come free, though the cost of these benefits is very minimal in this case. The code that you write in the Script task or Script component resides in the package and is not available to other packages. If you want to Chapter 11: Programming Integration Services 485 reuse the code, you will have to copy the script to other packages. To explain it further, when you deploy a package that has been developed using only the prebuilt objects to a different server, the code for prebuilt objects is not sent along with the code for your package. The prebuilt objects are available as a compiled binary library to the package within Integration Services, while the custom scripts obviously have not been published and hence are not available for use with other packages. You can copy the code quite easily to script objects in other packages if you need to; however, that will increase the maintenance cost. Think about the code that needs to be modified and has been used in hundreds of packages all over in the enterprise. It won’t be a welcome thing to do. The facility to script yourself out of a requirement should be used carefully and should be used where you know that the requirement is unique and will not be used in many packages. If that’s not the case, you’ll be better off with the custom-built extensions that have been developed from scratch by deriving from the base classes provided by the Integration Services object model. Developing Custom Objects from Scratch If you do not want to use scripting to extend your packages because the custom code might be used in multiple packages and you don’t want to undertake the hassle of fixing several packages later on, you can build custom extensions in the managed code from scratch. Using the managed object model of Integration Services, you can develop extensions such as control flow tasks, connection managers, log providers, enumerators, data flow sources, data flow transformations, and data flow destinations. To develop a custom object, you will inherit from the appropriate base class as provided for the functionality and build on that. For example, to develop a control flow task, you will inherit from the Task base class, and for a data flow component you will inherit from the PipelineComponent base class. The provision of a base class as a starting point makes it much easier to develop custom extensions. Once the object development has been completed, you will then build and deploy the object assembly into the appropriate Integration Services and global assembly cache (GAC). The object then can be added into the Toolbox within Visual Studio and can be used as any other prebuilt object. You will need to deploy the custom extensions on all the servers wherever you want to use them. For example, suppose a developer builds a package using a custom component that has been installed on his computer and wants to share this package with another team member. This new team member cannot use the package until he or she installs the custom component on his or her computer, as the component will be referenced locally on the computer. The availability of the custom extension in the SSIS designer makes it very easy to reuse it. As mentioned earlier, the code for the prebuilt components does not get copied into the package; rather, the package references the objects only. This also applies to the custom-built extensions, 486 Hands-On Microsoft SQL Server 2008 Integration Services and you do not need to worry about the deployment of the custom objects within your packages. The custom-built extensions are not deployed with your packages; rather, they are handled separately and keep your package deployments simple. This means that if you need to make an enhancement or a change into a custom extension, you do not need to modify all your packages, but will need to make change only in the custom extension and the packages will automatically pick up this changed object at their next run. Building Packages Programmatically When you want to work with your packages programmatically, the object model allows you to create, configure, load, and execute packages. You can create packages dynamically and define the sources, transformations, and metadata of the selected columns and destinations. Just to explain, think of a CRM application that you may want to extend with ETL capabilities so that you can create a reporting data mart. This CRM application is configured with different metadata for different clients, so you can create SSIS packages programmatically reading metadata of the deployed application from your application interfaces and avoid configuring SSIS packages manually for each client. Such extended applications can save you and the customer a lot of time and effort. Considering your requirements, you can create a grand application by creating packages from scratch, including package objects, you can simplify your solution by loading a template package and configuring it for the relevant changes, or else you can simply load and run an existing package. Extending Packages with Scripting Now that you’ve an overview of programming options, let’s explore them in detail and try some Hands-On exercises along the way as we proceed. I would like to clarify some of the concepts about my approach in this chapter before we get deep into the exercises. The focus will be to demonstrate on how you can implement your code in Integration Services rather than on how to write the code, and to keep things simple and within reason, only Visual Basic 2008 code will be listed. The Legacy Scripting Task: ActiveX Script Task If you have used DTS 2000, you might have used the ActiveX Script task to extend your Dts packages. This powerful task was provided in DTS 2000 and helped database developers to develop packages that otherwise wouldn’t be possible. Many database developers and information analysts have exploited this task to customize data transformation; apply business logic in the Dts package; manage files and folders; dynamically set properties on tasks, connections, or global variables; and perform Chapter 11: Programming Integration Services 487 complex computations on the data. To help smooth migration from DTS 2000 to SSIS, Microsoft provided the ActiveX Script task in SSIS to run those custom-build scripts until such time when the scripts can be upgraded to a more advanced scripting task, simply called the Script task in SSIS. The ActiveX Script task provided in SSIS is quite different than the one provided in DTS 2000 in look and feel. The basic purpose of the ActiveX Script task in SSIS is to allow you to run existing scripts, not develop new scripts. In fact, this task will be removed from future releases of Integration Services. Better not to use this task to develop new scripts, and opt instead for use of the more advanced and efficient Script task for new development work. Here are some of the benefits of using the Script task over the ActiveX Script task: e Script task uses a much powerful development environment—Visual Studio c Tools for Applications—which provides an integrated development environment (IDE) rich in features such as IntelliSense, color-coded syntax highlighting, line- by-line debugging support, and online help. It is easier to develop scripts in the Script task using either Visual Basic 2008 c or Visual C# 2008, both of which are fully capable of referring external .NET assemblies in addition to .NET Framework classes and libraries. All the scripts developed in the Script task (and in Script component) are precompiled c and hence, yield enhanced performance due to fast execution at run time. If you have to use this task to use an existing ActiveX script, follow these steps: 1. Drop the ActiveX Script task on the Designer surface and double-click it to configure it. 2. Specify a Name and a Description for the task in the General page. 3. In the Script page Language field drop-down list, choose a scripting language that was used to write the ActiveX script. The default choices available are the VB Script Language and the JScript Language, though the ActiveX Script task can support other scripting languages, depending on the scripting engines installed on the local computer. 4. The Script field provides a simple interface where you can paste or type in your ActiveX script. If you have an ActiveX script saved into a file, you can click Browse and select the file, and your script will be read in by the task and shown in the Script field. Click Save to save the contents of the Script field to a file, and click Parse to parse the script. 5. The EntryMethod specifies the name of the method that is called from the ActiveX Script task at run time. . Component c Summary 482 Hands-On Microsoft SQL Server 2008 Integration Services B y now you have worked with almost all the preconfigured tasks and components that Integration Services provides and. Excel destination. Figure 10-32 Executing Removing Duplicates package 480 Hands-On Microsoft SQL Server 2008 Integration Services e remaining seven records are then passed on to Fuzzy Grouping. 478 Hands-On Microsoft SQL Server 2008 Integration Services 30. Drop an Excel Destination just below the Splitting Fuzzy Grouped

Ngày đăng: 04/07/2014, 15:21

Tài liệu cùng người dùng

Tài liệu liên quan