Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 628 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
628
Dung lượng
24,31 MB
Nội dung
www.it-ebooks.info For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. www.it-ebooks.info iv Contents at a Glance About the Authors xvii About the Technical Reviewer xviii Chapter 1: Introducing IntegrationServices 1 Chapter 2: BIDS and SSMS 11 Chapter 3: Hello World—Your First SSIS 2012 Package 43 Chapter 4: Connection Managers 83 Chapter 5: Control Flow Basics 107 Chapter 6: Advanced Control Flow Tasks 163 Chapter 7: Source and Destination Adapters 203 Chapter 8: Data Flow Transformations 245 Chapter 9: Variables, Parameters, and Expressions 325 Chapter 10: Scripting 361 Chapter 11: Events and Error Handling 405 Chapter 12: Data Profiling and Scrubbing 427 Chapter 13: Logging and Auditing 465 Chapter 14: Heterogeneous Sources and Destinations 487 Chapter 15: Data Flow Tuning and Optimization 511 Chapter 16: Parent-Child Design Pattern 525 Chapter 17: Dimensional Data ETL 543 Chapter 18: Building Robust Solutions 561 Chapter 19: Deployment Model 579 Index 605 www.it-ebooks.info C H A P T E R 1 1 Introducing IntegrationServices I’m the glue that holds everything together. —Singer Otis Williams Your business analysts have finished gathering business requirements. The database architect has designed and built a database that can be described only as a work of art. The BI architects are designing their OLAP cubes and the dimensional data marts that feed them. On the other hand, maybe you’re a one-man show and have designed and built everything yourself. Either way, the only piece that’s missing is a tool to bring it all together. Enter SQLServerIntegrationServices (SSIS). Like Otis Williams, cofounder of the Motown group the Temptations, SSIS is the glue that holds it all together. More than that, SSIS is the circulatory system of your data warehousing and BI solutions. SSIS breathes life into your technical solutions by moving data—the lifeblood of your organization—from disparate sources, along well-known paths, and injecting it directly into the heart of your system. Along the way, SSIS can validate, cleanse, manipulate, transform, and enrich your data for maximum effectiveness. In this book, we’ll take you on a tour of SSIS, from building your very first SSIS package to implementing complex multipackage design patterns seamlessly. This chapter introduces you to SSIS and the concepts behind extract, transform, and load (ETL) processes in general. We begin at the beginning, with a brief history of ETL, Microsoft-style. A Brief History of Microsoft ETL Before we dive headfirst into the details of Microsoft’s current-generation ETL processing solution, it’s important to understand just what ETL is. As we have stated, ETL is an acronym for extract, transform, and load, which is a very literal description of modern data manipulation and movement processes. When we talk about ETL, we are specifically talking about (1) extracting data from a source, such as a database or flat files; (2) transforming data, or manipulating and enriching it en route to its destination; and (3) loading data into its destination, often a database. Over the years, business requirements for data processing in nearly any industry you can point to have grown more complex, even as the amount of data that needs to be processed has increased exponentially. Unfortunately, the number of hours in a day has remained fairly constant over the same time period, meaning you’re stuck with the same limited processing window each day to transport and manipulate an ever-growing magnitude of data. ETL solutions have become increasingly sophisticated and robust in response to these increased data processing demands of performance, flexibility, and quality. www.it-ebooks.info CHAPTER 1 INTRODUCING INTEGRATIONSERVICES 2 So we’ll begin our journey into SSIS by looking at how ETL has evolved in the SQLServer world. Up to SQLServer 6.5, the bulk copy program (bcp) was the primary tool for loading data into SQLServer databases. A command-line utility, bcp made loading basic text files into database tables fairly simple. Unfortunately, the flip side of that simplicity was that you could use bcp only to load data from flat files, and you couldn’t perform additional validations or transformations on the data during the load. A common database-to-database ETL scenario with bcp might include extracting data from a database server to a delimited text file, importing the file into a SQLServer database, and finally using T-SQL to perform transformations on the data in the database. The bcp utility is still provided with all versions of SQL Server, and is still used for simple one-off data loads from flat files on occasion. In response to the increasing demands of ETL processing, Data Transformation Services (DTS) made its first appearance in SQLServer 7. With DTS, you could grab data from a variety of sources, transform it on the fly, and load it into the database. Although DTS was a much more sophisticated tool than bcp, it still lacked much of the functionality required to develop enterprise-class ETL solutions. With the release of SQLServer 2005, Microsoft replaced DTS with SQLServerIntegrationServices (SSIS). SSIS is a true enterprise ETL solution with several advancements over its predecessors, including built-in logging; support for a wide variety of complex transformation, data validation, and data cleansing components; separation of process control from data flow; support for several types of data sources and destinations; and the ability to create custom components, to name a few. SSIS in SQLServer 12 represents the first major enhancement to SSIS since its introduction way back in 2005. In this newest release, Microsoft has implemented major improvements in functionality and usability. Some of the new goodness includes the ability to move ETL packages seamlessly between environments, centralized storage and administration of SSIS packages, and a host of usability enhancements. In this book, you’ll explore the core functionality you need to get up and running with SSIS and the advanced functionality you need to implement the most complex ETL processing. ETL: THE LOST YEARS Although bcp is efficient, many developers and DBAs over the years found the need for solutions that can perform more-complex solutions. During the “lost years” of SQLServer ETL, a large number of home- grown ETL applications began to sprout up in shops all over the world. Many of these solutions were very inefficient, featuring hard-coded sources and destinations and inflexible transformations. Even in the 21st century, there are quite a few of these legacy home-brewed ETL applications running at some of the world’s largest corporations. Building and maintaining in-house ETL applications from scratch can be an interesting academic exercise, but it’s terribly inefficient. The extra time and money spent trying to maintain and administer the code base for these applications can take a significant chunk of the resources you could otherwise devote to designing, developing, and building out actual ETL solutions with an enterprise ETL platform. What Can SSIS Do for You? SSIS provides a wide array of out-of-the-box functionality to accomplish common ETL-related tasks. The major tasks you’ll encounter during most ETL processing include the following: www.it-ebooks.info CHAPTER 1 INTRODUCING INTEGRATIONSERVICES 3 Extracting data from a wide variety of sources including flat files, XML, the Internet, Microsoft Excel spreadsheets, and relational and nonrelational databases. If the stock source adapters don’t cover your needs, SSIS’s support for .NET gives you the ability to extract data from literally any data source that you have access to. Validating data according to predefined rules you specify as it moves through your ETL process. You can validate data by using a variety of methods such as ensuring that strings match patterns and that numeric values are within a given range. Performing Data cleansing, or the process of identifying invalid data values and removing them or modifying them to conform to your predefined constraints. Examples include changing negative numbers to zero or removing extra whitespace characters from strings. Deduplicating data, which is the elimination of data records that you consider to be duplicates. For a given process, you may consider entire records that are value- for-value matches to be duplicates; for other processes, you may determine that a value match on a single field (or set of fields), such as Telephone Number, identifies a duplicate record. Loading data into files, databases such as SQL Server, or other destinations. SSIS provides a wide range of stock destination adapters that allow you to output data to several well-defined destinations. As with data extraction, if you have a special destination in mind that’s not supported by the SSIS stock adapters, the built-in .NET support lets you output to nearly any destination you can access. Nearly any process that you can define in terms of ETL steps can be performed with SSIS. And it’s not just limited to databases (though that is our primary focus in this book). As an example, you can use Windows Management Instrumentation (WMI) to retrieve data about a computer system, format it to your liking, and store it in an Excel spreadsheet; or you can grab data from a comma-delimited file, transform it a bit, and write it back out to a new comma-delimited file. Not to put too fine a point on it, but you can perform just about any task that requires data movement and manipulation with SSIS. What Is Enterprise ETL? You’ve seen us refer to SSIS as an enterprise ETL solution in this chapter, and you may have asked yourself, “What is the difference between an enterprise ETL solution and any other ETL solution?” Don’t worry, it’s a common question that we asked once and that has since been asked of us several times. It has a very simple answer: enterprise ETL solutions have the ability to help you meet your nonfunctional requirements in addition to the standard functional requirements of extract, transform, and load. So what is a nonfunctional requirement and what does it have to do with ETL? If you’ve ever been on a development project for an application or business system, you’re probably familiar with the term. In the previous section, we discussed how SSIS helps you meet your ETL functional requirements—those requirements of a system that describe what it does. In the case of ETL, the functional requirements are generally pretty simple: (1) get data from one or more sources, (2) manipulate the data according to some predefined business logic, and (3) store the data somewhere. Nonfunctional requirements, on the other hand, deal more with the qualities of the system. These types of requirements deal in aspects such as robustness, performance and efficiency, maintainability, www.it-ebooks.info CHAPTER 1 INTRODUCING INTEGRATIONSERVICES 4 security, scalability, reliability, and overall quality of an ETL solution. We like to think of nonfunctional requirements as the aspects of the system that do not necessarily have a direct effect on the end result or output of the system; instead they work behind the scenes in support of the result generation. Here are some of the ways SSIS can help you meet your nonfunctional requirements: Robustness is provided in SSIS primarily through built-in error-handling to capture and deal with bad data and execution exceptions as they occur, transactions that ensure consistency of your data should a process enter an unrecoverable processing exception, and checkpoints that allow some ability to restart packages. Performance and efficiency are closely related, but not entirely synonymous, concepts. You can think of performance as the raw speed with which your ETL processes accomplish their tasks. Efficiency digs a bit deeper and includes minimizing resource (memory, CPU, and persistent storage) contention and usage. SSIS has many optimizations baked directly into its data flow components and data flow engine—for instance, to tweak the raw performance and resource efficiency of the data flow. Chapter 14 covers the things you can do to get the most out of the built-in optimizations. Maintainability can be boiled down to the ongoing cost of managing and administering your ETL solution after it’s in production. Maintainability is also one of the easiest items to measure, because you can ask questions such as, “How many hours each month do I have to spend fixing issues in ETL processes?” or “How many hours of manual intervention are required each week to deal with, or to avoid, errors in my ETL process?” SSIS provides a new project deployment model to make it easier to move ETL projects from one environment to the next; and BIDS provides built-in support for source control systems such as Team Foundation Server (TFS) to help minimize the maintenance costs of your solutions. Security is provided in SSIS through a variety of methods and interactions with other systems, including Windows NT File System (NTFS) and SQLServer 12. Package and project deployment to SQLServer is a powerful method of securing your packages. In this case, SQLServer uses its robust security model to control access to, and encryption of, SSIS package contents. Scalability can be defined as how well your ETL solution can handle increasing quantities of data. SSIS provides the ability to scale predictably with your increased demands, providing of course that you create your packages to maximize SSIS’s throughput. We discuss scalable ETL design patterns in Chapter 15. TIP: For in-depth coverage of SSIS design patterns, we highly recommend picking up a copy of SSIS Design Patterns by Andy Leonard, Matt Masson, Tim Mitchell, Jessica Moss, and Michelle Ufford (Apress, 2012). www.it-ebooks.info CHAPTER 1 INTRODUCING INTEGRATIONSERVICES 5 Reliability, put simply, can be defined as how resistant your ETL solution is to failure—and if failure does occur, how well your solution handles the situation. SSIS provides extensive logging capabilities that, when combined with BIDS’s built-in debugging capabilities, can help you quickly track down and fix the root cause of failure. SSIS can also notify you in the event of a failure situation. All these individual nonfunctional requirements, when taken together, help define the overall quality of your ETL solution. Although it’s relatively simple to put together a package or program that shuttles data from point A to point B, the nonfunctional requirements provide a layer on top of this basic functionality that allows you to meet your service-level agreements (SLAs) and other processing requirements. SSIS Architecture One of the major improvements that SSIS introduced over DTS was the separation of the concepts of control flow and data flow. The control flow manages the order of execution for packages and manages communication with support elements such as event handlers. The data flow engine is exposed as a component within the control flow and it provides high-performance data extraction, transformation, and loading services. As you can see in Figure 1-1, the relationship between control flows, data flows, and their respective components is straightforward. Figure 1-1. Relationship between control flow and data flow www.it-ebooks.info CHAPTER 1 INTRODUCING INTEGRATIONSERVICES 6 Simply speaking, a package contains the control flow. The control flow contains control flow tasks and containers, both of which are discussed in detail in Chapter 4. The Data Flow task is a special type of task that contains the data flow. The data flow contains data flow components, which move and manipulate your data. There are three types of data flow components: Sources can pull data from any of a variety of data stores, and feed that data into the data flow. Transformations allow you to manipulate and modify data within the data flow one row at a time. Destinations provide a means for the data flow to output and persist data after it moves through the final stage of the data flow. Although the simplified diagram in Figure 1-1 shows only a single data flow in the control flow, any given control flow can contain many data flows. As the diagram also illustrates, both control flows and data flows are found within the confines of SSIS packages that you can design, build, and test with Microsoft Business Intelligence Development Studio (BIDS). BIDS is a shell of the Visual Studio integrated development environment (IDE) that .NET programmers are familiar with. Figure 1-2 shows the data flow for a very simple SSIS package in the BIDS designer. Figure 1-2. Simple SSIS package data flow in BIDS Since the introduction of SSIS, Microsoft has made significant investments in the infrastructure required to support package execution and enterprise ETL management. In addition to data movement and manipulation, the SSIS infrastructure supports logging, event handling, connection management, and enumeration activities. Figure 1-3 is a simplified pyramid showing the major components of the SSIS infrastructure. www.it-ebooks.info CHAPTER 1 INTRODUCING INTEGRATIONSERVICES 7 NOTE: We introduce BIDS and discuss the new designer features in Chapter 2. Figure 1-3. SSIS architectural components (simplified) At the base of the pyramid lie the command-line utilities, custom applications, the SSIS designer, and wizards such as the import/export wizard that provide interaction with SSIS. These applications and utilities are developed in either managed or unmanaged code. The object model layer exposes interfaces that allow these utilities and applications to interact with the IntegrationServices runtime. The IntegrationServices runtime, in turn, executes packages and provides support for logging, breakpoints and debugging, connection management and configuration, and transaction control. At the very top of the pyramid is the SSIS package itself, which you design and build in the BIDS environment, to contain the control flow and data flows discussed earlier in this chapter. BYE-BYE, DATA TRANSFORMATION SERVICES Back in SQLServer 2005, Microsoft announced the deprecation of Data Transformation Services (DTS). DTS was supported as a legacy application in SQLServer 2005, 2008, and 2008 R2. But because SSIS is a full-fledged enterprise-class replacement for DTS, it should come as no surprise that DTS is no longer supported with this newest release of SQL Server. This means that the Execute DTS 2000 Package task, the DTS runtime and API, and Package Migration Wizard are all going away. Fortunately, the learning curve to convert DTS packages to SSIS is not too steep, and the process is relatively simple in most cases. If you have legacy DTS packages lying around, now’s the time to plan to migrate them to SSIS. Additionally, the ActiveX Script task, which was provided strictly for DTS support, will be removed. Many of the tasks that ActiveX Script tasks were used for can be handled with precedence constraints, while more- complex tasks can be rewritten as Script tasks. We explore precedence constraints and Script tasks in detail in Chapters 4 and 5. www.it-ebooks.info [...]... SSIS:Name="VersionMinor">0 0 2012- 02-14T22:44:55.5341796-05:00 SQL1 2 SQL1 2 0... {bf2a36bf-0b7c-471d-95c7-3ee9a0d74794} 19 www.it-ebooks.info CHAPTER 2 BIDS AND SSMS Integration Services Project1 1 Package 1 0 1 {9181C329-7E44-4B3D-B125-14D94639BF03} 6... BIDS AND SSMS using SQL Server Reporting Services, SSRS projects) Figure 2-1 shows the business intelligence projects that are available within Visual Studio Figure 2-1 Projects available with BIDS For SQLServer 12, the projects require the installation of NET Framework 3.5 Visual Studio solutions can maintain multiple projects that address the different disciplines within the SQLServer suite A few... appropriate relationships within the Analysis Services database Miscellaneousis a category includingall files that serve a support function but are not integral This includes configuration files for IntegrationServices Analysis Services Project The Analysis Services project is used to design, develop, and deploy a SQLServer 12 Analysis Services database More often than not, the purpose of ETL projects... This is one site you want to check out Professional Association for SQLServer (PASS) is a professional organization for all SQLServer developers, DBAs, and BI pros Membership is free, and the benefits include access to the SQL Server Standard magazine Visit PASS at www.sqlpass.org for more information 9 www.it-ebooks.info CHAPTER 1 INTRODUCING INTEGRATION SERVICES We highly recommend visiting these... Project Wizard will automatically generate all the items of an Integration Services project The wizard will ask for the source of an existing project (either the dtproj file or a package deployed on a SQLServer instance) This wizard will import all the objects from the existing project Report Model Project The Report Model project utilizes SQLServer databases to generate reports By definition, a report... design-time-name="Package\Data Flow Task"> . configuration files for Integration Services. Analysis Services Project The Analysis Services project is used to design, develop, and deploy a SQL Server 12 Analysis Services database. More. Professional Association for SQL Server (PASS) is a professional organization for all SQL Server developers, DBAs, and BI pros. Membership is free, and the benefits include access to the SQL. The Integration Services Project Wizard will automatically generate all the items of an Integration Services project. The wizard will ask for the source of an existing project (either the .dtproj