Your Visual studio solution should now look like Figure 7-7. You need this solution for the next exercise, so leave it open for now.
in this exercise, you created a new ssis project and added it to a current solution. next we configure our ssis package. we begin by outlining the ETL process using ssis sequence containers and tasks.
The Anatomy of an SSIS Package
Once we have made and then renamed the package, it is important to become familiar with the SSIS design environment. We have listed the most frequently used components and labeled them with a letter that corresponds to the letters shown in Figure 7-8:
A. Package Designer: A collection of design tabs that visually represent underlying XML code
B. Control Flow tab: A designer surface used to collect SSIS tasks and containers C. Data Flow tab: A designer surface used to collect SSIS data sources, transformation
tasks, and data destinations
D. Parameters tab: A collection of parameter used to configure your package
E. Event Handlers tab: A designer surface used to hold SSIS tasks and containers that are executed when an event occurs
Figure 7-7. The PubsBIProjects Visual Studio solution at the end of Exercise 7-1
G. Zoom control: A slider control that changes the display magnification
H. Designer surface: The location where various SSIS tasks and containers are placed I. Connection Managers tray: A list view of SSIS connection objects
J. Properties window: A configuration tool allowing you to set the properties of individual SSIS objects
K. SSIS Toolbox: A list of SSIS tasks and containers
L. Visual Studio Toolbox: Used in previous versions of SSIS to hold tasks and containers M. Solution Explorer: A list of files and projects
Figure 7-8. Components of the SSIS design environment
Let’s take a closer look at a few key items from our list.
The Control Flow Tab
The Control Flow tab lets you configure the basic tasks that make up your ETL process. To work with the Control Flow tab, add sequence containers and control flow tasks from the SSIS Toolbox to the designer surface. The SSIS Toolbox offers many items to choose from, but only a select few are used frequently (Figure 7-9).
The following list describes each of the most frequently used control flow items:
• Annotations: Text descriptions that can be added to your SSIS package to provide additional clarity. (To create an annotation in an SSIS package, click anywhere on the designer surface within the Control Flow tab and select Add Annotation from the context menu.)
• Data Flow Task: Handles the importing and exporting of data.
• Execute SQL Task: Allows SQL commands to be sent to a database server.
• Sequence container: Allows tasks to be grouped into a single unit of work.
Note
■ For more information on additional tasks, please see “what’s next?” for recommended reading at the end of this chapter.
The Data Flow Tab
When working in SSIS, you soon discover that the data flow tasks are the most common tasks used on the Control Figure 7-9. Commonly used control flow items
Data flows are made up of other subtasks. In this way, data flows are similar to control flows in that they both contain various SSIS tasks; however, data flow tasks are specialized for transferring data from one location to another.
A data flow task can be edited either by double-clicking the task while working in the Control Flow tab or by navigating to the Data Flow tab at the top of the designer window.
Figure 7-10 illustrates the three categories of data flow tasks that are used to configure a data flow: Sources, Transformations, and Destinations.
Figure 7-10. Commonly used data flow tasks
In Figure 7-10, we have outlined four transformation tasks that you regularly see in SISS data flow tasks.
These transformations are found under the Common collection in the SSIS Toolbox. Many other transformations exist, but these are commonly used in ETL processing. These four transformations perform the same actions as the SQL programming statements detailed in Chapter 6, just not as efficiently.
Each data flow needs one or more data sources, zero or more transformations, and one or more destinations.
Both a data source and a data destination are required for data to “flow” from one place to another, but SSIS transformations are not required. Let’s review each of these categories of tasks.
Data Sources
Data sources collect data from a variety of locations but usually pull data from a database table by executing a SQL select statement. These select statements are either generated for you by SSIS or manually added to the data flow source editor, as shown in Figure 7-11. For example, setting the data access mode (shown in Figure 7-11) to Table or View and choosing a table or view by name automatically creates a SQL statement for you. This is a simple way to configure a data source but is usually not the most efficient. Writing your own SQL code, as shown in the Figure 7-11, is a much better practice.
Figure 7-11. Using SQL code in a data source
Data Transformations
Data transformations are a group of tasks that perform transformations such as concatenations, lookups, data conversions, sorting, and column aliasing, all of which can be performed using SQL code as well. We recommend using a combination of both SQL programming and SSIS tasks to perform ETL transformations. To do this, place SQL programming code that includes these transformations inside your data source tasks (Figure 7-11). All of the transformations are performed when your SQL query is executed. You are then able to send the transformed data directly to the data destinations.
This process may not always work. If, for example, your data source is pulling directly from a flat file, such as a comma-delimited data file (also known as a .csv file), you are not able to perform SQL programming statements directly on the file. Instead, you can use the SSIS data flow transformation tasks to accomplish the same thing (Figure 7-12). This is helpful when your business needs require that you pull data directly from a file.
CONSIDer StaGING taBLeS
if your data source is a file, you may find it advantageous to load the data from the file into a temporary staging table in the database. You can perform ETL transformations using sQL code from there.
This additional step may seem like extra work, but if there are enough transformations involved, you will find the process faster than performing transformations directly through ssis. This is because the database engine can process transformations on large datasets quicker than ssis can. while ssis is fast, especially compared to its predecessors, it is still not as fast as a dedicated relational database engine for processing data.
Another advantage to this technique is that it is easier to find team members who understand sQL programming statements than it is to find ssis experts. Using a combination of sQL statements and data flows provide a fast and easy way to include ETL programming code within a Visual studio solution.
Data Destinations
Data flow destination tasks allow you to place data in files, in tables, or even in an in-memory dataset. There are quite a few data flow destinations that come with SSIS at first install, and you can download others from the Microsoft website. The installed items are listed at the bottom of the SSIS Toolbox (Figure 7-13). If they are not enough to meet your needs, you can create custom destinations.
A data flow destination must have an input from another data flow component, and it must be connected before you start editing it. This is important because the input provides the metadata describing the list of source Figure 7-12. Using data transformation tasks
Data flow destinations can have an error output containing errors that occur when writing data to the destination data store. Errors can occur for a number of reasons, but one example is attempting to insert duplicate data into a primary key column.
Using Sequence Containers
Another common element of an SSIS package is the sequence container. Its main purpose is to group control flow tasks. These containers have additional helpful features within the sequence container. These features include the following:
Expand and collapse arrows on the upper-right side of the sequence container allow you
•
to show or hide tasks within it.
Properties within a sequence container affect the same property setting on the tasks
•
inside the container.
Disabling a sequence container disables the tasks inside it.
•
Configuring a sequence container as part of an SSIS transaction allows you to commit or
•
Figure 7-13. The data destinations
Although both the Control Flow tab and the Data Flow tab look similar, sequence containers are available only on the Control Flow tab and not on the Data Flow tab. This means you can group control flow tasks only within them, but it is really not much of a limitation since data flow tasks are by their very nature a way of grouping SSIS tasks.
Figure 7-14 shows an example of a typical control flow design. You will note that similar tasks are grouped together to form a set of those actions, such as preparing the data warehouse tables to be filled. Other sequence containers may have different actions, such as filling up dimension tables or filling up fact tables.
Figure 7-14. Grouping control flow tasks with the sequence container
In cases where you are using the flush-and-fill technique to fill up your data warehouse tables, you need a task that clears the tables before they are refilled. In Chapter 6, we discussed using the SQL truncate command for this process. You must drop the foreign key constraints before you begin truncation. This is accomplished by adding the code that you created in Chapter 6 to an Execute SQL task.
While the code to drop the foreign key constraints and the code to truncate the tables could be placed in one control flow task, it provides a better visual to create two tasks—one to drop the foreign key constraints and one to do the truncation. Since both of these tasks are part of a single process, it is logical to group them. In SSIS, a sequence container is specifically designed for this job. In Figure 7-14 you can see that we have created two Execute SQL Task items and placed them inside a sequence container called Prepare ETL Process Sequence Container.
Note
■ it is important to give your data flows and your sequence containers unique names that identify their purpose. The naming convention we have chosen to give our ssis tasks is the task description followed by the task type. For example, to fill up an authors table using a data flow task, we would name the task Fill DimAuthors Data Flow Task. This is a very long name, but it is self-explanatory, and you seldom—if ever—have to type it in code elsewhere.
Using Precedence Constraint Arrows
In some cases, you must perform a particular operation before another can begin. You can connect different control flow tasks using a precedence constraint arrow. These are the arrows shown in Figure 7-14.
Precedence constraint arrows control when a task runs in relationship to another task. When two tasks do not have a control flow arrow between them, the tasks run simultaneously. For example, looking closer, you see that the two data flow tasks in Figure 7-15 do not have a precedence constraint arrow between them. Therefore, they execute at the same time.
Figure 7-15. Control flow tasks without precedence constraint arrows
When you initially drag an SSIS task to the control flow surface, you may not see the precedence constraint arrow, just as you do not see one in Figure 7-15, but one magically appears once you click a task. When the arrow appears, you can click the arrow and then drag it to the other control flow task that you want to connect.
Note
■ Rules, rules, and more rules…! You cannot drag a precedence constraint arrow between tasks inside different sequence containers. instead, you must connect the containers. Containers can connect to other containers, and containers can connect to individual tasks, but individual tasks can only connect to other individual tasks.
It is possible to create more than one precedence constraint arrow per task. After creating the first precedence constraint arrow, click the initial task again. This causes another arrow to appear that you can then
Notice that Data Flow Task 1 has two precedence constraints attached to it. Once this task completes, control passes to both Data Flow Task 2 and Data Flow Task 3.When either one of these finishes, control passes to Data Flow Task 4, but this can also be configured so that both tasks must complete before control is passed.
Precedence constraints can be configured to allow the execution control to flow between tasks based on success, failure, or completion. This configuration is important because it allows you to apply conditional logic to your design. For example, if we were to add a task that sends us an email when a portion of the package fails, we could do so by adding a Send Mail Task, renaming it, and connecting a precedence constraint arrow from the existing task container to the new send mail task (Figure 7-17). We could then configure it to execute the send mail task only upon the condition that the previous task failed.
Figure 7-16. Control flow tasks can have multiple precedence constraints arrows.
Figure 7-17. Setting the Constraint operation value
Note
■ Although it cannot be seen in the black-and-white images of this book, the precedence constraints are color-coded to indicate their status. The green arrows indicate success, red indicates failure, and blue indicates completion.
Precedence constraint arrows can be configured using the context menu, as shown in Figure 7-17. Or you can right-click the arrow and select Edit from the context menu to access the Precedence Constraint Editor (Figure 7-18).
Figure 7-18. The Precedence Constraint Editor dialog
In the Precedence Constraint Editor, you can set the arrow to configure expression values as well as constraint operation values, such as Success, Completion, and Failure.
Expressions values always evaluate a true or false Boolean value, but they can be combined with the constraint values to give you a rather elaborate logic. Figure 7-19 shows an example of using both expressions and constraints.
The Evaluation Operation dropdown box allows you to set the precedence constraint to consider the following combinations:
Constraint
•
Expression
•
Constraint and Expression
•
Constraint or Expression
•
These combinations let you fine-tune the constraint process. In Figure 7-19, if the expression value
evaluates the @RowCount variable as equal to zero or the constraint evaluates that the connected task has failed its execution, then at least one of these conditions is met, and SSIS passes control onto the next task for processing.
Figure 7-19 also shows the “Multiple constraints” area containing radio buttons that allow you to configure a constraint to work in conjunction with other constraints on the same task. When a control flow task contains two or more constraints, as in Figure 7-16, you can determine whether just one or all constraints are required to be true before moving to the next task.
Note
■ You may notice that the expression in Figure 7-19 uses syntax similar to C#. At other times, you may see syntax that is more like sQL or perhaps Visual Basic. This is not C# but rather ssis’s own expression language.
Microsoft has published updates to this language for each of the last four versions of sQL server.
You can find more about this language by searching the web using the keywords “ssis Expression Reference.”
Here is an example page that was available at the time of this writing:
http://msdn.microsoft.com/en-us/library/ms141232.aspx.
SSIS Variables
In Figure 7-19 we use a custom SSIS variable called @RowCount. The intended purpose of this variable is to determine whether there are zero rows of data found in a given table. If this is true, a particular action is performed or avoided.
Figure 7-19. Configuring a precedence constraint to use an expression
An SSIS variable can be used to temporarily hold results or configuration data in random-access memory (RAM). Variables can be added to your package by using the Variables window. This window does not normally appear in a new package but can be opened using the Variables submenu item found in the SSIS menu, as shown in Figure 7-20.
Figure 7-20. Displaying the SSIS Variables window
Note
■ The ssis menu options change depending on the element of the Ui you have as your current focus. if you do not see the Variables menu item, try clicking the Control Flow designer surface and look under the ssis menu once more.
SSIS has many premade variables available, but when you first open the Variables window, you will not see them. And although we do not use them in this chapter, you can display the premade system variables by clicking the fourth button on the Variables window’s toolbar, circled in Figure 7-21, and checking the “Show system variables” checkbox.
Additional variables can be added by clicking the first button on the Variables window toolbar. The other buttons on the toolbar control features such as deleting a variable or moving variables between scopes.
Note
■ The Variables submenu items do not show under Visual studio’s ssis menu if you are not currently focused on the ssis package. You can change the focus to the package by clicking the package file in solution Explorer or the package designer window.
When you create a new variable, you must give it a name, define its data type, and optionally set an initial value.
The scope of the variable identifies which tasks have access to the variable. If the variable is scoped at the package level, which is the default, then all the tasks in the package have access to it. If, on the other hand, it is scoped at an individual data flow task, then only that data flow task has access to it.
The scope of the variable defaults to the package, but the variable scope can be changed using the second button of the Variables window toolbar. Clicking this button displays a new dialog that includes a tree view of all the containers and tasks in your SSIS package. By selecting one of the tree-view items and clicking the dialog window’s OK button, you can change the scope of the variable.
Figure 7-21. The SSIS Variables window
as @RowCount from a Precedence Constraint Editor dialog window, but as User::RowCount from an SSIS Script task. That’s all part of the fun of working with SSIS!
Outlining Your ETL Process
If you are new to creating SSIS projects, you may be overwhelmed with all the information we have shown you.
Rest assured that many developers feel this way when they first start working with SSIS but soon find that it is not as difficult as it appears.
The best way to keep from being overwhelmed is by outlining the activities you need to perform in your ETL process. You can outline these activities by adding SSIS tasks to the control flow surface, grouping them together with sequence containers, and connecting them with precedence constraints. Don’t forget that you can add annotations to the control flow surface to give more specific information about each task.
This practice of outlining the tasks can be very helpful. And we recommend doing so before configuring each task. You will get a feel for this as you perform this next exercise.