1. Trang chủ
  2. » Công Nghệ Thông Tin

Tài liệu Microsoft SQL Server 2000 Data Transformation Services- P3 pdf

50 383 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 50
Dung lượng 778,07 KB

Nội dung

Multidimensional Database Management Systems (OLAP) You can create a multidimensional database schema in a relational database system. There are also database systems that are specifically designed to hold multidimensional data. These sys- tems are typically called OLAP servers. Microsoft Analysis Server is an example of an OLAP server. The primary unit of data storage in a relational database system is a two-dimensional table. In an OLAP system, the primary unit of storage is a multidimensional cube. Each cell of a cube holds the data for the intersection of a particular value for each of the cube’s dimensions. The actual data storage for an OLAP system can be in a relational database system. Microsoft Analysis Services gives three data storage options: • MOLAP—Multidimensional OLAP. Data and calculated aggregations stored in a multi- dimensional format. •ROLAP—Relational OLAP. Data and calculated aggregations stored in a relational data- base. • HOLAP—Hybrid OLAP. Data stored in a relational database and calculated aggregations stored in multidimensional format. Conclusion The importance of data transformation will continue to grow in the coming years as the useful- ness of data becomes more apparent. DTS is a powerful and flexible tool for meeting your data transformation needs. The next chapter, “Using DTS to Move Data into a Data Mart,” describes the particular chal- lenge of transforming relational data into a multidimensional structure for business analysis and OLAP. The rest of the book gives you the details of how to use DTS. Getting Started with DTS P ART I 76 05 0672320118 CH03 11/13/00 4:57 PM Page 76 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 4 Using DTS to Move Data into a Data Mart IN THIS CHAPTER • Multidimensional Data Modeling 78 • The Fact Table 82 • The Dimension Tables 84 • Loading the Star Schema 88 •Avoiding Updates to Dimension Tables 94 06 0672320118 CH04 11/13/00 5:03 PM Page 77 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Getting Started with DTS P ART I 78 With the introduction of OLAP Services in SQL Server 7.0, Microsoft brought OLAP tools to a mass audience. This process continued in SQL Server 2000 with the upgraded OLAP func- tionality and the new data mining tools in Analysis Services. One of the most important uses for DTS is to prepare data to be used for OLAP and data mining. It’s easy to open the Analysis Manager and make a cube from FoodMart 2000, the sample database that is installed with Analysis Services. It’s easy because FoodMart has a star schema design, the logical structure for OLAP. It’s a lot harder when you have to use the Analysis Manager with data from a typical normal- ized database. The tables in a relational database present data in a two-dimensional view. These two-dimensional structures must be transformed into multidimensional structures. The star schema is the logical tool to use for this task. The goal of this chapter is to give you an introduction to multidimensional modeling so that you can use DTS to get your data ready for OLAP and data mining. A full treatment of multidimensional data modeling is beyond the scope of this book. Most of what I wrote about the topic in Microsoft OLAP Unleashed (Sams, 1999) is still relevant. I also recommend The Data Warehouse Lifecycle Toolkit by Ralph Kimball, Laura Reeves, Margy Ross, and Warren Thornthwaite. NOTE Multidimensional Data Modeling The star schema receives its name from its appearance. It has several tables radiating out from a central core table, as shown in Figure 4.1. The fact table is at the core of the star schema. This table stores the actual data that is analyzed in OLAP. Here are the kinds of facts you could put in a fact table: • The total number of items sold • The dollar amount of the sale • The profit on the item sold • The number of times a user clicked on an Internet ad • The length of time it took to return a record from the database • The number of minutes taken for an activity 06 0672320118 CH04 11/13/00 5:03 PM Page 78 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. • The account balance • The number of days the item was on the shelf • The number of units produced Using DTS to Move Data into a Data Mart C HAPTER 4 4 U SING DTS TO M OVE D ATA INTO A D ATA M ART 79 FIGURE 4.1 The star schema of the Sales cube from the Food Mart 2000 sample database, as shown in the Analysis Manager’s Cube Editor. The tables at the points of the star are called dimension tables. These tables provide all the dif- ferent perspectives from which the facts are going to be viewed. Each dimension table will become one or more dimensions in the OLAP cube. Here are some possible dimension tables: •Time •Product • Supplier •Store Location • Customer Identity • Customer Age • Customer Location • Customer Demographic 06 0672320118 CH04 11/13/00 5:03 PM Page 79 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. • Household Identity •Promotion •Status • Employee Differences Between Relational Modeling and Multidimensional Modeling There are several differences between data modeling as it’s normally applied in relational data- bases and the special multidimensional data modeling that prepares data for OLAP analysis. Figure 4.2 shows a database diagram of the sample Northwind database, which has a typical relational normalized schema. Getting Started with DTS P ART I 80 FIGURE 4.2 A typical relational normalized schema—the Northwind sample database. Figure 4.3 shows a diagram of a database that has a star schema. This star schema database was created by reorganizing the Northwind database. Both databases contain the same infor- mation. 06 0672320118 CH04 11/13/00 5:03 PM Page 80 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. FIGURE 4.3 A typical star schema, created by reorganizing the Northwind database. Star schema modeling doesn’t follow the normal rules of data modeling. Here are some of the differences: • Relational models can be very complex. The proper application of the rules of normal- ization can result in a schema with hundreds of tables that have long chains of relation- ships between them. Star schemas are very simple. In the basic star schema design, there are no chains of relationships. Each of the dimension tables has a direct relationship with the fact table (primary key to foreign key). • The same data can be modeled in many different ways using relational modeling. Normal data modeling is quite flexible. The star schema has a rigid structure. It must be rigid because the tables, relationships, and fields in a star schema all have a particular mapping to the multidimensional struc- ture of an OLAP cube. • One of the goals of relational modeling is to conform to the rules of normalization. In a normalized database, each data value is stored only once. Star schemas are radically denormalized. The dimension tables have a high number of repeated values in their fields. Using DTS to Move Data into a Data Mart C HAPTER 4 4 U SING DTS TO M OVE D ATA INTO A D ATA M ART 81 06 0672320118 CH04 11/13/00 5:03 PM Page 81 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. •Standard relational models are optimized for On Line Transaction Processing. OLTP needs the ability to efficiently update data. This is provided in a normalized database that has each value stored only once. Star schemas are optimized for reporting, OLAP, and data mining. Efficient data retrieval requires a minimum number of joins. This is provided with the simple structure of rela- tionships in a star schema, where each dimension table is only a single join away from the fact table. The rules for multidimensional modeling are different because the goals are different. The goal of standard relational modeling is to provide a database that is optimized for efficient data modification. The goal of multidimensional modeling is to provide a database optimized for data retrieval. The Fact Table The fact table is the heart of the star schema. This one table usually contains 90% to 99.9% of the space used by the entire star because it holds the records of the individual events that are stored in the star schema. New records are added to fact tables daily, weekly, or hourly. You might add a new record to the Sales Fact table for each line item of each sale during the previous day. Fact table records are never updated unless a mistake is being corrected or a schema change is being made. Fact table records are never deleted except when old records are being archived. A fact table has the following kinds of fields: • Measures—The fields containing the facts in the fact table. These fields are nearly always numeric. •Dimension Keys—Foreign keys to each of the dimension tables. • Source System Identifier—Field that identifies the source system of the record when the fact table is loaded from multiple sources. • Source System Key—The key value that identifies the fact table record in the source system. • Data Lineage Fields—One or more fields that identify how and when this record was transformed and loaded into the fact table. The fact table usually does not have a separate field for a primary key. The primary key is a composite of all the foreign keys. Getting Started with DTS P ART I 82 06 0672320118 CH04 11/13/00 5:03 PM Page 82 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Choosing the Measures Some of the fields you choose as measures in your star schema are obvious. If you want to build a star that examines sales data, you will want to include Sale Price as one of your mea- sures, and this field will probably be evident in your source data. After you have chosen the obvious measures for your star, you can look for others. Keep the following tips in mind for finding other fields to use as measures: • Consider other numeric fields in the same table as the measures you have already found. • Consider numeric fields in related tables. • Look at combinations of numeric fields that could be used to calculate additional mea- sures. •Any field can be used to create a counted measure. Use the COUNT aggregate function and a GROUP BY clause in a SQL query. • Date fields can be used as measures if they are used with MAX or MIN aggregation in your cube. Date fields can also be used to create calculated measures, such as the difference between two dates. • Consider averages and other calculated values that are non-additive. Include all the val- ues as facts that are needed to calculate these non-additive values. • Consider including additional values so that semi-additive measures can be turned into calculated measures. Choosing the Level of Summarization for the Measures Measures can be used either with the same level of detail as in the source data or with some degree of summarization. Maintaining the greatest possible level of detail is critical in building Using DTS to Move Data into a Data Mart C HAPTER 4 4 U SING DTS TO M OVE D ATA INTO A D ATA M ART 83 I believe that the Source System Identifier and the Source System Key should be con- sidered standard elements in a fact table. These fields make it possible for fact table records to be tied back to source system records. It’s important to do that for auditing purposes. It also makes it possible to use the new drillthrough feature in SQL Server 2000 Analysis Services. I also believe that a typical fact table should have data lineage fields so that the transformation history of the record can be positively identified. NOTE 06 0672320118 CH04 11/13/00 5:03 PM Page 83 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. a flexible OLAP system. Summarizing data is sometimes necessary to save storage space, but consider all the drawbacks: • The users will not be able to drill down to the lowest level of the data. • The connection between the star schema data and the source data is weakened. If one record in the star schema summarizes 15 records in the source data, it is almost impossi- ble to make a direct connection back to those source records. • The potential to browse from particular dimensions can be lost. If sales totals are aggre- gated in a star schema for a particular product per day, there will be no possibility of browsing along a customer dimension. • Merging or joint querying of separate star schemas is much easier if the data is kept at the lowest level of detail. Summarized data is much more likely to lead to independent data marts that cannot be analyzed together. • The possibilities for data mining are reduced. Summarizing data in a star schema makes the most sense for historical data. After a few years, the detail level of data often becomes much less frequently used. Old unused data can interfere with efficient access to current data. Move the detailed historical data into an offline storage area, where it’s available for occasional use. Create a summarized form of the historical data for continued online use. A cube created with summarized historical data can be joined together with cubes based on current data. You join cubes together by creating a virtual cube. As long as two or more cubes have common dimensions, they can be joined together even if they have a different degree of summarization. The Dimension Tables By themselves, the facts in a fact table have little value. The dimension tables provide the vari- ety of perspectives from which the facts become interesting. Compared to the fact table, the dimension tables are nearly always very small. For example, there could be a Sales data mart with the following numbers of records in the tables: •Store Dimension—One record for each store in this chain—14 records. •Promotion Dimension—One record for each different type of promotion—45 records. •Time Dimension—One record for each day over a two-year period—730 records. • Employee Dimension—One record for each employee—300 records. •Product Dimension—One record for each product—31,000 records. • Customer Dimension—One record for each customer—125,000 records. Getting Started with DTS P ART I 84 06 0672320118 CH04 11/13/00 5:03 PM Page 84 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. • Combined total for all of these dimension records—157,089 records. • Sales Fact Table—One record for each line item of each sale over a two-year period— 60,000,000 records. While the fact table always has more records being added to it, the dimension tables are rela- tively stable. Some of them, like the time dimension, are created and then rarely changed. Others, such as the employee and customer dimension, are slowly growing. One of the most important goals of star schema design is to minimize or eliminate the need for updating dimension tables. Dimension tables have the following kinds of fields: •Primary Key—The field that uniquely identifies each record and also joins the dimension table to the fact table. •Level Members—Fields that hold the members for the levels of each of the hierarchies in the dimension. •Attributes—Fields that contain business information about a record but are not used as levels in a hierarchy. • Subordinate Dimension Keys—Foreign key fields to the current related record in subor- dinate dimension tables. • Source System Identifier—Field that identifies the source system of the dimension record when the dimension table is loaded from multiple sources. • Source System Key—The key value that identifies the dimension table record in the source system. • Data Lineage Fields—One or more fields that identify how and when this record was transformed and loaded into the dimension table. The Primary Key in a Dimension Table The primary key of a dimension table should be a single field with an integer data type. Using DTS to Move Data into a Data Mart C HAPTER 4 4 U SING DTS TO M OVE D ATA INTO A D ATA M ART 85 Smallint (2-byte signed) or tinyint (1-byte unsigned) fields are often adequate for the dimension table primary keys. Generally, you will not be concerned about the size of your dimension tables, but using these smaller values can significantly reduce the size of the fact tables, which can become very large. Smaller key fields also make indexes work more efficiently. But don’t use smallint or tinyint unless you are absolutely cer- tain that it will be adequate now and in the future. TIP 06 0672320118 CH04 11/13/00 5:03 PM Page 85 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. [...]... with SQL Server 2000 for these data sources: • Microsoft SQL ServerMicrosoft Access 2000Microsoft Excel 2000 worksheets • HTML files • Text files • Oracle • DB2 • Dbase 5 • Paradox • Other databases that have ODBC drivers available DTS Connections CHAPTER 5 103 Creating DTS Connections Data connections are created in the DTS Designer with the following procedure: 1 Click or drag one of the data. .. data mart I like to change the data in just one step so that everything that happens to the data can be examined by looking at the one step You can use the following DTS tasks to load data into a staging area: • The FTP task to retrieve remote text files • The Bulk Insert task to load text files into SQL Server • The Parallel Data Pump task for hierarchical rowsets 4 USING DTS TO MOVE DATA INTO A DATA. .. strategy I like to copy data directly into the staging area so that it is as similar to the source data as possible If there is some question about the data values, I can examine the data in the staging area as if it’s the same as the source data 90 Getting Started with DTS PART I • The Execute SQL task when the data is being moved from one relational database to another • The Transform Data task, for situations... Connections and the Data Transformation Tasks PART II This chapter discusses data connections You can use DTS to connect to a variety of different database systems and file-based data storage systems These connections are used as the source and the destination of transformations They are needed for five of the built-in DTS tasks: • Transform Data task • Data Driven Query task • Parallel Data Pump task •... that loads a data mart Using DTS to Move Data into a Data Mart CHAPTER 4 89 Loading Data into a Staging Area A staging area is a set of tables that are used to store data temporarily during a data load Staging areas are especially useful if the source data is spread out over diverse sources After you have loaded the staging area, it’s easier to handle the data because it’s all in one place Data often... set without using a Data Link file The UseDSL property is set to True whenever you select Microsoft Data Link in the Data Source box on the Connection Properties dialog You can also change whether or not a connection is a Microsoft Data Link by editing this property in Disconnected Edit When you specify a value for the UDLPath property, new with the Connection2 object in SQL Server 2000, a persistent... has an integer data type The ID is assigned to the following properties of other DTS objects to associate the connection with those objects: DTS CONNECTIONS ID 5 106 DTS Connections and the Data Transformation Tasks PART II • The SourceConnectionID of the Transform Data task, Data Driven Query task, and Parallel Data Pump task • The DestinationConnectionID of the Transform Data task, Data Driven Query... following times: • As the data is loaded into the staging area • In the staging area, with the data being moved from one table to another, or with queries that directly update the data in a table • As the data is moved from the staging area into the data mart CAUTION In complex data cleansing situations, I cleanse it in steps in the staging area I prefer to do all the data cleansing as the data is moved from... from the Data Source list The DTS Designer will drop the connection and re-create it with the new provider, while keeping the connection’s ID and all its other properties the same You cannot change the ProviderID in Disconnected Edit, with the Dynamic Properties task, or with code Some examples of Provider ID’s are • SQLOLEDB Microsoft OLE DB Provider for SQL Server • Microsoft. Jet.OLEDB.4.0 Microsoft. .. Password—The • Catalog—The • DataSource—The the UserID and Password are not needed name used when making a connection password used when making a connection database name server name NOTE When you’re looking at the Connection Properties dialog, the Data Source is the label of the box that shows the Provider, and the Server box shows the Server However, when you use the DataSource property in Disconnected . multidimensional data. These sys- tems are typically called OLAP servers. Microsoft Analysis Server is an example of an OLAP server. The primary unit of data storage. purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 4 Using DTS to Move Data into a Data Mart IN THIS CHAPTER • Multidimensional Data

Ngày đăng: 26/01/2014, 15:20

TỪ KHÓA LIÊN QUAN