Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 50 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
50
Dung lượng
778,07 KB
Nội dung
Multidimensional Database Management Systems (OLAP)
You can create a multidimensional database schema in a relational database system. There are
also database systems that are specifically designed to hold multidimensional data. These sys-
tems are typically called OLAP servers. Microsoft Analysis Server is an example of an OLAP
server.
The primary unit of data storage in a relational database system is a two-dimensional table. In
an OLAP system, the primary unit of storage is a multidimensional cube. Each cell of a cube
holds the data for the intersection of a particular value for each of the cube’s dimensions.
The actual data storage for an OLAP system can be in a relational database system. Microsoft
Analysis Services gives three data storage options:
• MOLAP—Multidimensional OLAP. Data and calculated aggregations stored in a multi-
dimensional format.
•ROLAP—Relational OLAP. Data and calculated aggregations stored in a relational data-
base.
• HOLAP—Hybrid OLAP. Data stored in a relational database and calculated aggregations
stored in multidimensional format.
Conclusion
The importance of datatransformation will continue to grow in the coming years as the useful-
ness of data becomes more apparent. DTS is a powerful and flexible tool for meeting your data
transformation needs.
The next chapter, “Using DTS to Move Data into a Data Mart,” describes the particular chal-
lenge of transforming relational data into a multidimensional structure for business analysis
and OLAP. The rest of the book gives you the details of how to use DTS.
Getting Started with DTS
P
ART I
76
05 0672320118 CH03 11/13/00 4:57 PM Page 76
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER
4
Using DTS to Move Data
into a Data Mart
IN THIS CHAPTER
• Multidimensional Data Modeling 78
• The Fact Table 82
• The Dimension Tables 84
• Loading the Star Schema 88
•Avoiding Updates to Dimension Tables 94
06 0672320118 CH04 11/13/00 5:03 PM Page 77
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Getting Started with DTS
P
ART I
78
With the introduction of OLAP Services in SQLServer 7.0, Microsoft brought OLAP tools to
a mass audience. This process continued in SQLServer2000 with the upgraded OLAP func-
tionality and the new data mining tools in Analysis Services.
One of the most important uses for DTS is to prepare data to be used for OLAP and data
mining.
It’s easy to open the Analysis Manager and make a cube from FoodMart 2000, the sample
database that is installed with Analysis Services. It’s easy because FoodMart has a star schema
design, the logical structure for OLAP.
It’s a lot harder when you have to use the Analysis Manager with data from a typical normal-
ized database. The tables in a relational database present data in a two-dimensional view. These
two-dimensional structures must be transformed into multidimensional structures. The star
schema is the logical tool to use for this task.
The goal of this chapter is to give you an introduction to multidimensional modeling so that
you can use DTS to get your data ready for OLAP and data mining.
A full treatment of multidimensional data modeling is beyond the scope of this book.
Most of what I wrote about the topic in Microsoft OLAP Unleashed (Sams, 1999) is
still relevant. I also recommend The Data Warehouse Lifecycle Toolkit by Ralph
Kimball, Laura Reeves, Margy Ross, and Warren Thornthwaite.
NOTE
Multidimensional Data Modeling
The star schema receives its name from its appearance. It has several tables radiating out from
a central core table, as shown in Figure 4.1.
The fact table is at the core of the star schema. This table stores the actual data that is analyzed
in OLAP. Here are the kinds of facts you could put in a fact table:
• The total number of items sold
• The dollar amount of the sale
• The profit on the item sold
• The number of times a user clicked on an Internet ad
• The length of time it took to return a record from the database
• The number of minutes taken for an activity
06 0672320118 CH04 11/13/00 5:03 PM Page 78
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
• The account balance
• The number of days the item was on the shelf
• The number of units produced
Using DTS to Move Data into a Data Mart
C
HAPTER 4
4
U
SING
DTS
TO
M
OVE
D
ATA INTO
A
D
ATA
M
ART
79
FIGURE 4.1
The star schema of the Sales cube from the Food Mart 2000 sample database, as shown in the Analysis Manager’s
Cube Editor.
The tables at the points of the star are called dimension tables. These tables provide all the dif-
ferent perspectives from which the facts are going to be viewed. Each dimension table will
become one or more dimensions in the OLAP cube. Here are some possible dimension tables:
•Time
•Product
• Supplier
•Store Location
• Customer Identity
• Customer Age
• Customer Location
• Customer Demographic
06 0672320118 CH04 11/13/00 5:03 PM Page 79
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
• Household Identity
•Promotion
•Status
• Employee
Differences Between Relational Modeling and
Multidimensional Modeling
There are several differences between data modeling as it’s normally applied in relational data-
bases and the special multidimensional data modeling that prepares data for OLAP analysis.
Figure 4.2 shows a database diagram of the sample Northwind database, which has a typical
relational normalized schema.
Getting Started with DTS
P
ART I
80
FIGURE 4.2
A typical relational normalized schema—the Northwind sample database.
Figure 4.3 shows a diagram of a database that has a star schema. This star schema database
was created by reorganizing the Northwind database. Both databases contain the same infor-
mation.
06 0672320118 CH04 11/13/00 5:03 PM Page 80
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
FIGURE 4.3
A typical star schema, created by reorganizing the Northwind database.
Star schema modeling doesn’t follow the normal rules of data modeling. Here are some of the
differences:
• Relational models can be very complex. The proper application of the rules of normal-
ization can result in a schema with hundreds of tables that have long chains of relation-
ships between them.
Star schemas are very simple. In the basic star schema design, there are no chains of
relationships. Each of the dimension tables has a direct relationship with the fact table
(primary key to foreign key).
• The same data can be modeled in many different ways using relational modeling.
Normal data modeling is quite flexible.
The star schema has a rigid structure. It must be rigid because the tables, relationships,
and fields in a star schema all have a particular mapping to the multidimensional struc-
ture of an OLAP cube.
• One of the goals of relational modeling is to conform to the rules of normalization. In a
normalized database, each data value is stored only once.
Star schemas are radically denormalized. The dimension tables have a high number of
repeated values in their fields.
Using DTS to Move Data into a Data Mart
C
HAPTER 4
4
U
SING
DTS
TO
M
OVE
D
ATA INTO
A
D
ATA
M
ART
81
06 0672320118 CH04 11/13/00 5:03 PM Page 81
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
•Standard relational models are optimized for On Line Transaction Processing. OLTP
needs the ability to efficiently update data. This is provided in a normalized database that
has each value stored only once.
Star schemas are optimized for reporting, OLAP, and data mining. Efficient data retrieval
requires a minimum number of joins. This is provided with the simple structure of rela-
tionships in a star schema, where each dimension table is only a single join away from
the fact table.
The rules for multidimensional modeling are different because the goals are different.
The goal of standard relational modeling is to provide a database that is optimized for efficient
data modification. The goal of multidimensional modeling is to provide a database optimized
for data retrieval.
The Fact Table
The fact table is the heart of the star schema. This one table usually contains 90% to 99.9% of
the space used by the entire star because it holds the records of the individual events that are
stored in the star schema.
New records are added to fact tables daily, weekly, or hourly. You might add a new record to
the Sales Fact table for each line item of each sale during the previous day.
Fact table records are never updated unless a mistake is being corrected or a schema change is
being made. Fact table records are never deleted except when old records are being archived.
A fact table has the following kinds of fields:
• Measures—The fields containing the facts in the fact table. These fields are nearly
always numeric.
•Dimension Keys—Foreign keys to each of the dimension tables.
• Source System Identifier—Field that identifies the source system of the record when the
fact table is loaded from multiple sources.
• Source System Key—The key value that identifies the fact table record in the source
system.
• Data Lineage Fields—One or more fields that identify how and when this record was
transformed and loaded into the fact table.
The fact table usually does not have a separate field for a primary key. The primary key is a
composite of all the foreign keys.
Getting Started with DTS
P
ART I
82
06 0672320118 CH04 11/13/00 5:03 PM Page 82
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Choosing the Measures
Some of the fields you choose as measures in your star schema are obvious. If you want to
build a star that examines sales data, you will want to include Sale Price as one of your mea-
sures, and this field will probably be evident in your source data.
After you have chosen the obvious measures for your star, you can look for others. Keep the
following tips in mind for finding other fields to use as measures:
• Consider other numeric fields in the same table as the measures you have already found.
• Consider numeric fields in related tables.
• Look at combinations of numeric fields that could be used to calculate additional mea-
sures.
•Any field can be used to create a counted measure. Use the
COUNT aggregate function and
a GROUP BY clause in a SQL query.
• Date fields can be used as measures if they are used with
MAX or MIN aggregation in your
cube. Date fields can also be used to create calculated measures, such as the difference
between two dates.
• Consider averages and other calculated values that are non-additive. Include all the val-
ues as facts that are needed to calculate these non-additive values.
• Consider including additional values so that semi-additive measures can be turned into
calculated measures.
Choosing the Level of Summarization for the Measures
Measures can be used either with the same level of detail as in the source data or with some
degree of summarization. Maintaining the greatest possible level of detail is critical in building
Using DTS to Move Data into a Data Mart
C
HAPTER 4
4
U
SING
DTS
TO
M
OVE
D
ATA INTO
A
D
ATA
M
ART
83
I believe that the Source System Identifier and the Source System Key should be con-
sidered standard elements in a fact table. These fields make it possible for fact table
records to be tied back to source system records. It’s important to do that for auditing
purposes. It also makes it possible to use the new drillthrough feature in SQL Server
2000 Analysis Services.
I also believe that a typical fact table should have data lineage fields so that the
transformation history of the record can be positively identified.
NOTE
06 0672320118 CH04 11/13/00 5:03 PM Page 83
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
a flexible OLAP system. Summarizing data is sometimes necessary to save storage space, but
consider all the drawbacks:
• The users will not be able to drill down to the lowest level of the data.
• The connection between the star schema data and the source data is weakened. If one
record in the star schema summarizes 15 records in the source data, it is almost impossi-
ble to make a direct connection back to those source records.
• The potential to browse from particular dimensions can be lost. If sales totals are aggre-
gated in a star schema for a particular product per day, there will be no possibility of
browsing along a customer dimension.
• Merging or joint querying of separate star schemas is much easier if the data is kept at
the lowest level of detail. Summarized data is much more likely to lead to independent
data marts that cannot be analyzed together.
• The possibilities for data mining are reduced.
Summarizing data in a star schema makes the most sense for historical data. After a few years,
the detail level of data often becomes much less frequently used. Old unused data can interfere
with efficient access to current data. Move the detailed historical data into an offline storage
area, where it’s available for occasional use. Create a summarized form of the historical data
for continued online use.
A cube created with summarized historical data can be joined together with cubes based on
current data. You join cubes together by creating a virtual cube. As long as two or more cubes
have common dimensions, they can be joined together even if they have a different degree of
summarization.
The Dimension Tables
By themselves, the facts in a fact table have little value. The dimension tables provide the vari-
ety of perspectives from which the facts become interesting.
Compared to the fact table, the dimension tables are nearly always very small. For example,
there could be a Sales data mart with the following numbers of records in the tables:
•Store Dimension—One record for each store in this chain—14 records.
•Promotion Dimension—One record for each different type of promotion—45 records.
•Time Dimension—One record for each day over a two-year period—730 records.
• Employee Dimension—One record for each employee—300 records.
•Product Dimension—One record for each product—31,000 records.
• Customer Dimension—One record for each customer—125,000 records.
Getting Started with DTS
P
ART I
84
06 0672320118 CH04 11/13/00 5:03 PM Page 84
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
• Combined total for all of these dimension records—157,089 records.
• Sales Fact Table—One record for each line item of each sale over a two-year period—
60,000,000 records.
While the fact table always has more records being added to it, the dimension tables are rela-
tively stable. Some of them, like the time dimension, are created and then rarely changed.
Others, such as the employee and customer dimension, are slowly growing.
One of the most important goals of star schema design is to minimize or eliminate the need for
updating dimension tables.
Dimension tables have the following kinds of fields:
•Primary Key—The field that uniquely identifies each record and also joins the dimension
table to the fact table.
•Level Members—Fields that hold the members for the levels of each of the hierarchies in
the dimension.
•Attributes—Fields that contain business information about a record but are not used as
levels in a hierarchy.
• Subordinate Dimension Keys—Foreign key fields to the current related record in subor-
dinate dimension tables.
• Source System Identifier—Field that identifies the source system of the dimension
record when the dimension table is loaded from multiple sources.
• Source System Key—The key value that identifies the dimension table record in the
source system.
• Data Lineage Fields—One or more fields that identify how and when this record was
transformed and loaded into the dimension table.
The Primary Key in a Dimension Table
The primary key of a dimension table should be a single field with an integer data type.
Using DTS to Move Data into a Data Mart
C
HAPTER 4
4
U
SING
DTS
TO
M
OVE
D
ATA INTO
A
D
ATA
M
ART
85
Smallint (2-byte signed) or tinyint (1-byte unsigned) fields are often adequate for the
dimension table primary keys. Generally, you will not be concerned about the size of
your dimension tables, but using these smaller values can significantly reduce the size
of the fact tables, which can become very large. Smaller key fields also make indexes
work more efficiently. But don’t use smallint or tinyint unless you are absolutely cer-
tain that it will be adequate now and in the future.
TIP
06 0672320118 CH04 11/13/00 5:03 PM Page 85
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
[...]... with SQL Server2000 for these data sources: • Microsoft SQLServer • Microsoft Access 2000 • Microsoft Excel 2000 worksheets • HTML files • Text files • Oracle • DB2 • Dbase 5 • Paradox • Other databases that have ODBC drivers available DTS Connections CHAPTER 5 103 Creating DTS Connections Data connections are created in the DTS Designer with the following procedure: 1 Click or drag one of the data. .. data mart I like to change the data in just one step so that everything that happens to the data can be examined by looking at the one step You can use the following DTS tasks to load data into a staging area: • The FTP task to retrieve remote text files • The Bulk Insert task to load text files into SQL Server • The Parallel Data Pump task for hierarchical rowsets 4 USING DTS TO MOVE DATA INTO A DATA. .. strategy I like to copy data directly into the staging area so that it is as similar to the source data as possible If there is some question about the data values, I can examine the data in the staging area as if it’s the same as the source data 90 Getting Started with DTS PART I • The Execute SQL task when the data is being moved from one relational database to another • The Transform Data task, for situations... Connections and the DataTransformation Tasks PART II This chapter discusses data connections You can use DTS to connect to a variety of different database systems and file-based data storage systems These connections are used as the source and the destination of transformations They are needed for five of the built-in DTS tasks: • Transform Data task • Data Driven Query task • Parallel Data Pump task •... that loads a data mart Using DTS to Move Data into a Data Mart CHAPTER 4 89 Loading Data into a Staging Area A staging area is a set of tables that are used to store data temporarily during a data load Staging areas are especially useful if the source data is spread out over diverse sources After you have loaded the staging area, it’s easier to handle the data because it’s all in one place Data often... set without using a Data Link file The UseDSL property is set to True whenever you select MicrosoftData Link in the Data Source box on the Connection Properties dialog You can also change whether or not a connection is a MicrosoftData Link by editing this property in Disconnected Edit When you specify a value for the UDLPath property, new with the Connection2 object in SQL Server 2000, a persistent... has an integer data type The ID is assigned to the following properties of other DTS objects to associate the connection with those objects: DTS CONNECTIONS ID 5 106 DTS Connections and the DataTransformation Tasks PART II • The SourceConnectionID of the Transform Data task, Data Driven Query task, and Parallel Data Pump task • The DestinationConnectionID of the Transform Data task, Data Driven Query... following times: • As the data is loaded into the staging area • In the staging area, with the data being moved from one table to another, or with queries that directly update the data in a table • As the data is moved from the staging area into the data mart CAUTION In complex data cleansing situations, I cleanse it in steps in the staging area I prefer to do all the data cleansing as the data is moved from... from the Data Source list The DTS Designer will drop the connection and re-create it with the new provider, while keeping the connection’s ID and all its other properties the same You cannot change the ProviderID in Disconnected Edit, with the Dynamic Properties task, or with code Some examples of Provider ID’s are • SQLOLEDB Microsoft OLE DB Provider for SQL Server • Microsoft. Jet.OLEDB.4.0 Microsoft. .. Password—The • Catalog—The • DataSource—The the UserID and Password are not needed name used when making a connection password used when making a connection database name server name NOTE When you’re looking at the Connection Properties dialog, the Data Source is the label of the box that shows the Provider, and the Server box shows the Server However, when you use the DataSource property in Disconnected . multidimensional data. These sys-
tems are typically called OLAP servers. Microsoft Analysis Server is an example of an OLAP
server.
The primary unit of data storage. purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER
4
Using DTS to Move Data
into a Data Mart
IN THIS CHAPTER
• Multidimensional Data