Data Warehousing Fundamentals A Comprehensive Guide for IT Professionals phần 6 potx

앫 Examine the data extraction function, its challenges, its techniques, and learn howto evaluate and apply the techniques 앫 Discuss the wide range of tasks and types of the data transfor

Trang 1

앫 Product category by region by month

앫 Product department by region by month

앫 All products by region by month

앫 Product category by all stores by month

앫 Product department by all stores by month

앫 Product category by territory by quarter

앫 Product department by territory by quarter

앫 All products by territory by quarter

앫 Product category by region by quarter

앫 Product department by region by quarter

앫 All products by region by quarter

앫 Product category by all stores by quarter

앫 Product department by all stores by quarter

앫 Product category by territory by year

앫 Product department by territory by year

앫 All products by territory by year

앫 Product category by region by year

앫 Product department by region by year

앫 All products by region by year

앫 Product category by all stores by year

앫 Product department by all stores by year

앫 All products by all stores by year

Each of these aggregate fact tables is derived from a single base fact table The derivedaggregate fact tables are joined to one or more derived dimension tables See Figure 11-15showing a derived aggregate fact table connected to a derived dimension table

Effect of Sparsity on Aggregation. Consider the case of the grocery chain with 300stores, 40,000 products in each store, but only 4000 selling in each store in a day As dis-cussed earlier, assuming that you keep records for 5 years or 1825 days, the maximumnumber of base fact table rows is calculated as follows:

Product = 40,000

Store = 300

Time = 1825

Maximum number of base fact table rows = 22 billion

Because only 4,000 products sell in each store in a day, not all of these 22 billion rowsare occupied Because of this sparsity, only 10% of the rows are occupied Therefore, thereal estimate of the number of base table rows is 2 billion

Now let us see what happens when you form aggregates Scrutinize a one-way gate: brand totals by store by day Calculate the maximum number of rows in this one-wayaggregate

Trang 2

aggre-Brand = 80

Store = 300

Time = 1825

Maximum number of aggregate table rows = 43,800,000

While creating the one-way aggregate, you will notice that the sparsity for this gate is not 10% as in the case of the base table This is because when you aggregate bybrand, more of the brand codes will participate in combinations with store and time codes.The sparsity of the one-way aggregate would be about 50%, resulting in a real estimate of21,900,000 If the sparsity had remained as the 10% applicable to the base table, the realestimate of the number of rows in the aggregate table would be much less

aggre-When you go for higher levels of aggregates, the sparsity percentage moves up andeven reaches 100% Because of the failure of sparsity to stay lower, you are faced with thequestion whether aggregates do improve performance that much Do they reduce thenumber of rows that dramatically?

Experienced data warehousing practitioners have a suggestion When you form gates, make sure that each aggregate table row summarizes at least 10 rows in the lowerlevel table If you increase this to 20 rows or more, it would be really remarkable

Unit Sales Sales Dollars

SALES FACTS

STORE PRODUCT

TIME

Store Key

Store Name Territory Region

Category Department

Category Key Time Key Store Key

Unit Sales Sales Dollars

SALES FACTS

BASE TABLE

ONE-WAY AGGREGATE

DIMENSION DERIVED FROM PRODUCT

Figure 11-15 Aggregate fact table and derived dimension table

Trang 3

may create aggregates In the real world, the number of dimensions is not just three, butmany more Therefore, the number of possible aggregate tables escalates into the hundreds Further, from the reference to the failure of sparsity in aggregate tables, you know thatthe aggregation process does not reduce the number of rows proportionally In otherwords, if the sparsity of the base fact table is 10%, the sparsity of the higher-level aggre-gate tables does not remain at 10% The sparsity percentage increases more and more asyour aggregate tables climb higher and higher in levels of summarization

Is aggregation that much effective after all? What are some of the options? How do youdecide what to aggregate? First, set a few goals for aggregation for your data warehouseenvironment

Goals for Aggregation Strategy. Apart from the general overall goal of improvingdata warehouse performance, here are a few specific, practical goals:

앫 Do not get bogged down with too many aggregates Remember, you have to createadditional derived dimensions as well to support the aggregates

앫 Try to cater to a wide range of user groups In any case, provide for your powerusers

앫 Go for aggregates that do not unduly increase the overall usage of storage Lookcarefully into larger aggregates with low sparsity percentages

앫 Keep the aggregates hidden from the end-users That is, the aggregates must betransparent to the end-user query The query tool must be the one to be aware of theaggregates to direct the queries for proper access

앫 Attempt to keep the impact on the data staging process as less intensive as possible

Practical Suggestions. Before doing any calculations to determine the types of gregates needed for your data warehouse environment, spend a good deal of time on de-termining the nature of the common queries How do your users normally report results?What are the reporting levels? By stores? By months? By product categories? Go throughthe dimensions, one by one, and review the levels of the hierarchies Check if there aremultiple hierarchies within the same dimension If so, find out which of these multiple hi-erarchies are more important In each dimension, ascertain which attributes are used forgrouping the fact table metrics The next step is to determine which of these attributes areused in combinations and what the most common combinations are

ag-Once you determine the attributes and their possible combinations, look at the number

of values for each attribute For example, in a hotel chain schema, assume that hotel is atthe lowest level and city is at the next higher level in the hotel dimension Let us say thereare 25,000 values for hotel and 15,000 values for city Clearly, there is no big advantage ofaggregating by cities On the other hand, if city has only 500 values, then city is a level atwhich you may consider aggregation Examine each attribute in the hierarchies within adimension Check the values for each of the attributes Compare the values of attributes atdifferent levels of the same hierarchy and decide which ones are strong candidates to par-ticipate in aggregation

Develop a list of attributes that are useful candidates for aggregation, then work out thecombinations of these attributes to come up with your first set of multiway aggregate facttables Determine the derived dimension tables you need to support these aggregate facttables Go ahead and implement these aggregate fact tables as the initial set

Trang 4

Bear in mind that aggregation is a performance tuning mechanism Improved query formance drives the need to summarize, so do not be too concerned if your first set of ag-gregate tables do not perform perfectly Your aggregates are meant to be monitored and re-vised as necessary The nature of the bulk of the query requests is likely to change As yourusers become more adept at using the data warehouse, they will devise new ways of group-ing and analyzing data So what is the practical advice? Do your preparatory work, startwith a reasonable set of aggregate tables, and continue to make adjustments as necessary.

The fact tables of the STARS in a family share dimension tables Usually, the time mension is shared by most of the fact tables in the group In the above example, all the

di-FAMILIES OF STARS 249

Figure 11-16 Family of STARS

FACT TABLE

DIMENSION TABLE

DIMENSION

TABLE

FACT TABLE

DIMENSION TABLE

Trang 5

three fact tables are likely to share the time dimension Going the other way, dimension bles from multiple STARS may share the fact table of one STAR

ta-If you are in a business like banking or telephone services, it makes sense to capture dividual transactions as well as snapshots at specific intervals You may then use families

in-of STARS consisting in-of transaction and snapshot schemas If you are in a manufacturingcompany or a similar production-type enterprise, your company needs to monitor the met-rics along the value chain Some other institutions are like a medical center, where value isadded not in a chain but at different stations within the enterprise For these enterprises,the family of STARS supports the value chain or the value circle We will get into details

in the next few sections

Snapshot and Transaction Tables

Let us review some basic requirements of a telephone company A number of individualtransactions make up a telephone customer’s account Many of the transactions occur dur-ing the hours of 6 a.m to 10 p.m of the customer’s day More transactions happen duringthe holidays and weekends for residential customers Institutional customers use thephones on weekdays rather than over the weekends A telephone accumulates a very largecollection of rich transaction data that can be used for many types of valuable analysis.The telephone company needs a schema capturing transaction data that supports strategicdecision making for expansions, new service improvements, and so on This transactionschema answers questions such as how does the revenue of peak hours over the weekendsand holidays compare with peak hours over weekdays

In addition, the telephone company needs to answer questions from the customers as toaccount balances The customer service departments are constantly bombarded with ques-tions on the status of individual customer accounts At periodic intervals, the accountingdepartment may be interested in the amounts expected to be received by the middle ofnext month What are the outstanding balances for which bills will be sent this month-end? For these purposes, the telephone company needs a schema to capture snapshots atperiodic intervals Please see Figure 11-17 showing the snapshot and transaction fact ta-bles for a telephone company Make a note of the attributes in the two fact tables Onetable tracks the individual phone transactions The other table holds snapshots of individ-ual accounts at specific intervals Also, notice how dimension tables are shared betweenthe two fact tables

Snapshot and transaction tables are also common for banks For example, an ATMtransaction table stores individual ATM transactions This fact table keeps track of indi-vidual transaction amounts for the customer accounts The snapshot table holds the bal-ance for each account at the end of each day The two tables serve two distinct functions.From the transaction table, you can perform various types of analysis of the ATM transac-tions The snapshot table provides total amounts held at periodic intervals showing theshifting and movement of balances

Financial data warehouses also require snapshot and transaction tables because of thenature of the analysis in these cases The first set of questions for these warehouses relates

to the transactions affecting given accounts over a certain period of time The other set ofquestions centers around balances in individual accounts at specific intervals or totals ofgroups of accounts at the end of specific periods The transaction table answers the ques-tions of the first set; the snapshot table handles the questions of the second set

Trang 6

Core and Custom Tables

Consider two types of businesses that are apparently dissimilar First take the case of abank A bank offers a large variety of services all related to finance in one form or anoth-

er Most of the services are different from one another The checking account service andthe savings account service are similar in most ways But the savings account service doesnot resemble the credit card service in any way How do you track these dissimilar ser-vices?

Next, consider a manufacturing company producing a number of heterogeneous ucts Although a few factors may be common to the various products, by and large the fac-tors differ What must you do to get information about heterogeneous products?

prod-A different type of the family of STprod-ARS satisfies the requirements of these companies

In this type of family, all products and services connect to a core fact table and each uct or service relates to individual custom tables In Figure 11-18, you will see the coreand custom tables for a bank Note how the core fact table holds the metrics that are com-mon to all types of accounts Each custom fact table contains the metrics specific to thatline of service Also note the shared dimension and notice how the tables form a family ofSTARS

prod-Supporting Enterprise Value Chain or Value Circle

In a manufacturing business, a product travels through various steps, starting off as rawmaterials and ending as finished goods in the warehouse inventory Usually, the steps in-clude addition of ingredients, assembly of materials, process control, packaging, and ship-ping to the warehouse From finished goods inventory, a product moves into shipment todistributor, distributor inventory, distributor shipment, retail inventory, and retail sales At

FAMILIES OF STARS 251

Time Key Account Key Transaction Key District Key

Trans Reference Account Number Amount

TELEPHONE

TRANSACTION

FACT TABLE

STATUS Status Key

…………

ACCOUNT Account Key

…………

DISTRICT District Key

…………

TIME Time Key

…………

Time Key Account Key Status Key

Transaction Count Ending Balance

TELEPHONE SNAPSHOT FACT TABLE

Trang 7

each step, value is added to the product Several operational systems support the flowthrough these steps The whole flow forms the supply chain or the value chain Similarly,

in an insurance company, the value chain may include a number of steps from sales of surance through issuance of policy and then finally claims processing In this case, thevalue chain relates to the service

in-If you are in one of these businesses, you need to track important metrics at differentsteps along the value chain You create STAR schemas for the significant steps and thecomplete set of related schemas forms a family of STARS You define a fact table and aset of corresponding dimensions for each important step in the chain If your company hasmultiple value chains, then you have to support each chain with a separate family ofSTARS

A supply chain or a value chain runs in a linear fashion beginning with a certain stepand ending at another step with many steps in between Again, at each step, value isadded In some other kinds of businesses where value gets added to services, similar lin-ear movements do not exist For example, consider a health care institution where valuegets added to patient service from different units almost as if they form a circle around theservice We perceive a value circle in such organizations The value circle of a large healthmaintenance organization may include hospitals, clinics, doctors’ offices, pharmacies,laboratories, government agencies, and insurance companies Each of these units eitherprovide patient treatments or measure patient treatments Patient treatment by each unitmay be measured in different metrics But most of the units would analyze the metrics us-

Time Key Account Key Branch Key Household Key

Balance Fees Charged Transactions

BRANCH Branch Key

…………

Account Key

Deposits Withdrawals Interest Earned Balance Service Charges

HOUSEHOLD Household Key

BANK CORE FACT TABLE

SAVINGS CUSTOM FACT TABLE

CHECKING CUSTOM

FACT TABLE

Figure 11-18 Core and custom tables

Trang 8

ing the same set of conformed dimensions such as time, patient, health care provider,treatment, diagnosis, and payer For a value circle, the family of STARS comprises multi-ple fact tables and a set of conformed dimensions

The order and shipment fact tables share the conformed dimensions of product, date,customer, and salesperson A conformed dimension is a comprehensive combination ofattributes from the source systems after resolving all discrepancies and conflicts For ex-ample, a conformed product dimension must truly reflect the master product list of the en-terprise and must include all possible hierarchies Each attribute must be of the correctdata type and must have proper lengths and constraints

Conforming dimensions is a basic requirement in a data warehouse Pay special tion and take the necessary steps to conform all your dimensions This is a major respon-sibility of the project team Conformed dimensions allow rollups across data marts Userinterfaces will be consistent irrespective of the type of query Result sets of queries will be

Invoice Number Order Number Ship Date Arrival Date

Trang 9

consistent across data marts Of course, a single conformed dimension can be usedagainst multiple fact tables

Standardizing Facts

In addition to the task of conforming dimensions is the requirement to standardize facts

We have seen that fact tables work across STAR schemas Review the following issues lating to the standardization of fact table attributes:

re-앫 Ensure same definitions and terminology across data marts

앫 Resolve homonyms and synonyms

앫 Types of facts to be standardized include revenue, price, cost, and profit margin

앫 Guarantee that the same algorithms are used for any derived units in each facttable

앫 Make sure each fact uses the right unit of measurement

Summary of Family of STARS

Let us end our discussion of the family of STARS with a comprehensive diagram showing

a set of standardized fact tables and conformed dimension tables Study Figure 11-20

care-Date Key Account Key Branch Key

Balance

BRANCH Branch Key Branch Name

State Region

ACCOUNT

Account Key

…………

DATE Date Key

Date Month Year

Account Key

ATM Trans

Other Trans

STATE State Key

State Region

Date Key Account Kye State Key

Balance

MONTH Month Key

Month Year

Month Key Account Key State Key

Balance

2-WAY AGGREGATE

1-WAY AGGREGATE

BANK CORE TABLE

CHECKING CUSTOM TABLE

Figure 11-20 A comprehensive family of STARS

Branch Name

Trang 10

fully Note the aggregate fact tables and the corresponding derived dimension tables Whattypes of aggregates are these? One-way or two-way? Which are the base fact tables? Noticethe shared dimensions Are these conformed dimensions? See how the various fact tablesand the dimension tables are related.

CHAPTER SUMMARY

앫 Slowly changing dimensions may be classified into three different types based onthe nature of the changes Type 1 relates to corrections, Type 2 to preservation ofhistory, and Type 3 to soft revisions Applying each type of revision to the datawarehouse is different

앫 Large dimension tables such as customer or product need special considerations forapplying optimizing techniques

앫 “Snowflaking” or creating a snowflake schema is a method of normalizing theSTAR schema Although some conditions justify the snowflake schema, it is gener-ally not recommended

앫 Miscellaneous flags and textual data are thrown together in one table called a junkdimension table

앫 Aggregate or summary tables improve performance Formulate a strategy for ing aggregate tables

build-앫 A set of related STAR schemas make up a family of STARS Examples are snapshotand transaction tables, core and custom tables, and tables supporting a value chain

or a value circle A family of STARS relies on conformed dimension tables andstandardized fact tables

REVIEW QUESTIONS

1 Describe slowly changing dimensions What are the three types? Explain eachtype very briefly

2 Compare and contrast Type 2 and Type 3 slowly changing dimensions

3 Can you treat rapidly changing dimensions in the same way as Type 2 slowlychanging dimensions? Discuss

4 What are junk dimensions? Are they necessary in a data warehouse?

5 How does a snowflake schema differ from a STAR schema? Name two advantagesand two disadvantages of the snowflake schema

6 Differentiate between slowly and rapidly changing dimensions

7 What are aggregate fact tables? Why are they needed? Give an example

8 Describe with examples snapshot and transaction fact tables How are they ed?

relat-9 Give an example of a value circle Explain how a family of STARS can support avalue circle

10 What is meant by conforming the dimensions? Why is this important in a datawarehouse?

Trang 11

1 Indicate if true or false:

A Type 1 changes for slowly changing dimensions relate to correction of errors

B To apply Type 3 changes of slowly changing dimensions, overwrite the attributevalue in the dimension table row with the new value

C Large dimensions usually have multiple hierarchies

D The STAR schema is a normalized version of the snowflake schema

E Aggregates are precalculated summaries

F The percentage of sparsity of the base table tends to be higher than that of gate tables

aggre-G The fact tables of the STARS in a family share dimension tables

H Core and custom fact tables are useful for companies with several lines of vice

ser-I Conforming dimensions is not absolutely necessary in a data warehouse

J A value circle usually needs a family of STARS to support the business

2 Assume you are in the insurance business Find two examples of Type 2 slowlychanging dimensions in that business As an analyst on the project, write the speci-fications for applying the Type 2 changes to the data warehouse with regard to thetwo examples

3 You are the data design specialist on the data warehouse project team for a retailcompany Design a STAR schema to track the sales units and sales dollars withthree dimension tables Explain how you will decide to select and build four two-way aggregates

4 As the data designer for an international bank, consider the possible types of shot and transaction tables Complete the design with one set of snapshot and trans-action tables

snap-5 For a manufacturing company, design a family of three STARS to support the valuechain

Trang 12

앫 Examine the data extraction function, its challenges, its techniques, and learn how

to evaluate and apply the techniques

앫 Discuss the wide range of tasks and types of the data transformation function

앫 Understand the meaning of data integration and consolidation

앫 Perceive the importance of the data load function and probe the major methods forapplying data to the warehouse

앫 Gain a true insight into why ETL is crucial, time-consuming, and arduous

You may be convinced that the data in your organization’s operational systems is

total-ly inadequate for providing information for strategic decision making As informationtechnology professionals, we are fully aware of the futile attempts in the past two decades

to provide strategic information from operational systems These attempts did not work.Data warehousing can fulfill that pressing need for strategic information

Mostly, the information contained in a warehouse flows from the same operational tems that could not be directly used to provide strategic information What constitutes thedifference between the data in the source operational systems and the information in thedata warehouse? It is the set of functions that fall under the broad group of data extrac-tion, transformation, and loading (ETL)

sys-ETL functions reshape the relevant data from the source systems into useful tion to be stored in the data warehouse Without these functions, there would be no strate-

informa-257

Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals Paulraj Ponniah

Trang 13

gic information in the data warehouse If the source data is not extracted correctly,cleansed, and integrated in the proper formats, query processing, the backbone of the datawarehouse, could not happen.

In Chapter 2, when we discussed the building blocks of the data warehouse, webriefly looked at ETL functions as part of the data staging area In Chapter 6 we revis-ited ETL functions and examined how the business requirements drive these functions

as well Further, in Chapter 8, we explored the hardware and software infrastructure tions to support the data movement functions Why, then, is additional review of ETLnecessary?

op-ETL functions form the prerequisites for the data warehouse information content op-ETLfunctions rightly deserve more consideration and discussion In this chapter, we will delvedeeper into issues relating to ETL functions We will review many significant activitieswithin ETL In the next chapter, we need to continue the discussion by studying anotherimportant function that falls within the overall purview of ETL—data quality Now, let usbegin with a general overview of ETL

ETL OVERVIEW

If you recall our discussion of the functions and services of the technical architecture ofthe data warehouse, you will see that we divided the environment into three functionalareas These areas are data acquisition, data storage, and information delivery Data ex-traction, transformation, and loading encompass the areas of data acquisition and datastorage These are back-end processes that cover the extraction of data from the sourcesystems Next, they include all the functions and procedures for changing the source datainto the exact formats and structures appropriate for storage in the data warehouse data-base After the transformation of the data, these processes consist of all the functions forphysically moving the data into the data warehouse repository

Data extraction, of course, precedes all other functions But what is the scope and tent of the data you will extract from the source systems? Do you not think that the users

ex-of your data warehouse are interested in all ex-of the operational data for some type ex-of query

or analysis? So, why not extract all of operational data and dump it into the data house? This seems to be a straightforward approach Nevertheless, this approach is some-thing driven by the user requirements Your requirements definition should guide you as towhat data you need to extract and from which source systems Avoid creating a datajunkhouse by dumping all the available data from the source systems and waiting to seewhat the users will do with it Data extraction presupposes a selection process Select theneeded data based on the user requirements

ware-The extent and complexity of the back-end processes differ from one data warehouse

to another If your enterprise is supported by a large number of operational systems ning on several computing platforms, the back-end processes in your case would be exten-sive and possibly complex as well So, in your situation, data extraction becomes quitechallenging The data transformation and data loading functions may also be equally diffi-cult Moreover, if the quality of the source data is below standard, this condition furtheraggravates the back-end processes In addition to these challenges, if only a few of theloading methods are feasible for your situation, then data loading could also be difficult.Let us get into specifics about the nature of the ETL functions

Trang 14

run-Most Important and run-Most Challenging

Each of the ETL functions fulfills a significant purpose When you want to convert datafrom the source systems into information stored in the data warehouse, each of thesefunctions is essential For changing data into information you first need to capture thedata After you capture the data, you cannot simply dump that data into the data ware-house and call it strategic information You have to subject the extracted data to all manner

of transformations so that the data will be fit to be converted into information Once youhave transformed the data, it is still not useful to the end-users until it is moved to the datawarehouse repository Data loading is an essential function You must perform all threefunctions of ETL for successfully transforming data into information

Take as an example an analysis your user wants to perform The user wants to pare and analyze sales by store, by product, and by month The sale figures are available

com-in the several sales applications com-in your company Also, you have a product master file.Further, each sales transaction refers to a specific store All these are pieces of data inthe source operational systems For doing the analysis, you have to provide informationabout the sales in the data warehouse database You have to provide the sales units anddollars in a fact table, the products in a product dimension table, the stores in a store di-mension table, and months in a time dimension table How do you do this? Extract thedata from each of the operational systems, reconcile the variations in data representa-tions among the source systems, and transform all the sales of all the products Thenload the sales into the fact and dimension tables Now, after completion of these threefunctions, the extracted data is sitting in the data warehouse, transformed into informa-tion, ready for analysis Notice that it is important for each function to be performed,and performed in sequence

ETL functions are challenging primarily because of the nature of the source systems.Most of the challenges in ETL arise from the disparities among the source operationalsystems Please review the following list of reasons for the types of difficulties in ETLfunctions Consider each carefully and relate it to your environment so that you may findproper resolutions

앫 Source systems are very diverse and disparate

앫 There is usually a need to deal with source systems on multiple platforms and ferent operating systems

dif-앫 Many source systems are older legacy applications running on obsolete databasetechnologies

앫 Generally, historical data on changes in values are not preserved in source tional systems Historical information is critical in a data warehouse

opera-앫 Quality of data is dubious in many old source systems that have evolved over time

앫 Source system structures keep changing over time because of new business tions ETL functions must also be modified accordingly

condi-앫 Gross lack of consistency among source systems is commonly prevalent Same data

is likely to be represented differently in the various source systems For example,data on salary may be represented as monthly salary, weekly salary, and bimonthlysalary in different source payroll systems

앫 Even when inconsistent data is detected among disparate source systems, lack of ameans for resolving mismatches escalates the problem of inconsistency

Trang 15

앫 Most source systems do not represent data in types or formats that are meaningful

to the users Many representations are cryptic and ambiguous

Time-Consuming and Arduous

When the project team designs the ETL functions, tests the various processes, and deploysthem, you will find that these consume a very high percentage of the total project effort It

is not uncommon for a project team to spend as much as 50–70% of the project effort onETL functions You have already noted several factors that add to the complexity of theETL functions

Data extraction itself can be quite involved depending on the nature and complexity ofthe source systems The metadata on the source systems must contain information onevery database and every data structure that are needed from the source systems You needvery detailed information, including database size and volatility of the data You have toknow the time window during each day when you can extract data without impacting theusage of the operational systems You also need to determine the mechanism for capturingthe changes to data in each of the relevant source systems These are strenuous and time-consuming activities

Activities within the data transformation function can run the gamut of transformationmethods You have to reformat internal data structures, resequence data, apply variousforms of conversion techniques, supply default values wherever values are missing, andyou must design the whole set of aggregates that are needed for performance improve-ment In many cases, you need to convert from EBCDIC to ASCII formats

Now turn your attention to the data loading function The sheer massive size of the tial loading can populate millions of rows in the data warehouse database Creating andmanaging load images for such large numbers are not easy tasks Even more difficult isthe task of testing and applying the load images to actually populate the physical files inthe data warehouse Sometimes, it may take two or more weeks to complete the initialphysical loading

ini-With regard to extracting and applying the ongoing incremental changes, there are eral difficulties Finding the proper extraction method for individual source datasets can

sev-be arduous Once you settle on the extraction method, finding a time window to apply thechanges to the data warehouse can be tricky if your data warehouse cannot suffer longdowntimes

ETL Requirements and Steps

Before we highlight some key issues relating to ETL, let us review the functional steps.For initial bulk refresh as well as for the incremental data loads, the sequence is simply asnoted here: triggering for incremental changes, filtering for refreshes and incrementalloads, data extraction, transformation, integration, cleansing, and applying to the datawarehouse database

What are the major steps in the ETL process? Please look at the list shown in Figure12-1 Each of these major steps breaks down into a set of activities and tasks Use this fig-ure as a guide to come up with a list of steps for the ETL process of your data warehouse.The following list enumerates the types of activities and tasks that compose the ETLprocess This list is by no means complete for every data warehouse, but it gives a goodinsight into what is involved to complete the ETL process

Trang 16

앫 Combine several source data structures into a single row in the target database ofthe data warehouse.

앫 Split one source data structure into several structures to go into several rows of thetarget database

앫 Read data from data dictionaries and catalogs of source systems

앫 Read data from a variety of file structures including flat files, indexed files(VSAM), and legacy system databases (hierarchical/network)

앫 Load details for populating atomic fact tables

앫 Aggregate for populating aggregate or summary fact tables

앫 Transform data from one format in the source platform to another format in the get platform

tar-앫 Derive target values for input fields (example: age from date of birth)

앫 Change cryptic values to values meaningful to the users (example: 1 and 2 to maleand female)

Key Factors

Before we move on, let us point out a couple of key factors The first relates to the plexity of the data extraction and transformation functions The second is about the dataloading function

com-Remember that the primary reason for the complexity of the data extraction and formation functions is the tremendous diversity of the source systems In a large enter-prise, we could have a bewildering combination of computing platforms, operating sys-tems, database management systems, network protocols, and source legacy systems Youneed to pay special attention to the various sources and begin with a complete inventory

trans-of the source systems With this inventory as a starting point, work out all the details trans-of

Determine all the target data needed in the data warehouse.

Determine all the data sources, both internal and external.

Prepare data mapping for target data elements from sources Establish comprehensive data extraction rules.

Determine data transformation and cleansing rules Plan for aggregate tables.

Organize data staging area and test tools Write procedures for all data loads ETL for dimension tables.

ETL for fact tables.

Figure 12-1 Major steps in the ETL process

Trang 17

data extraction The difficulties encountered in the data transformation function also late to the heterogeneity of the source systems

re-Now, turning your attention to the data loading function, you have a couple of issues to

be careful about Usually, the mass refreshes, whether for initial load or for periodic freshes, cause difficulties, not so much because of complexities, but because these loadjobs run too long You will have to find the proper time to schedule these full refreshes.Incremental loads have some other types of difficulties First, you have to determine thebest method to capture the ongoing changes from each source system Next, you have toexecute the capture without impacting the source systems After that, at the other end, youhave to schedule the incremental loads without impacting the usage of the data warehouse

re-by the users

Pay special attention to these key issues while designing the ETL functions for yourdata warehouse Now let us take each of the three ETL functions, one by one, and studythe details

DATA EXTRACTION

As an IT professional, you must have participated in data extractions and conversionswhen implementing operational systems When you went from a VSAM file-oriented or-der entry system to a new order processing system using relational database technology,you may have written data extraction programs to capture data from the VSAM files toget the data ready for populating the relational database

Two major factors differentiate the data extraction for a new operational system fromthe data extraction for a data warehouse First, for a data warehouse, you have to extractdata from many disparate sources Next, for a data warehouse, you have to extract data onthe changes for ongoing incremental loads as well as for a one-time initial full load Foroperational systems, all you need is one-time extractions and data conversions

These two factors increase the complexity of data extraction for a data warehouse and,therefore, warrant the use of third-party data extraction tools in addition to in-house pro-grams or scripts Third-party tools are generally more expensive than in-house programs,but they record their own metadata On the other hand, in-house programs increase the cost

of maintenance and are hard to maintain as source systems change If your company is in anindustry where frequent changes to business conditions are the norm, then you may want tominimize the use of in-house programs Third-party tools usually provide built-in flexibili-

ty All you have to do is to change the input parameters for the third-part tool you are using Effective data extraction is a key to the success of your data warehouse Therefore, youneed to pay special attention to the issues and formulate a data extraction strategy for yourdata warehouse Here is a list of data extraction issues:

앫 Source Identification—identify source applications and source structures.

앫 Method of extraction—for each data source, define whether the extraction process

is manual or tool-based

앫 Extraction frequency—for each data source, establish how frequently the data

ex-traction must by done—daily, weekly, quarterly, and so on

앫 Time window—for each data source, denote the time window for the extraction

process

Trang 18

앫 Job sequencing—determine whether the beginning of one job in an extraction job

stream has to wait until the previous job has finished successfully

앫 Exception handling—determine how to handle input records that cannot be

extract-ed

Source Identification

Let us consider the first of the above issues, namely, source identification We will dealwith the rest of the issues later as we move through the remainder of this chapter Sourceidentification, of course, encompasses the identification of all the proper data sources Itdoes not stop with just the identification of the data sources It goes beyond that to exam-ine and verify that the identified sources will provide the necessary value to the data ware-house Let us walk through the source identification process in some detail

Assume that a part of your database, maybe one of your data marts, is designed to vide strategic information on the fulfillment of orders For this purpose, you need to storehistorical information about the fulfilled and pending orders If you ship orders throughmultiple delivery channels, you need to capture data about these channels If your usersare interested in analyzing the orders by the status of the orders as the orders go throughthe fulfillment process, then you need to extract data on the order statuses

pro-In the fact table for order fulfillment, you need attributes about the total order amount,discounts, commissions, expected delivery time, actual delivery time, and dates at differ-ent stages of the process You need dimension tables for product, order disposition, deliv-ery channel, and customer First, you have to determine if you have source systems to pro-vide you with the data needed for this data mart Then, from the source systems, you have

to establish the correct data source for each data element in the data mart Further, youhave to go through a verification process to ensure that the identified sources are reallythe right ones

Figure 12-2 describes a stepwise approach to source identification for order ment Source identification is not as simple a process as it may sound It is a critical firstprocess in the data extraction function You need to go through the source identificationprocess for every piece of information you have to store in the data warehouse As youmight have already figured out, source identification needs thoroughness, lots of time,and exhaustive analysis

fulfill-Data Extraction Techniques

Before examining the various data extraction techniques, you must clearly understand thenature of the source data you are extracting or capturing Also, you need to get an insightinto how the extracted data will be used Source data is in a state of constant flux Business transactions keep changing the data in the source systems In most cases, thevalue of an attribute in a source system is the value of that attribute at the current time Ifyou look at every data structure in the source operational systems, the day-to-day businesstransactions constantly change the values of the attributes in these structures When a cus-tomer moves to another state, the data about that customer changes in the customer table

in the source system When two additional package types are added to the way a productmay be sold, the product data changes in the source system When a correction is applied

to the quantity ordered, the data about that order gets changed in the source system

Trang 19

Data in the source systems are said to be time-dependent or temporal This is becausesource data changes with time The value of a single variable varies over time Again, takethe example of the change of address of a customer for a move from New York state toCalifornia In the operational system, what is important is that the current address of thecustomer has CA as the state code The actual change transaction itself, stating that theprevious state code was NY and the revised state code is CA, need not be preserved Butthink about how this change affects the information in the data warehouse If the statecode is used for analyzing some measurements such as sales, the sales to the customer pri-

or to the change must be counted in New York state and those after the move must becounted in California In other words, the history cannot be ignored in the data ware-house This brings us to the question: how do you capture the history from the source sys-tems? The answer depends on how exactly data is stored in the source systems So let usexamine and understand how data is stored in the source operational systems

Data in Operational Systems. These source systems generally store data in twoways Operational data in the source system may be thought of as falling into two broadcategories The type of data extraction technique you have to use depends on the nature ofeach of these two categories

Current Value. Most of the attributes in the source systems fall into this category Herethe stored value of an attribute represents the value of the attribute at this moment of time.The values are transient or transitory As business transactions happen, the values change.There is no way to predict how long the present value will stay or when it will get changed

PRODUCT DATA

ORDER METRICS

TIME DATA

DISPOSITION DATA

DELIVERY CHANNEL DATA CUSTOMER

TARGET SOURCE

SOURCE IDENTIFICATION PROCESS

• List each data item of metrics or facts needed for analysis in fact tables.

• List each dimension attribute from all dimensions.

• For each target data item, find the source system and source data item.

• If there are multiple sources for one data element, choose the preferred source.

• Identify multiple source fields for a single target field and form consolidation rules.

• Identify single source field for multiple target fields and establish splitting rules.

• Ascertain default values.

• Inspect source data for missing values.

Figure 12-2 Source identification: a stepwise approach

Trang 20

next Customer name and address, bank account balances, and outstanding amounts on dividual orders are some examples of this category.

in-What is the implication of this category for data extraction? The value of an attributeremains constant only until a business transaction changes it There is no telling when itwill get changed Data extraction for preserving the history of the changes in the datawarehouse gets quite involved for this category of data

Periodic Status. This category is not as common as the previous category In this gory, the value of the attribute is preserved as the status every time a change occurs Ateach of these points in time, the status value is stored with reference to the time when thenew value became effective This category also includes events stored with reference tothe time when each event occurred Look at the way data about an insurance policy is usu-ally recorded in the operational systems of an insurance company The operational data-bases store the status data of the policy at each point of time when something in the policychanges Similarly, for an insurance claim, each event, such as claim initiation, verifica-tion, appraisal, and settlement, is recorded with reference to the points in time

cate-For operational data in this category, the history of the changes is preserved in thesource systems themselves Therefore, data extraction for the purpose of keeping history

in the data warehouse is relatively easier Whether it is status data or data about an event,the source systems contain data at each point in time when any change occurred Please study Figure 12-3 and confirm your understanding of the two categories of datastored in the operational systems Pay special attention to the examples

Having reviewed the categories indicating how data is stored in the operational

EXAMPLES OF ATTRIBUTES OPERATIONAL SYSTEMS AT DIFFERENT DATES VALUES OF ATTRIBUTES AS STORED IN

Storing Current Value

Storing Periodic Status

Attribute : Customer’s State of Residence

6/1/2000 Value: OH

9/15/2000 Changed to CA

1/22/2001 Changed to NY

3/1/2001 Changed to NJ

Attribute : Status of Property consigned

to an auction house for sale.

6/1/2000 RE 9/15/2000 ES 1/22/2001 AS

6/1/2000 RE 9/15/2000 ES

6/1/2000 9/15/2000 1/22/2001 3/1/2001

Figure 12-3 Data in operational systems

Trang 21

tems, we are now in a position to discuss the common techniques for data extraction.When you deploy your data warehouse, the initial data as of a certain time must be moved

to the data warehouse to get it started This is the initial load After the initial load, yourdata warehouse must be kept updated so the history of the changes and statuses are re-flected in the data warehouse Broadly, there are two major types of data extractions fromthe source operational systems: “as is” (static) data and data of revisions

“As is” or static data is the capture of data at a given point in time It is like taking asnapshot of the relevant source data at a certain point in time For current or transient data,this capture would include all transient data identified for extraction In addition, for datacategorized as periodic, this data capture would include each status or event at each point

in time as available in the source operational systems

You will use static data capture primarily for the initial load of the data warehouse.Sometimes, you may want a full refresh of a dimension table For example, assume thatthe product master of your source application is completely revamped In this case, youmay find it easier to do a full refresh of the product dimension table of the target datawarehouse So, for this purpose, you will perform a static data capture of the productdata

Data of revisions is also known as incremental data capture Strictly, it is not tal data but the revisions since the last time data was captured If the source data is tran-sient, the capture of the revisions is not easy For periodic status data or periodic eventdata, the incremental data capture includes the values of attributes at specific times Ex-tract the statuses and events that have been recorded since the last date of extract Incremental data capture may be immediate or deferred Within the group of immedi-ate data capture there are three distinct options Two separate options are available for de-ferred data capture

incremen-Immediate Data Extraction. In this option, the data extraction is real-time It occurs asthe transactions happen at the source databases and files Figure 12-4 shows the immedi-ate data extraction options

Now let us go into some details about the three options for immediate data extraction

Capture through Transaction Logs This option uses the transaction logs of the DBMSs

maintained for recovery from possible failures As each transaction adds, updates, ordeletes a row from a database table, the DBMS immediately writes entries on the log file.This data extraction technique reads the transaction log and selects all the committedtransactions There is no extra overhead in the operational systems because logging is al-ready part of the transaction processing

You have to make sure that all transactions are extracted before the log file gets freshed As log files on disk storage get filled up, the contents are backed up on other me-dia and the disk log files are reused Ensure that all log transactions are extracted for datawarehouse updates

re-If all of your source systems are database applications, there is no problem with thistechnique But if some of your source system data is on indexed and other flat files, thisoption will not work for these cases There are no log files for these nondatabase applica-tions You will have to apply some other data extraction technique for these cases.While we are on the topic of data capture through transaction logs, let us take a sideexcursion and look at the use of replication Data replication is simply a method for creat-ing copies of data in a distributed environment Please refer to Figure 12-5 illustratinghow replication technology can be used to capture changes to source data

Trang 22

Trigger Programs

Source Data

DBMS

OPTION 1:

Capture through transaction logs

OPTION 2:

Capture through database triggers OPTION 3:

Capture in source

applications

Figure 12-4 Immediate data extraction: options

Figure 12-5 Data extraction: using replication technology

DBMS

REPLICATION SERVER

Log Transaction Manager

Replicated Log

Transactions stored in

Data Staging Area

Source Data

DATA STAGING AREA

Trang 23

The appropriate transaction logs contain all the changes to the various source databasetables Here are the broad steps for using replication to capture changes to source data:

앫 Identify the source system DB table

앫 Identify and define target files in staging area

앫 Create mapping between source table and target files

앫 Define the replication mode

앫 Schedule the replication process

앫 Capture the changes from the transaction logs

앫 Transfer captured data from logs to target files

앫 Verify transfer of data changes

앫 Confirm success or failure of replication

앫 In metadata, document the outcome of replication

앫 Maintain definitions of sources, targets, and mappings

Capture through Database Triggers Again, this option is applicable to your source

sys-tems that are database applications As you know, triggers are special stored procedures(programs) that are stored on the database and fired when certain predefined events occur.You can create trigger programs for all events for which you need data to be captured Theoutput of the trigger programs is written to a separate file that will be used to extract datafor the data warehouse For example, if you need to capture all changes to the records in thecustomer table, write a trigger program to capture all updates and deletes in that table Data capture through database triggers occurs right at the source and is therefore quitereliable You can capture both before and after images However, building and maintainingtrigger programs puts an additional burden on the development effort Also, execution oftrigger procedures during transaction processing of the source systems puts additionaloverhead on the source systems Further, this option is applicable only for source data indatabases

Capture in Source Applications This technique is also referred to as

application-assist-ed data capture In other words, the source application is made to assist in the data capturefor the data warehouse You have to modify the relevant application programs that write tothe source files and databases You revise the programs to write all adds, updates, anddeletes to the source files and database tables Then other extract programs can use theseparate file containing the changes to the source data

Unlike the previous two cases, this technique may be used for all types of source datairrespective of whether it is in databases, indexed files, or other flat files But you have torevise the programs in the source operational systems and keep them maintained Thiscould be a formidable task if the number of source system programs is large Also, thistechnique may degrade the performance of the source applications because of the addi-tional processing needed to capture the changes on separate files

Deferred Data Extraction. In the cases discussed above, data capture takes place whilethe transactions occur in the source operational systems The data capture is immediate orreal-time In contrast, the techniques under deferred data extraction do not capture thechanges in real time The capture happens later Please see Figure 12-6 showing the de-ferred data extraction options

Trang 24

Now let us discuss the two options for deferred data extraction.

Capture Based on Date and Time Stamp Every time a source record is created or

up-dated it may be marked with a stamp showing the date and time The time stamp providesthe basis for selecting records for data extraction Here the data capture occurs at a latertime, not while each source record is created or updated If you run your data extractionprogram at midnight every day, each day you will extract only those with the date andtime stamp later than midnight of the previous day This technique works well if the num-ber of revised records is small

Of course, this technique presupposes that all the relevant source records contain dateand time stamps Provided this is true, data capture based on date and time stamp canwork for any type of source file This technique captures the latest state of the source data.Any intermediary states between two data extraction runs are lost

Deletion of source records presents a special problem If a source record gets deleted inbetween two extract runs, the information about the delete is not detected You can getaround this by marking the source record for delete first, do the extraction run, and then

go ahead and physically delete the record This means you have to add more logic to thesource applications

Capture by Comparing Files If none of the above techniques are feasible for specific

source files in your environment, then consider this technique as the last resort This nique is also called the snapshot differential technique because it compares two snapshots

tech-of the source data Let us see how this technique works

Suppose you want to apply this technique to capture the changes to your product data

FILE COMPARISON PROGRAMS

Extract Files based

on file comparison

OPTION 2:

Capture by comparing files

Figure 12-6 Deferred data extraction: options

DATA STAGING AREA

Trang 25

While performing today’s data extraction for changes to product data, you do a full filecomparison between today’s copy of the product data and yesterday’s copy You also com-pare the record keys to find the inserts and deletes Then you capture any changes be-tween the two copies

This technique necessitates the keeping of prior copies of all the relevant source data.Though simple and straightforward, comparison of full rows in a large file can be very in-efficient However, this may be the only feasible option for some legacy data sources that

do not have transaction logs or time stamps on source records

Evaluation of the Techniques

To summarize, the following options are available for data extraction:

앫 Capture of static data

앫 Capture through transaction logs

앫 Capture through database triggers

앫 Capture in source applications

앫 Capture based on date and time stamp

앫 Capture by comparing files

You are faced with some big questions Which ones are applicable in your ment? Which techniques must you use? You will be using the static data capture technique

environ-at least in one situenviron-ation when you populenviron-ate the denviron-ata warehouse initially environ-at the time of ployment After that, you will usually find that you need a combination of a few of thesetechniques for your environment If you have old legacy systems, you may even have theneed for the file comparison method

de-Figure 12-7 highlights the advantages and disadvantages of the different techniques.Please study it carefully and use it to determine the techniques you would need to use inyour environment

Let us make a few general comments Which of the techniques are easy and sive to implement? Consider the techniques of using transaction logs and database trig-gers Both of these techniques are already available through the database products Bothare comparatively cheap and easy to implement The technique based on transaction logs

inexpen-is perhaps the most inexpensive There inexpen-is no additional overhead on the source operationalsystems In the case of database triggers, there is a need to create and maintain triggerprograms Even here, the maintenance effort and the additional overhead on the sourceoperational systems are not that much compared to other techniques

Data capture in source systems could be the most expensive in terms of developmentand maintenance This technique needs substantial revisions to existing source systems.For many legacy source applications, finding the source code and modifying it may not

be feasible at all However, if the source data does not reside on database files and dateand time stamps are not present in source records, this is one of the few available op-tions

What is the impact on the performance of the source operational systems? Certainly,the deferred data extraction methods have the least impact on the operational systems.Data extraction based on time stamps and data extraction based on file comparisons areperformed outside the normal operation of the source systems Therefore, these two are

Trang 26

preferred options when minimizing the impact on operational systems is a priority ever, these deferred capture options suffer from some inadequacy They track the changesfrom the state of the source data at the time of the current extraction as compared to itsstate at the time of the previous extraction Any interim changes are not captured There-fore, wherever you are dealing with transient source data, you can only come up with ap-proximations of the history.

How-So what is the bottom line? Use the data capture technique in source systems sparinglybecause it involves too much development and maintenance work For your source data ondatabases, capture through transaction logs and capture through database triggers are ob-vious first choices Between these two, capture through transaction logs is a better choicebecause of better performance Also, this technique is applicable to nonrelational databas-

es The file comparison method is the most time-consuming for data extraction Use itonly if all others cannot be applied

DATA TRANSFORMATION

By making use of the several techniques discussed in the previous section, you design thedata extraction function Now the extracted data is raw data and it cannot be applied to thedata warehouse First, all the extracted data must be made usable in the data warehouse.Having information that is usable for strategic decision making is the underlying principle

of the data warehouse You know that the data in the operational systems is not usable forthis purpose Next, because operational data is extracted from many old legacy systems,the quality of the data in those systems is less likely to be good enough for the data ware-

Capture of static data Capture in source applications

Capture through transaction logs Capture based on date and time stamp

Capture through database triggers Capture by comparing files

Good flexibility for capture specifications

Performance of source systems not affected

No revisions to existing applications

Can be used on legacy systems

Can be used on file-oriented systems

Vendor products are used No internal costs

Not much flexibility for capture specifications

Performance of source systems not affected

Can be used on most legacy systems

Cannot be used on file-oriented systems

Not much flexibility for capture specifications

Performance of source systems affected a bit

Cannot be used on most legacy systems

Cannot be used on file-oriented systems

Good flexibility for capture specifications Performance of source systems affected a bit Major revisions to existing applications Can be used on most legacy systems Can be used on file-oriented systems High internal costs because of in-house work

Good flexibility for capture specifications Performance of source systems not affected Major revisions to existing applications likely Cannot be used on most legacy systems Can be used on file-oriented systems Vendor products may be used

Good flexibility for capture specifications Performance of source systems not affected

No revisions to existing applications May be used on legacy systems May be used on file-oriented systems Vendor products are used No internal costs

Figure 12-7 Data capture techniques: advantages and disadvantages

Định dạng
Số trang	53
Dung lượng	443,07 KB