앫 Examine the data extraction function, its challenges, its techniques, and learn howto evaluate and apply the techniques 앫 Discuss the wide range of tasks and types of the data transfor
Trang 1앫 Product category by region by month
앫 Product department by region by month
앫 All products by region by month
앫 Product category by all stores by month
앫 Product department by all stores by month
앫 Product category by territory by quarter
앫 Product department by territory by quarter
앫 All products by territory by quarter
앫 Product category by region by quarter
앫 Product department by region by quarter
앫 All products by region by quarter
앫 Product category by all stores by quarter
앫 Product department by all stores by quarter
앫 Product category by territory by year
앫 Product department by territory by year
앫 All products by territory by year
앫 Product category by region by year
앫 Product department by region by year
앫 All products by region by year
앫 Product category by all stores by year
앫 Product department by all stores by year
앫 All products by all stores by year
Each of these aggregate fact tables is derived from a single base fact table The derivedaggregate fact tables are joined to one or more derived dimension tables See Figure 11-15showing a derived aggregate fact table connected to a derived dimension table
Effect of Sparsity on Aggregation. Consider the case of the grocery chain with 300stores, 40,000 products in each store, but only 4000 selling in each store in a day As dis-cussed earlier, assuming that you keep records for 5 years or 1825 days, the maximumnumber of base fact table rows is calculated as follows:
Product = 40,000
Store = 300
Time = 1825
Maximum number of base fact table rows = 22 billion
Because only 4,000 products sell in each store in a day, not all of these 22 billion rowsare occupied Because of this sparsity, only 10% of the rows are occupied Therefore, thereal estimate of the number of base table rows is 2 billion
Now let us see what happens when you form aggregates Scrutinize a one-way gate: brand totals by store by day Calculate the maximum number of rows in this one-wayaggregate
Trang 2aggre-Brand = 80
Store = 300
Time = 1825
Maximum number of aggregate table rows = 43,800,000
While creating the one-way aggregate, you will notice that the sparsity for this gate is not 10% as in the case of the base table This is because when you aggregate bybrand, more of the brand codes will participate in combinations with store and time codes.The sparsity of the one-way aggregate would be about 50%, resulting in a real estimate of21,900,000 If the sparsity had remained as the 10% applicable to the base table, the realestimate of the number of rows in the aggregate table would be much less
aggre-When you go for higher levels of aggregates, the sparsity percentage moves up andeven reaches 100% Because of the failure of sparsity to stay lower, you are faced with thequestion whether aggregates do improve performance that much Do they reduce thenumber of rows that dramatically?
Experienced data warehousing practitioners have a suggestion When you form gates, make sure that each aggregate table row summarizes at least 10 rows in the lowerlevel table If you increase this to 20 rows or more, it would be really remarkable
Unit Sales Sales Dollars
SALES FACTS
STORE PRODUCT
TIME
Store Key
Store Name Territory Region
Category Department
Category Key Time Key Store Key
Unit Sales Sales Dollars
SALES FACTS
BASE TABLE
ONE-WAY AGGREGATE
DIMENSION DERIVED FROM PRODUCT
Figure 11-15 Aggregate fact table and derived dimension table
Trang 3may create aggregates In the real world, the number of dimensions is not just three, butmany more Therefore, the number of possible aggregate tables escalates into the hundreds Further, from the reference to the failure of sparsity in aggregate tables, you know thatthe aggregation process does not reduce the number of rows proportionally In otherwords, if the sparsity of the base fact table is 10%, the sparsity of the higher-level aggre-gate tables does not remain at 10% The sparsity percentage increases more and more asyour aggregate tables climb higher and higher in levels of summarization
Is aggregation that much effective after all? What are some of the options? How do youdecide what to aggregate? First, set a few goals for aggregation for your data warehouseenvironment
Goals for Aggregation Strategy. Apart from the general overall goal of improvingdata warehouse performance, here are a few specific, practical goals:
앫 Do not get bogged down with too many aggregates Remember, you have to createadditional derived dimensions as well to support the aggregates
앫 Try to cater to a wide range of user groups In any case, provide for your powerusers
앫 Go for aggregates that do not unduly increase the overall usage of storage Lookcarefully into larger aggregates with low sparsity percentages
앫 Keep the aggregates hidden from the end-users That is, the aggregates must betransparent to the end-user query The query tool must be the one to be aware of theaggregates to direct the queries for proper access
앫 Attempt to keep the impact on the data staging process as less intensive as possible
Practical Suggestions. Before doing any calculations to determine the types of gregates needed for your data warehouse environment, spend a good deal of time on de-termining the nature of the common queries How do your users normally report results?What are the reporting levels? By stores? By months? By product categories? Go throughthe dimensions, one by one, and review the levels of the hierarchies Check if there aremultiple hierarchies within the same dimension If so, find out which of these multiple hi-erarchies are more important In each dimension, ascertain which attributes are used forgrouping the fact table metrics The next step is to determine which of these attributes areused in combinations and what the most common combinations are
ag-Once you determine the attributes and their possible combinations, look at the number
of values for each attribute For example, in a hotel chain schema, assume that hotel is atthe lowest level and city is at the next higher level in the hotel dimension Let us say thereare 25,000 values for hotel and 15,000 values for city Clearly, there is no big advantage ofaggregating by cities On the other hand, if city has only 500 values, then city is a level atwhich you may consider aggregation Examine each attribute in the hierarchies within adimension Check the values for each of the attributes Compare the values of attributes atdifferent levels of the same hierarchy and decide which ones are strong candidates to par-ticipate in aggregation
Develop a list of attributes that are useful candidates for aggregation, then work out thecombinations of these attributes to come up with your first set of multiway aggregate facttables Determine the derived dimension tables you need to support these aggregate facttables Go ahead and implement these aggregate fact tables as the initial set
Trang 4Bear in mind that aggregation is a performance tuning mechanism Improved query formance drives the need to summarize, so do not be too concerned if your first set of ag-gregate tables do not perform perfectly Your aggregates are meant to be monitored and re-vised as necessary The nature of the bulk of the query requests is likely to change As yourusers become more adept at using the data warehouse, they will devise new ways of group-ing and analyzing data So what is the practical advice? Do your preparatory work, startwith a reasonable set of aggregate tables, and continue to make adjustments as necessary.
The fact tables of the STARS in a family share dimension tables Usually, the time mension is shared by most of the fact tables in the group In the above example, all the
di-FAMILIES OF STARS 249
Figure 11-16 Family of STARS
FACT TABLE
DIMENSION TABLE
DIMENSION TABLE
DIMENSION
TABLE
FACT TABLE
FACT TABLE
DIMENSION TABLE
DIMENSION TABLE
Trang 5three fact tables are likely to share the time dimension Going the other way, dimension bles from multiple STARS may share the fact table of one STAR
ta-If you are in a business like banking or telephone services, it makes sense to capture dividual transactions as well as snapshots at specific intervals You may then use families
in-of STARS consisting in-of transaction and snapshot schemas If you are in a manufacturingcompany or a similar production-type enterprise, your company needs to monitor the met-rics along the value chain Some other institutions are like a medical center, where value isadded not in a chain but at different stations within the enterprise For these enterprises,the family of STARS supports the value chain or the value circle We will get into details
in the next few sections
Snapshot and Transaction Tables
Let us review some basic requirements of a telephone company A number of individualtransactions make up a telephone customer’s account Many of the transactions occur dur-ing the hours of 6 a.m to 10 p.m of the customer’s day More transactions happen duringthe holidays and weekends for residential customers Institutional customers use thephones on weekdays rather than over the weekends A telephone accumulates a very largecollection of rich transaction data that can be used for many types of valuable analysis.The telephone company needs a schema capturing transaction data that supports strategicdecision making for expansions, new service improvements, and so on This transactionschema answers questions such as how does the revenue of peak hours over the weekendsand holidays compare with peak hours over weekdays
In addition, the telephone company needs to answer questions from the customers as toaccount balances The customer service departments are constantly bombarded with ques-tions on the status of individual customer accounts At periodic intervals, the accountingdepartment may be interested in the amounts expected to be received by the middle ofnext month What are the outstanding balances for which bills will be sent this month-end? For these purposes, the telephone company needs a schema to capture snapshots atperiodic intervals Please see Figure 11-17 showing the snapshot and transaction fact ta-bles for a telephone company Make a note of the attributes in the two fact tables Onetable tracks the individual phone transactions The other table holds snapshots of individ-ual accounts at specific intervals Also, notice how dimension tables are shared betweenthe two fact tables
Snapshot and transaction tables are also common for banks For example, an ATMtransaction table stores individual ATM transactions This fact table keeps track of indi-vidual transaction amounts for the customer accounts The snapshot table holds the bal-ance for each account at the end of each day The two tables serve two distinct functions.From the transaction table, you can perform various types of analysis of the ATM transac-tions The snapshot table provides total amounts held at periodic intervals showing theshifting and movement of balances
Financial data warehouses also require snapshot and transaction tables because of thenature of the analysis in these cases The first set of questions for these warehouses relates
to the transactions affecting given accounts over a certain period of time The other set ofquestions centers around balances in individual accounts at specific intervals or totals ofgroups of accounts at the end of specific periods The transaction table answers the ques-tions of the first set; the snapshot table handles the questions of the second set
Trang 6Core and Custom Tables
Consider two types of businesses that are apparently dissimilar First take the case of abank A bank offers a large variety of services all related to finance in one form or anoth-
er Most of the services are different from one another The checking account service andthe savings account service are similar in most ways But the savings account service doesnot resemble the credit card service in any way How do you track these dissimilar ser-vices?
Next, consider a manufacturing company producing a number of heterogeneous ucts Although a few factors may be common to the various products, by and large the fac-tors differ What must you do to get information about heterogeneous products?
prod-A different type of the family of STprod-ARS satisfies the requirements of these companies
In this type of family, all products and services connect to a core fact table and each uct or service relates to individual custom tables In Figure 11-18, you will see the coreand custom tables for a bank Note how the core fact table holds the metrics that are com-mon to all types of accounts Each custom fact table contains the metrics specific to thatline of service Also note the shared dimension and notice how the tables form a family ofSTARS
prod-Supporting Enterprise Value Chain or Value Circle
In a manufacturing business, a product travels through various steps, starting off as rawmaterials and ending as finished goods in the warehouse inventory Usually, the steps in-clude addition of ingredients, assembly of materials, process control, packaging, and ship-ping to the warehouse From finished goods inventory, a product moves into shipment todistributor, distributor inventory, distributor shipment, retail inventory, and retail sales At
FAMILIES OF STARS 251
Time Key Account Key Transaction Key District Key
Trans Reference Account Number Amount
TELEPHONE
TRANSACTION
FACT TABLE
STATUS Status Key
…………
ACCOUNT Account Key
…………
DISTRICT District Key
…………
TIME Time Key
…………
Time Key Account Key Status Key
Transaction Count Ending Balance
TELEPHONE SNAPSHOT FACT TABLE
Trang 7each step, value is added to the product Several operational systems support the flowthrough these steps The whole flow forms the supply chain or the value chain Similarly,
in an insurance company, the value chain may include a number of steps from sales of surance through issuance of policy and then finally claims processing In this case, thevalue chain relates to the service
in-If you are in one of these businesses, you need to track important metrics at differentsteps along the value chain You create STAR schemas for the significant steps and thecomplete set of related schemas forms a family of STARS You define a fact table and aset of corresponding dimensions for each important step in the chain If your company hasmultiple value chains, then you have to support each chain with a separate family ofSTARS
A supply chain or a value chain runs in a linear fashion beginning with a certain stepand ending at another step with many steps in between Again, at each step, value isadded In some other kinds of businesses where value gets added to services, similar lin-ear movements do not exist For example, consider a health care institution where valuegets added to patient service from different units almost as if they form a circle around theservice We perceive a value circle in such organizations The value circle of a large healthmaintenance organization may include hospitals, clinics, doctors’ offices, pharmacies,laboratories, government agencies, and insurance companies Each of these units eitherprovide patient treatments or measure patient treatments Patient treatment by each unitmay be measured in different metrics But most of the units would analyze the metrics us-
Time Key Account Key Branch Key Household Key
Balance Fees Charged Transactions
BRANCH Branch Key
…………
Account Key
Deposits Withdrawals Interest Earned Balance Service Charges
HOUSEHOLD Household Key
BANK CORE FACT TABLE
SAVINGS CUSTOM FACT TABLE
CHECKING CUSTOM
FACT TABLE
Figure 11-18 Core and custom tables
Trang 8ing the same set of conformed dimensions such as time, patient, health care provider,treatment, diagnosis, and payer For a value circle, the family of STARS comprises multi-ple fact tables and a set of conformed dimensions
The order and shipment fact tables share the conformed dimensions of product, date,customer, and salesperson A conformed dimension is a comprehensive combination ofattributes from the source systems after resolving all discrepancies and conflicts For ex-ample, a conformed product dimension must truly reflect the master product list of the en-terprise and must include all possible hierarchies Each attribute must be of the correctdata type and must have proper lengths and constraints
Conforming dimensions is a basic requirement in a data warehouse Pay special tion and take the necessary steps to conform all your dimensions This is a major respon-sibility of the project team Conformed dimensions allow rollups across data marts Userinterfaces will be consistent irrespective of the type of query Result sets of queries will be
Invoice Number Order Number Ship Date Arrival Date
Trang 9consistent across data marts Of course, a single conformed dimension can be usedagainst multiple fact tables
Standardizing Facts
In addition to the task of conforming dimensions is the requirement to standardize facts
We have seen that fact tables work across STAR schemas Review the following issues lating to the standardization of fact table attributes:
re-앫 Ensure same definitions and terminology across data marts
앫 Resolve homonyms and synonyms
앫 Types of facts to be standardized include revenue, price, cost, and profit margin
앫 Guarantee that the same algorithms are used for any derived units in each facttable
앫 Make sure each fact uses the right unit of measurement
Summary of Family of STARS
Let us end our discussion of the family of STARS with a comprehensive diagram showing
a set of standardized fact tables and conformed dimension tables Study Figure 11-20
care-Date Key Account Key Branch Key
Balance
BRANCH Branch Key Branch Name
State Region
ACCOUNT
Account Key
…………
DATE Date Key
Date Month Year
Account Key
ATM Trans
Other Trans
STATE State Key
State Region
Date Key Account Kye State Key
Balance
MONTH Month Key
Month Year
Month Key Account Key State Key
Balance
2-WAY AGGREGATE
1-WAY AGGREGATE
BANK CORE TABLE
CHECKING CUSTOM TABLE
Figure 11-20 A comprehensive family of STARS
Branch Name
Trang 10fully Note the aggregate fact tables and the corresponding derived dimension tables Whattypes of aggregates are these? One-way or two-way? Which are the base fact tables? Noticethe shared dimensions Are these conformed dimensions? See how the various fact tablesand the dimension tables are related.
CHAPTER SUMMARY
앫 Slowly changing dimensions may be classified into three different types based onthe nature of the changes Type 1 relates to corrections, Type 2 to preservation ofhistory, and Type 3 to soft revisions Applying each type of revision to the datawarehouse is different
앫 Large dimension tables such as customer or product need special considerations forapplying optimizing techniques
앫 “Snowflaking” or creating a snowflake schema is a method of normalizing theSTAR schema Although some conditions justify the snowflake schema, it is gener-ally not recommended
앫 Miscellaneous flags and textual data are thrown together in one table called a junkdimension table
앫 Aggregate or summary tables improve performance Formulate a strategy for ing aggregate tables
build-앫 A set of related STAR schemas make up a family of STARS Examples are snapshotand transaction tables, core and custom tables, and tables supporting a value chain
or a value circle A family of STARS relies on conformed dimension tables andstandardized fact tables
REVIEW QUESTIONS
1 Describe slowly changing dimensions What are the three types? Explain eachtype very briefly
2 Compare and contrast Type 2 and Type 3 slowly changing dimensions
3 Can you treat rapidly changing dimensions in the same way as Type 2 slowlychanging dimensions? Discuss
4 What are junk dimensions? Are they necessary in a data warehouse?
5 How does a snowflake schema differ from a STAR schema? Name two advantagesand two disadvantages of the snowflake schema
6 Differentiate between slowly and rapidly changing dimensions
7 What are aggregate fact tables? Why are they needed? Give an example
8 Describe with examples snapshot and transaction fact tables How are they ed?
relat-9 Give an example of a value circle Explain how a family of STARS can support avalue circle
10 What is meant by conforming the dimensions? Why is this important in a datawarehouse?
Trang 111 Indicate if true or false:
A Type 1 changes for slowly changing dimensions relate to correction of errors
B To apply Type 3 changes of slowly changing dimensions, overwrite the attributevalue in the dimension table row with the new value
C Large dimensions usually have multiple hierarchies
D The STAR schema is a normalized version of the snowflake schema
E Aggregates are precalculated summaries
F The percentage of sparsity of the base table tends to be higher than that of gate tables
aggre-G The fact tables of the STARS in a family share dimension tables
H Core and custom fact tables are useful for companies with several lines of vice
ser-I Conforming dimensions is not absolutely necessary in a data warehouse
J A value circle usually needs a family of STARS to support the business
2 Assume you are in the insurance business Find two examples of Type 2 slowlychanging dimensions in that business As an analyst on the project, write the speci-fications for applying the Type 2 changes to the data warehouse with regard to thetwo examples
3 You are the data design specialist on the data warehouse project team for a retailcompany Design a STAR schema to track the sales units and sales dollars withthree dimension tables Explain how you will decide to select and build four two-way aggregates
4 As the data designer for an international bank, consider the possible types of shot and transaction tables Complete the design with one set of snapshot and trans-action tables
snap-5 For a manufacturing company, design a family of three STARS to support the valuechain
Trang 12앫 Examine the data extraction function, its challenges, its techniques, and learn how
to evaluate and apply the techniques
앫 Discuss the wide range of tasks and types of the data transformation function
앫 Understand the meaning of data integration and consolidation
앫 Perceive the importance of the data load function and probe the major methods forapplying data to the warehouse
앫 Gain a true insight into why ETL is crucial, time-consuming, and arduous
You may be convinced that the data in your organization’s operational systems is
total-ly inadequate for providing information for strategic decision making As informationtechnology professionals, we are fully aware of the futile attempts in the past two decades
to provide strategic information from operational systems These attempts did not work.Data warehousing can fulfill that pressing need for strategic information
Mostly, the information contained in a warehouse flows from the same operational tems that could not be directly used to provide strategic information What constitutes thedifference between the data in the source operational systems and the information in thedata warehouse? It is the set of functions that fall under the broad group of data extrac-tion, transformation, and loading (ETL)
sys-ETL functions reshape the relevant data from the source systems into useful tion to be stored in the data warehouse Without these functions, there would be no strate-
informa-257
Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals Paulraj Ponniah
Copyright © 2001 John Wiley & Sons, Inc ISBNs: 0-471-41254-6 (Hardback); 0-471-22162-7 (Electronic)
Trang 13gic information in the data warehouse If the source data is not extracted correctly,cleansed, and integrated in the proper formats, query processing, the backbone of the datawarehouse, could not happen.
In Chapter 2, when we discussed the building blocks of the data warehouse, webriefly looked at ETL functions as part of the data staging area In Chapter 6 we revis-ited ETL functions and examined how the business requirements drive these functions
as well Further, in Chapter 8, we explored the hardware and software infrastructure tions to support the data movement functions Why, then, is additional review of ETLnecessary?
op-ETL functions form the prerequisites for the data warehouse information content op-ETLfunctions rightly deserve more consideration and discussion In this chapter, we will delvedeeper into issues relating to ETL functions We will review many significant activitieswithin ETL In the next chapter, we need to continue the discussion by studying anotherimportant function that falls within the overall purview of ETL—data quality Now, let usbegin with a general overview of ETL
ETL OVERVIEW
If you recall our discussion of the functions and services of the technical architecture ofthe data warehouse, you will see that we divided the environment into three functionalareas These areas are data acquisition, data storage, and information delivery Data ex-traction, transformation, and loading encompass the areas of data acquisition and datastorage These are back-end processes that cover the extraction of data from the sourcesystems Next, they include all the functions and procedures for changing the source datainto the exact formats and structures appropriate for storage in the data warehouse data-base After the transformation of the data, these processes consist of all the functions forphysically moving the data into the data warehouse repository
Data extraction, of course, precedes all other functions But what is the scope and tent of the data you will extract from the source systems? Do you not think that the users
ex-of your data warehouse are interested in all ex-of the operational data for some type ex-of query
or analysis? So, why not extract all of operational data and dump it into the data house? This seems to be a straightforward approach Nevertheless, this approach is some-thing driven by the user requirements Your requirements definition should guide you as towhat data you need to extract and from which source systems Avoid creating a datajunkhouse by dumping all the available data from the source systems and waiting to seewhat the users will do with it Data extraction presupposes a selection process Select theneeded data based on the user requirements
ware-The extent and complexity of the back-end processes differ from one data warehouse
to another If your enterprise is supported by a large number of operational systems ning on several computing platforms, the back-end processes in your case would be exten-sive and possibly complex as well So, in your situation, data extraction becomes quitechallenging The data transformation and data loading functions may also be equally diffi-cult Moreover, if the quality of the source data is below standard, this condition furtheraggravates the back-end processes In addition to these challenges, if only a few of theloading methods are feasible for your situation, then data loading could also be difficult.Let us get into specifics about the nature of the ETL functions
Trang 14run-Most Important and run-Most Challenging
Each of the ETL functions fulfills a significant purpose When you want to convert datafrom the source systems into information stored in the data warehouse, each of thesefunctions is essential For changing data into information you first need to capture thedata After you capture the data, you cannot simply dump that data into the data ware-house and call it strategic information You have to subject the extracted data to all manner
of transformations so that the data will be fit to be converted into information Once youhave transformed the data, it is still not useful to the end-users until it is moved to the datawarehouse repository Data loading is an essential function You must perform all threefunctions of ETL for successfully transforming data into information
Take as an example an analysis your user wants to perform The user wants to pare and analyze sales by store, by product, and by month The sale figures are available
com-in the several sales applications com-in your company Also, you have a product master file.Further, each sales transaction refers to a specific store All these are pieces of data inthe source operational systems For doing the analysis, you have to provide informationabout the sales in the data warehouse database You have to provide the sales units anddollars in a fact table, the products in a product dimension table, the stores in a store di-mension table, and months in a time dimension table How do you do this? Extract thedata from each of the operational systems, reconcile the variations in data representa-tions among the source systems, and transform all the sales of all the products Thenload the sales into the fact and dimension tables Now, after completion of these threefunctions, the extracted data is sitting in the data warehouse, transformed into informa-tion, ready for analysis Notice that it is important for each function to be performed,and performed in sequence
ETL functions are challenging primarily because of the nature of the source systems.Most of the challenges in ETL arise from the disparities among the source operationalsystems Please review the following list of reasons for the types of difficulties in ETLfunctions Consider each carefully and relate it to your environment so that you may findproper resolutions
앫 Source systems are very diverse and disparate
앫 There is usually a need to deal with source systems on multiple platforms and ferent operating systems
dif-앫 Many source systems are older legacy applications running on obsolete databasetechnologies
앫 Generally, historical data on changes in values are not preserved in source tional systems Historical information is critical in a data warehouse
opera-앫 Quality of data is dubious in many old source systems that have evolved over time
앫 Source system structures keep changing over time because of new business tions ETL functions must also be modified accordingly
condi-앫 Gross lack of consistency among source systems is commonly prevalent Same data
is likely to be represented differently in the various source systems For example,data on salary may be represented as monthly salary, weekly salary, and bimonthlysalary in different source payroll systems
앫 Even when inconsistent data is detected among disparate source systems, lack of ameans for resolving mismatches escalates the problem of inconsistency
Trang 15앫 Most source systems do not represent data in types or formats that are meaningful
to the users Many representations are cryptic and ambiguous
Time-Consuming and Arduous
When the project team designs the ETL functions, tests the various processes, and deploysthem, you will find that these consume a very high percentage of the total project effort It
is not uncommon for a project team to spend as much as 50–70% of the project effort onETL functions You have already noted several factors that add to the complexity of theETL functions
Data extraction itself can be quite involved depending on the nature and complexity ofthe source systems The metadata on the source systems must contain information onevery database and every data structure that are needed from the source systems You needvery detailed information, including database size and volatility of the data You have toknow the time window during each day when you can extract data without impacting theusage of the operational systems You also need to determine the mechanism for capturingthe changes to data in each of the relevant source systems These are strenuous and time-consuming activities
Activities within the data transformation function can run the gamut of transformationmethods You have to reformat internal data structures, resequence data, apply variousforms of conversion techniques, supply default values wherever values are missing, andyou must design the whole set of aggregates that are needed for performance improve-ment In many cases, you need to convert from EBCDIC to ASCII formats
Now turn your attention to the data loading function The sheer massive size of the tial loading can populate millions of rows in the data warehouse database Creating andmanaging load images for such large numbers are not easy tasks Even more difficult isthe task of testing and applying the load images to actually populate the physical files inthe data warehouse Sometimes, it may take two or more weeks to complete the initialphysical loading
ini-With regard to extracting and applying the ongoing incremental changes, there are eral difficulties Finding the proper extraction method for individual source datasets can
sev-be arduous Once you settle on the extraction method, finding a time window to apply thechanges to the data warehouse can be tricky if your data warehouse cannot suffer longdowntimes
ETL Requirements and Steps
Before we highlight some key issues relating to ETL, let us review the functional steps.For initial bulk refresh as well as for the incremental data loads, the sequence is simply asnoted here: triggering for incremental changes, filtering for refreshes and incrementalloads, data extraction, transformation, integration, cleansing, and applying to the datawarehouse database
What are the major steps in the ETL process? Please look at the list shown in Figure12-1 Each of these major steps breaks down into a set of activities and tasks Use this fig-ure as a guide to come up with a list of steps for the ETL process of your data warehouse.The following list enumerates the types of activities and tasks that compose the ETLprocess This list is by no means complete for every data warehouse, but it gives a goodinsight into what is involved to complete the ETL process
Trang 16앫 Combine several source data structures into a single row in the target database ofthe data warehouse.
앫 Split one source data structure into several structures to go into several rows of thetarget database
앫 Read data from data dictionaries and catalogs of source systems
앫 Read data from a variety of file structures including flat files, indexed files(VSAM), and legacy system databases (hierarchical/network)
앫 Load details for populating atomic fact tables
앫 Aggregate for populating aggregate or summary fact tables
앫 Transform data from one format in the source platform to another format in the get platform
tar-앫 Derive target values for input fields (example: age from date of birth)
앫 Change cryptic values to values meaningful to the users (example: 1 and 2 to maleand female)
Key Factors
Before we move on, let us point out a couple of key factors The first relates to the plexity of the data extraction and transformation functions The second is about the dataloading function
com-Remember that the primary reason for the complexity of the data extraction and formation functions is the tremendous diversity of the source systems In a large enter-prise, we could have a bewildering combination of computing platforms, operating sys-tems, database management systems, network protocols, and source legacy systems Youneed to pay special attention to the various sources and begin with a complete inventory
trans-of the source systems With this inventory as a starting point, work out all the details trans-of
Determine all the target data needed in the data warehouse.
Determine all the data sources, both internal and external.
Prepare data mapping for target data elements from sources Establish comprehensive data extraction rules.
Determine data transformation and cleansing rules Plan for aggregate tables.
Organize data staging area and test tools Write procedures for all data loads ETL for dimension tables.
ETL for fact tables.
Figure 12-1 Major steps in the ETL process
Trang 17data extraction The difficulties encountered in the data transformation function also late to the heterogeneity of the source systems
re-Now, turning your attention to the data loading function, you have a couple of issues to
be careful about Usually, the mass refreshes, whether for initial load or for periodic freshes, cause difficulties, not so much because of complexities, but because these loadjobs run too long You will have to find the proper time to schedule these full refreshes.Incremental loads have some other types of difficulties First, you have to determine thebest method to capture the ongoing changes from each source system Next, you have toexecute the capture without impacting the source systems After that, at the other end, youhave to schedule the incremental loads without impacting the usage of the data warehouse
re-by the users
Pay special attention to these key issues while designing the ETL functions for yourdata warehouse Now let us take each of the three ETL functions, one by one, and studythe details
DATA EXTRACTION
As an IT professional, you must have participated in data extractions and conversionswhen implementing operational systems When you went from a VSAM file-oriented or-der entry system to a new order processing system using relational database technology,you may have written data extraction programs to capture data from the VSAM files toget the data ready for populating the relational database
Two major factors differentiate the data extraction for a new operational system fromthe data extraction for a data warehouse First, for a data warehouse, you have to extractdata from many disparate sources Next, for a data warehouse, you have to extract data onthe changes for ongoing incremental loads as well as for a one-time initial full load Foroperational systems, all you need is one-time extractions and data conversions
These two factors increase the complexity of data extraction for a data warehouse and,therefore, warrant the use of third-party data extraction tools in addition to in-house pro-grams or scripts Third-party tools are generally more expensive than in-house programs,but they record their own metadata On the other hand, in-house programs increase the cost
of maintenance and are hard to maintain as source systems change If your company is in anindustry where frequent changes to business conditions are the norm, then you may want tominimize the use of in-house programs Third-party tools usually provide built-in flexibili-
ty All you have to do is to change the input parameters for the third-part tool you are using Effective data extraction is a key to the success of your data warehouse Therefore, youneed to pay special attention to the issues and formulate a data extraction strategy for yourdata warehouse Here is a list of data extraction issues:
앫 Source Identification—identify source applications and source structures.
앫 Method of extraction—for each data source, define whether the extraction process
is manual or tool-based
앫 Extraction frequency—for each data source, establish how frequently the data
ex-traction must by done—daily, weekly, quarterly, and so on
앫 Time window—for each data source, denote the time window for the extraction
process
Trang 18앫 Job sequencing—determine whether the beginning of one job in an extraction job
stream has to wait until the previous job has finished successfully
앫 Exception handling—determine how to handle input records that cannot be
extract-ed
Source Identification
Let us consider the first of the above issues, namely, source identification We will dealwith the rest of the issues later as we move through the remainder of this chapter Sourceidentification, of course, encompasses the identification of all the proper data sources Itdoes not stop with just the identification of the data sources It goes beyond that to exam-ine and verify that the identified sources will provide the necessary value to the data ware-house Let us walk through the source identification process in some detail
Assume that a part of your database, maybe one of your data marts, is designed to vide strategic information on the fulfillment of orders For this purpose, you need to storehistorical information about the fulfilled and pending orders If you ship orders throughmultiple delivery channels, you need to capture data about these channels If your usersare interested in analyzing the orders by the status of the orders as the orders go throughthe fulfillment process, then you need to extract data on the order statuses
pro-In the fact table for order fulfillment, you need attributes about the total order amount,discounts, commissions, expected delivery time, actual delivery time, and dates at differ-ent stages of the process You need dimension tables for product, order disposition, deliv-ery channel, and customer First, you have to determine if you have source systems to pro-vide you with the data needed for this data mart Then, from the source systems, you have
to establish the correct data source for each data element in the data mart Further, youhave to go through a verification process to ensure that the identified sources are reallythe right ones
Figure 12-2 describes a stepwise approach to source identification for order ment Source identification is not as simple a process as it may sound It is a critical firstprocess in the data extraction function You need to go through the source identificationprocess for every piece of information you have to store in the data warehouse As youmight have already figured out, source identification needs thoroughness, lots of time,and exhaustive analysis
fulfill-Data Extraction Techniques
Before examining the various data extraction techniques, you must clearly understand thenature of the source data you are extracting or capturing Also, you need to get an insightinto how the extracted data will be used Source data is in a state of constant flux Business transactions keep changing the data in the source systems In most cases, thevalue of an attribute in a source system is the value of that attribute at the current time Ifyou look at every data structure in the source operational systems, the day-to-day businesstransactions constantly change the values of the attributes in these structures When a cus-tomer moves to another state, the data about that customer changes in the customer table
in the source system When two additional package types are added to the way a productmay be sold, the product data changes in the source system When a correction is applied
to the quantity ordered, the data about that order gets changed in the source system
Trang 19Data in the source systems are said to be time-dependent or temporal This is becausesource data changes with time The value of a single variable varies over time Again, takethe example of the change of address of a customer for a move from New York state toCalifornia In the operational system, what is important is that the current address of thecustomer has CA as the state code The actual change transaction itself, stating that theprevious state code was NY and the revised state code is CA, need not be preserved Butthink about how this change affects the information in the data warehouse If the statecode is used for analyzing some measurements such as sales, the sales to the customer pri-
or to the change must be counted in New York state and those after the move must becounted in California In other words, the history cannot be ignored in the data ware-house This brings us to the question: how do you capture the history from the source sys-tems? The answer depends on how exactly data is stored in the source systems So let usexamine and understand how data is stored in the source operational systems
Data in Operational Systems. These source systems generally store data in twoways Operational data in the source system may be thought of as falling into two broadcategories The type of data extraction technique you have to use depends on the nature ofeach of these two categories
Current Value. Most of the attributes in the source systems fall into this category Herethe stored value of an attribute represents the value of the attribute at this moment of time.The values are transient or transitory As business transactions happen, the values change.There is no way to predict how long the present value will stay or when it will get changed
PRODUCT DATA
ORDER METRICS
TIME DATA
DISPOSITION DATA
DELIVERY CHANNEL DATA CUSTOMER
TARGET SOURCE
SOURCE IDENTIFICATION PROCESS
• List each data item of metrics or facts needed for analysis in fact tables.
• List each dimension attribute from all dimensions.
• For each target data item, find the source system and source data item.
• If there are multiple sources for one data element, choose the preferred source.
• Identify multiple source fields for a single target field and form consolidation rules.
• Identify single source field for multiple target fields and establish splitting rules.
• Ascertain default values.
• Inspect source data for missing values.
Figure 12-2 Source identification: a stepwise approach
Trang 20next Customer name and address, bank account balances, and outstanding amounts on dividual orders are some examples of this category.
in-What is the implication of this category for data extraction? The value of an attributeremains constant only until a business transaction changes it There is no telling when itwill get changed Data extraction for preserving the history of the changes in the datawarehouse gets quite involved for this category of data
Periodic Status. This category is not as common as the previous category In this gory, the value of the attribute is preserved as the status every time a change occurs Ateach of these points in time, the status value is stored with reference to the time when thenew value became effective This category also includes events stored with reference tothe time when each event occurred Look at the way data about an insurance policy is usu-ally recorded in the operational systems of an insurance company The operational data-bases store the status data of the policy at each point of time when something in the policychanges Similarly, for an insurance claim, each event, such as claim initiation, verifica-tion, appraisal, and settlement, is recorded with reference to the points in time
cate-For operational data in this category, the history of the changes is preserved in thesource systems themselves Therefore, data extraction for the purpose of keeping history
in the data warehouse is relatively easier Whether it is status data or data about an event,the source systems contain data at each point in time when any change occurred Please study Figure 12-3 and confirm your understanding of the two categories of datastored in the operational systems Pay special attention to the examples
Having reviewed the categories indicating how data is stored in the operational
EXAMPLES OF ATTRIBUTES OPERATIONAL SYSTEMS AT DIFFERENT DATES VALUES OF ATTRIBUTES AS STORED IN
Storing Current Value
Storing Periodic Status
Attribute : Customer’s State of Residence
6/1/2000 Value: OH
9/15/2000 Changed to CA
1/22/2001 Changed to NY
3/1/2001 Changed to NJ
Attribute : Status of Property consigned
to an auction house for sale.
6/1/2000 RE 9/15/2000 ES 1/22/2001 AS
6/1/2000 RE 9/15/2000 ES
6/1/2000 9/15/2000 1/22/2001 3/1/2001
Figure 12-3 Data in operational systems
Trang 21tems, we are now in a position to discuss the common techniques for data extraction.When you deploy your data warehouse, the initial data as of a certain time must be moved
to the data warehouse to get it started This is the initial load After the initial load, yourdata warehouse must be kept updated so the history of the changes and statuses are re-flected in the data warehouse Broadly, there are two major types of data extractions fromthe source operational systems: “as is” (static) data and data of revisions
“As is” or static data is the capture of data at a given point in time It is like taking asnapshot of the relevant source data at a certain point in time For current or transient data,this capture would include all transient data identified for extraction In addition, for datacategorized as periodic, this data capture would include each status or event at each point
in time as available in the source operational systems
You will use static data capture primarily for the initial load of the data warehouse.Sometimes, you may want a full refresh of a dimension table For example, assume thatthe product master of your source application is completely revamped In this case, youmay find it easier to do a full refresh of the product dimension table of the target datawarehouse So, for this purpose, you will perform a static data capture of the productdata
Data of revisions is also known as incremental data capture Strictly, it is not tal data but the revisions since the last time data was captured If the source data is tran-sient, the capture of the revisions is not easy For periodic status data or periodic eventdata, the incremental data capture includes the values of attributes at specific times Ex-tract the statuses and events that have been recorded since the last date of extract Incremental data capture may be immediate or deferred Within the group of immedi-ate data capture there are three distinct options Two separate options are available for de-ferred data capture
incremen-Immediate Data Extraction. In this option, the data extraction is real-time It occurs asthe transactions happen at the source databases and files Figure 12-4 shows the immedi-ate data extraction options
Now let us go into some details about the three options for immediate data extraction
Capture through Transaction Logs This option uses the transaction logs of the DBMSs
maintained for recovery from possible failures As each transaction adds, updates, ordeletes a row from a database table, the DBMS immediately writes entries on the log file.This data extraction technique reads the transaction log and selects all the committedtransactions There is no extra overhead in the operational systems because logging is al-ready part of the transaction processing
You have to make sure that all transactions are extracted before the log file gets freshed As log files on disk storage get filled up, the contents are backed up on other me-dia and the disk log files are reused Ensure that all log transactions are extracted for datawarehouse updates
re-If all of your source systems are database applications, there is no problem with thistechnique But if some of your source system data is on indexed and other flat files, thisoption will not work for these cases There are no log files for these nondatabase applica-tions You will have to apply some other data extraction technique for these cases.While we are on the topic of data capture through transaction logs, let us take a sideexcursion and look at the use of replication Data replication is simply a method for creat-ing copies of data in a distributed environment Please refer to Figure 12-5 illustratinghow replication technology can be used to capture changes to source data
Trang 22Trigger Programs
Source Data
DBMS
OPTION 1:
Capture through transaction logs
OPTION 2:
Capture through database triggers OPTION 3:
Capture in source
applications
Figure 12-4 Immediate data extraction: options
Figure 12-5 Data extraction: using replication technology
DBMS
REPLICATION SERVER
Log Transaction Manager
Replicated Log
Transactions stored in
Data Staging Area
Source Data
DATA STAGING AREA
Trang 23The appropriate transaction logs contain all the changes to the various source databasetables Here are the broad steps for using replication to capture changes to source data:
앫 Identify the source system DB table
앫 Identify and define target files in staging area
앫 Create mapping between source table and target files
앫 Define the replication mode
앫 Schedule the replication process
앫 Capture the changes from the transaction logs
앫 Transfer captured data from logs to target files
앫 Verify transfer of data changes
앫 Confirm success or failure of replication
앫 In metadata, document the outcome of replication
앫 Maintain definitions of sources, targets, and mappings
Capture through Database Triggers Again, this option is applicable to your source
sys-tems that are database applications As you know, triggers are special stored procedures(programs) that are stored on the database and fired when certain predefined events occur.You can create trigger programs for all events for which you need data to be captured Theoutput of the trigger programs is written to a separate file that will be used to extract datafor the data warehouse For example, if you need to capture all changes to the records in thecustomer table, write a trigger program to capture all updates and deletes in that table Data capture through database triggers occurs right at the source and is therefore quitereliable You can capture both before and after images However, building and maintainingtrigger programs puts an additional burden on the development effort Also, execution oftrigger procedures during transaction processing of the source systems puts additionaloverhead on the source systems Further, this option is applicable only for source data indatabases
Capture in Source Applications This technique is also referred to as
application-assist-ed data capture In other words, the source application is made to assist in the data capturefor the data warehouse You have to modify the relevant application programs that write tothe source files and databases You revise the programs to write all adds, updates, anddeletes to the source files and database tables Then other extract programs can use theseparate file containing the changes to the source data
Unlike the previous two cases, this technique may be used for all types of source datairrespective of whether it is in databases, indexed files, or other flat files But you have torevise the programs in the source operational systems and keep them maintained Thiscould be a formidable task if the number of source system programs is large Also, thistechnique may degrade the performance of the source applications because of the addi-tional processing needed to capture the changes on separate files
Deferred Data Extraction. In the cases discussed above, data capture takes place whilethe transactions occur in the source operational systems The data capture is immediate orreal-time In contrast, the techniques under deferred data extraction do not capture thechanges in real time The capture happens later Please see Figure 12-6 showing the de-ferred data extraction options
Trang 24Now let us discuss the two options for deferred data extraction.
Capture Based on Date and Time Stamp Every time a source record is created or
up-dated it may be marked with a stamp showing the date and time The time stamp providesthe basis for selecting records for data extraction Here the data capture occurs at a latertime, not while each source record is created or updated If you run your data extractionprogram at midnight every day, each day you will extract only those with the date andtime stamp later than midnight of the previous day This technique works well if the num-ber of revised records is small
Of course, this technique presupposes that all the relevant source records contain dateand time stamps Provided this is true, data capture based on date and time stamp canwork for any type of source file This technique captures the latest state of the source data.Any intermediary states between two data extraction runs are lost
Deletion of source records presents a special problem If a source record gets deleted inbetween two extract runs, the information about the delete is not detected You can getaround this by marking the source record for delete first, do the extraction run, and then
go ahead and physically delete the record This means you have to add more logic to thesource applications
Capture by Comparing Files If none of the above techniques are feasible for specific
source files in your environment, then consider this technique as the last resort This nique is also called the snapshot differential technique because it compares two snapshots
tech-of the source data Let us see how this technique works
Suppose you want to apply this technique to capture the changes to your product data
FILE COMPARISON PROGRAMS
Extract Files based
on file comparison
OPTION 2:
Capture by comparing files
Figure 12-6 Deferred data extraction: options
DATA STAGING AREA
Trang 25While performing today’s data extraction for changes to product data, you do a full filecomparison between today’s copy of the product data and yesterday’s copy You also com-pare the record keys to find the inserts and deletes Then you capture any changes be-tween the two copies
This technique necessitates the keeping of prior copies of all the relevant source data.Though simple and straightforward, comparison of full rows in a large file can be very in-efficient However, this may be the only feasible option for some legacy data sources that
do not have transaction logs or time stamps on source records
Evaluation of the Techniques
To summarize, the following options are available for data extraction:
앫 Capture of static data
앫 Capture through transaction logs
앫 Capture through database triggers
앫 Capture in source applications
앫 Capture based on date and time stamp
앫 Capture by comparing files
You are faced with some big questions Which ones are applicable in your ment? Which techniques must you use? You will be using the static data capture technique
environ-at least in one situenviron-ation when you populenviron-ate the denviron-ata warehouse initially environ-at the time of ployment After that, you will usually find that you need a combination of a few of thesetechniques for your environment If you have old legacy systems, you may even have theneed for the file comparison method
de-Figure 12-7 highlights the advantages and disadvantages of the different techniques.Please study it carefully and use it to determine the techniques you would need to use inyour environment
Let us make a few general comments Which of the techniques are easy and sive to implement? Consider the techniques of using transaction logs and database trig-gers Both of these techniques are already available through the database products Bothare comparatively cheap and easy to implement The technique based on transaction logs
inexpen-is perhaps the most inexpensive There inexpen-is no additional overhead on the source operationalsystems In the case of database triggers, there is a need to create and maintain triggerprograms Even here, the maintenance effort and the additional overhead on the sourceoperational systems are not that much compared to other techniques
Data capture in source systems could be the most expensive in terms of developmentand maintenance This technique needs substantial revisions to existing source systems.For many legacy source applications, finding the source code and modifying it may not
be feasible at all However, if the source data does not reside on database files and dateand time stamps are not present in source records, this is one of the few available op-tions
What is the impact on the performance of the source operational systems? Certainly,the deferred data extraction methods have the least impact on the operational systems.Data extraction based on time stamps and data extraction based on file comparisons areperformed outside the normal operation of the source systems Therefore, these two are
Trang 26preferred options when minimizing the impact on operational systems is a priority ever, these deferred capture options suffer from some inadequacy They track the changesfrom the state of the source data at the time of the current extraction as compared to itsstate at the time of the previous extraction Any interim changes are not captured There-fore, wherever you are dealing with transient source data, you can only come up with ap-proximations of the history.
How-So what is the bottom line? Use the data capture technique in source systems sparinglybecause it involves too much development and maintenance work For your source data ondatabases, capture through transaction logs and capture through database triggers are ob-vious first choices Between these two, capture through transaction logs is a better choicebecause of better performance Also, this technique is applicable to nonrelational databas-
es The file comparison method is the most time-consuming for data extraction Use itonly if all others cannot be applied
DATA TRANSFORMATION
By making use of the several techniques discussed in the previous section, you design thedata extraction function Now the extracted data is raw data and it cannot be applied to thedata warehouse First, all the extracted data must be made usable in the data warehouse.Having information that is usable for strategic decision making is the underlying principle
of the data warehouse You know that the data in the operational systems is not usable forthis purpose Next, because operational data is extracted from many old legacy systems,the quality of the data in those systems is less likely to be good enough for the data ware-
Capture of static data Capture in source applications
Capture through transaction logs Capture based on date and time stamp
Capture through database triggers Capture by comparing files
Good flexibility for capture specifications
Performance of source systems not affected
No revisions to existing applications
Can be used on legacy systems
Can be used on file-oriented systems
Vendor products are used No internal costs
Not much flexibility for capture specifications
Performance of source systems not affected
No revisions to existing applications
Can be used on most legacy systems
Cannot be used on file-oriented systems
Vendor products are used No internal costs
Not much flexibility for capture specifications
Performance of source systems affected a bit
No revisions to existing applications
Cannot be used on most legacy systems
Cannot be used on file-oriented systems
Vendor products are used No internal costs
Good flexibility for capture specifications Performance of source systems affected a bit Major revisions to existing applications Can be used on most legacy systems Can be used on file-oriented systems High internal costs because of in-house work
Good flexibility for capture specifications Performance of source systems not affected Major revisions to existing applications likely Cannot be used on most legacy systems Can be used on file-oriented systems Vendor products may be used
Good flexibility for capture specifications Performance of source systems not affected
No revisions to existing applications May be used on legacy systems May be used on file-oriented systems Vendor products are used No internal costs
Figure 12-7 Data capture techniques: advantages and disadvantages