Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 53 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
53
Dung lượng
443,07 KB
Nội dung
ț Product category by region by month ț Product department by region by month ț All products by region by month ț Product category by all stores by month ț Product department by all stores by month ț Product category by territory by quarter ț Product department by territory by quarter ț All products by territory by quarter ț Product category by region by quarter ț Product department by region by quarter ț All products by region by quarter ț Product category by all stores by quarter ț Product department by all stores by quarter ț Product category by territory by year ț Product department by territory by year ț All products by territory by year ț Product category by region by year ț Product department by region by year ț All products by region by year ț Product category by all stores by year ț Product department by all stores by year ț All products by all stores by year Each of these aggregate fact tables is derived from a single base fact table. The derived aggregate fact tables are joined to one or more derived dimension tables. See Figure 11-15 showing a derived aggregate fact table connected to a derived dimension table. Effect of Sparsity on Aggregation. Consider the case of the grocery chain with 300 stores, 40,000 products in each store, but only 4000 selling in each store in a day. As dis- cussed earlier, assuming that you keep records for 5 years or 1825 days, the maximum number of base fact table rows is calculated as follows: Product = 40,000 Store = 300 Time = 1825 Maximum number of base fact table rows = 22 billion Because only 4,000 products sell in each store in a day, not all of these 22 billion rows are occupied. Because of this sparsity, only 10% of the rows are occupied. Therefore, the real estimate of the number of base table rows is 2 billion. Now let us see what happens when you form aggregates. Scrutinize a one-way aggre- gate: brand totals by store by day. Calculate the maximum number of rows in this one-way aggregate. 246 DIMENSIONAL MODELING: ADVANCED TOPICS Brand = 80 Store = 300 Time = 1825 Maximum number of aggregate table rows = 43,800,000 While creating the one-way aggregate, you will notice that the sparsity for this aggre- gate is not 10% as in the case of the base table. This is because when you aggregate by brand, more of the brand codes will participate in combinations with store and time codes. The sparsity of the one-way aggregate would be about 50%, resulting in a real estimate of 21,900,000. If the sparsity had remained as the 10% applicable to the base table, the real estimate of the number of rows in the aggregate table would be much less. When you go for higher levels of aggregates, the sparsity percentage moves up and even reaches 100%. Because of the failure of sparsity to stay lower, you are faced with the question whether aggregates do improve performance that much. Do they reduce the number of rows that dramatically? Experienced data warehousing practitioners have a suggestion. When you form aggre- gates, make sure that each aggregate table row summarizes at least 10 rows in the lower level table. If you increase this to 20 rows or more, it would be really remarkable. Aggregation Options Going back to our discussion of one-way, two-way, and three-way aggregates for a basic STAR schema with just three dimensions, you could count more than 50 different ways you AGGREGATE FACT TABLES 247 Product Key Time Key Store Key Unit Sales Sales Dollars SALES FACTS STORE PRODUCT TIME Store Key Store Name Territory Region Time Key Date Month Quarter Year Product Key Product Category Department CATEGORY Category Key Category Department Category Key Time Key Store Key Unit Sales Sales Dollars SALES FACTS BASE TABLE ONE-WAY AGGREGATE DIMENSION DERIVED FROM PRODUCT Figure 11-15 Aggregate fact table and derived dimension table. may create aggregates. In the real world, the number of dimensions is not just three, but many more. Therefore, the number of possible aggregate tables escalates into the hundreds. Further, from the reference to the failure of sparsity in aggregate tables, you know that the aggregation process does not reduce the number of rows proportionally. In other words, if the sparsity of the base fact table is 10%, the sparsity of the higher-level aggre- gate tables does not remain at 10%. The sparsity percentage increases more and more as your aggregate tables climb higher and higher in levels of summarization. Is aggregation that much effective after all? What are some of the options? How do you decide what to aggregate? First, set a few goals for aggregation for your data warehouse environment. Goals for Aggregation Strategy. Apart from the general overall goal of improving data warehouse performance, here are a few specific, practical goals: ț Do not get bogged down with too many aggregates. Remember, you have to create additional derived dimensions as well to support the aggregates. ț Try to cater to a wide range of user groups. In any case, provide for your power users. ț Go for aggregates that do not unduly increase the overall usage of storage. Look carefully into larger aggregates with low sparsity percentages. ț Keep the aggregates hidden from the end-users. That is, the aggregates must be transparent to the end-user query. The query tool must be the one to be aware of the aggregates to direct the queries for proper access. ț Attempt to keep the impact on the data staging process as less intensive as possible. Practical Suggestions. Before doing any calculations to determine the types of ag- gregates needed for your data warehouse environment, spend a good deal of time on de- termining the nature of the common queries. How do your users normally report results? What are the reporting levels? By stores? By months? By product categories? Go through the dimensions, one by one, and review the levels of the hierarchies. Check if there are multiple hierarchies within the same dimension. If so, find out which of these multiple hi- erarchies are more important. In each dimension, ascertain which attributes are used for grouping the fact table metrics. The next step is to determine which of these attributes are used in combinations and what the most common combinations are. Once you determine the attributes and their possible combinations, look at the number of values for each attribute. For example, in a hotel chain schema, assume that hotel is at the lowest level and city is at the next higher level in the hotel dimension. Let us say there are 25,000 values for hotel and 15,000 values for city. Clearly, there is no big advantage of aggregating by cities. On the other hand, if city has only 500 values, then city is a level at which you may consider aggregation. Examine each attribute in the hierarchies within a dimension. Check the values for each of the attributes. Compare the values of attributes at different levels of the same hierarchy and decide which ones are strong candidates to par- ticipate in aggregation. Develop a list of attributes that are useful candidates for aggregation, then work out the combinations of these attributes to come up with your first set of multiway aggregate fact tables. Determine the derived dimension tables you need to support these aggregate fact tables. Go ahead and implement these aggregate fact tables as the initial set. 248 DIMENSIONAL MODELING: ADVANCED TOPICS Bear in mind that aggregation is a performance tuning mechanism. Improved query per- formance drives the need to summarize, so do not be too concerned if your first set of ag- gregate tables do not perform perfectly. Your aggregates are meant to be monitored and re- vised as necessary. The nature of the bulk of the query requests is likely to change. As your users become more adept at using the data warehouse, they will devise new ways of group- ing and analyzing data. So what is the practical advice? Do your preparatory work, start with a reasonable set of aggregate tables, and continue to make adjustments as necessary. FAMILIES OF STARS When you look at a single STAR schema with its fact table and the surrounding dimen- sion tables, you know that is not the extent of a data warehouse. Almost all data warehous- es contain multiple STAR schema structures. Each STAR serves a specific purpose to track the measures stored in the fact table. When you have a collection of related STAR schemas, you may call the collection a family of STARS. Families of STARS are formed for various reasons. You may form a family by just adding aggregate fact tables and the derived dimension tables to support the aggregates. Sometimes, you may create a core fact table containing facts interesting to most users and customized fact tables for specific user groups. Many factors lead to the existence of families of STARS. First, look at the example provided in Figure 11-16. The fact tables of the STARS in a family share dimension tables. Usually, the time di- mension is shared by most of the fact tables in the group. In the above example, all the FAMILIES OF STARS 249 Figure 11-16 Family of STARS. FACT TABLE DIMENSION TABLE DIMENSION TABLE DIMENSION TABLE DIMENSION TABLE DIMENSION TABLE DIMENSION TABLE FACT TABLE FACT TABLE DIMENSION TABLE DIMENSION TABLE three fact tables are likely to share the time dimension. Going the other way, dimension ta- bles from multiple STARS may share the fact table of one STAR. If you are in a business like banking or telephone services, it makes sense to capture in- dividual transactions as well as snapshots at specific intervals. You may then use families of STARS consisting of transaction and snapshot schemas. If you are in a manufacturing company or a similar production-type enterprise, your company needs to monitor the met- rics along the value chain. Some other institutions are like a medical center, where value is added not in a chain but at different stations within the enterprise. For these enterprises, the family of STARS supports the value chain or the value circle. We will get into details in the next few sections. Snapshot and Transaction Tables Let us review some basic requirements of a telephone company. A number of individual transactions make up a telephone customer’s account. Many of the transactions occur dur- ing the hours of 6 a.m. to 10 p.m. of the customer’s day. More transactions happen during the holidays and weekends for residential customers. Institutional customers use the phones on weekdays rather than over the weekends. A telephone accumulates a very large collection of rich transaction data that can be used for many types of valuable analysis. The telephone company needs a schema capturing transaction data that supports strategic decision making for expansions, new service improvements, and so on. This transaction schema answers questions such as how does the revenue of peak hours over the weekends and holidays compare with peak hours over weekdays. In addition, the telephone company needs to answer questions from the customers as to account balances. The customer service departments are constantly bombarded with ques- tions on the status of individual customer accounts. At periodic intervals, the accounting department may be interested in the amounts expected to be received by the middle of next month. What are the outstanding balances for which bills will be sent this month- end? For these purposes, the telephone company needs a schema to capture snapshots at periodic intervals. Please see Figure 11-17 showing the snapshot and transaction fact ta- bles for a telephone company. Make a note of the attributes in the two fact tables. One table tracks the individual phone transactions. The other table holds snapshots of individ- ual accounts at specific intervals. Also, notice how dimension tables are shared between the two fact tables. Snapshot and transaction tables are also common for banks. For example, an ATM transaction table stores individual ATM transactions. This fact table keeps track of indi- vidual transaction amounts for the customer accounts. The snapshot table holds the bal- ance for each account at the end of each day. The two tables serve two distinct functions. From the transaction table, you can perform various types of analysis of the ATM transac- tions. The snapshot table provides total amounts held at periodic intervals showing the shifting and movement of balances. Financial data warehouses also require snapshot and transaction tables because of the nature of the analysis in these cases. The first set of questions for these warehouses relates to the transactions affecting given accounts over a certain period of time. The other set of questions centers around balances in individual accounts at specific intervals or totals of groups of accounts at the end of specific periods. The transaction table answers the ques- tions of the first set; the snapshot table handles the questions of the second set. 250 DIMENSIONAL MODELING: ADVANCED TOPICS Core and Custom Tables Consider two types of businesses that are apparently dissimilar. First take the case of a bank. A bank offers a large variety of services all related to finance in one form or anoth- er. Most of the services are different from one another. The checking account service and the savings account service are similar in most ways. But the savings account service does not resemble the credit card service in any way. How do you track these dissimilar ser- vices? Next, consider a manufacturing company producing a number of heterogeneous prod- ucts. Although a few factors may be common to the various products, by and large the fac- tors differ. What must you do to get information about heterogeneous products? A different type of the family of STARS satisfies the requirements of these companies. In this type of family, all products and services connect to a core fact table and each prod- uct or service relates to individual custom tables. In Figure 11-18, you will see the core and custom tables for a bank. Note how the core fact table holds the metrics that are com- mon to all types of accounts. Each custom fact table contains the metrics specific to that line of service. Also note the shared dimension and notice how the tables form a family of STARS. Supporting Enterprise Value Chain or Value Circle In a manufacturing business, a product travels through various steps, starting off as raw materials and ending as finished goods in the warehouse inventory. Usually, the steps in- clude addition of ingredients, assembly of materials, process control, packaging, and ship- ping to the warehouse. From finished goods inventory, a product moves into shipment to distributor, distributor inventory, distributor shipment, retail inventory, and retail sales. At FAMILIES OF STARS 251 Time Key Account Key Transaction Key District Key Trans Reference Account Number Amount TELEPHONE TRANSACTION FACT TABLE STATUS Status Key ………… ACCOUNT Account Key ………… DISTRICT District Key ………… TIME Time Key ………… Time Key Account Key Status Key Transaction Count Ending Balance TELEPHONE SNAPSHOT FACT TABLE TRANSACTION Transaction Key ………… Figure 11-17 Snapshot and transaction tables. each step, value is added to the product. Several operational systems support the flow through these steps. The whole flow forms the supply chain or the value chain. Similarly, in an insurance company, the value chain may include a number of steps from sales of in- surance through issuance of policy and then finally claims processing. In this case, the value chain relates to the service. If you are in one of these businesses, you need to track important metrics at different steps along the value chain. You create STAR schemas for the significant steps and the complete set of related schemas forms a family of STARS. You define a fact table and a set of corresponding dimensions for each important step in the chain. If your company has multiple value chains, then you have to support each chain with a separate family of STARS. A supply chain or a value chain runs in a linear fashion beginning with a certain step and ending at another step with many steps in between. Again, at each step, value is added. In some other kinds of businesses where value gets added to services, similar lin- ear movements do not exist. For example, consider a health care institution where value gets added to patient service from different units almost as if they form a circle around the service. We perceive a value circle in such organizations. The value circle of a large health maintenance organization may include hospitals, clinics, doctors’ offices, pharmacies, laboratories, government agencies, and insurance companies. Each of these units either provide patient treatments or measure patient treatments. Patient treatment by each unit may be measured in different metrics. But most of the units would analyze the metrics us- 252 DIMENSIONAL MODELING: ADVANCED TOPICS Time Key Account Key Branch Key Household Key Balance Fees Charged Transactions BRANCH Branch Key ………… ACCOUNT Account Key ………… TIME Time Key ………… Account Key Deposits Withdrawals Interest Earned Balance Service Charges HOUSEHOLD Household Key ………… Account Key ATM Trans. Drive-up Trans. Walk-in Trans. Deposits Checks Paid Overdraft BANK CORE FACT TABLE SAVINGS CUSTOM FACT TABLE CHECKING CUSTOM FACT TABLE Figure 11-18 Core and custom tables. ing the same set of conformed dimensions such as time, patient, health care provider, treatment, diagnosis, and payer. For a value circle, the family of STARS comprises multi- ple fact tables and a set of conformed dimensions. Conforming Dimensions While exploring families of STARS, you will have noticed that dimensions are shared among fact tables. Dimensions form common links between STARS. For dimensions to be conformed, you have to deliberately make sure that common dimensions may be used between two or more STARS. If the product dimension is shared between two fact tables of sales and inventory, then the attributes of the product dimension must have the same meaning in relation to each of the two fact tables. Figure 11-19 shows a set of conformed dimensions. The order and shipment fact tables share the conformed dimensions of product, date, customer, and salesperson. A conformed dimension is a comprehensive combination of attributes from the source systems after resolving all discrepancies and conflicts. For ex- ample, a conformed product dimension must truly reflect the master product list of the en- terprise and must include all possible hierarchies. Each attribute must be of the correct data type and must have proper lengths and constraints. Conforming dimensions is a basic requirement in a data warehouse. Pay special atten- tion and take the necessary steps to conform all your dimensions. This is a major respon- sibility of the project team. Conformed dimensions allow rollups across data marts. User interfaces will be consistent irrespective of the type of query. Result sets of queries will be FAMILIES OF STARS 253 Product Key Time Key Customer Key Salesperson Key Order Dollars Cost Dollars Margin Dollars Sale Units ORDER CUSTOMER SALESPERSON PRODUCT DATE Customer Key ………………. Salesperson Key ……………… Date Key ……………… Product Key …………… Product Key Time Key Customer Key Salesperson Key Channel Key Ship-to Key Ship-from Key Invoice Number Order Number Ship Date Arrival Date SHIPMENT CHANNEL Channel Key ……………… SHIP-TO Ship-to Key ……………… SHIP-FROM Ship-from Key ……………… CONFORMED DIMENSIONS Figure 11-19 Conformed dimensions. consistent across data marts. Of course, a single conformed dimension can be used against multiple fact tables. Standardizing Facts In addition to the task of conforming dimensions is the requirement to standardize facts. We have seen that fact tables work across STAR schemas. Review the following issues re- lating to the standardization of fact table attributes: ț Ensure same definitions and terminology across data marts ț Resolve homonyms and synonyms ț Types of facts to be standardized include revenue, price, cost, and profit margin ț Guarantee that the same algorithms are used for any derived units in each fact table ț Make sure each fact uses the right unit of measurement Summary of Family of STARS Let us end our discussion of the family of STARS with a comprehensive diagram showing a set of standardized fact tables and conformed dimension tables. Study Figure 11-20 care- 254 DIMENSIONAL MODELING: ADVANCED TOPICS Date Key Account Key Branch Key Balance BRANCH Branch Key Branch Name State Region ACCOUNT Account Key ………… DATE Date Key Date Month Year Account Key ATM Trans. Other Trans. STATE State Key State Region Date Key Account Kye State Key Balance MONTH Month Key Month Year Month Key Account Key State Key Balance 2-WAY AGGREGATE 1-WAY AGGREGATE BANK CORE TABLE CHECKING CUSTOM TABLE Figure 11-20 A comprehensive family of STARS. Branch Name fully. Note the aggregate fact tables and the corresponding derived dimension tables. What types of aggregates are these? One-way or two-way? Which are the base fact tables? Notice the shared dimensions. Are these conformed dimensions? See how the various fact tables and the dimension tables are related. CHAPTER SUMMARY ț Slowly changing dimensions may be classified into three different types based on the nature of the changes. Type 1 relates to corrections, Type 2 to preservation of history, and Type 3 to soft revisions. Applying each type of revision to the data warehouse is different. ț Large dimension tables such as customer or product need special considerations for applying optimizing techniques. ț “Snowflaking” or creating a snowflake schema is a method of normalizing the STAR schema. Although some conditions justify the snowflake schema, it is gener- ally not recommended. ț Miscellaneous flags and textual data are thrown together in one table called a junk dimension table. ț Aggregate or summary tables improve performance. Formulate a strategy for build- ing aggregate tables. ț A set of related STAR schemas make up a family of STARS. Examples are snapshot and transaction tables, core and custom tables, and tables supporting a value chain or a value circle. A family of STARS relies on conformed dimension tables and standardized fact tables. REVIEW QUESTIONS 1. Describe slowly changing dimensions. What are the three types? Explain each type very briefly. 2. Compare and contrast Type 2 and Type 3 slowly changing dimensions. 3. Can you treat rapidly changing dimensions in the same way as Type 2 slowly changing dimensions? Discuss. 4. What are junk dimensions? Are they necessary in a data warehouse? 5. How does a snowflake schema differ from a STAR schema? Name two advantages and two disadvantages of the snowflake schema. 6. Differentiate between slowly and rapidly changing dimensions. 7. What are aggregate fact tables? Why are they needed? Give an example. 8. Describe with examples snapshot and transaction fact tables. How are they relat- ed? 9. Give an example of a value circle. Explain how a family of STARS can support a value circle. 10. What is meant by conforming the dimensions? Why is this important in a data warehouse? REVIEW QUESTIONS 255 [...]... data extraction programs to capture data from the VSAM files to get the data ready for populating the relational database Two major factors differentiate the data extraction for a new operational system from the data extraction for a data warehouse First, for a data warehouse, you have to extract data from many disparate sources Next, for a data warehouse, you have to extract data on the changes for. .. Key Data 555 PPPPP 66 6 QQQQ 777 HHHH Append DATA STAGING Key Data 123 AAAAA 234 BBBBB 345 CCCCC Key Data 123 AAAAA 234 BBBBB 345 CCCCC Destructive Merge WAREHOUSE Key Data 111 PPPPP WAREHOUSE Key Data 123 PPPPP WAREHOUSE WAREHOUSE Key Data 123 AAAAA 234 BBBBB 345 CCCCC Key Data 111 PPPPP 123 AAAAA 234 BBBBB 345 CCCCC Figure 12-11 DATA STAGING 281 Constructive Merge WAREHOUSE Key Data 123 PPPPP WAREHOUSE... store and use metadata Even if the in-house programs record the data transformation metadata initially, every time changes occur to transformation rules, the metadata has to be maintained This puts an additional burden on the maintenance of the manually coded transformation programs DATA LOADING It is generally agreed that transformation functions end as soon as load images are created The next major... delivery Data extraction, transformation, and loading encompass the areas of data acquisition and data storage These are back-end processes that cover the extraction of data from the source systems Next, they include all the functions and procedures for changing the source data into the exact formats and structures appropriate for storage in the data warehouse database After the transformation of the data, ... loading, the load process simply applies the data from the incoming file Append You may think of the append as an extension of the load If data already exists in the table, the append process unconditionally adds the incoming data, preserving the DATA LOADING DATA STAGING DATA STAGING Key Data 123 AAAAA 234 BBBBB 345 CCCCC Key Data 123 AAAAA 234 BBBBB 345 CCCCC BEFORE Load B EF O R AFTER AF TE R WAREHOUSE... that take the prepared data, apply it to the data warehouse, and store it in the database there You create load images to correspond to the target files to be loaded in the data warehouse database The whole process of moving data into the data warehouse repository is referred to in several ways You must have heard the phrases applying the data, loading the data, and refreshing the data For the sake... be loaded in the data warehouse instead of loading the most granular level of data For example, for a credit card company to analyze sales patterns, it may not be necessary to store in the data warehouse every single transaction on each credit card Instead, you may want to summarize the daily transactions for each credit card and store the summary data instead of storing the most granular data by individual... static data is the capture of data at a given point in time It is like taking a snapshot of the relevant source data at a certain point in time For current or transient data, this capture would include all transient data identified for extraction In addition, for data categorized as periodic, this data capture would include each status or event at each point in time as available in the source operational... combined data does not violate any business rules Consider the data structures and data elements that you need in your data warehouse Now think about all the relevant data to be extracted from the source systems From the variety of source data formats, data values, and the condition of the data quality, you know that you have to perform several types of transformations to make the source data suitable for. .. suitable for your data warehouse Transformation of source data encompasses a wide variety of manipulations to change all the extracted source data into usable information to be stored in the data warehouse Many companies underestimate the extent and complexity of the data transformation functions They start out with a simple departmental data mart as the pilot project Almost all of the data for this pilot . extraction for a new operational system from the data extraction for a data warehouse. First, for a data warehouse, you have to extract data from many disparate sources. Next, for a data warehouse,. comprehensive data extraction rules. Determine data transformation and cleansing rules. Plan for aggregate tables. Organize data staging area and test tools. Write procedures for all data loads. ETL. technology, you may have written data extraction programs to capture data from the VSAM files to get the data ready for populating the relational database. Two major factors differentiate the data extraction