Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 46 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
46
Dung lượng
719,77 KB
Nội dung
Delta Snapshot Interface The delta snapshot is a commonly used interface for reference data, such as a customer master list. The basic delta snapshot would contain a row or transac- tion that changed since the last extraction. It would contain the current state of all attributes without information about what, in particular, had changed. This is the easiest of the delta interfaces to process in most cases. Since it con- tains both changed and unchanged attributes, creating time-variant snapshots does not require retrieval of the previous version of the row. It also does not require the process to examine each column to determine change, but rather, only those columns where such an examination is necessary. And, when such examination is necessary, there are a number of techniques discussed later in this chapter that allow it to occur efficiently with minimal development effort. Transaction Interface A transaction interface is special form of delta snapshot interface. A transaction interface is made up of three parts: an action that is to be performed, data that identifies the subject, and data that defines the magnitude of the change. Atrans- action interface is always complete and received once. This latter characteristic differentiates it from a delta snapshot. In a delta snapshot, the same instance may be received repeatedly over time as it is updated. Instances in a transaction interface are never updated. The term should not be confused with a business transaction. While the characteristics are basically the same, the term as it is used here describes the interaction between systems. You may have an interface that provides busi- ness transactions, but such an interface may be in the form of a delta snapshot or a transaction interface. The ways that each interface is processed are signif- icantly different. Database Transaction Logs Database transaction logs are another form of delta interface. They are discussed separately because the delta capture occurs outside the control of the application system. These transaction logs are maintained by the database system itself at the physical database structure level to provide restart and recovery capabilities. The content of these logs will vary depending on the database system being used. They may take the form of any of the three delta structures discussed earlier. In row snapshot logs, it may contain row images before and after the update, depending on how the database logging options are set. Modeling Transactions 257 There are three main challenges when working with database logs. The first is reading the log itself. These logs use proprietary formats and the database sys- tem may not have an API that allows direct access to these structures. Even if they did, the coding effort can be significant. Often it is necessary to use third- party interfaces to access the transaction logs. The second challenge is applying a business context to the content of the logs. The database doesn’t know about the application or business logic behind an update. A database restoration does not need to interpret the data, but rather simply get the database back to the way it was prior to the failure. On the other hand, to load a data warehouse you need to apply this data in a manner that makes business sense. You are not simply replicating the operational system, but interpreting and transforming the data. To do this from a database log requires in-depth knowledge of the application system and its data structures. The third challenge is dealing with software changes in both the application system and the database system. A new release of the database software may significantly change the format of the transaction logs. Even more difficult to deal with are updates to the application software. The vendor may implement back-end changes that they do not even mention in their release notes because the changes do not outwardly affect the way the system functions. However, the changes may have affected the schema or data content, which in turn affects the content of the database logs. Such logs can be an effective means to obtain change data. However, proceed with caution and only if other avenues are not available to you. Delivering Transaction Data The primary purpose of the data warehouse is to serve as a central data repos- itory from which data is delivered to external applications. Those applications may be data marts, data-mining systems, operational systems, or just about any other system. In general, these other systems expect to receive data in one of two ways: a point-in-time snapshot or changes since the last delivery. Point- in-time snapshots come in two flavors: a current snapshot (the point in time is now) or the state of the data at a specified time in the past. The delivery may also be further qualified, for example, by limiting it to transactions processed during a specified period. Since most of the work for a data warehouse is to deliver snapshots or changes, it makes sense that the data structures used to store the data be optimized to do just that. This means that the data warehouse load process should perform the work necessary to transform the data so it is in a form suitable for delivery. In the Chapter 8 258 case studies in this chapter, we will provide different techniques and models to transform and store the data. No one process will be optimal for every avenue of delivery. However, depending on your timeframe and budget, you may wish to combine techniques to produce a comprehensive solution. Be careful not to overdesign the warehouse. If your deliveries require current snapshots or changes and only rarely do you require a snapshot for a point in time in the past, then it makes sense to optimize the system for the first two requirements and take a processing hit when you need to address the third. Modeling Transactions 259 Updating Fact Tables Fact tables in a data mart may be maintained in three ways: a complete refresh, updating rows, or inserting changes. In a complete refresh, the entire fact table is cleared and reloaded with new data. This type of process requires delivery of cur- rent information from the data warehouse, which is transformed and summarized before loading into the data mart. This technique is commonly used for smaller, highly summarized, snapshot-type fact tables. Updating a fact table also requires delivery of current information that is trans- formed to conform to the grain of the fact table. The load process then updates or inserts rows as required with the new information. This technique minimizes the growth of the fact table at the cost of an inefficient load process. This is a particularly cumbersome method if fact table uses bitmap indexes for its foreign keys and your database system does not update in place. Some database sys- tems, such as Oracle, update rows by deleting the old ones and inserting new rows. The physical movement of a row to another location in the tablespace forces an update of all the indexes. While b-tree indexes are fairly well behaved during updates, bitmap indexes are not. During updating, bitmap structures can become fragmented and grow in size. This fragmentation reduces the efficiency of the index, causing an increase in query time. A DBA is required to monitor the indexes and rebuild them periodically to maintain optimal response times. The third technique is to simply append the differences to the fact table. This requires the data warehouse to deliver the changes in values since the last deliv- ery. This data is then transformed to match the granularity of the fact table, and then appended to the table. This approach works best when the measures are fully additive, but may also be suitable for semiadditive measures as well. This method is, by far, the fastest way to get the data into the data mart. Row inser- tion can be performed using the database’s bulk load utility, which can typically load very large numbers of rows in a short period of time. Some databases allow you to disable index maintenance during the load, making the load even faster. If you are using bitmap indexes, you should load with index maintenance disabled, then rebuild the indexes after the load. The result is fast load times and optimal indexes to support queries. Case Study: Sales Order Snapshots In this case study, we examine how to model and process a snapshot data extract. We discuss typical transformations that occur prior to loading the data into the data warehouse. We also examine three different techniques for cap- turing and storing historical information. Our packaged goods manufacturer receives sales orders for processing and fulfillment. When received by the company, an order goes through a number of administrative steps before it is approved and released for shipment. On average, an order will remain open for 7 to 10 business days before it is shipped. Its actual lifespan will depend on the size, available inventory, and delivery schedule requested by the customer. During that time, changes to the content or status of the order can occur. The order is received by the data warehouse in a delta snapshot interface. An order appears in the extract anytime something in the order changes. The order when received is a complete picture of the order at that point in time. An order transaction is made up of a number of parts: ■■ The order header contains customer related information about the order. It identifies the sold-to, ship-to, and bill-to customers, shipping address, the customer’s PO information, and other characteristics about the order. While such an arrangement violates normalization rules, transaction data extracts are often received in a denormalized form. We will discuss this further in the next section. ■■ A child of the order header is one or more pricing segments. A pricing seg- ment contains a pricing code, an amount, a quantity, and accounting infor- mation. Pricing segments at this level represent charges or credits applied to the total order. For example, shipping charges would appear here. ■■ Another child of the order header is one or more order lines. An order line contains a product ID (SKU), order quantity, confirmed quantity, unit price, unit of measure, weight, volume, status code, and requested deliv- ery date as well as other characteristics. ■■ A child of the order line is one or more line-pricing segments. These are in the same format as the order header-pricing segments, but contain data pertaining to the line. A segment exists for the base price as well as dis- counts or surcharges that make up the final price. The quantity in a pricing segment may be different than the quantity on the order line because some discounts or surcharges may be limited to a fixed maximum quantity or a portion of the order quantity. The sum of all line-pricing segments and all order header-pricing segments will equal the total order value. Chapter 8 260 ■■ Another child of the order lines is one or more schedule lines. A schedule line contains a planned shipping date and a quantity. The schedule will contain sufficient lines to meet the order quantity. However, based on business rules, the confirmed quantity of the order line is derived from the delivery schedule the customer is willing to accept. Therefore, only the earliest schedule lines that sum to the confirmed quantity represent the actual shipping schedule. The shipping schedule is used for reporting future expected revenue. Figure 8.3 shows the transaction structure as it is received in the interface. Dur- ing the life of the order, it is possible that some portions of the order will be deleted in the operational system. The operational system will not provide any explicit indication that lines, schedule, or pricing information has been deleted. The data will simply be missing in the new snapshot. The process must be able to detect and act on such deletions. Figure 8.3 Order transaction structure. Order Line Pricing Order Line Pricing Line Identifier Order Identifier (FK) Order Line Identifier (FK) Pricing Code Value Quantity Rate other attributes Order Header Order Identifier Sold-To Customer Identifier Bill-To Customer Identifier Ship-To Customer Identifier Order Date Order Status Customer PO Number Delivery Address other attributes Order Header Pricing Order Header Pricing Line Identifier Order Identifier (FK) Pricing Code Value Quantity Rate other attributes Order Line Order Line Identifier Order Identifier (FK) Item Identifier Item Unit of Measure Order Quantity Confirmed Quantity Order Unit Price Order Line Status Item Volume Item Weight Requested Delivery Date other attributes Order Line Schedule Order Line Schedule Line Identifier Order Identifier (FK) Order Line Identifier (FK) Planned Shipping Date Planned Shipping Quantity Planned Shipping Location other attributes Modeling Transactions 261 Transforming the Order The order data extracted from the operational system is not purposely built for populating the data warehouse. It is used for a number of different purposes, providing order information to other operational systems. Thus, the data extract contains superfluous information. In addition, some of the data is not well suited for use in a data warehouse but could be used to derive more use- ful data. Figure 8.4 shows the business model of how the order appears in the data warehouse. Its content is based on the business rules for the organization. This is not the final model. As you will see in subsequent sections of this case study, the final model varies depending on how you decide to collect order history. The model in Figure 8.4 represents an order at a moment in time. It is used in this discussion to identify the attributes that are maintained in the data warehouse. Chapter 8 262 Unit Price and Other Characteristics 1 When delivering data to a data mart, it is important that numeric values that are used to measure the business be delivered so that they are fully additive. When dealing with sales data, it is often the case that the sales line contains a unit price along with a quantity. However, unit price is not particularly useful as a quantita- tive measure of the business. It cannot be summed or averaged on its own. Instead, what is needed is the extended price of the line, which can be calculated by multiplying price by quantity. This value is fully additive and may serve as a business measure. Unit price, on the other hand, is a characteristic of the sale. It most certainly useful in analysis, but in the role as a dimensional attribute rather than a measure. Depending on your business, you may choose not to store unit price, but rather derive it from the extended value when necessary for analysis. In the retail busi- ness, this is not an issue since the unit price is always expressed in the selling unit. This is not the case with a packaged goods manufacturer, which may sell the same product in a variety of units (cases, pallets, and so on). In this case, any analysis of unit price needs to take into account the unit being sold. This analysis is simplified when the quantity and value are stored. The unit dependent value, sales quantity, would be converted and stored expressed in a standard unit, such as the base or inventory unit. Either the sales quantity or standardized quantity can simply be divided into the value to derive the unit price. 1 The term “characteristic” is being used to refer to dimensional attributes as used in dimen- sional modeling. This is to avoid confusion with the relational modeling use of attribute, which has a more generic meaning. A number of attributes are eliminated from the data model because they are redundant with information maintained elsewhere. Item weight and volume were removed from Order Line because those attributes are available from the Item UOM entity. The Delivery Address is removed from the Order Header because that information is carried by the Ship-To Customer role in the Cus- tomer entity. This presumes that the ship-to address cannot be overridden, which is the case in this instance. If such an address can be changed during order entry, you would need to retain that information with the order. As mentioned earlier, the data being received in such interfaces are often in a denormalized form. This normalization process should be a part of any interface analysis. Its purpose is not necessarily to change the content of the interface, but to identify what form the data warehouse model will take. Properly done, it can signifi- cantly reduce data storage requirements as well as improve the usability of the data warehouse. Figure 8.4 Order business model. Order Line Pricing Order Line Pricing Line Identifier Order Identifier (FK) Order Line Identifier (FK) Pricing Code Value Quantity Rate other attributes Load Log Identifier (FK) Order Header Order Identifier Order Date Order Status Customer PO Number other attributes Sold-To Customer Identifier (FK) Bill-To Customer Identifier (FK) Ship-To Customer Identifier (FK) Load Log Identifier (FK) Customer Customer Identifier Customer Name other attributes Order Header Pricing Order Header Pricing Line Identifier Order Identifier (FK) Pricing Code Value Quantity Rate other attributes Load Log Identifier (FK) Order Line Order Line Identifier Order Identifier (FK) Order Quantity Order Extended Price Order Line Value Confirmed Quantity Order Line Status Requested Delivery Date other attributes Item Identifier (FK) Item Unit of Measure (FK) Load Log Identifier (FK) Item Item Identifier Item Name Item SKU Item Type other attributes Order Line Schedule Order Line Schedule Line Identifier Order Identifier (FK) Order Line Identifier (FK) Planned Shipping Date Planned Shipping Quantity Planned Shipping Location other attributes Load Log Identifier (FK) Load Log Load Log Identifier Process Name Process Status Process Start Time Process End Time other attributes Item UOM Item Unit of Measure Item Identifier (FK) Base Unit Factor UPC Code EAN Code Weight Weight Unit of Measure Volume Volume Unit of Measure other attributes Modeling Transactions 263 Chapter 8 264 Units of Measure in Manufacturing and Distribution As retail customers, we usually deal with one unit of measure, the each. Whether we buy a gallon of milk, a six-pack of beer or a jumbo bag of potato chips, it is still one item, an each. Manufacturing and distribution, on the other hand, have to deal with a multitude of units of the same item. The most common are the each, or consumer unit; the case; and the pallet, although there are many others, such as carton, barrel, layer, and so forth. When orders are received, the quantity may be expressed in a number of different ways. Customers may order cases, pallets, or eaches, of the same item. Within inventory, an item is tracked by its SKU. The SKU number not only identifies the item, but also identifies the unit of measure used to inventory the item. This inventory unit of measure is often referred to as the base unit of measure. In such situations, the data warehouse needs to provide mechanisms to accommodate different units of measure for the same item. Any quantity being stored needs to be tagged with the unit of measure the quantity is expressed in. It is not enough to simply convert everything into the base unit of measure for a number of reasons. First, any such conversion creates a derived value. Changes in the conversion factor will affect the derivation. You should always store such quantities as they were entered to avoid discrepancies later. Second, you will be required to present those quantities in different units of measure, depending on the audience. Therefore, you cannot avoid unit conversions at query time. For a particular item and unit of measure, the source system will often provide characteristics such as conversion factors, weight, dimensions, and volume. A chal- lenge you will face is how to maintain those characteristics. To understand how the data warehouse should maintain the conversion factors and other physical charac- teristics, it is important to understand the SKU and its implications in inventory management. The SKU represents the physical unit maintained and counted in inventory. Everything relating to the content and physical characteristics of an item is tied to the SKU. If there is any change to the item, such as making it bigger or smaller, standard inventory practice requires that the changed item be assigned a new SKU identifier. Therefore, any changes to the physical information relating to the SKU can be considered corrections to erroneous data and not a new version of the truth. So, in general, this will not require maintaining a time-variant structure since you would want error corrections to be applied to historical data as well. This approach, however, only applies to units of measure that are the base unit or smaller. Larger units of measure can have physical changes that do not affect inventory and do not require a new SKU. For example, an item is inventoried by the case. The SKU represents a case of the product. A pallet of the product is made up of 40 cases, made up of five layers with eight cases on a layer. Over time it has been discovered that there were a number of instances where cases Another type of transformation creates new attributes to improve the usability of the information. For example, the data extract provides the Item Unit Price. This attribute is transformed into Item Extended Price by multiplying the unit price by the ordered quantity. The extended price is a more useful value for most applications since it can be summed and averaged directly, without further manipulation in a delivery query. In fact, because of the additional util- ity the value provides and since no information is lost, it is common to replace the unit value with the extended value in the model. Also, since the unit price is often available in an item price table, its inclusion in the sales transaction information provides little additional value. Another transformation is the cal- culation of Order Line Value. In this case, it is the sum of the values received in Order Line Pricing for that line. There may be other calculations as well. There may be business rules to estimate the Gross and Net Proceeds of Sale from the Order Line Pricing information. Such calculations should take place during the load process and be placed into the data warehouse so they are readily available for delivery. By performing such transformations up front in the load process, you elimi- nate the need to perform these calculations later when delivering data to the data marts or other external applications. This eliminates duplication of effort when enforcing these business rules and the possibility of different results due to misinterpretation of the rules or errors in the implementation of the delivery process transformation logic. Making the effort to calculate and store these derivations up front goes a long way toward simplifying data delivery and ensuring consistency across multiple uses of the data. The data warehouse is required to record the change history for the order lines and pricing segments. In the remainder of this case study, we will present three techniques to maintain the current transaction state, detect deletions, and Modeling Transactions 265 on the bottom layer were being crushed due to the weight above them. It is decided to reconfigure the pallet to four layers, each holding 32 cases. This changes the weight, dimensions, volume, and conversion factors of the pallet but does not affect the SKU itself. The change does not affect how inventory is counted, so no new SKU is created. However the old and new pallets have signifi- cance in historical reporting, so it is necessary to retain time-variant information so that pallet counts, order weights, and volumes can be properly calculated. This necessitates a hybrid approach when applying changes to unit of measure data. Updates to base units and smaller units are applied in place without history, while updates to units larger than the base unit should be maintained as time- based variants. maintain a historical change log. We will evaluate each technique for its ability to accomplish these tasks as well as its utility for delivering data to down- stream systems and data marts. Technique 1: Complete Snapshot Capture The model in Figure 8.2 shows an example of structures to support complete snapshot capture. In such a situation, a full image of the transaction Stock Keeping Unit (in this case, an order Stock Keeping Unit) is maintained for each point in time the order is received in the data warehouse. The Order Snapshot Date is part of the primary key and identifies the point in time that image is valid. Figure 8.5 shows the complete model as it applies to this case study. Figure 8.5 Complete snapshot history. Order Line Schedule Order Identifier (FK) Planned Shipping Date other attributes Load Log Identifier (FK) Order Line Pricing Order Line Pricing Line Identifier Order Identifier (FK) Order Line Identifier (FK) Order Snapshot Date (FK) Pricing Code Value Quantity Rate other attributes Load Log Identifier (FK) Order Header Order Identifier Order Snapshot Date Order Date Order Status Customer PO Number Delivery Address other attributes Sold-To Customer Identifier (FK) Bill-To Customer Identifier (FK) Ship-To Customer Identifier (FK) Load Log Identifier (FK) Customer Customer Identifier Customer Name other attributes Order Header Pricing Order Header Pricing Line Identifier Order Identifier (FK) Order Snapshot Date (FK) Pricing Code Value Quantity Rate other attributes Load Log Identifier (FK) Order Line Order Line Identifier Order Identifier (FK) Order Snapshot Date (FK) Order Quantity Order Unit Price Order Line Status Order Value other attributes Item Identifier (FK) Load Log Identifier (FK) Item Item Identifier Item Name Item UPC Item SKU Item Type other attributes Order Line Schedule Order Line Schedule Line Identifier Order Identifier (FK) Order Line Identifier (FK) Order Snapshot Date (FK) Planned Shipping Date Planned Shipping Quantity Planned Shipping Location other attributes Load Log Identifier (FK) Load Log Load Log Identifier Process Name Process Status Process Start Time Process End Time other attributes Chapter 8 266 [...]... with your database administrator for specific details and benefits for your environment We first examine techniques to optimize the data warehouse physical schema The techniques examine options to arrange data and indexes within the database and their applicability to the data warehouse environment Next, we examine processing techniques to improve load and publication performance Data Clustering Data clustering... additive, and they are not updated Each transaction defines its own scope as well as the direction and magnitude of change It is an ideal data extract for both the data warehouse and data marts So ideal, in fact, that one is tempted to send the extract to both the data warehouse and the data marts at the same time We recommend that you resist this temptation and perform all data deliveries from the data warehouse. .. So, aside from data validation and transformations, there isn’t much to do besides insert the data into the data warehouse The only real issue to address is how the data is delivered to the data marts There are two general schools of though on this topic One method is to prepare the data for delivery in the same process that prepares the data for the data warehouse, then load the data and deliver it... other method is to load the data warehouse and deliver the data using data warehouse queries after the load We will discuss each method Simultaneous Delivery Figure 8.12 shows the process architecture for simultaneous delivery In this scenario, the data is transformed in the ETL process for loading into the data warehouse as well as staging for delivery of the data to the data marts or other external... major and minor ETL vendors at http://www dwinfocenter.org When selecting such a tool, here are some points you should consider in your evaluation Data Warehouse Optimization 2 87 Data access Where is your source data, and what database systems are used? Can the tool connect to and read these sources? How will such connections affect your operational system? How does the tool handle transfer of data. .. Simultaneous load and delivery architecture Postload Delivery Figure 8.13 shows an example of a postload delivery process After the data has been transformed and loaded into the data warehouse, other processes extract and deliver the data to the target data marts or external systems The disadvantage of this approach is the delay between receiving the data and making it available to the data marts Such... that the data warehouse and the data marts may get out of sync because of technical problems, such as hardware failures Since no process would exist to move the data from the data warehouse to the data marts, recovery from such a situation would require processes involving the staging area It would also require some mechanism to validate the recovery based on reconciliation with the data warehouse. .. business analysis Deriving and quantifying change within the data warehouse goes a long way to supporting and simplifying that analysis CHAPTER Installing Custom Controls Data Warehouse Optimization O 9 285 ptimization is a broad topic that envelops the entire data warehouse environment from source system data acquisition to final delivery to the information consumer A data warehousing program within... physical database techniques to improve the load and publication of data into the data warehouse and to the marts We will also briefly examine physical database techniques for the data marts Finally, we will discuss changes to the entity structures in the system model that can effect optimization of the physical model Optimizing the Development Process The development process includes design and analysis... projects incrementally expanding the overall scope of the data warehouse This implies a commitment of manpower over the life of the program to develop, implement, and support the applications to load and deliver the data warehouse Labor costs are the most significant expense any organization will encounter in a data warehouse project Using those resources effectively can reduce cost and development time . Transaction Data The primary purpose of the data warehouse is to serve as a central data repos- itory from which data is delivered to external applications. Those applications may be data marts, data- mining. other hand, to load a data warehouse you need to apply this data in a manner that makes business sense. You are not simply replicating the operational system, but interpreting and transforming the data. . work for a data warehouse is to deliver snapshots or changes, it makes sense that the data structures used to store the data be optimized to do just that. This means that the data warehouse load