Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 8 pptx

performed by the database’s optimizer, and as a result it would either look at all rows for the specific date and scan for customer or look at all rows for the customer and scan for date. If, on the other hand, you defined a compound b-tree index using date and customer, it would use that index to locate the rows directly. The compound index would perform much better than either of the two simple indexes. Figure 9.7 Simplified b-tree index structure. Cars ID Type Color other 1DGS902 Sedan White … 1HUE039 Sedan Silver 2UUE384 Coupe Red 2ZUD923 Coupe White 3ABD038 3KES734 3IEK299 3JSU823 3LOP929 3LMN347 3SDF293 Sedan Sedan Sedan Sedan White White Red Red Coupe Coupe Coupe Silver Silver Silver … … … … … … … … … … Cars.ID B-tree Index (Unique) ID Parent 1DGS902 1HUE039 2UUE384 2ZUD923 3ABD038 3KES734 3IEK299 3JSU823 3LOP929 3LMN347 3SDF293 Row 2 1 3 2 7 3 5 4 3 5 10 6 root 7 6 8 11 9 7 10 10 11 < 1 2 4 8 3 6 9 > 5 10 11 = Data Warehouse Optimization 303 B-Tree Index Advantages B-tree indexes work best in a controlled environment. That is to say, you are able to anticipate how the tables will be accessed for both updates and queries. This is certainly attainable in the enterprise data warehouse as both the update and delivery processes are controlled by the data warehouse development team. Careful design of the indexes provides optimal performance with minimal overhead. B-tree indexes are low maintenance indexes. Database vendors have gone to great lengths to optimize their index structures and algorithms to maintain balanced index trees at all times. This means that frequent updating of tables does not significantly degrade index performance. However, it is still a good idea to rebuild the indexes periodically as part of a normal maintenance cycle. B-Tree Index Disadvantages As mentioned earlier, b-tree indexes cannot be used in combination with each other. This means that you must create sufficient indexes to support the antic- ipated accesses to the table. This does not necessarily mean that you need to create a lot of indexes. For example, if you have a table that is queried by date and by customer and date, you need only create a single compound index using date and customer in that order to support both. The significance of column order in a compound index is another disadvan- tage. You may be required to create multiple compound indexes or accept that some queries will require sequential scans after exhausting the usefulness of an existing index. Which way you go depends on the indexes you have and the nature of the data. If the existing index results in a scan of a few dozen rows for a particular query, it probably isn’t worth the overhead to create a new index structure to overcome the scan. Keep in mind, the more index structures you create, the slower the update process becomes. B-tree indexes tend to be large. In addition to the columns that make up the index, an index row also contains 16 to 24 additional bytes of pointer and other internal data used by the database system. Also, you need to add as much as 40 percent to the size as overhead to cover nonleaf nodes and dead space. Refer to your database system’s documentation for its method of estimating index sizes. Bitmap Indexes Bitmap indexes are almost never seen in OLTP type databases, but are the dar- lings of dimensional data marts. Bitmap indexes are best used in environments whose primary purpose is to support ad hoc queries. These indexes, however, are high-maintenance structures that do not handle updating very well. Let’s examine the bitmap structure to see why. Chapter 9 304 Bitmap Structure Figure 9.8 shows a bitmap index on a table containing information about cars. The index shown is for the Color column of the table. For this example, there are only three colors: red, white, and silver. A bitmap index structure contains a series of bit vectors. There is one vector for each unique value in the column. Each vector contains one bit for each row in the table. In the example, there are three vectors for each of the three possible colors. The red vector will contain a zero for the row if the color in that row is not red. If the color in the row is red, the bit in the red vector will be set to 1. If we were to query the table for red cars, the database would use the red color vector to locate the rows by finding all the 1 bits. This type of search is fairly fast, but it is not significantly different from, and possibly slower than, a b-tree index on the color column. The advantage of a bitmap index is that it can be used in combination with other bitmap indexes. Let’s expand the example to include a bitmap index on the type of car. Figure 9.9 includes the new index. In this case, there are two car types: sedans and coupes. Now, the user enters a query to select all cars that are coupes and are not white. With bitmap indexes, the database is able to resolve the query using the bitmap vectors and Boolean operations. It does not need to touch the data until it has isolated the rows it needs. Figure 9.10 shows how the database resolves the query. First, it takes the white vector and performs a Not operation. It takes that result and performs an And operation with the coupe vector. The result is a vector that identifies all rows containing red and silver coupes. Boolean operations against bit vectors are very fast operations for any computer. A database system can perform this selection much faster than if you had created a b-tree index on car type and color. Figure 9.8 A car color bitmap index. Cars Color Bit Map Index ID Type Color other 1DGS902 Sedan White … 1HUE039 Sedan Silver 2UUE384 Coupe Red 2ZUD923 Coupe White 3ABD038 3KES734 3IEK299 3JSU823 3LOP929 3LMN347 3SDF293 Sedan Sedan Sedan Sedan White White Red Red Coupe Coupe Coupe Silver Silver Silver … … … … … … … … … … WhiteRedSilver 0 1 0 00 0 0 1 1 10 0 00 00 00 0 1 1 1 1 1 1 0 00 00 0 0 1 Data Warehouse Optimization 305 Figure 9.9 Adding a car type bitmap index. Since a query can use multiple bitmap indexes, you do not need to anticipate the combination of columns that will be used in a query. Instead, you simply create bitmap indexes on most, if not all, the columns in the table. All bitmap indexes are simple, single-column indexes. You do not create, and most database systems will not allow you to create, compound bitmap indexes. Doing so does not make much sense, since using more than one column only increases the cardinality (the number of possible values) of the index, which leads to greater index sparsity. A separate bitmap index on each column is a more effective approach. Cars Color Bit Map Index ID Type Color other 1DGS902 Sedan White … 1HUE039 Sedan Silver 2UUE384 Coupe Red 2ZUD923 Coupe White 3ABD038 3KES734 3IEK299 3JSU823 3LOP929 3LMN347 3SDF293 Sedan Sedan Sedan Sedan White White Red Red Coupe Coupe Coupe Silver Silver Silver … … … … … … … … … … White RedSilver 0 1 0 00 0 1 1 10 0 00 00 00 0 1 1 1 1 1 1 0 00 00 00 0 1 Type Bit Map Index CoupeSedan 10 01 10 0 01 0 10 1 1 0 1 0 0 0 1 1 1 Chapter 9 306 Figure 9.10 Query evaluation using bitmap indexes. Cars Color ID Type Color other 1DGS902 Sedan White … 1HUE039 Sedan Silver 2UUE384 Coupe Red 2ZUD923 Coupe White 3ABD038 3KES734 3IEK299 3JSU823 3LOP929 3LMN347 3SDF293 Sedan Sedan Sedan Sedan White White Red Red Coupe Coupe Coupe Silver Silver Silver … … … … … … … … … … White 1 0 0 1 0 1 1 0 0 0 0 Type Coupe 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 1 Not White 0 0 1 0 0 0 0 0 0 1 1 Not And Result Data Warehouse Optimization 307 Cardinality and Bitmap Size Older texts not directly related to data warehousing warn about creating bitmap indexes on columns with high cardinality, that is to say, columns with a large number of possible values. Sometimes they will even give a number, say 100 values, as the upper limit for bitmap indexes. These warnings are related to two issues with bitmaps, their size and their maintenance overhead. In this section, we discuss bitmap index size. The length of a bitmap vector is directly related to the size of the table. The vector needs 1 bit to represent the row. A byte can store 8 bits. If the table contains 8 million rows, a bitmap vector will require 1 million bytes to store all the bits. If the column being indexed has a very high cardinality with 1000 different possible values, then the size of the index, with 1000 vectors, would be 1 bil- lion bytes. One could then imagine that such a table with indexes on a dozen columns could have bitmap indexes that are many times bigger than the table itself. At least it would appear that way on paper. In reality, these vectors are very sparse. With 1,000 possible values, a vector representing one value contains far more 0 bits than 1 bits. Knowing this, the database systems that implement bitmap indexes use data compression techniques to significantly reduce the size of these vectors. Data compression can have a dramatic effect on the actual space used to store these indexes. In actual use, a bitmap index on a 1-million-row table and a column with 30,000 different values only requires 4 MB to store the index. A comparable b-tree index requires 20 MB or more, depending on the size of the column and the overhead imposed by the database system. Compression also has a dramatic effect on the speed of these indexes. Since the compressed indexes are so small, evaluation of the indexes on even very large tables can occur entirely in memory. Cardinality and Bitmap Maintenance The biggest downside to bitmap indexes is that they require constant maintenance to remain compact and efficient. When a column value changes, the database must update two bitmap vectors. For the old value, it must change the 1 bit to a 0 bit. To do this it locates the vector segment of the bit, decom- presses the segment, changes the bit, and compresses the segment. Chances are that the size of the segment has changed, so the system must place the segment in a new location on disk and link it back to the rest of the vector. This process is repeated for the new value. If the new value does not have a vector, a new vector is created. This new vector will contain bits for every row in the table, although initially it will be very small due to compression. The repeated changes and creations of new vectors severely fragments the bitmap vectors. As the vectors are split into smaller and smaller segments, the compression efficiency decreases. Size increases can be dramatic, with indexes Chapter 9 308 growing to 10 or 20 times normal size after updating 5 percent of a table. Fur- thermore, the database must piece together the segments, which are now spread across different areas of the disk in order to examine a vector. These two prob- lems, increase in size and fragmentation, work in concert to slow down such indexes. High-cardinality indexes make the problem worse because each vector is initially very small due to its sparsity. Any change to a vector causes it to split and fragment. The only way to resolve the problem is to rebuild the index after each data load. Fortunately, this is not a big problem in a data warehouse environment. Most database systems can perform this operation quickly. Where to Use Bitmap Indexes Without question, bitmap indexes should be used extensively in dimensional data marts. Each fact table foreign key and some number of the dimensional attributes should be indexed in this manner. In fact, you should avoid using b-tree indexes in combination with bitmap indexes in data marts. The reason for this is to prevent the database optimizer from making a choice. If you use bitmaps exclusively, queries perform in a consistent, predictable manner. If you introduce b-tree indexes as well, you invariably run into situations where, for whatever reason, the optimizer makes the wrong choice and the query runs for a long period of time. Use of bitmap indexes in the data warehouse depends on two factors: the use of the table and the means used to update the table. In general, bitmap indexes are not used because of the update overhead and the fact that table access is known and controlled. However, bitmap indexes may be useful for staging, or delivery preparation, tables. If these tables are exposed for end-user or application access, bitmaps may provide better performance and utility than b-tree indexes. Conclusion We have presented a variety of techniques to organize the data in the physical database structure to optimize performance and data management. Data clus- tering and index-organized tables can reduce the I/O necessary to retrieve data, provided the access to the data is known and predictable. Each technique has a significant downside if access to the data occurs in an unintended manner. Fortunately, the loading and delivery processes are controlled by the data warehouse development team. Thus, access to the data warehouse is known and predictable. With this knowledge, you should be able to apply the most appropriate technique when necessary. Table partitioning is primarily used in a data warehouse to improve the manageability of large tables. If the partitions are based on dates, they help reduce the size of incremental backups and simplify the archival process. Date-based partitions can also be used to implement a tiered storage strategy that can Data Warehouse Optimization 309 significantly reduce overall disk storage costs. Date-based partitions can also provide performance improvements for queries that cover a large time span, allowing for parallel access to multiple partitions. We also reviewed partitioning strategies designed specifically for performance enhancement by forcing parallel access to partitioned data. Such strategies are best applied to data mart tables, where query performance is of primary concern. We also examined indexing techniques and structures. For partitioned tables, local indexes provide the best combination of performance and manageability. We looked at the two most common index structures, b-tree and bitmap indexes. B-tree indexes are better suited for the data warehouse due to the frequency of updating and controlled query environment. Bitmap indexes, on the other hand, are the best choice for ad hoc query environments supported by the data marts. Optimizing the System Model The title for this section is in some ways an oxymoron. The system model itself is purely a logical representation, while it is the technology model that repre- sents the physical database implementation. How does one optimize a model that is never queried? What we address in this section are changes that can improve data storage utilization and performance, which affect the entity structure itself. The types of changes discussed here do not occur “under the hood,” that is, just to the physical model, but also propagate back to the system model and require changes to the processes that load and deliver data in the data warehouse. Because of the side effects of such changes, these techniques are best applied during initial database design. Making such changes after the fact to an existing data warehouse may involve a significant amount of work. Vertical Partitioning Vertical partitioning is a technique in which a table with a large number of columns is split into two or more tables, each with an exclusive subset of the nonkey columns. There are a number of reasons to perform such partitioning: Performance. A smaller row takes less space. Updates and queries perform better because the database is able to buffer more rows at a time. Change history. Some values change more frequently than others. By sepa- rating high-change-frequency and low-change-frequency columns, the storage requirements are reduced. Large text. If the row contains large free-form text columns, you can gain significant storage and performance efficiencies by placing the large text columns in their own tables. Chapter 9 310 We now examine each of these reasons and how they can be addressed using vertical partitioning. Vertical Partitioning for Performance The basic premise here is that a smaller row performs better than a larger row. There is simply less data for the database to handle, allowing it to buffer more rows and reduce the amount of physical I/O. But to achieve such efficiencies you must be able to identify those columns that are most frequently delivered exclusive of the other columns in the table. Let’s examine a business scenario where vertical partitioning in this manner would prove to be a useful endeavor. During the development of the data warehouse, it was discovered that planned service level agreements would not be met. The problem had to do with the large volume of order lines being processed and the need to deliver order line data to data marts in a timely manner. Analysis of the situation determined that the data marts most important to the company and most vul- nerable to service level failure only required a small number of columns from the order line table. A decision was made to create vertical partitions of the Order Line table. Figure 9.11 shows the resulting model. The original Order Line table contained all the columns in the Order Line 1 and Order Line 2 tables. In the physical implementation, only the Order Line 1 and Order Line 2 tables are created. The Order Line 1 table contains the data needed by the critical data marts. To expedite delivery to the critical marts, the update process was split so the Order Line 1 table update occurs first. Updates to the Order Line 2 table occur later in the process schedule, removed from the critical path. Notice that some columns appear in both tables. Of course, the primary key columns must be repeated, but there is no reason other columns should not be repeated if doing so helps achieve process efficiencies. It may be that there are some delivery processes that can function faster by avoiding a join if the Order Line 2 table contains some values from Order Line 1. The level of redundancy depends on the needs of your application. Because the Order Line 1 table’s row size is smaller than a combined row, updates to this table run faster. So do data delivery processes against this table. However, the combined updating time for both parts of the table is longer than if it was a single table. Data delivery processes also take more time, since there is now a need to perform a join, which was not necessary for a single table. But, this additional cost is acceptable if the solution enables delivery of the critical data marts within the planned service level agreements. Data Warehouse Optimization 311 Figure 9.11 Vertical partitioning. Vertical partitioning to improve performance is a drastic solution to a very specific problem. We recommend performance testing within your database environment to see if such a solution is of benefit to you. It is easier to quantify a vertical partitioning approach used to control change history tracking. This is discussed in the next section. Vertical Partitioning of Change History Given any table, there are most likely some columns that are updated more frequently than others. The most likely candidates for this approach are tables that meet the following criteria: Order Line 1 Order Identifier (FK) Order Line Identifier (FK) Profit Center Identifier Ship From Plant Identifier Product Identifier Unit Of Measure Identifier Order Quantity Confirmed Quantity Scheduled Delivery Date Planned Ship Date Net Value Gross Value Standard Cost Value Reject Reason Substitution Indicator Ship To Customer Identifier Sold To Customer Identifier Update Date Substituted Order Line Identifier (FK) Order Line 2 Order Identifier (FK) Order Line Identifier (FK) Product Identifier Item Category Product Batch Identifier Creation Date Pricing Date Requested Delivery Date Net Price Price Quantity Price Units Pricing Unit of Measure Gross Weight Net Weight Weight Unit of Measure Delivery Group Delivery Route Shipping Point Terms Days 1 Terms Percent 1 Terms Days 2 Terms Percent 2 Terms Days 3 Scheduled Delivery Time Material Category 1 Material Category 2 Material Category 3 Material Category 4 Material Category 5 Customer Category 1 Customer Category 2 Customer Category 3 Customer Category 4 Customer Category 5 Ship To Address Line 1 Ship To Address Line 2 Ship To Address Line 3 Ship To Address City Ship To Address State Ship To Address Postal Code Bill To Customer Identifier Update Date Order Line Order Identifier Order Line Identifier Chapter 9 312 [...]... those changes, and how to field changes to the data warehouse In the next section, we will examine how you can build flexibility in your data warehouse model so that it is more adaptable to future changes Finally, we will look at two common and challenging business changes that affect a data warehouse: the integration of similar but disparate source systems and expanding the scope of the data warehouse. .. data warehouse is data stewardship Specific personnel within the company should be designated as the stewards of specific data within the data warehouse The stewards of the data would be given overall responsibility for the content, definition, and access to the data The responsibilities of the data steward include: ■ ■ Establish data element definitions, specifying valid values where applicable, and. .. to support the ongoing operation and future expansion of a data warehouse can significantly reduce the effort and resources required to keep the warehouse running smoothly This chapter looks at the challenges faced by a data warehouse support group and presents modeling techniques to accommodate future change This chapter will first look at how change affects the data warehouse We will look at why changes... “everybody for anything” to requiring a review and approval by the data steward for requests for certain data elements ■■ Approve use of the data The data steward should review new data requests to validate how the data is to be used This is different from controlling access This responsibility ensures that the data requestor understands the data elements and is applying them in a manner consistent with... the data warehouse organization must overcome this perception and embrace change After all, one of the significant values of a data warehouse is the ability it provides for the business to evaluate the effect of a change in their business If the data warehouse is unable to change with the business, its value will diminish to the point were the data warehouse becomes irrelevant How you, your team, and. .. applicable, and notifying the data warehouse team whenever there is a change in the defined use of the data 324 Chapter 10 ■■ Resolve data quality issues, including defining any transformations ■■ Establish integration standards, such as a common coding system ■■ Control access to the data The data steward should be able to define where and how the data should be used and by whom This permission can... team, and your company deal with change has a profound effect on the success of the data warehouse In this section, we examine data warehouse change at a high level We look at why changes occur and their effect on the company and the data warehouse team Later in this chapter, we dig deeper into the technical issues and techniques to create an environment that is adaptable to minimize the effect of future... other actions that affect that operational system may have a severe and destructive impact on the data warehouse 332 Chapter 10 This stability is critical in a data warehouse because it often stores data spanning many years and possibly decades If you are faced with a situation where 20,000 SKUs are renumbered and you have a data warehouse with 100 million rows referencing those SKUs you will be faced... attributes Account Identifier (FK) Secured Loan Account Data Warehouse Optimization 317 Figure 9.13 Subtype cluster model Summary This chapter reviewed many techniques that can improve the performance of your data warehouse and its implementation We made recommendations for altering or refining the system and technology data models While we believe 3 18 Chapter 9 these recommendations are valid, we also... and versioning system, it should be integrated into your development environment At a minimum, the creation of development and quality assurance database instances is required to support changes once the data warehouse goes into production Figure 10.1 shows the minimal data warehouse landscape to properly support a production environment Process rework Development Completed changes Initial coding and . common and challenging business changes that affect a data warehouse: the integration of similar but disparate source systems and expanding the scope of the data warehouse. The Changing Data Warehouse The. the success of the data warehouse. In this section, we examine data warehouse change at a high level. We look at why changes occur and their effect on the company and the data warehouse team. Later. the data occurs in an unintended manner. Fortunately, the loading and delivery processes are controlled by the data warehouse development team. Thus, access to the data warehouse is known and

Định dạng
Số trang	46
Dung lượng	877,61 KB