Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 20 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
20
Dung lượng
641,12 KB
Nội dung
The demands of the modern global economy and the Internet dictate that end user operational applications are required to be active 24/7, 365 days a year. There is no window for any type of batch activity because when people are asleep in Europe, others are awake down under in Australia. The global economy requires instant and acceptable servicing of the needs of a global user population. In reality, the most significant difference between OLTP databases and data warehouses extends all the way down to the hardware layer. OLTP databases need highly efficient sharing of critical resources such as onboard memory (RAM), and have very small I/O requirements. Data warehouses are completely opposite. Data warehouses can consume large portions of RAM by transferring between disk and memory, in detriment to an OLTP database running on the same machine. Where OLTP databases need resource sharing, data warehouses need to hog those resources for extended periods of time. So, a data warehouse hogs machine resources. An OLTP database attempts to share those same resources. It is likely to have unacceptable response times because of a lack of basic I/O resources for both database types. The result, therefore, is a requirement for a complete separation between operational (OLTP) and decision-support (data warehouse) activity. This is why data warehouses exist! The Relational Database Model and Data Warehouses The traditional OLTP (transactional) type of relational database model does not cater for data warehouse requirements. The relational database model is too granular. “Granular” implies too many little pieces. Processing through all those little-bitty pieces is too time consuming for large transactions, joining all those pieces together. Similar to the object database model, the relational database model removes duplication and creates granularity. This type of database model is efficient for front-end application performance, involving small amounts of data that are accessed frequently and concurrently by many users at once. This is what an OLTP database does. Data warehouses, on the other hand, need throughput of huge amounts of data by relatively very few users. Data warehouses process large quantities of data at once, mainly for reporting and analytical processing. Also, data warehouses are regularly updated, but usually in large batch operations. OLTP databases need lightning-quick response to many individual users. Data warehouses perform enormous amounts of I/O activity over copious quantities of data; therefore, the needs of OLTP and data warehouse databases are completely contrary to each other, down to the lowest layer of hardware resource usage. Hardware resource usage is the most critical consideration. Software rests quite squarely on the shoulders of your hardware. Proper use of memory (RAM), disk storage, and CPU time to manage everything is the critical layer for all activity. OLTP and data warehouse database differences extend all the way down to this most critical of layers. OLTP databases require intensely sharable hardware structures (commonly known as concurrency), needing highly efficient use of memory and processor time allocations. Data warehouses need huge amounts of disk space, processing power as well, but all dedicated to long-running programs (commonly known as batch operations or throughput). A data warehouse database simply cannot cope using a standard OLTP database relational database model. Something else is needed for a data warehouse. 173 Understanding Data Warehouse Database Modeling 12_574906 ch07.qxd 11/4/05 10:48 AM Page 173 Surrogate Keys in a Data Warehouse Surrogate keys, as you already know, are replacement key values. A surrogate key makes database access more efficient —usually. In data warehouse databases, surrogate keys are possibly more important in terms of gluing together different data, even from different databases, perhaps even different database engines. Sometimes different databases could be keyed on different values, or even contain different key values, which in the non-computerized world are actually identical. For example, a customer in a department of a company could be uniquely identified by the customer’s name. In a second department, within the same company, the same customer could be identified by the name of a contact or even perhaps the phone number of that customer. A third department could identify the same customer by a fixed-length character coding system. All three definitions identify exactly the same customer. If this single company is to have meaningful data across all departments, it must identify the three separate formats, all representing the same customer as being the same customer in the data warehouse. A surrogate key is the perfect solution, using the same surrogate key value for each repetition of the same customer, across all departments. Surrogate key use is prominent in data warehouse database modeling. Referential Integrity in a Data Warehouse Data warehouse data modeling is essentially a form of relational database modeling, albeit a simplistic form. Referential integrity still applies to data warehouse databases; however, even though referential integrity applies, it is not essential to create primary keys, foreign keys, and their inter-table referential links (referential integrity). It is important to understand that a data warehouse database generally has two distinct activities. The first activity is updating with large numbers of records added at once, some- times also with large numbers of records changed. It is always best to only add or remove data in a data warehouse. Changing existing data warehouse table records can be extremely inefficient simply because of the sheer size of data warehouses. Referential integrity is best implemented and enforced when updating tables. The second activity of a data warehouse is the reading of data. When data is read, referential integrity does not need to be verified because no changes are occurring to records in tables. On the contrary, because referential integrity implies creation of primary and foreign keys, and because the best database model designs make profligate use of primary and foreign key fields in SQL code, leave referential integrity intact for a data warehouse. So, now we know the origin of data warehouses and why they were devised. What is the data warehouse dimensional database model? 174 Chapter 7 12_574906 ch07.qxd 11/4/05 10:48 AM Page 174 The Dimensional Database Model A standard, normalized, relational database model is completely inappropriate to the requirements of a data warehouse. Even a denormalized relational database model doesn’t make the cut. An entirely dif- ferent modeling technique, called a dimensional database model, is needed for data warehouses. A dimen- sional model contains what are called facts and dimensions. A fact table contains historical transactions, such as all invoices issued to all customers for the last five years. That could be a lot of records. Dimensions describe facts. The easiest way to describe the dimensional model is to demonstrate by example. Figure 7-1 shows a relational table structure for both static book data and dynamic (transactional) book data. The grayed out tables in Figure 7-1 are static data tables and others are tables containing data, which is in a constant state of change. Static tables are the equivalent of dimensions, describing facts (equivalent to transactions). So, in Figure 7-1, the dimensions are grayed out and the facts are not. Figure 7-1: The OLTP relational database model for books. Customer customer_id Shipper shipper_id shipper address phone email customer address phone email credit_card_type credit_card# credit_card_expiry Sale sale_id ISBN (FK) shipper_id (FK) customer_id (FK) sale_price sale_date Edition ISBN publisher_id (FK) publication_id (FK) print_date pages list_price format Publisher publisher_id name Publication publication_id subject_id (FK) author_id (FK) title Author author_id name Review review_id publication_id (FK) review_date text Subject subject_id parent_id name CoAuthor coauthor_id (FK) publication_id (FK) Rank ISBN (FK) rank ingram_units 175 Understanding Data Warehouse Database Modeling 12_574906 ch07.qxd 11/4/05 10:48 AM Page 175 What Is a Star Schema? The most effective approach for a data warehouse database model (using dimensions and facts) is called a star schema. Figure 7-2 shows a simple star schema for the REVIEW fact table shown in Figure 7-1. Figure 7-2: The REVIEW table fact-dimensional structure. A more simplistic equivalent diagram to that of Figure 7-2 is shown by the star schema structure in Figure 7-3. Review review_id customer_id (FK) publication_id (FK) author_id (FK) publisher_id (FK) review_date text Customer customer_id customer address phone email credit_card_type credit_card# credit_card_expiry Author author_id author Publisher publisher_id publisher Publication publication_id title 176 Chapter 7 12_574906 ch07.qxd 11/4/05 10:48 AM Page 176 Figure 7-3: The REVIEW fact-dimensional structure is a star schema. A star schema contains a single fact table plus a number of small dimensional tables. If there is more than one fact table, effectively there is more than one star schema. Fact tables contain transactional records, which over a period of time can come to contain very large numbers of records. Dimension tables on the other hand remain relatively constant in record numbers. The objective is to enhance SQL query join performance, where joins are executed between a single fact table and multiple dimensions, all on a single hierarchical level. So, a star schema is a single, very large, very changeable, fact table, connected directly to a single layer of multiple, static-sized dimensional tables. Author Book Publisher Customer One-To-Many Relationship Review Review 177 Understanding Data Warehouse Database Modeling 12_574906 ch07.qxd 11/4/05 10:48 AM Page 177 What Is a Snowflake Schema? A snowflake schema is shown in Figure 7-4. A snowflake schema is a normalized star schema, such that dimension entities are normalized (dimensions are separated into multiple tables). Normalized dimensions have all duplication removed from each dimension, such that the result is a single fact table, connected directly to some of the dimensions. Not all of the dimensions are directly connected to the fact table. In Figure 7-4, the dimensions are grayed out in two shades of gray. The lighter shade of gray represents dimensions connected directly to the fact table ( BOOK, AUTHOR, SUBJECT, SHIPPER, and CUSTOMER). The darker-shaded gray dimensional tables, are normalized subset dimensional tables, not connected to the fact table directly ( PUBLISHER, PUBLICATION, and CATEGORY). Figure 7-4: The SALE table fact-dimensional structure. Customer customer_id Shipper shipper_id shipper address phone email customer address phone email credit_card_type credit_card# credit_card_expiry Book ISBN publication_id (FK) publisher_id (FK) edition# print_date pages list_price format rank ingram_units Sale sale_id ISBN (FK) author_id (FK) shipper_id (FK) customer_id (FK) subject_id (FK) sale_price sale_date Publisher publisher_id publisher Author author_id author Publication publication_id title Subject subject_id category_id (FK) subject Category category_id category 178 Chapter 7 12_574906 ch07.qxd 11/4/05 10:48 AM Page 178 A more simplistic equivalent diagram to that of Figure 7-4 is shown by the snowflake schema in Figure 7-5. Figure 7-5: The SALE fact-dimensional structure is a snowflake schema. The problem with snowflake schemas isn’t too many tables but too many layers. Data warehouse fact tables can become incredibly large, even to millions, billions, even trillions of records. The critical factor in creating star and snowflake schemas, instead of using standard “nth” Normal Form layers, is decreasing the number of tables in SQL query joins. The more tables in a join, the more complex a query, the slower it will execute. When fact tables contain enormous record counts, reports can take hours and days, not minutes. Adding just one more table to a fact-dimensional query join at that level of database size could make the query run for weeks. That’s no good! Book Publication Publisher Customer Shipper Author Subject One-To-Many Relationship Sale Sale Category 179 Understanding Data Warehouse Database Modeling 12_574906 ch07.qxd 11/4/05 10:48 AM Page 179 The solution is an obvious one. Convert (denormalize) a normalized snowflake schema into a star schema, as shown in Figure 7-6. In Figure 7-6 the PUBLISHER and PUBLICATION tables have been denormalized into the BOOK table, plus the CATEGORY table has been denormalized into the SUBJECT table. Figure 7-6: A denormalized SALE table fact-dimensional structure. A more simplistic equivalent diagram to that of Figure 7-6 is shown by the star schema in Figure 7-7. Sale sale_id ISBN (FK) author_id (FK) shipper_id (FK) customer_id (FK) subject_id (FK) sale_price sale_date Customer customer_id customer address phone email credit_card_type credit_card# credit_card_expiry Shipper shipper_id shipper address phone email Author author_id author Book ISBN publisher title edition# print_date pages list_price format rank ingram_units Subject subject_id category subject 180 Chapter 7 12_574906 ch07.qxd 11/4/05 10:48 AM Page 180 Figure 7-7: The SALE fact-dimensional structure denormalized into a star schema. What does all this prove? Not much, you might say. On the contrary, two things are achieved by using fact-dimensional structures and star schemas: ❑ Figure 7-1 shows a highly normalized table structure, useful for high-concurrency, precision record-searching databases (an OLTP database). Replacing this structure with a fact-dimensional structure (as shown in Figure 7-2, Figure 7-4, and Figure 7-6) reduces the number of tables. As you already know, reducing the number tables is critical to SQL query performance. Data warehouses consist of large quantities of data, batch updates, and incredibly complex queries. The fewer tables, the better. It just makes things so much easier with fewer tables, especially because there is so much data. The following code is a SQL join query for the snowflake schema, joining all nine tables in the snowflake schema shown in Figure 7-5. SELECT * FROM SALE SAL JOIN AUTHOR AUT JOIN CUSTOMER CUS JOIN SHIPPER SHP JOIN SUBJECT SUB JOIN CATEGORY CAT JOIN BOOK BOO JOIN PUBLISHER PBS JOIN PUBLICATION PBL WHERE GROUP BY ORDER BY ; ❑ If the SALE fact table has 1 million records, and all dimensions contain 10 records each, a Cartesian product would return 10 6 multiplied by 10 9 records. That makes for 10 15 records. That is a lot of records for any CPU to process. Book Customer Shipper Author Subject One-To-Many Relationship Sale Sale 181 Understanding Data Warehouse Database Modeling 12_574906 ch07.qxd 11/4/05 10:48 AM Page 181 A Cartesian product is a worse-case scenario. ❑ Now look at the next query. SELECT * FROM SALE SAL JOIN AUTHOR AUT JOIN CUSTOMER CUS JOIN SHIPPER SHP JOIN SUBJECT SUB JOIN BOOK BOO WHERE GROUP BY ORDER BY ; ❑ Using the star schema from Figure 7-7, assuming the same number of records, a join occurs between one fact table and six dimensional tables. That is a Cartesian product of 10 6 multiple by 10 6 , resulting in 10 12 records returned. The difference between 10 12 and 10 15 is three deci- mals. Three decimals is not just three zeroes and thus 1,000 records. The difference is actually 1,000,000,000,000,000 – 1,000,000,000,000 = 999,000,000,000,000. That is effectively just a little less than 10 15 . The difference between six dimensions and nine dimensions is more or less infinite, from the perspective of counting all those zeros. Fewer dimensions make for faster queries. That’s why it is so essential to denormalize snowflake schemas into star schemas. ❑ Take another quick glance at the snowflake schema in Figure 7-4 and Figure 7-5. Then examine the equivalent denormalized star schema in Figure 7-6 and Figure 7-7. Now put yourself into the shoes of a hustled, harried and very busy executive —trying to get a quick report. Think as an end-user, one only interested in results. Which diagram is easier to decipher as to content and meaning? The diagram in Figure 7-7 is more complex than the diagram in Figure 7-5? After all, being an end-user, you are probably not too interested in understanding the complexities of how to build SQL join queries. You have bigger fish to fry. The point is this: The less complex the table structure, the easier it will be to use. This is because a star schema is more representa- tive of the real world than a snowflake schema. Look at it this way. A snowflake schema is more deeply normalized than a star schema, and, therefore, by definition more mathematical. Something more mathematical is generally of more use to a mathematician than it is to an exec- utive manager. The executive is trying to get a quick overall impression of whether his company will sell more cans of lima beans, or more cans of string beans, over the course of the next ten years. If you are a computer programmer, you will quite probably not agree with this analogy. That tells us the very basics of data warehouse database modeling. How can a data warehouse database model be constructed? How to Build a Data Warehouse Database Model Now you know how to build star schemas for data warehouse database models. As you can see, a star schema is quite different from a standard relational database model (Figure 7-1). The next step is to examine the process, or the steps, by which a data warehouse database model can be built. 182 Chapter 7 12_574906 ch07.qxd 11/4/05 10:48 AM Page 182 [...]... very time-consuming 183 Chapter 7 Types of Dimension Tables The data warehouse database models for the REVIEW and SALE tables, shown previously in this chapter, are actually inadequate Many data warehouse databases have standard requirements based on how end-users need to analyze data Typical additions to data warehouse databases are dimensions such as dates, locations, and products These extra dimensions... Dimensions typically depend on the data content of a data warehouse, and also how the data warehouse is used for analysis 186 Understanding Data Warehouse Database Modeling Try It Out Creating a Data Warehouse Database Model Figure 7-12 shows the tables of a database containing details of bands, their released CDs, tracks on each of those CDs, followed by three tables containing royalty amounts earned by... of data warehouses, especially why they were invented and how they are used ❑ The inadequacy of the transactional (OLTP) database model for producing forecasts and predictions from huge amounts of out of date data because an OLTP database model has too much granularity ❑ Specialized database modeling techniques required by data warehouses in the form of facts and dimensions ❑ Star schemas as single dimensional... dimensional layer ❑ How fact-dimensional or dimensional database modeling is either a star schema or a snowflake schema ❑ Data warehouses containing one or more fact tables, linked to the same dimensions, or a subset of the dimensions ❑ The various steps and processes used in the creation of a data warehouse database model ❑ The contents of a data warehouse database and model are both static and historical... continually added to, and can grow enormously over time 191 Chapter 7 This chapter has described the specialized database modeling techniques required for the building of data warehouses The next chapter examines one of the most important factors of database modeling — performance If a database underperforms, a company can lose customers and, in extreme situations, lose all its customers Guess what...Understanding Data Warehouse Database Modeling Data Warehouse Modeling Step by Step The primary objective of a data warehouse (or any database) is to service end-users The end-users are the people who read reports produced by data warehouse SQL queries End-users utilize a data warehouse... a data warehouse to search for patterns, and attempt to forecast trends from masses of historical information From that perspective, there is a sequence of steps in approaching data warehouse database design, beginning with the end-user perspective The end-user looks at a company from a business process, or operational perspective: ❑ Business processes — Establish the subject areas of a business How... dimensions, such as with whom and what (customers, authors, shippers, books) 190 Understanding Data Warehouse Database Modeling Fact tables can contain detailed transactional histories, summaries (aggregations), or both Details and aggregations can be separated into different fact tables Many relational databases allow use of materialized views to contain fact table aggregations, containing less detailed... like that shown in Figure 7-10 Some silly fields have been added just to give this conversation a little color The equivalent star schema changes are shown in Figure 7-11 184 Understanding Data Warehouse Database Modeling Time time_id Publisher month quarter year Dates are replaced with TIME_ID foreign key to the TIME dimension publisher_id publisher Review review_id time_id (FK) customer_id (FK) publication_id... processes of the day-to-day functions of a business Transactional data is function 2 3 4 Granularity can be ignored in this case What are the dimensions? Dimensions are the static tables in a data warehouse database model Create the dimension tables, followed by the fact tables Band CD band_id cd_id name address phone email band_id (FK) title length tracks Radio radio_id track_id (FK) station city date royalty_amount . data warehouse database simply cannot cope using a standard OLTP database relational database model. Something else is needed for a data warehouse. 173 Understanding Data Warehouse Database Modeling 12_574906. warehouse database modeling. How can a data warehouse database model be constructed? How to Build a Data Warehouse Database Model Now you know how to build star schemas for data warehouse database. Relational Database Model and Data Warehouses The traditional OLTP (transactional) type of relational database model does not cater for data warehouse requirements. The relational database model