Building the Data Warehouse Third Edition phần 3 doc

43 351 0
Building the Data Warehouse Third Edition phần 3 doc

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Data Warehouse: The Standards Manual The data warehouse is relevant to many people-managers, DSS analysts, devel- opers, planners, and so forth. In most organizations, the data warehouse is new. Accordingly, there should be an official organizational explanation and descrip- tion of what is in the data warehouse and how the data warehouse can be used. Calling the explanation of what is inside the data warehouse a “standards man- ual” is probably deadly. Standards manuals have a dreary connotation and are famous for being ignored and gathering dust. Yet, some form of internal publi- cation is a necessary and worthwhile endeavor. The kinds of things the publication (whatever it is called!) should contain are the following: ■■ A description of what a data warehouse is ■■ A description of source systems feeding the warehouse ■■ How to use the data warehouse ■■ How to get help if there is a problem ■■ Who is responsible for what ■■ The migration plan for the warehouse ■■ How warehouse data relates to operational data ■■ How to use warehouse data for DSS ■■ When not to add data to the warehouse ■■ What kind of data is not in the warehouse ■■ A guide to the meta data that is available ■■ What the system of record is Auditing and the Data Warehouse An interesting issue that arises with data warehouses is whether auditing can be or should be done from them. Auditing can be done from the data ware- house. In the past there have been a few examples of detailed audits being per- formed there. But there are many reasons why auditing—even if it can be done from the data warehouse—should not be done from there. The primary reasons for not doing so are the following: ■■ Data that otherwise would not find its way into the warehouse suddenly has to be there. CHAPTER 2 64 Uttama Reddy ■■ The timing of data entry into the warehouse changes dramatically when auditing capability is required. ■■ The backup and recovery restrictions for the data warehouse change dras- tically when auditing capability is required. ■■ Auditing data at the warehouse forces the granularity of data in the ware- house to be at the very lowest level. In short, it is possible to audit from the data warehouse environment, but due to the complications involved, it makes much more sense to audit elsewhere. Cost Justification Cost justification for the data warehouse is normally not done on an a priori, return-on-investment (ROI) basis. To do such an analysis, the benefits must be known prior to building the data warehouse. In most cases, the real benefits of the data warehouse are not known or even anticipated before construction begins because the warehouse is used differ- ently than other data and systems built by information systems. Unlike most information processing, the data warehouse exists in a realm of “Give me what I say I want, then I can tell you what I really want.” The DSS analyst really can- not determine the possibilities and potentials of the data warehouse, nor how and why it will be used, until the first iteration of the data warehouse is avail- able. The analyst operates in a mode of discovery, which cannot commence until the data warehouse is running in its first iteration. Only then can the DSS analyst start to unlock the potential of DSS processing. For this reason, classical ROI techniques simply do not apply to the data ware- house environment. Fortunately, data warehouses are built incrementally. The first iteration can be done quickly and for a relatively small amount of money. Once the first portion of the data warehouse is built and populated, the analyst can start to explore the possibilities. It is at this point that the analyst can start to justify the development costs of the warehouse. As a rule of thumb, the first iteration of the data warehouse should be small enough to be built and large enough to be meaningful. Therefore, the data ware- house is best built a small iteration at a time. There should be a direct feedback loop between the warehouse developer and the DSS analyst, in which they are constantly modifying the existing warehouse data and adding other data to the warehouse. And the first iteration should be done quickly. It is said that the ini- tial data warehouse design is a success if it is 50 percent accurate. The Data Warehouse Environment 65 Uttama Reddy Typically, the initial data warehouse focuses on one of these functional areas: ■■ Finance ■■ Marketing ■■ Sales Occasionally, the data warehouse’s first functional area will focus on one of these areas: ■■ Engineering/manufacturing ■■ Actuarial interests Justifying Your Data Warehouse There is no getting around the fact that data warehouses cost money. Data, processors, communications, software, tools, and so forth all cost money. In fact, the volumes of data that aggregate and collect in the data warehouse go well beyond anything the corporation has ever seen. The level of detail and the history of that detail all add up to a large amount of money. In almost every other aspect of information technology, the major investment for a system lies in creating, installing, and establishing the system. The ongo- ing maintenance costs for a system are miniscule compared to the initial costs. However, establishing the initial infrastructure of the data warehouse is not the most significant cost—the ongoing maintenance costs far outweigh the initial infrastructure costs. There are several good reasons why the costs of a data warehouse are significantly different from the cost of a standard system: ■■ The truly enormous volume of data that enters the data warehouse. ■■ The cost of maintaining the interface between the data warehouse and the operational sources. If the organization has chosen an extract/transfer/load (ETL) tool, then these costs are mitigated over time; if an organization has chosen to build the interface manually, then the costs of maintenance sky- rocket. ■■ The fact that a data warehouse is never done. Even after the initial few iterations of the data warehouse are successfully completed, adding more subject areas to the data warehouse is an ongoing need. Cost of Running Reports How does an organization justify the costs of a data warehouse before the data warehouse is built? There are many approaches. We will discuss one in depth here, but be advised that there are many other ways to justify a data warehouse. CHAPTER 2 66 Uttama Reddy We chose this approach because it is simple and because it applies to every organization. When the justification is presented properly, it is very difficult to deny the powerful cost justifications for a data warehouse. It is an argument that technicians and non-technicians alike can appreciate and understand. Data warehousing lowers the cost of information by approximately two orders of magnitude. This means that with a data warehouse an organization can access a piece of information for $100; an organization that does not have a data warehouse can access the same unit of information for $10,000. How do you show that data warehousing greatly lowers the cost of information? First, use a report. This doesn’t necessarily need to be an actual report. It can be a screen, a report, a spreadsheet, or some form of analytics that demonstrates the need for information in the corporation. Second, you should look at your legacy environment, which includes single or multiple applications, old and new applications. The applications may be Enterprise Resource Planning (ERP) applications, non-ERP applications, online applications, or offline applications. Now consider two companies, company A and company B. The companies are identical in respect to their legacy applications and their need for information. The only difference between the two is that company B has a data warehouse from which to do reporting and company A does not. Company A looks to its legacy applications to gather information. This task includes the following: ■■ Finding the data needed for the report ■■ Accessing the data ■■ Integrating the data ■■ Merging the data ■■ Building the report Finding the data can be no small task. In many cases, the legacy systems are not documented. There is a time-honored saying: Real programmers don’t do docu- mentation. This will come back to haunt organizations, as there simply is no easy way to go back and find out what data is in the old legacy systems and what processing has occurred there. Accessing the data is even more difficult. Some of the legacy data is in Infor- mation Management System (IMS), some in Model 204, some in Adabas. And there is no IMS, Model 204, and Adabas technology expertise around anymore. The technology that houses the legacy environment is a mystery. And even if the legacy environment can be accessed, the computer operations department stands in the way because it does not want anything in the way of the online window of processing. The Data Warehouse Environment 67 Uttama Reddy If the data can be found and accessed, it then needs to be integrated. Reports typically need information from multiple sources. The problem is that those sources were never designed to be run together. A customer in one system is not a customer in another system, a transaction in one system is different from a transaction in another system, and so forth. A tremendous amount of conver- sion, reformatting, interpretation, and the like must go on in order to integrate data from multiple systems. Merging the data is easy in some cases. But in the case of large amounts of data or in the case of data coming from multiple sources, the merger of data can be quite an operation. Finally, the report is built. How long does this process take for company A? How much does it cost? Depending on the information that is needed and depending on the size and state of the legacy systems environment, it may take a considerable amount of time and a high cost to get the information. The typical cost ranges from $25,000 to $1 million. The typical length of time to access data is anywhere from 1 to 12 months. Now suppose that an company B has built a data warehouse. The typical cost here ranges from $500 to $10,000. The typical length of time to access data is one hour to a half day. We see that company B’s costs and time investment for retrieving information are much lower. The cost differential between company A and company B forms the basis of the cost justification for a data warehouse. Data warehousing greatly lowers the cost of information and accelerates the time required to get the information. Cost of Building the Data Warehouse The astute observer will ask, what about the cost of building the data ware- house? Figure 2.26 shows that in order to generate a single report for company B, it is still necessary to find, access, integrate, and merge the data. These are the same initial steps taken to build a single report for company A, so there are no real savings found in building a data warehouse. Actually, building a data warehouse to run one report is a costly waste of time and money. But no corporation in the world operates from a single report. Different divi- sions of even the simplest, smallest corporation look at data differently. Accounting looks at data one way; marketing looks at data another way; sales looks at data yet another way; and management looks at data in even another way. In this scenario, the cost of building the data warehouse is worthwhile. It is a one-time cost that liberates the information found in the data warehouse. Whereas each report company A needs is both costly and time-consuming, com- CHAPTER 2 68 TEAMFLY Team-Fly ® Uttama Reddy pany B uses the one-time cost of building the data warehouse to generate mul- tiple reports (see Figure 2.27). But that expense is a one-time expense, for the most part. (At least the initial establishment of the data warehouse is a one-time expense.) Figure 2.27 shows that indeed data warehousing greatly lowers the cost of information and greatly accelerates the rate at which information can be retrieved. Would company A actually even pay to generate individual reports? Probably not. Perhaps it would pay the price for information the first few times. When it realizes that it cannot afford to pay the price for every report, it simply stops creating reports. The end user has the attitude, “I know the information is in my corporation, but I just can’t get to it.” The result of the high costs of getting information and the length of time required is such that end users are frustrated and are unhappy with their IT organization for not being able to deliver information. Data Homogeneity/Heterogeneity At first glance, it may appear that the data found in the data warehouse is homo- geneous in the sense that all of the types of records are the same. In truth, data in the data warehouse is very heterogeneous. The data found in the data ware- house is divided into major subdivisions called subject areas. Figure 2.28 shows that a data warehouse has subject areas of product, customer, vendor, and transaction. The first division of data inside a data warehouse is along the lines of the major subjects of the corporation. But with each subject area there are further subdi- visions. Data within a subject area is divided into tables. Figure 2.29 shows this division of data into tables for the subject area product. The Data Warehouse Environment 69 5 - build the report 1 - find the data 2 - access the data 3 - integrate the data 4 - merge the data legacy applications data warehouse report Figure 2.26 Where the costs and the activities are when a data warehouse is built. Uttama Reddy CHAPTER 2 70 $1,000,000 $500,000 $2,000,000 $2,500,000 $1,000,000 $250 $10,000 $1,000 $2,000 $3,000 $2,000,000 company B company A Figure 2.27 Multiple reports make the cost of the data warehouse worthwhile. product customer transaction vendor Figure 2.28 The data in the different parts of the data warehouse are grouped by subject area. Uttama Reddy Figure 2.29 shows that there are five tables that make up the subject area inside the data warehouse. Each of the tables has its own data, and there is a common thread for each of the tables in the subject area. That common thread is the key/foreign key data element—product. Within the physical tables that make up a subject area there are further subdi- visions. These subdivisions are created by different occurrences of data values. For example, inside the product shipping table, there are January shipments, February shipments, March shipments, and so forth. The data in the data warehouse then is subdivided by the following criteria: ■■ Subject area ■■ Table ■■ Occurrences of data within table This organization of data within a data warehouse makes the data easily acces- sible and understandable for all the different components of the architecture that must build on the data found there. The result is that the data warehouse, with its granular data, serves as a basis for many different components, as seen in Figure 2.30. The simple yet elegant organization of data within the data warehouse environ- ment seen in Figure 2.30 makes data accessible in many different ways for many different purposes. The Data Warehouse Environment 71 product product date location order product date vendor product description product ship date ship amount product bom number bom description Figure 2.29 Within the product subject area there are different types of tables, but each table has a common product identifier as part of the key. Uttama Reddy Purging Warehouse Data Data does not just eternally pour into a data warehouse. It has its own life cycle within the warehouse as well. At some point in time, data is purged from the warehouse. The issue of purging data is one of the fundamental design issues that must not escape the data warehouse designer. In some senses, data is not purged from the warehouse at all. It is simply rolled up to higher levels of summary. There are several ways in which data is purged or the detail of data is transformed, including the following: CHAPTER 2 72 Fig 2.30 The data warehouse sits at the center of a large framework. Uttama Reddy ■■ Data is added to a rolling summary file where detail is lost. ■■ Data is transferred to a bulk storage medium from a high-performance medium such as DASD. ■■ Data is actually purged from the system. ■■ Data is transferred from one level of the architecture to another, such as from the operational level to the data warehouse level. There are, then, a variety of ways in which data is purged or otherwise trans- formed inside the data warehouse environment. The life cycle of data—includ- ing its purge or final archival dissemination—should be an active part of the design process for the data warehouse. Reporting and the Architected Environment It is a temptation to say that once the data warehouse has been constructed all reporting and informational processing will be done from there. That is simply not the case. There is a legitimate class of report processing that rightfully belongs in the domain of operational systems. Figure 2.31 shows where the dif- ferent styles of processing should be located. The Data Warehouse Environment 73 operational operational reporting • the line item is of the essence; the summary is of little or no importance once used • of interest to the clerical community data warehouse reporting • the line item is of little or no use once used; the summary or other calculation is of primary importance • of interest to the managerial community data warehouse Figure 2.31 The differences between the two types of reporting. Uttama Reddy [...]... of the data warehouse is constructed, then another part of the warehouse is constructed It is never appropriate to develop the data warehouse under the “big bang” approach One reason is that the end user of the warehouse operates in a discovery mode, so only after the warehouse s first iteration is built can the developer tell what is really needed in the warehouse AM FL Y The granularity of the data. .. incorrect data in the data warehouse The first assumption is that incorrect data arrives in the data warehouse on an exception basis If data is being incorrectly entered in the data warehouse on a wholesale basis, then it is incumbent on the architect to find the offending ETL program and make adjustments Occasionally, even with the best of ETL processing, a few pieces of incorrect data enter the data warehouse. .. changes are made to the corporate data model as it is applied to the data warehouse First, data that is used purely in the Uttama Reddy 90 CHAPTER 3 corporate model data model data model data warehouse data model data model operational data model data warehouse oper • operational data model equals corporate data model • remove pure operational data • performance factors are added prior to database design... archival data found in the operational environment the “operational window” of data is not nearly as long It can be anywhere from 1 week to 2 years The time horizon of archival data in the operational environment is not the only difference between archival data in the data warehouse and in the operational Uttama Reddy The Data Warehouse Environment 75 environment Unlike the data warehouse, the operational... be done from a data warehouse, but auditing should not be done from a data warehouse Instead, auditing is best done in the detailed operational transaction-oriented environment When auditing is done in the data warehouse, data that would not otherwise be included is found there, the timing of the update into the data warehouse becomes an issue, and the level of Uttama Reddy The Data Warehouse Environment... assumptions do not hold for the data warehouse Many development tools, such as CASE tools, have the same orientation and as such are not applicable to the data warehouse environment Uttama Reddy The Data Warehouse and Design 89 the corporate model, the operational model, and the data warehouse model data model applies directly oper applies indirectly data warehouse dept ind data warehouse applies directly... created in the data warehouse for account ABC Then on August 15 an error is discovered Instead of an entry for $5,000, the entry should have been for $750 How can the data in the data warehouse be corrected? Uttama Reddy The Data Warehouse Environment ■ ■ 77 Choice 1: Go back into the data warehouse for July 2 and find the offending entry Then, using update capabilities, replace the value $5,000 with the. .. Figure 3. 7 How the different types of models apply to the architected environment The Data Warehouse and Data Models As shown in Figure 3. 8, the data model is applicable to both the existing systems environment and the data warehouse environment Here, an overall corporate data model has been constructed with no regard for a distinction between existing operational systems and the data warehouse The corporate... could be further from the truth Merely pulling data out of the legacy environment and placing it in the data warehouse achieves very little of the potential of data warehousing Figure 3. 1 shows a simplification of how data is transferred from the existing legacy systems environment to the data warehouse We see here that multiple applications contribute to the data warehouse Figure 3. 1 is overly simplistic... data warehouse environment goes into the data warehouse cannot be updated Instead, an element of time must be attached to it A major shift in the modes of processing surrounding the data is necessary as it passes into the data warehouse from the operational environment Yet another major consideration when passing data is the need to manage the volume of data that resides in and passes into the warehouse . Environment 69 5 - build the report 1 - find the data 2 - access the data 3 - integrate the data 4 - merge the data legacy applications data warehouse report Figure 2.26 Where the costs and the activities. applications to gather information. This task includes the following: ■■ Finding the data needed for the report ■■ Accessing the data ■■ Integrating the data ■■ Merging the data ■■ Building the report Finding. and/or other data is added. Then another portion of the data warehouse is built, and so forth.This feedback loop continues throughout the entire life of the data warehouse. Therefore, data warehouses

Ngày đăng: 08/08/2014, 22:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan