Data Warehousing Fundamentals A Comprehensive Guide for IT Professionals phần 2 ppt

53 1.9K 0
Data Warehousing Fundamentals A Comprehensive Guide for IT Professionals phần 2 ppt

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

conversions of data into your internal formats and data types. You have to organize the data transmissions from the external sources. Some sources may provide information at regular, stipulated intervals. Others may give you the data on request. You need to accom- modate the variations. Data Staging Component After you have extracted data from various operational systems and from external sources, you have to prepare the data for storing in the data warehouse. The extracted data coming from several disparate sources needs to be changed, converted, and made ready in a format that is suitable to be stored for querying and analysis. Three major functions need to be performed for getting the data ready. You have to ex- tract the data, transform the data, and then load the data into the data warehouse storage. These three major functions of extraction, transformation, and preparation for loading take place in a staging area. The data staging component consists of a workbench for these functions. Data staging provides a place and an area with a set of functions to clean, change, combine, convert, deduplicate, and prepare source data for storage and use in the data warehouse. Why do you need a separate place or component to perform the data preparation? Can you not move the data from the various sources into the data warehouse storage itself and then prepare the data? When we implement an operational system, we are likely to pick up data from different sources, move the data into the new operational system database, and run data conversions. Why can’t this method work for a data warehouse? The essential dif- ference here is this: in a data warehouse you pull in data from many source operational systems. Remember that data in a data warehouse is subject-oriented and cuts across op- erational applications. A separate staging area, therefore, is a necessity for preparing data for the data warehouse. Now that we have clarified the need for a separate data staging component, let us un- derstand what happens in data staging. We will now briefly discuss the three major func- tions that take place in the staging area. Data Extraction. This function has to deal with numerous data sources. You have to employ the appropriate technique for each data source. Source data may be from differ- ent source machines in diverse data formats. Part of the source data may be in relation- al database systems. Some data may be on other legacy network and hierarchical data models. Many data sources may still be in flat files. You may want to include data from spreadsheets and local departmental data sets. Data extraction may become quite com- plex. Tools are available on the market for data extraction. You may want to consider using outside tools suitable for certain data sources. For the other data sources, you may want to develop in-house programs to do the data extraction. Purchasing outside tools may entail high initial costs. In-house programs, on the other hand, may mean ongoing costs for de- velopment and maintenance. After you extract the data, where do you keep the data for further preparation? You may perform the extraction function in the legacy platform itself if that approach suits your framework. More frequently, data warehouse implementation teams extract the source into a separate physical environment from which moving the data into the data warehouse OVERVIEW OF THE COMPONENTS 31 would be easier. In the separate environment, you may extract the source data into a group of flat files, or a data-staging relational database, or a combination of both. Data Transformation. In every system implementation, data conversion is an impor- tant function. For example, when you implement an operational system such as a maga- zine subscription application, you have to initially populate your database with data from the prior system records. You may be converting over from a manual system. Or, you may be moving from a file-oriented system to a modern system supported with relational data- base tables. In either case, you will convert the data from the prior systems. So, what is so different for a data warehouse? How is data transformation for a data warehouse more in- volved than for an operational system? Again, as you know, data for a data warehouse comes from many disparate sources. If data extraction for a data warehouse poses great challenges, data transformation presents even greater challenges. Another factor in the data warehouse is that the data feed is not just an initial load. You will have to continue to pick up the ongoing changes from the source systems. Any transformation tasks you set up for the initial load will be adapted for the ongoing revisions as well. You perform a number of individual tasks as part of data transformation. First, you clean the data extracted from each source. Cleaning may just be correction of mis- spellings, or may include resolution of conflicts between state codes and zip codes in the source data, or may deal with providing default values for missing data elements, or elim- ination of duplicates when you bring in the same data from multiple source systems. Standardization of data elements forms a large part of data transformation. You stan- dardize the data types and field lengths for same data elements retrieved from the various sources. Semantic standardization is another major task. You resolve synonyms and homonyms. When two or more terms from different source systems mean the same thing, you resolve the synonyms. When a single term means many different things in different source systems, you resolve the homonym. Data transformation involves many forms of combining pieces of data from the differ- ent sources. You combine data from a single source record or related data elements from many source records. On the other hand, data transformation also involves purging source data that is not useful and separating out source records into new combinations. Sorting and merging of data takes place on a large scale in the data staging area. In many cases, the keys chosen for the operational systems are field values with built- in meanings. For example, the product key value may be a combination of characters indi- cating the product category, the code of the warehouse where the product is stored, and some code to show the production batch. Primary keys in the data warehouse cannot have built-in meanings. We will discuss this further in Chapter 10. Data transformation also in- cludes the assignment of surrogate keys derived from the source system primary keys. A grocery chain point-of-sale operational system keeps the unit sales and revenue amounts by individual transactions at the check-out counter at each store. But in the data warehouse, it may not be necessary to keep the data at this detailed level. You may want to summarize the totals by product at each store for a given day and keep the summary totals of the sale units and revenue in the data warehouse storage. In such cases, the data trans- formation function would include appropriate summarization. When the data transformation function ends, you have a collection of integrated data that is cleaned, standardized, and summarized. You now have data ready to load into each data set in your data warehouse. 32 DATA WAREHOUSE: THE BUILDING BLOCKS Data Loading. Two distinct groups of tasks form the data loading function. When you complete the design and construction of the data warehouse and go live for the first time, you do the initial loading of the data into the data warehouse storage. The initial load moves large volumes of data using up substantial amounts of time. As the data warehouse starts functioning, you continue to extract the changes to the source data, transform the data revisions, and feed the incremental data revisions on an ongoing basis. Figure 2-7 il- lustrates the common types of data movements from the staging area to the data ware- house storage. Data Storage Component The data storage for the data warehouse is a separate repository. The operational systems of your enterprise support the day-to-day operations. These are online transaction process- ing applications. The data repositories for the operational systems typically contain only the current data. Also, these data repositories contain the data structured in highly normal- ized formats for fast and efficient processing. In contrast, in the data repository for a data warehouse, you need to keep large volumes of historical data for analysis. Further, you have to keep the data in the data warehouse in structures suitable for analysis, and not for quick retrieval of individual pieces of information. Therefore, the data storage for the data warehouse is kept separate from the data storage for operational systems. In your databases supporting operational systems, the updates to data happen as trans- actions occur. These transactions hit the databases in a random fashion. How and when the transactions change the data in the databases is not completely within your control. The data in the operational databases could change from moment to moment. When your analysts use the data in the data warehouse for analysis, they need to know that the data is stable and that it represents snapshots at specified periods. As they are working with the OVERVIEW OF THE COMPONENTS 33 K This function is time-consuming K Initial load moves very large volumes of data K The business conditions determine the refresh cycles Base data load Quarterly refresh Monthly refresh Yearly refresh Daily refresh Data Sources DATA WAREHOUSE Figure 2-7 Data movements to the data warehouse. data, the data storage must not be in a state of continual updating. For this reason, the data warehouses are “read-only” data repositories. Generally, the database in your data warehouse must be open. Depending on your re- quirements, you are likely to use tools from multiple vendors. The data warehouse must be open to different tools. Most of the data warehouses employ relational database man- agement systems. Many of the data warehouses also employ multidimensional database management systems. Data extracted from the data warehouse storage is aggregated in many ways and the summary data is kept in the multidimensional databases (MDDBs). Such multidimen- sional database systems are usually proprietary products. Information Delivery Component Who are the users that need information from the data warehouse? The range is fairly comprehensive. The novice user comes to the data warehouse with no training and, there- fore, needs prefabricated reports and preset queries. The casual user needs information once in a while, not regularly. This type of user also needs prepackaged information. The business analyst looks for ability to do complex analysis using the information in the data warehouse. The power user wants to be able to navigate throughout the data warehouse, pick up interesting data, format his or her own queries, drill through the data layers, and create custom reports and ad hoc queries. In order to provide information to the wide community of data warehouse users, the in- formation delivery component includes different methods of information delivery. Figure 2-8 shows the different information delivery methods. Ad hoc reports are predefined re- ports primarily meant for novice and casual users. Provision for complex queries, multidi- mensional (MD) analysis, and statistical analysis cater to the needs of the business ana- lysts and power users. Information fed into Executive Information Systems (EIS) is meant for senior executives and high-level managers. Some data warehouses also provide data to data-mining applications. Data-mining applications are knowledge discovery systems 34 DATA WAREHOUSE: THE BUILDING BLOCKS Data Warehouse Data Marts Information Delivery Component Ad hoc reports EIS feed Statistical Analysis MD Analysis Complex queries Online Intranet Internet E-Mail Data Mining Figure 2-8 Information delivery component. where the mining algorithms help you discover trends and patterns from the usage of your data. In your data warehouse, you may include several information delivery mechanisms. Most commonly, you provide for online queries and reports. The users will enter their re- quests online and will receive the results online. You may set up delivery of scheduled re- ports through e-mail or you may make adequate use of your organization’s intranet for in- formation delivery. Recently, information delivery over the Internet has been gaining ground. Metadata Component Metadata in a data warehouse is similar to the data dictionary or the data catalog in a database management system. In the data dictionary, you keep the information about the logical data structures, the information about the files and addresses, the information about the indexes, and so on. The data dictionary contains data about the data in the database. Similarly, the metadata component is the data about the data in the data warehouse. This definition is a commonly used definition. We need to elaborate on this definition. Metadata in a data warehouse is similar to a data dictionary, but much more than a data dictionary. Later, in a separate section in this chapter, we will devote more time for the discussion of metadata. Here, for the sake of completeness, we just want to list metadata as one of the components of the data warehouse architecture. Management and Control Component This component of the data warehouse architecture sits on top of all the other compo- nents. The management and control component coordinates the services and activities within the data warehouse. This component controls the data transformation and the data transfer into the data warehouse storage. On the other hand, it moderates the information delivery to the users. It works with the database management systems and enables data to be properly stored in the repositories. It monitors the movement of data into the staging area and from there into the data warehouse storage itself. The management and control component interacts with the metadata component to perform the management and control functions. As the metadata component contains in- formation about the data warehouse itself, the metadata is the source of information for the management module. METADATA IN THE DATA WAREHOUSE Think of metadata as the Yellow Pages ® of your town. Do you need information about the stores in your town, where they are, what their names are, and what products they special- ize in? Go to the Yellow Pages. The Yellow Pages is a directory with data about the institu- tions in your town. Almost in the same manner, the metadata component serves as a direc- tory of the contents of your data warehouse. Because of the importance of metadata in a data warehouse, we have set apart all of Chapter 9 for this topic. At this stage, we just want to get an introduction to the topic and highlight that metadata is a key architectural component of the data warehouse. METADATA IN THE DATA WAREHOUSE 35 Types of Metadata Metadata in a data warehouse fall into three major categories: ț Operational Metadata ț Extraction and Transformation Metadata ț End-User Metadata Operational Metadata. As you know, data for the data warehouse comes from several operational systems of the enterprise. These source systems contain different data struc- tures. The data elements selected for the data warehouse have various field lengths and data types. In selecting data from the source systems for the data warehouse, you split records, combine parts of records from different source files, and deal with multiple cod- ing schemes and field lengths. When you deliver information to the end-users, you must be able to tie that back to the original source data sets. Operational metadata contain all of this information about the operational data sources. Extraction and Transformation Metadata. Extraction and transformation metada- ta contain data about the extraction of data from the source systems, namely, the extrac- tion frequencies, extraction methods, and business rules for the data extraction. Also, this category of metadata contains information about all the data transformations that take place in the data staging area. End-User Metadata. The end-user metadata is the navigational map of the data ware- house. It enables the end-users to find information from the data warehouse. The end-user metadata allows the end-users to use their own business terminology and look for infor- mation in those ways in which they normally think of the business. Special Significance Why is metadata especially important in a data warehouse? ț First, it acts as the glue that connects all parts of the data warehouse. ț Next, it provides information about the contents and structures to the developers. ț Finally, it opens the door to the end-users and makes the contents recognizable in their own terms. CHAPTER SUMMARY ț Defining features of the data warehouse are: separate, subject-oriented, integrated, time-variant, and nonvolatile. ț You may use a top-down approach and build a large, comprehensive, enterprise data warehouse; or, you may use a bottom-up approach and build small, independent, de- partmental data marts. In spite of some advantages, both approaches have serious shortcomings. 36 DATA WAREHOUSE: THE BUILDING BLOCKS ț A viable practical approach is to build conformed data marts, which together form the corporate data warehouse. ț Data warehouse building blocks or components are: source data, data staging, data storage, information delivery, metadata, and management and control. ț In a data warehouse, metadata is especially significant because it acts as the glue holding all the components together and serves as a roadmap for the end-users. REVIEW QUESTIONS 1. Name at least six characteristics or features of a data warehouse. 2. Why is data integration required in a data warehouse, more so there than in an op- erational application? 3. Every data structure in the data warehouse contains the time element. Why? 4. Explain data granularity and how it is applicable to the data warehouse. 5. How are the top-down and bottom-up approaches for building a data warehouse different? Discuss the merits and disadvantages of each approach. 6. What are the various data sources for the data warehouse? 7. Why do you need a separate data staging component? 8. Under data transformation, list five different functions you can think of. 9. Name any six different methods for information delivery. 10. What are the three major types of metadata in a data warehouse? Briefly mention the purpose of each type. EXERCISES 1. Match the columns: a. nonvolatile data A. roadmap for users 2. dual data granularity B. subject-oriented 3. dependent data mart C. knowledge discovery 4. disparate data D. private spreadsheets 5. decision support E. application flavor 6. data staging F. because of multiple sources 7. data mining G. details and summary 8. metadata H. read-only 9. operational systems I. workbench for data integration 10. internal data J. data from main data warehouse 2. A data warehouse is subject-oriented. What would be the major critical business subjects for the following companies? a. an international manufacturing company b. a local community bank c. a domestic hotel chain EXERCISES 37 3. You are the data analyst on the project team building a data warehouse for an insur- ance company. List the possible data sources from which you will bring the data into your data warehouse. State your assumptions. 4. For an airlines company, identify three operational applications that would feed into the data warehouse. What would be the data load and refresh cycles? 5. Prepare a table showing all the potential users and information delivery methods for a data warehouse supporting a large national grocery chain. 38 DATA WAREHOUSE: THE BUILDING BLOCKS CHAPTER 3 TRENDS IN DATA WAREHOUSING CHAPTER OBJECTIVES ț Review the continued growth in data warehousing ț Learn how data warehousing is becoming mainstream ț Discuss several major trends, one by one ț Grasp the need for standards and review the progress ț Understand Web-enabled data warehouse In the previous chapters, we have seen why data warehousing is essential for enterprises of all sizes in all industries. We have reviewed how businesses are reaping major benefits from data warehousing. We have also discussed the building blocks of a data warehouse. You now have a fairly good idea of the features and functions of the basic components and a reasonable definition of data warehousing. You have understood that it is a fundamental- ly simple concept; at the same time, you know it is also a blend of many technologies. Several business and technological drivers have moved data warehousing forward in the past few years. Before we proceed further, we are at the point where we want to ask some relevant questions. What is the current scenario and state of the market? What businesses have adopted data warehousing? What are the technological advances? In short, what are the significant trends? Are you wondering if it is too early in our discussion of the subject to talk about trends? The usual practice is to include a chapter on future trends towards the end, almost as an afterthought. The reader typically glosses over the discussion on future trends. This chapter is not so much like looking into the crystal ball for possible future happenings; we want to deal with the important current trends that are happening now. It is important for you to keep the knowledge about the current trends as a backdrop in your mind as you continue the deeper study of the subject. When you gather the informa- 39 Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals. Paulraj Ponniah Copyright © 2001 John Wiley & Sons, Inc. ISBNs: 0-471-41254-6 (Hardback); 0-471-22162-7 (Electronic) tional requirements for your data warehouse, you need to be aware of the current trends. When you get into the design phase, you need to be cognizant of the trends. When you im- plement your data warehouse, you need to ensure that your data warehouse is in line with the trends. Knowledge of the trends is important and necessary even at a fairly early stage of your study. In this chapter, we will touch upon most of the major trends. You will understand how and why data warehousing continues to grow and become more and more pervasive. We will discuss the trends in vendor solutions and products. We will relate data warehousing with other technological phenomena such as the Internet and the Worldwide Web. Wherever more detailed discussions are necessary, we will revisit some of the trends in later chapters. CONTINUED GROWTH IN DATA WAREHOUSING Data warehousing is no longer a purely novel idea for study and experimentation. It is be- coming mainstream. True, the data warehouse is not in every dentist’s office yet, but nei- ther it is confined only to high-end businesses. More than half of all U.S. companies has made a commitment to data warehousing. About 90% of multinational companies have data warehouses or are planning to implement data warehouses in the next 12 months. In every industry across the board, from retail chain stores to financial institutions, from manufacturing enterprises to government departments, from airline companies to utility businesses, data warehousing is revolutionizing the way people perform business analysis and make strategic decisions. Every company that has a data warehouse is realiz- ing enormous benefits that get translated into positive results at the bottom line. Many of these companies, now incorporating Web-based technologies, are enhancing the potential for greater and easier delivery of vital information. Over the past five years, hundreds of vendors have flooded the market with numerous products. Vendor solutions and products run the gamut of data warehousing: data model- ing, data acquisition, data quality, data analysis, metadata, and so on. The buyer’s guide published by the Data Warehousing Institute features no fewer than 105 leading products. The market is already huge and continues to grow. Data Warehousing is Becoming Mainstream In the early stages, four significant factors drove many companies to move into data ware- housing: ț Fierce competition ț Government deregulation ț Need to revamp internal processes ț Imperative for customized marketing Telecommunications, banking, and retail were the first ones to adopt data warehous- ing. That was largely because of government deregulation in telecommunications and banking. Retail businesses moved into data warehousing because of fiercer competition. Utility companies joined the group as that sector was deregulated. The next wave of busi- nesses to get into data warehousing consisted of companies in financial services, health care, insurance, manufacturing, pharmaceuticals, transportation, and distribution. 40 TRENDS IN DATA WAREHOUSING [...]... unstructured data than structured data in a data warehouse C Dynamic charts are themselves user interfaces D MPP is a shared-memory parallel hardware configuration E ERP systems may be substituted for data warehouses F Most of a corporation’s knowledge base contains unstructured data G The traditional data transformation tools are quite adequate for a CRM-ready data warehouse H Metadata standards facilitate... video ț Data visualization deals with displaying information in several types of visual forms: text, numerical arrays, spreadsheets, charts, graphs, and so on Tremendous progress has been made in data visualization ț Data warehouse performance may be improved by using parallel processing with appropriate hardware and software options ț It is critical to adapt data warehousing to work with ERP packages,... Syndicated Data The value of the data content is derived not only from the internal operational systems, but from suitable external data as well With the escalating growth of data warehouse implementations, the market for syndicated data is rapidly expanding Examples of the traditional suppliers of syndicated data are A C Nielsen and Information Resources, Inc for retail data and Dun & Bradstreet and... separate bodies are working on the standards for metadata: the Meta Data Coalition and the Object Management Group Meta Data Coalition Formed as a consortium of vendors and interested parties in October 1995 to launch a metadata standards initiative, the coalition has been working on a standard known as the Open Information Model (OIM) Microsoft joined the coalition in December 1998 and has been a staunch... Data Other Operational Systems External Data Other Operational Systems External Data Custom Data Warehouse ERP System ERP Data Warehouse ERP System OPTION 1 ERP Data Warehouse “as is” Figure 3-7 ERP System Enhanced ERP Data Warehouse OPTION 2 OPTION 3 Custom-developed Data Warehouse Hybrid: ERP Data Warehouse enhanced with 3rd party tools ERP and data warehouse integration: options 54 TRENDS IN DATA. .. in a data warehouse? 8 Describe any one of the options available to integrate ERP with data warehousing 9 What is CRM? How can you make your data warehouse CRM-ready? 10 What do we mean by a Web-enabled data warehouse? Describe three of its functional features 62 TRENDS IN DATA WAREHOUSING EXERCISES 1 Indicate if true or false: A Data warehousing helps in customized marketing B It is more important... spatial data will greatly enhance the value of your data warehouse Address, street block, city quadrant, county, state, and zone are examples of spatial data Vendors have begun to address the need to include spatial data Some database vendors are providing spatial extenders to their products using SQL extensions to bring spatial and business data together Data Visualization When a user queries your data. .. option and parallel query option You may purchase each option separately Depending on the provisions made by the database vendors, these options may be used with one or more of the parallel hardware configurations The parallel server option allows each hardware node to have its own separate database instance, and enables all database instances to access a common set of underlying database files The parallel... sales data in detail but also details of every other type of encounter with each customer In addition to summary data, you have to load every encounter with every customer in the data warehouse Atomic or detailed data provides maximum flexibility for the CRM-ready data warehouse Making your data warehouse CRM-ready will increase the data volumes tremendously Fortunately, today’s technology facilitates... currently available functionality and await the enhancements The downside to this approach is that you may be waiting forever for the enhancements In Option 2, companies implement customized data warehouses and use third-party tools to extract data from the ERP datasets Retrieving and loading data from the proprietary ERP datasets is not easy Option 3 is a hybrid approach that combines the functionalities . this category of metadata contains information about all the data transformations that take place in the data staging area. End-User Metadata. The end-user metadata is the navigational map of the data. architectural component of the data warehouse. METADATA IN THE DATA WAREHOUSE 35 Types of Metadata Metadata in a data warehouse fall into three major categories: ț Operational Metadata ț Extraction. data transformation for a data warehouse more in- volved than for an operational system? Again, as you know, data for a data warehouse comes from many disparate sources. If data extraction for

Ngày đăng: 08/08/2014, 18:22

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan