data warehousing fundamentals a comprehensive guide for it professionals phần 4 ppsx

53 1.2K 1
data warehousing fundamentals a comprehensive guide for it professionals phần 4 ppsx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

If the data warehouse is an enterprise-wide data warehouse being built in a top-down fashion, then there could be movements of data from the enterprise-wide data warehouse repository to the repositories of the dependent data marts. Alternatively, if the data ware- house is a conglomeration of conformed data marts being built in a bottom-up manner, then the data movements stop with the appropriate conformed data marts. Data Groups. Prepared data waiting in the data staging area fall into two groups. The first group is the set of files or tables containing data for a full refresh. This group of data is usually meant for the initial loading of the data warehouse. Occasionally, some data warehouse tables may be refreshed fully. The other group of data is the set of files or tables containing ongoing incremental loads. Most of these relate to nightly loads. Some incremental loads of dimension data may be performed at less frequent intervals. The Data Repository. Almost all of today’s data warehouse databases are relational databases. All the power, flexibility, and ease of use capabilities of the RDBMS become available for the processing of data. Functions and Services. The general list of functions and services given in this sec- tion is for your guidance. The list relates to the data storage area and covers the broad functions and services. This is a general list. It does not indicate the extent or complexity of each function or service. For the technical architecture of your data warehouse, you have to determine the content and complexity of each function or service. TECHNICAL ARCHITECTURE 139 Data Marts Data Storage Management & Control Metadata Relational DB E-R Model Relational DB Dimensional Model INCREMENTAL LOAD BACKUP / RECOVERY FULL REFRESH DATA ARCHIVAL SECURITY Figure 7-5 Data storage: technical architecture. List of Functions and Services ț Load data for full refreshes of data warehouse tables ț Perform incremental loads at regular prescribed intervals ț Support loading into multiple tables at the detailed and summarized levels ț Optimize the loading process ț Provide automated job control services for loading the data warehouse ț Provide backup and recovery for the data warehouse database ț Provide security ț Monitor and fine-tune the database ț Periodically archive data from the database according to preset conditions Information Delivery This area spans a broad spectrum of many different methods of making information avail- able to users. For your users, the information delivery component is the data warehouse. They do not come into contact with the other components directly. For the users, the strength of your data warehouse architecture is mainly concentrated in the robustness and flexibility of the information delivery component. The information delivery component makes it easy for the users to access the informa- tion either directly from the enterprise-wide data warehouse, from the dependent data marts, or from the set of conformed data marts. Most of the information access in a data warehouse is through online queries and interactive analysis sessions. Nevertheless, your data warehouse will also be producing regular and ad hoc reports. Almost all modern data warehouses provide for online analytical processing (OLAP). In this case, the primary data warehouse feeds data to proprietary multidimensional data- bases (MDDBs) where summarized data is kept as multidimensional cubes of informa- tion. The users perform complex multidimensional analysis using the information cubes in the MDDBs. Refer to Figure 7-6 for a summarized view of the technical architecture for information delivery. Data Flow Flow. For information delivery, the data flow begins at the enterprise-wide data ware- house and the dependent data marts when the design is based on the top-down technique. When the design follows the bottom-up method, the data flow starts at the set of con- formed data marts. Generally, data transformed into information flows to the user desk- tops during query sessions. Also, information printed on regular or ad hoc reports reaches the users. Sometimes, the result sets from individual queries or reports are held in propri- etary data stores of the query or reporting tool vendors. The stored information may be put to faster repeated use. In many data warehouses, data also flows into specialized downstream decision support applications such as executive information systems (EIS) and data mining. The other more common flow of information is to proprietary multidimensional databases for OLAP. Service Locations. In your information delivery component, you may provide query services from the user desktop, from an application server, or from the database itself. This will be one of the critical decisions for your architecture design. 140 THE ARCHITECTURAL COMPONENTS For producing regular or ad hoc reports, you may want to include a comprehensive re- porting service. This service will allow users to create and run their own reports. It will also provide for standard reports to be run at regular intervals. Data Stores. For information delivery, you may consider the following intermediary data stores: ț Proprietary temporary stores to hold results of individual queries and reports for re- peated use ț Data stores for standard reporting ț Proprietary multidimensional databases Functions and Services. Please review the general list of functions and services given below and use it as a guide to establish the information delivery component of your data warehouse architecture. The list relates to information delivery and covers the broad functions and services. Again, this is a general list. It does not indicate the extent or com- plexity of each function or service. For the technical architecture of your data warehouse, you have to determine the content and complexity of each function or service. ț Provide security to control information access ț Monitor user access to improve service and for future enhancements ț Allow users to browse data warehouse content ț Simplify access by hiding internal complexities of data storage from users TECHNICAL ARCHITECTURE 141 Report/Query OLAP Data Mining Information Delivery Management & Control Metadata QUERY OPTIMIZATION Multidimensional Database Temporary Result Sets Standard Reporting Data Stores QUERY GOVERNMENT CONTENT BROWSE SECURITY CONTROL SELF - SERVICE REPORT GENERATION Figure 7-6 Information delivery: technical architecture. Information Delivery ț Automatically reformat queries for optimal execution ț Enable queries to be aware of aggregate tables for faster results ț Govern queries and control runaway queries ț Provide self-service report generation for users, consisting of a variety of flexible options to create, schedule, and run reports ț Store result sets of queries and reports for future use ț Provide multiple levels of data granularity ț Provide event triggers to monitor data loading ț Make provision for the users to perform complex analysis through online analytical processing (OLAP) ț Enable data feeds to downstream, specialized decisions support systems such as EIS and data mining CHAPTER SUMMARY ț Architecture is the structure that brings all the components together. ț Data warehouse architecture consists of distinct components with the read-only data repository as the centerpiece. ț The architectural components support the functioning of the data warehouse in the three major areas of data acquisition, data storage, and information delivery. ț Data warehouse architecture is wide, complex, expansive, and has several distin- guishing characteristics. ț The architectural framework enables the flow of data from the data sources at one end and the user’s desktop at the other. ț The technical architecture of a data warehouse is the complete set of functions and services provided within its components. It includes the procedures and rules need- ed to perform the functions and to provide the services. It encompasses the data stores needed for each component to provide the services. REVIEW QUESTIONS 1. What is your understanding of data warehouse architecture? Describe in one or two paragraphs. 2. What are the three major areas in the data warehouse? Is this a logical division? If so, why do you think so? Relate the architectural components to the three major areas. 3. Name four distinguishing characteristics of data warehouse architecture. Describe each briefly. 4. Trace the flow of data through the data warehouse from beginning to end. 5. For information delivery, what is the difference between top-down and bottom-up approaches to data warehouse implementation? 6. In which architectural component does OLAP fit in? What is the function of OLAP? 142 THE ARCHITECTURAL COMPONENTS 7. Define technical architecture of the data warehouse. How does it relate to the indi- vidual architectural components? 8. List five major functions and services in the data storage area. 9. What are the types of storage repositories in the data staging area? 10. List four major functions and services for information delivery. Describe each briefly. EXERCISES 1. Indicate if true or false: A. Data warehouse architecture is just an overall guideline. It is not a blueprint for the data warehouse. B. In a data warehouse, the metadata component is unique, with no truly matching component in operational systems. C. Normally, data flows from the data warehouse repository to the data staging area. D. The management and control component does not relate to all operations in a data warehouse. E. Technical architecture simply means the vendor tools. F. SQL-based languages are used to extract data from hierarchical databases. G. Sorts and merges of files are common in the staging area. H. MDDBs are generally relational databases. I. Sometimes, results of individual queries are held in temporary data stores for re- peated use. J. Downstream specialized applications are fed directly from the source data com- ponent. 2. You have been recently promoted to administrator for the data warehouse of a na- tionwide automobile insurance company. You are asked to prepare a checklist for selecting a proper vendor tool to help you with the data warehouse administration. Make a list of the functions in the management and control component of your data warehouse architecture. Use this list to derive the tool-selection checklist. 3. As the senior analyst responsible for data staging, you are responsible for the design of the data staging area. If your data warehouse gets input from several legacy sys- tems on multiple platforms, and also regular feeds from two external sources, how will you organize your data staging area? Describe the data repositories you will have for data staging. 4. You are the data warehouse architect for a leading national department store chain. The data warehouse has been up and running for nearly a year. Now the manage- ment has decided to provide the power users with OLAP facilities. How will you al- ter the information delivery component of your data warehouse architecture? Make realistic assumptions and proceed. 5. You recently joined as the data extraction specialist on the data warehouse project team developing a conformed data mart for a local but progressive pharmacy. Make a detailed list of functions and services for data extraction, data transformation, and data staging. EXERCISES 143 CHAPTER 8 INFRASTRUCTURE AS THE FOUNDATION FOR DATA WAREHOUSING CHAPTER OBJECTIVES ț Understand the distinction between architecture and infrastructure ț Find out how the data warehouse infrastructure supports its architecture ț Gain an insight into the components of the physical infrastructure ț Review hardware and operating systems for the data warehouse ț Study parallel processing options as applicable to the data warehouse ț Discuss the server options in detail ț Learn how to select the DBMS ț Review the types of tools needed for the data warehouse What is data warehouse infrastructure in relation to its architecture? What is the distinc- tion between architecture and infrastructure? In what ways are they different? Why do we have to study the two separately? In the previous chapter, we discussed data warehouse architecture in detail. We looked at the various architectural components and studied them by grouping them into the three major areas of the data warehouse, namely, data acquisition, data storage, and information delivery. You learned the elements that composed the technical architecture of each archi- tectural component. In this chapter, let us find out what infrastructure means and what it includes. We will discuss each part of the data warehouse infrastructure. You will understand the signifi- cance of infrastructure and master the techniques for creating the proper infrastructure for your data warehouse. INFRASTRUCTURE SUPPORTING ARCHITECTURE Consider the architectural components. For example, let us take the technical architecture of the data staging component. This part of the technical architecture for your data ware- 145 Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals. Paulraj Ponniah Copyright © 2001 John Wiley & Sons, Inc. ISBNs: 0-471-41254-6 (Hardback); 0-471-22162-7 (Electronic) house does a number of things. First of all, it indicates that there is a section of the archi- tecture called data staging. Then it notes that this section of the architecture contains an area where data is staged before it is loaded into the data warehouse repository. Next, it denotes that this section of the architecture performs certain functions and provides spe- cific services in the data warehouse. Among others, the functions and services include data transformation and data cleansing. Let us now ask a few questions. Where exactly is the data staging area? What are the specific files and databases? How do the functions get performed? What enables the ser- vices to be provided? What is the underlying base? What is the foundational structure? In- frastructure is the foundation supporting the architecture. Figure 8-1 expresses this fact in a simple manner. What are the various elements needed to support the architecture? The foundational in- frastructure includes many elements. First, it consists of the basic computing platform. The platform includes all the required hardware and the operating system. Next, the data- base management system (DBMS) is an important element of the infrastructure. All other types of software and tools are also part of the infrastructure. What about the people and the procedures that make the architecture come alive? Are these also part of the infra- structure? In a sense, they are. Data warehouse infrastructure includes all the foundational elements that enable the ar- chitecture to be implemented. In summary, the infrastructure includes several elements such as server hardware, operating system, network software, database software, the LAN and WAN, vendor tools for every architectural component, people, procedures, and train- ing. The elements of the data warehouse infrastructure may be classified into two cate- gories: operational infrastructure and physical infrastructure. This distinction is important because elements in each category are different in their nature and features compared to those in the other category. First, we will go over the elements that may be grouped as op- erational infrastructure. The physical infrastructure is much wider and more fundamental. 146 INFRASTRUCTURE AS THE FOUNDATION FOR DATA WAREHOUSING Data Acquisi - tion Data Storage Information Access Data Warehouse Architecture Figure 8-1 Infrastructure supporting architecture. After gaining a basic understanding of the elements of the physical architecture, we will spend a large portion of this chapter examining specific elements in greater detail. Operational Infrastructure To understand operational infrastructure, let us once again take the example of data staging. One part of foundational infrastructure refers to the computing hardware and the related software. You need the hardware and software to perform the data staging functions and render the appropriate services. You need software tools to perform data transformations. You need software to create the output files. You need disk hardware to place the data in the staging area files. But what about the people involved in performing these functions? What about the business rules and procedures for the data transformations? What about the man- agement software to monitor and administer the data transformation tasks? Operational infrastructure to support each architectural component consists of ț People ț Procedures ț Training ț Management software These are not the people and procedures needed for developing the data warehouse. These are the ones needed to keep the data warehouse going. These elements are as essen- tial as the hardware and software that keep the data warehouse running. They support the management of the data warehouse and maintain its efficiency. Data warehouse developers pay a lot of attention to the hardware and system software elements of the infrastructure. It is right to do so. But operational infrastructure is often neglected. Even though you may have the right hardware and software, your data ware- house needs the operational infrastructure in place for proper functioning. Without appro- priate operational infrastructure, your data warehouse is likely to just limp along and cease to be effective. Pay attention to the details of your operational infrastructure. Physical Infrastructure Let us begin with a diagram. Figure 8-2 highlights the major elements of physical infra- structure. What do you see in the diagram? As you know, every system, including your data warehouse, must have an overall platform on which to reside. Essentially, the plat- form consists of the basic hardware components, the operating system with its utility soft- ware, the network, and the network software. Along with the overall platform is the set of tools that run on the selected platform to perform the various functions and services of in- dividual architectural components. We will examine the elements of physical infrastructure in the next few sections. Deci- sions about the hardware top the list of decisions you have to make about the infrastruc- ture of your data warehouse. Hardware decisions are not easy. You have to consider many factors. You have to ensure that the selected hardware will support the entire data ware- house architecture. Perhaps we can go back to our mainframe days and get some helpful hints. As newer models of the corporate mainframes were announced and as we ran out of steam on the INFRASTRUCTURE SUPPORTING ARCHITECTURE 147 current configuration, we stuck to two principles. First, we leveraged as much of the exist- ing physical infrastructure as possible. Next, we kept the infrastructure as modular as pos- sible. When needs arose and when newer versions became available at cheaper prices, we unplugged an existing component and plugged in the replacement. In your data warehouse, try to adopt these two principles. You already have the hard- ware and operating system components in your company supporting the current opera- tions. How much of this can you use for your data warehouse? How much extra capacity is available? How much disk space can be spared for the data warehouse repository? Find answers to these questions. Applying the modular approach, can you add more processors to the server hardware? Explore if you can accommodate the data warehouse by adding more disk units. Take an inventory of individual hardware components. Check which of these components need to be replaced with more potent versions. Also, make a list of the additional components that have to be procured and plugged in. HARDWARE AND OPERATING SYSTEMS Hardware and operating systems make up the computing environment for your data ware- house. All the data extraction, transformation, integration, and staging jobs run on the se- lected hardware under the chosen operating system. When you transport the consolidated and integrated data from the staging area to your data warehouse repository, you make use of the server hardware and the operating system software. When the queries are initiated from the client workstations, the server hardware, in conjunction with the database soft- ware, executes the queries and produces the results. Here are some general guidelines for hardware selection, not entirely specific to hard- ware for the data warehouse. Scalability. When your data warehouse grows in terms of the number of users, the number of queries, and the complexity of the queries, ensure that your selected hardware could be scaled up. Support. Vendor support is crucial for hardware maintenance. Make sure that the sup- port from the hardware vendor is at the highest possible level. 148 INFRASTRUCTURE AS THE FOUNDATION FOR DATA WAREHOUSING Hardware Network Software DBMS Operating System DATA ACQUISITION TOOLS DATA STAGING TOOLS INFO. DELIVERY TOOLS COMPUTING PLATFORM Figure 8-2 Physical infrastructure. Vendor Reference. It is important to check vendor references with other sites using hardware from this vendor. You do not want to be caught with your data warehouse being down because of hardware malfunctions when the CEO wants some critical analysis to be completed. Vendor Stability. Check on the stability and staying power of the vendor. Next let us quickly consider a few general criteria for the selection of the operating system. First of all, the operating system must be compatible with the hardware. A list of criteria follows. Scalability. Again, scalability is first on the list because this is one common feature of every data warehouse. Data warehouses grow, and they grow very fast. Along with the hardware and database software, the operating system must be able to support the increase in the number of users and applications. Security. When multiple client workstations access the server, the operating system must be able to protect each client and associated resources. The operating system must provide each client with a secure environment. Reliability. The operating system must be able to protect the environment from appli- cation malfunctions. Availability. This is a corollary to reliability. The computing environment must contin- ue to be available after abnormal application terminations. Preemptive Multitasking. The server hardware must be able to balance the allocation of time and resources among the multiple tasks. Also, the operating system must be able to let a higher priority task preempt or interrupt another task as and when needed. Use multithreaded approach. The operating system must be able to serve multiple re- quests concurrently by distributing threads to multiple processors in a multiprocessor hardware configuration. This feature is very important because multiprocessor configura- tions are architectures of choice in a data warehouse environment. Memory protection. Again, in a data warehouse environment, large numbers of queries are common. That means that multiple queries will be executing concurrently. A memory protection feature in an operating system prevents one task from violating the memory space of another. Having reviewed the requirements for hardware and operating systems in a data ware- house environment, let us try to narrow down the choices. What are the possible options? Please go through the following list of three common options. Mainframes ț Leftover hardware from legacy applications ț Primarily designed for OLTP and not for decision support applications ț Not cost-effective for data warehousing ț Not easily scalable ț Rarely used for data warehousing when too much spare resources are available for smaller data marts Open System Servers ț UNIX servers, the choice medium for most data warehouses ț Generally robust ț Adapted for parallel processing HARDWARE AND OPERATING SYSTEMS 149 [...]... area UNIX or OTHER UNIX MAINFRAME Data Extraction Initial Reformatting/Merging Preliminary Data Cleansing Preliminary Data Cleansing Transformation / Consolidation Validation / Quality Check Load Image Creation MINI SOURCE DATA PLATFORMS Figure 8 -4 STAGING AREA PLATFORM Platforms for data acquisition 1 54 INFRASTRUCTURE AS THE FOUNDATION FOR DATA WAREHOUSING Option 3 STAGING AREA Option 2 STAGING AREA... STAGING AREA UNIX MINI SOURCE DATA PLATFORMS Figure 8-5 DATA STORAGE PLATFORM SEPARATE PLATFORM Platform options for the staging area On the Data Storage Platform This is the platform on which the data warehouse DBMS runs and the database exists When you keep your data staging area on this platform, you will realize all the advantages for applying the load images to the database You may even be able... transformations It is possible that you need to work through your data thoroughly to cleanse and prepare it for your data warehouse In such circumstances, you need a separate platform to stage your data before loading to the database Here are some distinct advantages of a separate platform for data staging: ț You can optimize the separate platform for complex data transformations and data cleansing What do we mean... to share data by placing the common data on a shared disk You may adapt this method to pass data from one step to another for data acquisition in your data warehouse You have to designate a disk storage area and set it up so that each of the two platforms recognizes the disk storage area as its own Mass Data Transmission In this case, transmission of data across platforms takes place through data ports... three major areas Here is a quick summary recap: Data Acquisition: data extraction, data transformation, data cleansing, data integration, and data staging Data Storage: data loading, archiving, and data management Information Delivery: report generation, query processing, and complex analysis We will now discuss platform options in terms of the functions in these three areas Where should each function... databases, and available built-in extraction and duplication facilities in the source systems Data Transformation ț Transform extracted data into appropriate formats and data structures ț Provide default values as specified ț Major features include field splitting, consolidation, standardization, and deduplication Data Loading ț Load transformed and consolidated data in the form of load images into the data. .. Understand who needs metadata and what types they need Review metadata types by the three functional areas Discuss business metadata and technical metadata in detail Examine all the requirements metadata must satisfy Understand the challenges for metadata management Study options for providing metadata We discussed metadata briefly in earlier chapters In Chapter 2, we considered metadata as one of the major... server hardware is a key decision Invariably, the choice is one of the four parallel server architectures ț Parallel processing options are critical in the DBMS Current database software products are able to perform interquery and intraquery parallelization ț Software tools are used in the data warehouse for data modeling, data extraction, data transformation, data loading, data quality assurance, queries... eliminate a few intermediary substeps and apply data directly to the database from some of the consolidated files in the staging area On a Separate Optimal Platform You may review your data source platforms, examine the data warehouse storage platform, and then decide that none of these platforms are really suitable for your staging area It is likely that your environment needs complex data transformations... is a strong candidate for the staging area You will arrange for this step to happen on that platform Creation of Load Images This step creates load images for individual database files of the data warehouse repository This step almost always occurs in the staging area and, therefore, on the platform where the staging area resides Figure 8 -4 summarizes the data acquisition steps and the associated platforms . separate platform for complex data transformations and data cleansing. What do we mean by this? You can gear up the neutral platform with all the necessary tools for data transformation, data. Reformatting/Merging Preliminary Data Cleansing Preliminary Data Cleansing Transformation / Consolidation Validation / Quality Check Load Image Creation Figure 8 -4 Platforms for data acquisition. On the Data Storage. technical architecture of the data staging component. This part of the technical architecture for your data ware- 145 Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals. Paulraj

Ngày đăng: 08/08/2014, 18:22

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan