List of Functions and Services앫 Load data for full refreshes of data warehouse tables 앫 Perform incremental loads at regular prescribed intervals 앫 Support loading into multiple tables a
Trang 1If the data warehouse is an enterprise-wide data warehouse being built in a top-downfashion, then there could be movements of data from the enterprise-wide data warehouserepository to the repositories of the dependent data marts Alternatively, if the data ware-house is a conglomeration of conformed data marts being built in a bottom-up manner,then the data movements stop with the appropriate conformed data marts
Data Groups. Prepared data waiting in the data staging area fall into two groups Thefirst group is the set of files or tables containing data for a full refresh This group of data
is usually meant for the initial loading of the data warehouse Occasionally, some datawarehouse tables may be refreshed fully
The other group of data is the set of files or tables containing ongoing incrementalloads Most of these relate to nightly loads Some incremental loads of dimension datamay be performed at less frequent intervals
The Data Repository. Almost all of today’s data warehouse databases are relationaldatabases All the power, flexibility, and ease of use capabilities of the RDBMS becomeavailable for the processing of data
Functions and Services. The general list of functions and services given in this tion is for your guidance The list relates to the data storage area and covers the broadfunctions and services This is a general list It does not indicate the extent or complexity
sec-of each function or service For the technical architecture sec-of your data warehouse, youhave to determine the content and complexity of each function or service
Relational DB Dimensional Model
INCREM ENT
AL LOA D
VA L
SECURI TY
Figure 7-5 Data storage: technical architecture
Trang 2List of Functions and Services
앫 Load data for full refreshes of data warehouse tables
앫 Perform incremental loads at regular prescribed intervals
앫 Support loading into multiple tables at the detailed and summarized levels
앫 Optimize the loading process
앫 Provide automated job control services for loading the data warehouse
앫 Provide backup and recovery for the data warehouse database
앫 Provide security
앫 Monitor and fine-tune the database
앫 Periodically archive data from the database according to preset conditions
Information Delivery
This area spans a broad spectrum of many different methods of making information able to users For your users, the information delivery component is the data warehouse.They do not come into contact with the other components directly For the users, thestrength of your data warehouse architecture is mainly concentrated in the robustness andflexibility of the information delivery component
avail-The information delivery component makes it easy for the users to access the tion either directly from the enterprise-wide data warehouse, from the dependent datamarts, or from the set of conformed data marts Most of the information access in a datawarehouse is through online queries and interactive analysis sessions Nevertheless, yourdata warehouse will also be producing regular and ad hoc reports
informa-Almost all modern data warehouses provide for online analytical processing (OLAP)
In this case, the primary data warehouse feeds data to proprietary multidimensional bases (MDDBs) where summarized data is kept as multidimensional cubes of informa-tion The users perform complex multidimensional analysis using the information cubes
data-in the MDDBs Refer to Figure 7-6 for a summarized view of the technical architecturefor information delivery
Data Flow
Flow. For information delivery, the data flow begins at the enterprise-wide data house and the dependent data marts when the design is based on the top-down technique.When the design follows the bottom-up method, the data flow starts at the set of con-formed data marts Generally, data transformed into information flows to the user desk-tops during query sessions Also, information printed on regular or ad hoc reports reachesthe users Sometimes, the result sets from individual queries or reports are held in propri-etary data stores of the query or reporting tool vendors The stored information may beput to faster repeated use
ware-In many data warehouses, data also flows into specialized downstream decision supportapplications such as executive information systems (EIS) and data mining The other morecommon flow of information is to proprietary multidimensional databases for OLAP
Service Locations. In your information delivery component, you may provide queryservices from the user desktop, from an application server, or from the database itself.This will be one of the critical decisions for your architecture design
Trang 3For producing regular or ad hoc reports, you may want to include a comprehensive porting service This service will allow users to create and run their own reports It willalso provide for standard reports to be run at regular intervals.
re-Data Stores. For information delivery, you may consider the following intermediarydata stores:
앫 Proprietary temporary stores to hold results of individual queries and reports for peated use
re-앫 Data stores for standard reporting
앫 Proprietary multidimensional databases
Functions and Services. Please review the general list of functions and servicesgiven below and use it as a guide to establish the information delivery component of yourdata warehouse architecture The list relates to information delivery and covers the broadfunctions and services Again, this is a general list It does not indicate the extent or com-plexity of each function or service For the technical architecture of your data warehouse,you have to determine the content and complexity of each function or service
앫 Provide security to control information access
앫 Monitor user access to improve service and for future enhancements
앫 Allow users to browse data warehouse content
앫 Simplify access by hiding internal complexities of data storage from users
Temporary Result SetsStandard Reporting Data Stores
Trang 4앫 Automatically reformat queries for optimal execution
앫 Enable queries to be aware of aggregate tables for faster results
앫 Govern queries and control runaway queries
앫 Provide self-service report generation for users, consisting of a variety of flexibleoptions to create, schedule, and run reports
앫 Store result sets of queries and reports for future use
앫 Provide multiple levels of data granularity
앫 Provide event triggers to monitor data loading
앫 Make provision for the users to perform complex analysis through online analyticalprocessing (OLAP)
앫 Enable data feeds to downstream, specialized decisions support systems such as EISand data mining
CHAPTER SUMMARY
앫 Architecture is the structure that brings all the components together
앫 Data warehouse architecture consists of distinct components with the read-only datarepository as the centerpiece
앫 The architectural components support the functioning of the data warehouse in thethree major areas of data acquisition, data storage, and information delivery
앫 Data warehouse architecture is wide, complex, expansive, and has several guishing characteristics
distin-앫 The architectural framework enables the flow of data from the data sources at oneend and the user’s desktop at the other
앫 The technical architecture of a data warehouse is the complete set of functions andservices provided within its components It includes the procedures and rules need-
ed to perform the functions and to provide the services It encompasses the datastores needed for each component to provide the services
REVIEW QUESTIONS
1 What is your understanding of data warehouse architecture? Describe in one ortwo paragraphs
2 What are the three major areas in the data warehouse? Is this a logical division? If
so, why do you think so? Relate the architectural components to the three majorareas
3 Name four distinguishing characteristics of data warehouse architecture Describeeach briefly
4 Trace the flow of data through the data warehouse from beginning to end
5 For information delivery, what is the difference between top-down and bottom-upapproaches to data warehouse implementation?
6 In which architectural component does OLAP fit in? What is the function ofOLAP?
Trang 57 Define technical architecture of the data warehouse How does it relate to the vidual architectural components?
indi-8 List five major functions and services in the data storage area
9 What are the types of storage repositories in the data staging area?
10 List four major functions and services for information delivery Describe eachbriefly
EXERCISES
1 Indicate if true or false:
A Data warehouse architecture is just an overall guideline It is not a blueprint forthe data warehouse
B In a data warehouse, the metadata component is unique, with no truly matchingcomponent in operational systems
C Normally, data flows from the data warehouse repository to the data staging area
D The management and control component does not relate to all operations in adata warehouse
E Technical architecture simply means the vendor tools
F SQL-based languages are used to extract data from hierarchical databases
G Sorts and merges of files are common in the staging area
H MDDBs are generally relational databases
I Sometimes, results of individual queries are held in temporary data stores for peated use
re-J Downstream specialized applications are fed directly from the source data ponent
com-2 You have been recently promoted to administrator for the data warehouse of a tionwide automobile insurance company You are asked to prepare a checklist forselecting a proper vendor tool to help you with the data warehouse administration.Make a list of the functions in the management and control component of your datawarehouse architecture Use this list to derive the tool-selection checklist
na-3 As the senior analyst responsible for data staging, you are responsible for the design
of the data staging area If your data warehouse gets input from several legacy tems on multiple platforms, and also regular feeds from two external sources, howwill you organize your data staging area? Describe the data repositories you willhave for data staging
sys-4 You are the data warehouse architect for a leading national department store chain.The data warehouse has been up and running for nearly a year Now the manage-ment has decided to provide the power users with OLAP facilities How will you al-ter the information delivery component of your data warehouse architecture? Makerealistic assumptions and proceed
5 You recently joined as the data extraction specialist on the data warehouse projectteam developing a conformed data mart for a local but progressive pharmacy Make
a detailed list of functions and services for data extraction, data transformation, anddata staging
Trang 6CHAPTER 8
INFRASTRUCTURE AS THE FOUNDATION FOR DATA WAREHOUSING
CHAPTER OBJECTIVES
앫 Understand the distinction between architecture and infrastructure
앫 Find out how the data warehouse infrastructure supports its architecture
앫 Gain an insight into the components of the physical infrastructure
앫 Review hardware and operating systems for the data warehouse
앫 Study parallel processing options as applicable to the data warehouse
앫 Discuss the server options in detail
앫 Learn how to select the DBMS
앫 Review the types of tools needed for the data warehouse
What is data warehouse infrastructure in relation to its architecture? What is the tion between architecture and infrastructure? In what ways are they different? Why do wehave to study the two separately?
distinc-In the previous chapter, we discussed data warehouse architecture in detail We looked
at the various architectural components and studied them by grouping them into the threemajor areas of the data warehouse, namely, data acquisition, data storage, and informationdelivery You learned the elements that composed the technical architecture of each archi-tectural component
In this chapter, let us find out what infrastructure means and what it includes We willdiscuss each part of the data warehouse infrastructure You will understand the signifi-cance of infrastructure and master the techniques for creating the proper infrastructure foryour data warehouse
INFRASTRUCTURE SUPPORTING ARCHITECTURE
Consider the architectural components For example, let us take the technical architecture
of the data staging component This part of the technical architecture for your data
ware-145
Copyright © 2001 John Wiley & Sons, Inc ISBNs: 0-471-41254-6 (Hardback); 0-471-22162-7 (Electronic)
Trang 7house does a number of things First of all, it indicates that there is a section of the tecture called data staging Then it notes that this section of the architecture contains anarea where data is staged before it is loaded into the data warehouse repository Next, itdenotes that this section of the architecture performs certain functions and provides spe-cific services in the data warehouse Among others, the functions and services includedata transformation and data cleansing
archi-Let us now ask a few questions Where exactly is the data staging area? What are thespecific files and databases? How do the functions get performed? What enables the ser-vices to be provided? What is the underlying base? What is the foundational structure? In-frastructure is the foundation supporting the architecture Figure 8-1 expresses this fact in
a simple manner
What are the various elements needed to support the architecture? The foundational frastructure includes many elements First, it consists of the basic computing platform.The platform includes all the required hardware and the operating system Next, the data-base management system (DBMS) is an important element of the infrastructure All othertypes of software and tools are also part of the infrastructure What about the people andthe procedures that make the architecture come alive? Are these also part of the infra-structure? In a sense, they are
in-Data warehouse infrastructure includes all the foundational elements that enable the chitecture to be implemented In summary, the infrastructure includes several elementssuch as server hardware, operating system, network software, database software, the LANand WAN, vendor tools for every architectural component, people, procedures, and train-ing
ar-The elements of the data warehouse infrastructure may be classified into two gories: operational infrastructure and physical infrastructure This distinction is importantbecause elements in each category are different in their nature and features compared tothose in the other category First, we will go over the elements that may be grouped as op-erational infrastructure The physical infrastructure is much wider and more fundamental
Data Warehouse Architecture
Figure 8-1 Infrastructure supporting architecture
Trang 8After gaining a basic understanding of the elements of the physical architecture, we willspend a large portion of this chapter examining specific elements in greater detail.
Operational Infrastructure
To understand operational infrastructure, let us once again take the example of data staging.One part of foundational infrastructure refers to the computing hardware and the relatedsoftware You need the hardware and software to perform the data staging functions andrender the appropriate services You need software tools to perform data transformations.You need software to create the output files You need disk hardware to place the data in thestaging area files But what about the people involved in performing these functions? Whatabout the business rules and procedures for the data transformations? What about the man-agement software to monitor and administer the data transformation tasks?
Operational infrastructure to support each architectural component consists of
Data warehouse developers pay a lot of attention to the hardware and system softwareelements of the infrastructure It is right to do so But operational infrastructure is oftenneglected Even though you may have the right hardware and software, your data ware-house needs the operational infrastructure in place for proper functioning Without appro-priate operational infrastructure, your data warehouse is likely to just limp along andcease to be effective Pay attention to the details of your operational infrastructure
Physical Infrastructure
Let us begin with a diagram Figure 8-2 highlights the major elements of physical structure What do you see in the diagram? As you know, every system, including yourdata warehouse, must have an overall platform on which to reside Essentially, the plat-form consists of the basic hardware components, the operating system with its utility soft-ware, the network, and the network software Along with the overall platform is the set oftools that run on the selected platform to perform the various functions and services of in-dividual architectural components
infra-We will examine the elements of physical infrastructure in the next few sections sions about the hardware top the list of decisions you have to make about the infrastruc-ture of your data warehouse Hardware decisions are not easy You have to consider manyfactors You have to ensure that the selected hardware will support the entire data ware-house architecture
Deci-Perhaps we can go back to our mainframe days and get some helpful hints As newermodels of the corporate mainframes were announced and as we ran out of steam on the
Trang 9current configuration, we stuck to two principles First, we leveraged as much of the ing physical infrastructure as possible Next, we kept the infrastructure as modular as pos-sible When needs arose and when newer versions became available at cheaper prices, weunplugged an existing component and plugged in the replacement.
exist-In your data warehouse, try to adopt these two principles You already have the ware and operating system components in your company supporting the current opera-tions How much of this can you use for your data warehouse? How much extra capacity
hard-is available? How much dhard-isk space can be spared for the data warehouse repository? Findanswers to these questions
Applying the modular approach, can you add more processors to the server hardware?Explore if you can accommodate the data warehouse by adding more disk units Take aninventory of individual hardware components Check which of these components need to
be replaced with more potent versions Also, make a list of the additional components thathave to be procured and plugged in
HARDWARE AND OPERATING SYSTEMS
Hardware and operating systems make up the computing environment for your data house All the data extraction, transformation, integration, and staging jobs run on the se-lected hardware under the chosen operating system When you transport the consolidatedand integrated data from the staging area to your data warehouse repository, you make use
ware-of the server hardware and the operating system sware-oftware When the queries are initiatedfrom the client workstations, the server hardware, in conjunction with the database soft-ware, executes the queries and produces the results
Here are some general guidelines for hardware selection, not entirely specific to ware for the data warehouse
hard-Scalability When your data warehouse grows in terms of the number of users, the
number of queries, and the complexity of the queries, ensure that your selected hardwarecould be scaled up
Support Vendor support is crucial for hardware maintenance Make sure that the
sup-port from the hardware vendor is at the highest possible level
SoftwareDBMS
Operating System
DATA ACQUISITION TOOLS
DATA STAGING TOOLS
INFO
DELIVERY TOOLS
COMPUTING PLATFORM
Figure 8-2 Physical infrastructure
Trang 10Vendor Reference It is important to check vendor references with other sites using
hardware from this vendor You do not want to be caught with your data warehouse beingdown because of hardware malfunctions when the CEO wants some critical analysis to becompleted
Vendor Stability Check on the stability and staying power of the vendor.
Next let us quickly consider a few general criteria for the selection of the operatingsystem First of all, the operating system must be compatible with the hardware A list ofcriteria follows
Scalability Again, scalability is first on the list because this is one common feature of
every data warehouse Data warehouses grow, and they grow very fast Along with thehardware and database software, the operating system must be able to support the increase
in the number of users and applications
Security When multiple client workstations access the server, the operating system
must be able to protect each client and associated resources The operating system mustprovide each client with a secure environment
Reliability The operating system must be able to protect the environment from
appli-cation malfunctions
Availability This is a corollary to reliability The computing environment must
contin-ue to be available after abnormal application terminations
Preemptive Multitasking The server hardware must be able to balance the allocation
of time and resources among the multiple tasks Also, the operating system must be able
to let a higher priority task preempt or interrupt another task as and when needed
Use multithreaded approach The operating system must be able to serve multiple
re-quests concurrently by distributing threads to multiple processors in a multiprocessorhardware configuration This feature is very important because multiprocessor configura-tions are architectures of choice in a data warehouse environment
Memory protection Again, in a data warehouse environment, large numbers of
queries are common That means that multiple queries will be executing concurrently Amemory protection feature in an operating system prevents one task from violating thememory space of another
Having reviewed the requirements for hardware and operating systems in a data house environment, let us try to narrow down the choices What are the possible options?Please go through the following list of three common options
ware-Mainframes
앫 Leftover hardware from legacy applications
앫 Primarily designed for OLTP and not for decision support applications
앫 Not cost-effective for data warehousing
앫 Not easily scalable
앫 Rarely used for data warehousing when too much spare resources are available forsmaller data marts
Open System Servers
앫 UNIX servers, the choice medium for most data warehouses
앫 Generally robust
앫 Adapted for parallel processing
Trang 11NT Servers
앫 Support medium-sized data warehouses
앫 Limited parallel processing capabilities
앫 Cost-effective for medium-sized and small data warehouses
Platform Options
Let us now turn our attention to the computing platforms that are needed to perform the eral functions of the various components of the data warehouse architecture A computingplatform is the set of hardware components, the operating system, the network, and the net-work software Whether it is a function of an OLTP system or a decision support system likethe data warehouse, the function has to be performed on a computing platform
sev-Before we get into a deeper discussion of platform options, let us get back to the tions and services of the architectural components in the three major areas Here is a quicksummary recap:
func-Data Acquisition: data extraction, data transformation, data cleansing, data integration,
and data staging
Data Storage: data loading, archiving, and data management.
Information Delivery: report generation, query processing, and complex analysis.
We will now discuss platform options in terms of the functions in these three areas.Where should each function be performed? On which platforms? How could you opti-mize the functions?
Single Platform Option. This is the most straightforward and simplest option for plementing the data warehouse architecture In this option, all functions from the back-end data extraction to the front-end query processing are performed on a single comput-ing platform This was perhaps the earliest approach, when developers were implementingdata warehouses on existing mainframes, minicomputers, or a single UNIX-based server Because all operations in the data acquisition, data storage, and information deliveryareas take place on the same platform, this option hardly ever encounters any compatibili-
im-ty or interface problems The data flows smoothly from beginning to end without any form-to-platform conversions No middleware is needed All tools work in a single com-puting environment
plat-In many companies, legacy systems are still running on mainframes or minis Some ofthese companies have migrated to UNIX-based servers and others have moved over toERP systems in client/server environments as part of the transition to address the Y2Kchallenge In any case, most legacy systems still reside on mainframes, minis, or UNIX-based servers What is the relationship of the legacy systems to the data warehouse? Re-member, the legacy systems contribute the major part of the data warehouse data If thesecompanies wish to adopt a single-platform solution, that platform of choice has to be amainframe, mini, or a UNIX-based server
If the situation in your company warrants serious consideration of the single-platformoption, then analyze the implications before making a decision The single-platform solu-tion appears to be an ideal option If so, why are not many companies adopting this optionnow? Let us examine the reasons
Trang 12Legacy Platform Stretched to Capacity. In many companies, the existing legacycomputing environment may have been around for a couple of decades and already fullystretched to capacity The environment may be at a point where it can no longer be up-graded further to accommodate your data warehouse
Nonavailability of Tools. Software tools form a large part of the data warehouse structure You will clearly grasp this fact from the last few subsections of this chapter.Most of the tools provided by the numerous data warehouse vendors do not support themainframe or minicomputer environment Without the appropriate tools in the infrastruc-ture, your data warehouse will fall apart
infra-Multiple Legacy Platforms. Although we have surmised that the legacy mainframe orminicomputer environment may be extended to include data warehousing, the practicalfact points to a different situation In most corporations, a combination of a few main-frame systems, an assortment of minicomputer applications, and a smattering of the new-
er PC-based systems exist side by side The path most companies have taken is frommainframes to minis and then to PCs Figure 8-3 highlights the typical configuration
If your corporation is one of the typical enterprises, what can you do about a platform solution? Not much With such a conglomeration of disparate platforms, a sin-gle-platform option having your data warehouse alongside all the other applications is justnot tenable
single-Company’s Migration Policy. This is another important consideration You very wellknow the varied benefits of the client/server architecture for computing You are also
MINI
UNIX MAINFRAME
Figure 8-3 Multiple platforms in a typical corporation
Trang 13aware of the fact that every company is changing to embrace this new computing digm by moving the applications from the mainframe and minicomputer platforms Inmost companies, the policy on the usage of information technology does not permit theperpetuation of the old platforms If your company has a similar policy, then you will not
para-be permitted to add another significant system such as your data warehouse on the oldplatforms
Hybrid Option. After examining the legacy systems and the more modern tions in your corporation, it is most likely that you will decide that a single-platform ap-proach is not workable for your data warehouse This is the conclusion most companiescome to On the other hand, if your company falls in the category where the legacy plat-form will accommodate your data warehouse, then, by all means, take the approach of asingle-platform solution Again, the single-platform solution, if feasible, is an easier solu-tion
applica-For the rest of us who are not that fortunate, we have to consider other options Let usbegin with data extraction, the first major operation, and follow the flow of data until it isconsolidated into load images and waiting in the staging area We will now step throughthe data flow and examine the platform options
Data Extraction. In any data warehouse, it is best to perform the data extraction tion from each source system on its own computing platform If your telephone sales dataresides in a minicomputer environment, create extract files on the mini-computer itself fortelephone sales If your mail order application executes on the mainframe using an IMSdatabase, then create the extract files for mail orders on the mainframe platform It israrely prudent to copy all the mail order database files to another platform and then do thedata extraction
func-Initial Reformatting and Merging. After creating the raw data extracts from the ous sources, the extracted files from each source are reformatted and merged into a small-
vari-er numbvari-er of extract files Vvari-erification of the extracted data against source system reportsand reconciliation of input and output record counts take place in this step Just like theextraction step, it is best to do this step of initial merging of each set of source extracts onthe source platform itself
Preliminary Data Cleansing. In this step, you verify the extracted data from each datasource for any missing values in individual fields, supply default values, and perform ba-sic edits This is another step for the computing platform of the source system itself How-ever, in some data warehouses, this type of data cleansing happens after the data from allsources are reconciled and consolidated In either case, the features and conditions of datafrom your source systems dictate when and where this step must be performed for yourdata warehouse
Transformation and Consolidation. This step comprises all the major data mation and integration functions Usually, you will use transformation software tools forthis purpose Where is the best place to perform this step? Obviously, not in any individ-ual legacy platform You perform this step on the platform where your staging area re-sides
Trang 14transfor-Validation and Final Quality Check. This step of final validation and quality check is
a strong candidate for the staging area You will arrange for this step to happen on thatplatform
Creation of Load Images. This step creates load images for individual database files
of the data warehouse repository This step almost always occurs in the staging area and,therefore, on the platform where the staging area resides
Figure 8-4 summarizes the data acquisition steps and the associated platforms You willnotice the options for the steps Relate this to your own corporate environment and deter-mine where the data acquisition steps must take place
Options for the Staging Area. In the discussion of the data acquisition steps, wehave highlighted the optimal computing platform for each step You will notice that thekey steps happen in the staging area This is the place where all data for the data ware-house come together and get prepared What is the ideal platform for the staging area? Let
us repeat that the platform most suitable for your staging area depends on the status ofyour source platforms Nevertheless, let us explore the options for placing the staging areaand come up with general guidelines These will help you decide Figure 8-5 shows thedifferent options for the staging area Please study the figure and follow the amplification
of the options given in the subsections below
In One of Legacy Platforms. If most of your legacy data sources are on the same form and if extra capacity is readily available, then consider keeping your data stagingarea in that legacy platform In this option, you will save time and effort in moving thedata across platforms to the staging area
plat-MAINFRAME
MINI
UNIX
UNIX or OTHER
SOURCE DATA PLATFORMS STAGING AREA PLATFORM
Data Extraction Initial
Reformatting/Merging
Preliminary Data Cleansing
Preliminary Data Cleansing
Transformation / Consolidation Validation / Quality Check
Load Image Creation
Figure 8-4 Platforms for data acquisition
Trang 15On the Data Storage Platform. This is the platform on which the data warehouseDBMS runs and the database exists When you keep your data staging area on this plat-form, you will realize all the advantages for applying the load images to the database Youmay even be able to eliminate a few intermediary substeps and apply data directly to thedatabase from some of the consolidated files in the staging area.
On a Separate Optimal Platform. You may review your data source platforms, ine the data warehouse storage platform, and then decide that none of these platforms arereally suitable for your staging area It is likely that your environment needs complex datatransformations It is possible that you need to work through your data thoroughly tocleanse and prepare it for your data warehouse In such circumstances, you need a sepa-rate platform to stage your data before loading to the database
exam-Here are some distinct advantages of a separate platform for data staging:
앫 You can optimize the separate platform for complex data transformations and datacleansing What do we mean by this? You can gear up the neutral platform with allthe necessary tools for data transformation, data cleansing, and data formatting
앫 While the extracted data is being transformed and cleansed in the data stagingarea, you need to keep the entire data content and ensure that nothing is lost on theway You may want to think of some tracking file or table to contain tracking en-tries A separate environment is most conducive for managing the movement ofdata
앫 We talked about the possibility of having specialized tools to manipulate the data inthe staging area If you have a separate computing environment for the staging area,
MINI
UNIX
UNIX or OTHER SOURCE DATA
PLATFORMS
DATA STORAGE PLATFORM
Option 3 Option 2
Option 1
Figure 8-5 Platform options for the staging area
Trang 16you could easily have people specifically trained on these tools running the separatecomputing equipment
Data Movement Considerations. On whichever computing platforms the ual steps of data acquisition and data storage happen, data has to move across platforms.Depending on the source platforms in your company and the choice of the platform fordata staging and data storage, you have to provide for data transportation across differentplatforms
individ-Please review the following options Figure 8-6 summarizes the standard options Youmay find that a single approach alone is not sufficient Do not hesitate to have a balancedcombination of the different approaches In each data movement across two computingplatforms, choose the option that is most appropriate for that environment Brief explana-tions of the standard options follow
Shared Disk. This method goes back to the mainframe days Applications running indifferent partitions or regions were allowed to share data by placing the common data on ashared disk You may adapt this method to pass data from one step to another for data ac-quisition in your data warehouse You have to designate a disk storage area and set it up sothat each of the two platforms recognizes the disk storage area as its own
Mass Data Transmission. In this case, transmission of data across platforms takesplace through data ports Data ports are simply interplatform devices that enable massivequantities of data to be transported from one platform to the other Each platform must beconfigured to handle the transfers through the ports This option calls for special hard-
MAINFRAME
MINI
DATA MOVEMENT Option 1 - Shared Disk
Option 2 - Mass Transmission
Option 3 - Realtime Connection
Option 4 - Manual Methods
High Volume Data
Figure 8-6 Data movement options
Trang 17ware, software, and network components There must also be sufficient network width to carry high data volumes.
band-Real-Time Connection. In this option, two platforms establish connection in real time
so that a program running on one platform may use the resources of the other platform Aprogram on one platform can write to the disk storage on the other Also, jobs running onone platform can schedule jobs and events on the other With the widespread adoption ofTCP/IP, this option is very viable for your data warehouse
Manual Methods. Perhaps these are the options of last resort Nevertheless, these tions are straightforward and simple A program on one platform writes to an externalmedium such as tape or disk Another program on the receiving platform reads the datafrom the external medium
op-C/S Architecture for the Data Warehouse. Although mainframe and puter platforms were utilized in the early implementations of data warehouses, by andlarge, today’s warehouses are built using the client/server architecture Most of these aremultitiered, second-generation client/server architectures Figure 8-7 shows a typicalclient/server architecture for a data warehouse implementation
minicom-The data warehouse DBMS executes on the data server component minicom-The data
reposito-ry of the data warehouse sits on this machine This server component is a major nent and we want to dedicate the next section for a detailed discussion of it
compo-As data warehousing technologies have grown substantially, you will now observe aproliferation of application server components in the middle tier You will find applicationservers for a number of purposes Here are the important ones:
Middleware / Connectivity / Control / Metadata Management / Web Access / Authentication / Query - Report Management / OLAP
DBMS Primary Data Repository
Figure 8-7 Client/server architecture for the data warehouse
Trang 18앫 To run middleware and establish connectivity
앫 To execute management and control software
앫 To handle data access from the Web
앫 To manage metadata
앫 For authentication
앫 As front end
앫 For managing and running standard reports
앫 For sophisticated query management
앫 For OLAP applications
Generally, the client workstations still handle the presentation logic and provide thepresentation services Let us briefly address the significant considerations for the clientworkstations
Considerations for Client Workstations. When you are ready to consider the figurations for the workstation machines, you will quickly come to realize that you need
con-to cater con-to a variety of user types We are only considering the needs at the workstationwith regard to information delivery from the data warehouse A casual user is perhaps sat-isfied with a machine that can run a Web browser to access HTML reports A serious ana-lyst, on the other hand, needs a larger and more powerful workstation machine The othertypes of users between these two extremes need a variety of services
Do you then come up with a unique configuration for each user? That will not be tical It is better to determine a minimum configuration on an appropriate platform thatwould support a standard set of information delivery tools in your data warehouse Applythis configuration for most of your users Here and there, add a few more functions asnecessary For the power users, select another configuration that would support tools forcomplex analysis Generally, this configuration for power users also supports OLAP.The factors for consideration when selecting the configurations for your users’ work-stations are similar to the ones for any operating environment However, the main consid-eration for workstations accessing the data warehouse is the support for the selected set oftools This is the primary reason for the preference of one platform over another
prac-Use this checklist while considering workstations:
앫 Workstation operating system
Trang 19com-decision making, you will find that the platform choices may have to be recast Figure 8-8shows you what to expect as your data warehouse matures.
Options in Practice. Before we leave this section, it may be worthwhile to take alook at the types of data sources and target platforms in use at different enterprises An in-dependent survey has produced some interesting findings Figure 8-9 shows the approxi-mate percentage distribution for the first part of the survey about the principal datasources In Figure 8-10, you will see the distribution of the answers to the question aboutthe platforms the respondents use for the data storage component of their data warehous-es
infor-The need to scale is driven by a few factors As your data warehouse matures, you willsee a steep increase in the number of users and in the number of queries The load willsimply shoot up Typically, the number of active users doubles in six months Again, as
STAGE 1
INITIAL
STAGE 2 GROWING
STAGE 3 MATURED
Desktop Clients
Appln
Server
Data Warehouse / Data Staging
Desktop Clients
Appln
Server
Data Warehouse / Data Mart
Data Staging / Develop- ment
Desktop Clients
Appln Servers
Data Marts
Data Staging
ment
Develop-Data Warehouse / Data Mart
Figure 8-8 Platform options as the data warehouse matures
Trang 20your data warehouse matures, you will be increasing the content by including more ness subject areas and adding more data marts Corporate data warehouses start at approx-imately 200 GB and some shoot up to a terabyte within 18–24 months.
busi-Hardware options for scalability and complex query processing consists of four types
of parallel architecture Initially, parallel architecture makes the most sense Shouldn’t aquery complete faster if you increase the number of processors, each processor working
Figure 8-9 Principal data sources
frame VSAM and other files
Trang 21on parts of the query simultaneously? Can you not subdivide a large query into separatetasks and spread the tasks among many processors? Parallel processing with multiplecomputing engines does provide a broad range of benefits, but no single architecture doeseverything right
In Chapter 3, we reviewed parallel processing as one of the significant trends in datawarehousing We also briefly looked at three more common architectures In this section,let us summarize the current parallel processing hardware options You will gain sufficientinsight into the features, benefits, and limitations of each of these options By doing so,you will be able contribute your understanding to your project team for selecting the prop-
er server hardware
SMP (Symmetric Multiprocessing). Refer to Figure 8-11
Features:
앫 This is a shared-everything architecture, the simplest parallel processing machine
앫 Each processor has full access to the shared memory through a common bus
앫 Communication between processors occurs through common memory
앫 Disk controllers are accessible to all processors
Benefits:
앫 This is a proven technology that has been used since the early 1970s
앫 Provides high concurrency You can run many concurrent queries
앫 Balances workload very well
앫 Gives scalable performance Simply add more processors to the system bus
앫 Being a simple design, you can administer the server easily
Common BusProcessor Processor Processor Processor
Figure 8-11 Server hardware option: SMP
Trang 22앫 Available memory may be limited
앫 May be limited by bandwidth for processor-to-processor communication, I/O, andbus communication
앫 Availability is limited; like a single computer with many processors
You may consider this option if the size of your data warehouse is expected to bearound a two or three hundred gigabytes and concurrency requirements are reasonable
Clusters. Refer to Figure 8-12
Features:
앫 Each node consists of one or more processors and associated memory
앫 Memory is not shared among the nodes; it is shared only within each node
앫 Communication occurs over a high-speed bus
앫 Each node has access to the common set of disks
앫 This architecture is a cluster of nodes
Benefits:
앫 This architecture provides high availability; all data is accessible even if one nodefails
앫 Preserves the concept of one database
앫 This option is good for incremental growth
Common High Speed Bus
Shared Disks
Figure 8-12 Server hardware option: cluster
Trang 23앫 Bandwidth of the bus could limit the scalability of the system
앫 This option comes with a high operating system overhead
앫 Each node has a data cache; the architecture needs to maintain cache consistencyfor internode synchronization A cache is “work area” holding currently used data;main memory is like a big file cabinet stretching across the entire room
You may consider this option if your data warehouse is expected to grow in defined increments
well-MPP (Massively Parallel Processing). Refer to Figure 8-13
Features:
앫 This is a shared-nothing architecture
앫 This architecture is more concerned with disk access than memory access
앫 Works well with an operating system that supports transparent disk access
앫 If a database table is located on a particular disk, access to that disk depends
entire-ly on the processor that owns it
앫 Internode communication is by processor-to-processor connection
Benefits:
앫 This architecture is highly scalable
앫 The option provides fast access between nodes
앫 Any failure is local to the failed node; improves system availability
앫 Generally, the cost per node is low
Limitations:
앫 The architecture requires rigid data partitioning
앫 Data access is restricted
Processor Processor Processor Processor
Figure 8-13 Server hardware option: MPP
Trang 24앫 Workload balancing is limited.
앫 Cache consistency must be maintained
Consider this option if you are building a medium-sized or large data warehouse in therange of 400–500 GB For larger warehouses in the terabyte range, look for special archi-tectural combinations
ccNUMA or NUMA (Cache-coherent Nonuniform Memory Architecture).
Refer to Figure 8-14
Features:
앫 This is the newest architecture; was developed in the early 1990s
앫 The NUMA architecture is like a big SMP broken into smaller SMPs that are easier
to build
앫 Hardware considers all memory units as one giant memory The system has a singlereal memory address space over the entire machine; memory addresses begin with 1
on the first node and continue on the following nodes Each node contains a
directo-ry of memodirecto-ry addresses within that node
앫 In this architecture, the amount of time needed to retrieve a memory value variesbecause the first node may need the value that resides in the memory of the thirdnode That is why this architecture is called nonuniform memory access architec-ture
Benefits:
앫 Provides maximum flexibility
앫 Overcomes the memory limitations of SMP
앫 Better scalability than SMP
Figure 8-14 Server hardware option: NUMA
Trang 25앫 If you need to partition your data warehouse database and run these using a ized approach, you may want to consider this architecture You may also place yourOLAP data on the same server.
central-Limitations:
앫 Programming NUMA architecture is more complex than even with MPP
앫 Software support for NUMA is fairly limited
앫 Technology is still maturing
This option is a more aggressive approach for you You may decide on a NUMA chine consisting of one or two SMP nodes, but if your company is inexperienced in hard-ware technology, this option may not be for you
ma-DATABASE SOFTWARE
Examine the features of the leading commercial RDBMSs As data warehousing becomesmore prevalent, you would expect to see data warehouse features being included in thesoftware products That is exactly what the database vendors are doing Data-warehouse-related add-ons are becoming part of the database offerings The database software thatstarted out for use in operational OLTP systems is being enhanced to cater to decisionsupport systems DBMSs have also been scaled up to support very large databases.Some RDBMS products now include support for the data acquisition area of the datawarehouse Mass loading and retrieval of data from other database systems have becomeeasier Some vendors have paid special attention to the data transformation function.Replication features have been reinforced to assist in bulk refreshes and incremental load-ing of the data warehouse
Bit-mapped indexes could be very effective in a data warehouse environment to index
on fields that have a smaller number of distinct values For example, in a database tablecontaining geographic regions, the number of distinct region codes is few But frequently,queries involve selection by regions In this case, retrieval by a bit-mapped index on theregion code values can be very fast Vendors have strengthened this type of indexing Wewill discuss bit-mapped indexing further in Chapter 18
Apart from these enhancements, the more important ones relate to load balancing andquery performance These two features are critical in a data warehouse Your data ware-house is query-centric Everything that can be done to improve query performance is mostdesirable The DBMS vendors are providing parallel processing features to improve queryperformance Let us briefly review the parallel processing options within the DBMS thatcan take full advantage of parallel server hardware
Parallel Processing Options
Parallel processing options in database software are intended only for machines withmultiple processors Most of the current database software can parallelize a large num-ber of operations These operations include the following: mass loading of data, fulltable scans, queries with exclusion conditions, queries with grouping, selection with dis-tinct values, aggregation, sorting, creation of tables using subqueries, creating and re-building indexes, inserting rows into a table from other tables, enabling constraints, star
Trang 26transformation (an optimization technique when processing queries against a STARschema), and so on Notice that this an impressive list of operations that the RDBMScan process in parallel
Let us now examine what happens when a user initiates a query at the workstation.Each session accesses the database through a server process The query is sent to theDBMS and data retrieval takes place from the database Data is retrieved and the resultsare sent back, all under the control of the dedicated server process The query dispatchersoftware is responsible for splitting the work, distributing the units to be performedamong the pool of available query server processes, and balancing the load Finally, theresults of the query processes are assembled and returned as a single, consolidated resultset
Interquery Parallelization. In this method, several server processes handle multiplerequests simultaneously Multiple queries may be serviced based on your server configu-ration and the number of available processors You may successfully take advantage of thisfeature of the DBMS on SMP systems, thereby increasing the throughput and supportingmore concurrent users
However, interquery parallelism is limited Let us see what happens here Multiplequeries are processed concurrently, but each query is still being processed serially by asingle server process Suppose a query consists of index read, data read, sort, and join op-erations; these operations are carried out in this order Each operation must finish beforethe next one can begin Parts of the same query do not execute in parallel To overcomethis limitation, many DBMS vendors have come up with versions of their products to pro-vide intraquery parallelization
Intraquery Parallelization. We will use Figure 8-15 for our discussion of intraqueryparallelization, so please take a quick look and follow along This will greatly help you inmatching up your choice of server hardware with your selection of RDBMS
Let us say a query from one of your users consists of an index read, a data read, a datajoin, and a data sort from the data warehouse database A serial processing DBMS willprocess this query in the sequence of these base operations and produce the result set.However, while this query is executing on one processor in the SMP system, other queriescan execute in parallel This method is the interquery parallelization discussed above Thefirst group of operations in Figure 8-15 illustrates this method of execution
Using the intraquery parallelization technique, the DBMS splits the query into thelower-level operations of index read, data read, data join, and data sort Then each one ofthese basic operations is executed in parallel on a single processor The final result set isthe consolidation of the intermediary results Let us review three ways a DBMS can pro-vide intraquery parallelization, that is, parallelization of parts of the operations within thesame query itself
Horizontal Parallelism. The data is partitioned across multiple disks Parallel ing occurs within each single task in the query, for example, data read, which is performed
process-on multiple processors cprocess-oncurrently process-on different sets of data to be read from multipledisks After the first task is completed from all of the relevant parts of the partitioned data,the next task of that query is carried out, and then the next one after that task, and so on.The problem with this approach is the wait until all the needed data is read Look at Case
A in Figure 8-15