1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Bản sao của DATA WAREHOUSING FUNDAMENTALS FOR IT PROFESSIONALS, 2nd edition, 2010 tủ tài liệu bách khoa

602 332 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 602
Dung lượng 3,81 MB

Nội dung

DATA WAREHOUSING FUNDAMENTALS FOR IT PROFESSIONALS Second Edition PAULRAJ PONNIAH DATA WAREHOUSING FUNDAMENTALS FOR IT PROFESSIONALS DATA WAREHOUSING FUNDAMENTALS FOR IT PROFESSIONALS Second Edition PAULRAJ PONNIAH Copyright # 2010 by John Wiley & Sons, Inc All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic formats For more information about Wiley products, visit our web site at www.wiley.com Library of Congress Cataloging-in-Publication Data: Ponniah, Paulraj Data warehousing fundamentals for IT professionals / Paulraj Ponniah.—2nd ed p cm Previous ed published under title: Data warehousing fundamentals Includes bibliographical references and index ISBN 978-0-470-46207-2 (cloth) Data warehousing I Ponniah, Paulraj Data warehousing fundamentals II Title QA76.9.D37P66 2010 005.740 5—dc22 2009041789 Printed in the United States of America 10 To Vimala, my loving wife and to Joseph, David, and Shobi, my dear children CONTENTS PREFACE PART 1 xxv OVERVIEW AND CONCEPTS THE COMPELLING NEED FOR DATA WAREHOUSING CHAPTER OBJECTIVES / ESCALATING NEED FOR STRATEGIC INFORMATION / The Information Crisis / Technology Trends / Opportunities and Risks / FAILURES OF PAST DECISION-SUPPORT SYSTEMS / History of Decision-Support Systems / 10 Inability to Provide Information / 10 OPERATIONAL VERSUS DECISION-SUPPORT SYSTEMS / 11 Making the Wheels of Business Turn / 12 Watching the Wheels of Business Turn / 12 Different Scope, Different Purposes / 12 DATA WAREHOUSING—THE ONLY VIABLE SOLUTION / 13 A New Type of System Environment / 13 Processing Requirements in the New Environment / 14 Strategic Information from the Data Warehouse / 14 vii GLOSSARY Ad Hoc Query A query that is not predefined or anticipated, usually run just once Ad hoc queries are typical in a data warehouse environment ADO ActiveX Data Objects Microsoft’s software interface mechanism for distributed Web applications Agent Technology A technology where specialized software modules act to produce desired results based on specified events The software is structurally transparent to users Agile Development A term coined in 2001 with the formulation of the Agile Manifesto to refer to software development methodologies based on iterative development where requirements and solutions evolve through collaboration among self-organizing, crossfunctional teams Alert System A software system that notifies and alerts users when particular events take place, such as some business indicator exceeding a preset threshold value API Application Program Interface A functional interface, usually supplied by the operating system, that allows one application program to communicate with another program for receiving services APIs are generally implemented through function calls ASCII American Standard Code for Information Interchange A standard code, consisting of a set of 7-bit coded characters, used for information exchange between computer and communication systems BAM Business Activity Monitoring refers to the collection, aggregation, analysis, and presentation of real-time information about activities within an organization Bitmapped Indexing A sophisticated and fast indexing technique using the binary values of individual bits to indicate values of attributes in relational database tables This technique is very effective in a data warehouse for low-selectivity data, that is, for attributes that have only a few distinct values Data Warehousing Fundamentals for IT Professionals, Second Edition By Paulraj Ponniah Copyright # 2010 John Wiley & Sons, Inc 557 558 GLOSSARY BLOB Binary Large Object Very large binary representation of multimedia objects that can be stored and used in some enhanced relational databases B-Tree Indexing A hierarchical indexing technique based on an inverted tree of nodes containing ranges of indexed values Going down the hierarchical levels, the nodes progressively contain smaller numbers of index values, so that any value may be searched for in a few trials by starting at the top Buffer A region of computer memory that holds data being transferred from one area to another Data from database tables are fetched into memory buffers Memory buffer management is crucial for system performance Business Intelligence (BI) Generally used synonymously with the information available in an enterprise for making strategic decisions Cache A method for improving system performance by creating a secondary memory area with access speeds closer to the processor speed A disk cache is a section of main memory set apart to cache data from the disk A memory cache is a section of high-speed memory to cache data from main memory CASE Computer-Aided Software Engineering CASE tools or programs help develop software applications A set of tools may include code generators, data modeling tools, analysis and design tools, and tools for documenting and testing applications CGI Common Gateway Interface A standard program interface in Web servers to connect different computer applications In Web technology, the interface is designed to connect client applications to remote relational databases CIO Chief Information Officer The executive who heads the information services division of an organization The CIO, usually reporting directly to the Chief Executive Officer, has the responsibility for all the organization’s computing and data communications Clickstream Recording of what the user clicks on while browsing the Web Clickstream data provides insight into what a visitor is interested in at a Website Clustering A method of keeping database files physically close to one another on the storage media for improving performance through sequential pre-fetch operations Composite Key A key for a database table made up of more than one attribute or field Conformed Dimensions Two sets of business dimensions represented in dimension tables are said to be conformed if both sets are identical in their attributes or if one set is an exact subset of the other Conformed dimensions are fundamental in the bus architecture for a family of STARS CORBA Common Object Request Broker Architecture Developed by the Object Management Group (OMG) to provide interoperability and portability for objects over a network of heterogeneous, distributed systems in a multiplatform environment The object request broker is a software mechanism for switching messages between objects CRM Customer Relationship Management Refers to the set of procedures and computer applications designed to manage and improve customer service in an enterprise Data warehousing, with integrated data about each customer, is eminently suitable for CRM Crosstab Refers to cross tabulation of data in tabular format with totals of columns and rows for summarization and analysis C/S (Client/Server) Computing A distributed methodology for building applications where the server machines provide services to users at the client machines The servers may manage communications, run applications, or provide database services GLOSSARY 559 Dashboard A visual display mechanism to enable business users at every level to receive the information they need to make better decisions that improve business performance Database A repository where an ordered and integrated collection of the enterprise data is stored for computer processing and information sharing Data Mart A collection of related data from internal and external sources, transformed, integrated, and stored for the purpose of providing strategic information to a specific set of users in an enterprise Data Mining A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge Data Visualization Technique for presentation and analysis of data through visual objects, such as graphs, charts, images, and specialized tabular formats Data Warehouse A collection of transformed and integrated data, stored for the purpose of providing strategic information to the entire enterprise In the top-down approach of implementation, it represents a centralized repository along with a set of dependent data marts In the bottom-up approach of implementation, it may represent a set of independent data marts In a practical approach built with conformed dimensions and facts, it represents a unified set of conformed data marts Data Warehouse Appliance Consists of an integrated set of servers, storage, operating systems, DBMS, and software specifically pre-installed and pre-optimized for data warehousing DBMS Database Management System A software system to store, access, maintain, manage, and safeguard data in databases DD Data Dictionary A catalog or directory in a database management system that stores the data structures and relationships DDL Data Definition Language A component in a database management system used for defining data structures in the data dictionary Dimension Table In the dimensional data model, each dimension table contains the attributes of a single business dimension Product, store, salesperson, and promotional campaign are examples of business dimensions along which business measurements or facts are analyzed DOLAP Desktop OLAP A variation of ROLAP (relational online analytical processing) In the DOLAP model, multidimensional cubes are created and sent to the desktop machine where the DOLAP software exists to process the cubes Drill Down Method of analysis for retrieving lower levels of detailed data starting from summary data DSS Decision Support System Application that enables users to make strategic decisions EAI Enterprise Application Integration is a set of technologies and services that form a framework to enable integration of systems and applications across the enterprise EBCDIC Extended Binary-Coded Decimal Interchange Code A coded character set of 256 eight-bit characters commonly used in mainframe systems EII Enterprise Information Integration is a process using data abstraction to provide a single interface for viewing all the data within an organization so that heterogeneous data sources may appear to a user as a single, homogeneous data source EIS Executive Information Systems Applications specially designed for senior executives to perform information look-up and trend analysis 560 GLOSSARY E-R Data Modeling A popular data modeling technique used for representing business entities and the relationships among them ERP Systems Enterprise Resource Planning Systems Large packaged applications offered by leading vendors, such as SAP and PeopleSoft ERP applications are built with proprietary software and they usually cover the entire range of a company’s business Extranet Enterprise network using Web technologies for collaboration of internal users and selected external business partners Fact Table In the dimensional data model, the middle table that contains the facts or metrics of the business as attributes in the table Sales units, sales dollars, costs, and profit margin are examples of business metrics that are analyzed Fat Client In the client/server architecture, a client workstation that can process both application logic and presentation services Fine-tuning The application of software and procedures for performance improvement of a computing system Firewall A computer system placed between the Internet and an internal subnet of an enterprise to prevent unauthorized outsiders from accessing internal data Foreign Key An attribute in a relational table used for establishing the direct relationship with another table, known as the parent table The values for the foreign key attribute are drawn from the primary key values of the parent table Forward Engineering The process of transforming a logical data model into a physical schema of a target database CASE tools used for data modeling have facilities for forward engineering 4GL Fourth Generation Language High-level, nonprocedural language for data manipulation, generally used with relational databases Gateway A generic term referring to a computer system that routes data or merges two dissimilar services together Granularity Indicates the level or grain of data Detailed data have low granularity GUI Graphical User Interface An intuitive user interface consisting of windows, pointing devices, pull-down menus, drag-and-drop facilities, and icons GUI has replaced the earlier CUI (Character User Interface) HOLAP Hybrid Online Analytical Processing An approach to analytical processing that combines the MOLAP and ROLAP techniques Homonyms Two or more data elements having the same name but containing different data HTML HyperText Markup Language A standard for defining and creating Web documents HTTP HyperText Transfer Protocol A communications protocol of the Web that governs the exchange of HTML-coded documents between a Website and a browser IMS Information Management System IBM’s hierarchical database management system Old applications in several companies are still supported by IMS databases Indexing The method for speeding up database access by creating index files that point to data files Intranet The enterprise network using Web technologies for collaboration of internal users GLOSSARY 561 I/O Input/Output Abbreviation used to indicate a database read/write operation Excessive I/O degrades system performance IT Information Technology Covers all the computing and data communications in an enterprise The CIO is responsible for IT operations of the enterprise JAD Joint Application Development A methodology for developing computer applications in which IT professionals and end-users cooperate and participate in the development Java An object-oriented programming language that offers full interactivity with the Web JDBC Java Database Connectivity Java provides access to SQL through JDBC Join A database operation used to merge data from two related tables that have common attributes KDD Knowledge Discovery in Data The process for discovering knowledge from prepared data by using data mining algorithms KM Knowledge Management A computing environment for accumulating, encoding, storing, and managing enterprise knowledge LAN Local Area Network The physical network links that connect computing devices located within a small area Legacy Systems Old computer applications on disparate platforms, supported by outdated database systems that are still in use in many companies, as legacies from the past Load Image Record layout in a flat file for loading a database table generally by using database utility programs The record layout of the flat file is an exact match or image of the table Log A file used by the database management system to record all database transactions The log file is used for recovery of the database in case of failures Mainframe A large computer system with extensive capabilities and resources, usually housed in a large computer center MDAPI Multi-Dimensional Application Programmers Interface The standard developed by the OLAP Council MDDB Multidimensional Database A proprietary database meant to store multidimensional data cubes MDDBs use multidimensional arrays to store data MDM Master Data Management comprises a set of processes and tools that consistently define and manage the nontransactional data entities of an organization such as Customers and Products Metadata Data about data itself For example, in a database metadata refers to the data type, length, format, default values, and so on—all information or data about the data itself Middleware The term refers to software services that are placed between applications and database servers to make the data interchange transparent and efficient Mission Critical System A software application that is absolutely essential for the continued operation of an organization MOLAP Multidimensional Online Analytical Processing An analytical processing technique in which multidimensional data cubes are created and stored in separate proprietary databases 562 GLOSSARY MPP Massively Parallel Processing Shared-nothing architecture for parallel server hardware where memory and disks are not shared among the processor nodes NUMA Nonuniform Memory Architecture Recent architecture for parallel server hardware, which is like a big SMP broken down into smaller SMPs ODBC Open Database Connectivity A programming interface from Microsoft that provides a common language interface for Windows applications to access databases on a network OLAP Online Analytical Processing Covers a wide spectrum of complex multidimensional analysis involving intricate calculations and requiring fast response times OLTP Online Transaction Processing Processes in applications that collect data online during the execution of business transactions Order processing is an OLTP application Operational System An application that supports the day-to-day operations of a business Outlier In analysis, an outlier is an observation or member of a sample that deviates significantly from the rest of the sample Partitioning The method for dividing a database into manageable parts for the purpose of easier management and better performance Portability Refers to the ability of a piece of software to be moved around and made to function on different computing platforms Predictive Analysis Includes a variety of statistical and data mining techniques to analyze historical and current data to make predictions about the future Primary Key One or more fields or attributes that uniquely identify each record in a database table Punch Card A card medium that was used to store data in old computer systems Each card had 80 columns and several rows to store data Query A computing function that requests data from the database, stating the parameters and constraints for the request Query Governor A mechanism, usually in the DBMS, used to monitor and intercept runaway queries that might bring down the database system RAID Redundant Array of Inexpensive Disks A system of disk storage where data is distributed across several drives for faster access and improved fault tolerance RDBMS Relational Database Management System A software system for relational databases Referential Integrity Refers to two relational tables that are directly related Referential integrity between related tables is established if non-null values in the foreign key field of the child table are primary key values in the parent table Replication A method for creating copies of the database, either in real time or in a deferred mode Reverse Engineering The process of transforming the physical schema of any particular database into a logical model Data modeling CASE tools have facilities for reverse engineering ROA Return on Assets Measure of payback from a project for assets deployed ROI Return on Investment Measure of payback from a project for investment made ROLAP Relational Online Analytical Processing An analytical processing technique in which multidimensional data cubes are created on the fly by the relational database engine GLOSSARY 563 Roll Up Method of analysis for retrieving higher levels of summary data starting from detailed data Scalability The ability to support increasing numbers of users in cost-effective increments without adversely affecting business operations Schema A collection of tables that forms a database Scorecard Online, real time reporting to monitor performance against targets Slice and Dice The term commonly used for a method of analysis where multidimensional data is presented in many ways, by rotating the presentation between columns, rows, and pages SMP Symmetric Multiprocessing Shared-everything architecture for parallel server hardware where memory and disks are shared among the processor nodes Snowflake Schema A normalized version of the STAR schema in which dimension tables are partially or fully normalized Not generally recommended because it compromises query performance and simplicity for understanding Sparsity Indicates the condition in the data warehouse in which every fact table record in a dimensional model is not necessarily filled with data SQL Structured Query Language Has become the standard interface for relational databases Standardized Facts Two sets of facts represented in fact tables in a family of STARS are said to be standardized if both sets are identical in their attributes or if one set is an exact subset of the other Standardized facts are fundamental in the bus architecture for a family of STARS STAR Schema The arrangement of the collection of fact and dimension tables in the dimensional data model, resembling a star formation, with the fact table placed in the middle surrounded by the dimension tables Each dimension table is in a one-to-many relationship with the fact table Stored Procedure A software program stored in the database itself to be executed on the server based on stipulated conditions Subschema A collection of external user views of a database Supermart The term commonly applied to a data mart with conformed dimensions and standardized facts Surrogate Key An artificial key field, usually with system-assigned sequential numbers, used in the dimensional model to link a dimension table to the fact table In a dimension table, the surrogate key is the primary key which becomes a foreign key in the fact table Syndicated Data Data that can be purchased from outside commercial sources to augment the data in the enterprise data warehouse Synonyms Two or more data elements containing the same data but having different names Table Space Refers to an area on a physical medium where one or more relational database tables can exist TCP/IP Transmission Control Protocol/Internet Protocol Basic communication protocol with an applications layer (TCP) and a network layer (IP) Thin Client In the client/server architecture, a client workstation that manages only the graphical user interface 564 GLOSSARY Threads A thread is a unit of task under the control of a single computing process that can be implemented within a server process, or by means of an operating system service Time Stamping In applications, the procedure of marking each database record with the date and time of the database operation such as insert, update, or designation to delete Trigger A stored procedure that can be triggered and executed automatically when a database operation such as insert, update, or delete takes place UDF User-Defined Functions in advanced database systems UDT User-Defined Data Types in advanced database systems UNIX A multiuser, multitasking robust operating system originally developed by Bell Laboratories Volatility Data are said to be highly volatile if they are subject to frequent additions, updates or deletes VSAM Virtual Storage Access Method A powerful storage and data access method, very popular on mainframes before databases became prevalent WAN Wide Area Network The physical network links that connect computing devices spread out across a large area In a global organization, a WAN connects the users across continents XML eXtensible Markup Language Was introduced to overcome the limitations of HTML XML is extensible, portable, structured, and descriptive INDEX Additive measures, 237 Agent technology, 59 Aggregates fact tables, 266 –270 need for, 266 Aggregation goals, 271 options, 271 –272 suggestions for, 271 –272 Alert systems, 188 Analysis, 359 Analytical processing, see OLAP Applications, 359 –360 Architecture, 141 –160 characteristics of, 143 –146 definitions for, 141 –142 framework, 146– 148 in major areas, 142 –143 technical, 148 –156 types of centralized, 32, 156 data-mart bus, 32, 160 federated, 33, 159 hub-and-spoke, 33, 159 independent data marts, 34, 156 Archived data, 36 Backup, see also Recovery reasons for, 505 schedule, 506– 507 strategy, 505–506 Best practices, real-world examples of, airlines, 549 credit union, 552–553 health care, 550 home improvement retail, 552 international shipping and delivery, 551 life insurance, 553 phone service, 552 rail services, 551 securities, 551 specialty textiles, 550 telecommunications, 553 travel, 550 Bitmapped index, 481–482 See also Indexing Bottom-up approach, 30–31 Browser, Web, 418– 419 B-Tree index, 479–480 See also Indexing Business data, dimensional nature of, 101–102 Business dimensions examples of, 102–103 hierarchies, 106– 107 Data Warehousing Fundamentals for IT Professionals, Second Edition By Paulraj Ponniah Copyright # 2010 John Wiley & Sons, Inc 565 566 INDEX Business intelligence evolution of, 18 –19 two environments, 19 Business metrics, 107 Business requirements, see Requirements CASE tools, 232 Categories, in dimensions, 106 –107 Checkpoint, for data loads, 493 Cleansing, see Data cleansing Client/server architecture, 174 –175 Cluster detection, in data mining, 440 –443 Clustered indexes, 482 Clustering, 466 Clusters, server hardware, 179 Codd, E.F., OLAP guidelines, 380– 382 Computing environment, 167 –168 Conformed dimensions, 275 –276 Costs, see Project costs CRM (Customer Relationship Management), 63 Cubes, multidimensional data, 384, 386 –387 See also OLAP DDL (Data Definition Language), 469 –470 Data archived, 36 external, 36 –37 internal, 35 –36 spatial, 52 unstructured, 51 –52 Data acquisition data flow, 150 –151 functions and services, 151– 152 Database management system, see DBMS Database software, 181 –184 Data cleansing decision, 229– 230 pollution discovery for, 330 practical tips, 334 purification process, 333 –334 Data Design CASE tools for, use of, 232 decisions, 226 dimensional modeling in, 226 –230 Data dictionary, 470 Data extraction Overview, 37–38 Techniques deferred extraction, 292 –293 evaluation of, 294– 295 immediate extraction, 290 –292 tools, 186–188 Data granularity, 28, 236–237, 238–239 Data integration, 299–301 other approaches enterprise application integration (EAI), 312 enterprise information integration (EII), 311 Data loading applying data in, 303–305 dimension tables, procedure for, 306–307 fact tables, history and incremental, 307–308 full refresh in, 305 incremental loads, 305 initial load, 303–304 overview, 39 Data marts, 29–32 Data model, 123–125 Data movement, considerations for, 173–174 Data mining, 429–459 applications, 454– 459 banking and finance, 459 biotechnology, 457–459 business areas, 452– 453 CRM (Customer Relationship Management), 454–455 retail, 455–456 telecommunications, 456–457 aspects, 437 association rules outlier analysis predictive analytics benefits of, 453– 454 and data warehouse, 438– 439 defined, 431–432 knowledge discovery, 432–435 phases, 433–434 OLAP versus, 435– 436 techniques comparison of, 451 cluster detection, 440–443 decision trees, 443–444 genetic algorithms, 448–450 link analysis, 445–447 memory-based reasoning, 444–445 neural networks, 447– 448 tools, evaluation of, 451–452 Data pollution, 320–323 See also Data quality cryptic values, 322 data aging, 324 fields, multipurpose, 322 identifiers, nonunique, 322 source, 129, 323–324 INDEX Data quality benefits of, 319 –320 data accuracy, compared with, 317 dimensions, 318 –319 explained, 316 –317 framework, 330– 331 initiative, 328 –333 MDM (Master Data Management) benefits, 335 categories, 335 names and addresses, validation of, 325 problem types, 129, 320 –323 tools, features of, 187, 326 –327 Data refresh, 305 –306 See also Data loading Data replication, 290–292 Data sources, 34–37 Data staging, 37 –39 Data storage architecture, technical, 152 –154 component, 39 –40 sizing, 132 –133, 477 Data transformation implementation, 301 –302 integration in, 299 –301 overview, 38 problems entity identification, 300 multiple sources, 300– 301 tasks, 296 –297 tools, 187 types, 297 –298 Data visualization, 52–54 Data warehouse components, 34–41 data content, 144 –145 data granularity, 28 definitions for, 15 –17, 23–24 development approaches for, 29–32 integrated data in, 25 –26 nonvolatile data in, 27 –28 subject-oriented data in, 24 –25 time-variant data in, 26 –27 Data warehousing environment, a new type of, 13– 14 market, 47 –48 movement, 17 –18 milestones, 17 products, 48 –50 solution, only viable, 13 uses, 3–4, 47–48 Data Webhouse, 422 See also World Wide Web DBMS, selection of, 132, 184 Decision-support systems failures of, past, –10 history of, 10 progression of, 430–431 scope and purpose, 12– 13 Decision trees, in data mining, 443–444 Degenerate dimensions, 237–238 Deployment, 489–508 activities desktop readiness, 493–494 initial loads, 492–493 initial user training, 494–495 user acceptance, 491–492 of backup, 505–507 of pilot system, 497– 502 of recovery, 507 of security, 502–504 in stages, 495–497 testing, 490 Dimensions conforming of, 275–276 degenerate, 237–238 junk, 258–259 large, 255–256 rapidly changing, 256–258 slowly changing, 250– 255 type changes (error corrections), 251– 252 type changes (preserving history), 252– 253 type changes (soft changes), 253– 255 Dimensional modeling design dimension table, 229– 230 fact table, 228–229 entity-relationship modeling, compared with, 230–231 Dimension table, 229–230, 234–236 DOLAP (desktop ROLAP), 393 Drill-down analysis, 233–234, 390–392 DSS, see Decision-support systems EAI (Enterprise Application Integration) See Data integration EII (Enterprise Information Integration) See Data integration EIS (Executive Information Systems), 10, 40, 360 End-users, see Users Entity-relationship modeling versus dimensional modeling, 230–231 567 568 INDEX ETL (Data Extraction, Transformation, and Loading) considerations for, in requirements, 127 –129 management, 522 –523 metadata, 309– 310 overview, 37–39, 282 –284 steps, 284 –285 tool options, 308 –309 External data, 36–37 Extranet, 409, 422 Facts, 228 –229 Fact table, 228 –229, 236 –238, 264 –265 factless, 238 Fat client, in OLAP architecture, 398 Fault tolerance, in RAID technology, 476 Fine-tuning, ongoing, 524 Foreign keys, in STAR schema, 240 –241 Genetic algorithms, in data mining, 448 –450 Granularity, see Data granularity Hardware selection, guidelines for, 166 –168 server options clusters, 179 MPP, 180 NUMA, 180 –181 SMP, 178– 179 Helpdesk, in support structure, 520 Hierarchies, in dimension table, 235, 256 –257 Hypercubes, 386– 390 See also OLAP Indexing, 477 –483 B-Tree Index, 479– 480 bimapped index, 480 –482 for dimension table, 483 for fact table, 482 Information crisis, technology growth, 6– Information delivery architecture, technical data flow, 155 –156 functions and services, 155 –156 BAM (Business Activity Monitoring), 366 –367 in business areas, 345 –346 components, 40–41 dashboards and scorecards, 367 –371 enhancement, 523 –524 methods, 40 analysis, 359 applications, 359–360 queries, 357–358 reports, 358–359 operational systems, differences, 342–344 real time, 135 tools, selection of, 361–364 user-information interface, 347–348 users, to broad class of, 354–355 Information packages dimensional model, basis for, 225– 230 examples, 108 purpose and contents, 104–105 requirements document, essential part of, 118 Information potential, data warehouse business areas, for, 345– 347 plan-assess-execute loop, 344–345 Infrastructure, 163–190 computing environment, 167–168 database software, 181–184 hardware, 166–167 operational, 165 physical, 165–166 platforms, computing, 168–177 See also Platform options server options, 178–181 tools, features of, 186– 188 Inmon, Bill data warehouse, definition of, 23 data warehouse or data mart, 29 Internal data, 35–36 Internet, 409, 422, 425 Interviews, 109 See also JAD Intranet, 409, 410, 422 JAD (Joint Application Development), 113–115 See also Interviews Junk dimensions, 258–259 Kelly, Sean data warehouse, definition of, 24 Keys, STAR schema, 239–241 Kimball, Ralph data Webhouse, 408, 422 KM (Knowledge Management), 61 –63 Knowledge discovery systems, see Data mining Loading, see Data loading Logging, 290–291, 507–508 Logical model, and physical model, 463, 469–470 INDEX Mainframe, 167 –168 See also Hardware Managed query system, 357 –358 Managed report system, 358 –359 Management of data warehouse, 520– 524 tools, features of, 188 MDDB (Multidimensional Database), 394, 398 MDS (Multidimensional Domain Structure), 387–390 See also OLAP Memory-based reasoning, in data mining, 444–445 Metadata, 193 –220 business, 207–209 challenges, 215 data acquisition, 204– 205 data storage, 205 –206 end-user, 198 –199 ETL, 310 –311 implementation options, 218 –219 information delivery, 206 –207 IT, 199 –200 repository, 215 –217 requirements, 212 –213 sources, 213 –215 standards, initiatives for, 65, 217 tasks, driven by, 200 –202 technical, 209 –212 Metadata Coalition, 65, 217 Middleware, 188 MOLAP (Multidimensional OLAP), 394–395 MPP (Massively Parallel Processing), 180 Mutidimensional analysis, 374 –375, 383 –386 See also OLAP Naming standards, 471– 472 Neural networks, in data mining, 447 –448 NUMA (Nonuniform Memory Architecture), 180–181 OLAP (Online Analytical Processing), 373–406 calculations, powerful, 375 –376 definition of, 380 drill-down and roll-up, 390 –392 guidelines, Codd, E.F., 380–382 hypercubes, 386– 390 implementation, examples of, 404 models MOLAP, 394 –395 ROLAP, 395 –397 ROLAP versus MOLAP, 397 –398 multidimensional analysis, 383 –386 569 options architectural, 394–397 platforms, 402 results, online display of, 384–385 slice-and-dice, 392–393 standards, initiatives for, 65 –66 tools, features of, 186– 188, 403 OLAP Council, 65, 380 OLTP (Online Transaction Processing) systems, 19 OMG (Object Management Group), 65, 217 Operational systems, 12 Operating systems, 166–167 Parallel hardware, 178–181 Parallel processing database software in, 181– 184 implementation of, 54– 56 performance improvement by, 484–485 of queries, 182–184 Partitioning, data, 483–484 Performance, improvement of data arrays, use of, 486 data partitioning, 483– 484 DBMS, initialization of, 485–486 indexing, 477–483 parallel processing, 484–485 referential integrity checks, suspension of, 485 summarization, 485 Physical design, 463–483 block usage, 475 data partitioning, 483– 484 indexing, 477–483 objectives, 467–469 RAID, use of, 476 scalability, provision for, 468 steps, 464–467 storage, allocation of, 473–476 Physical model, 463, 469–471 Pilot system, 497–502 See also Deployment choices, 500–502 types, 498– 500 usefulness, 497–498 Planning, see Project plan Platform options client/server, 174–175 for data movement, 173–174 as data warehouse matures, 176 hybrid, 169–171 for OLAP, 402 570 INDEX Platform options (Continued ) in staging area, 171 –173 single platform, 168– 169 Post-deployment review, 512 Primary keys, in STAR schema, 239 –240 Project, differences from OLTP, 80 Project costs, 77–78 Project life cycle steps and checklists, 531 –534 Project management approach, practical, 94 –95 failure scenarios, 91 principles, guiding, 91 –92 success factors, 92 –94 warning signs, 92 –93 Project plan approach, top-down or bottom-up, 75 build or buy decisions, 75 plan outline, 78, 83 readiness assessment, 81 risk assessment, 74 –75 survey, preliminary, 76 –77 values and expectations, 74 vendor options, 75–76 Project sponsor, 77, 87 Project team challenges, 85 job titles, 86 responsibilities, 87 roles, 86 skills and experience levels, 87 –88 user participation, 88 –90 Quality, see Data quality Queries, 357 –358 RAID (Redundant Array of Independent Disks), 476 Recovery, 507 –508 See also Backup Referential integrity, 485 Refresh, data, 306 See also Data loading Replication, data, 291 –292 Reports, 358 –359 Requirements driving force for architectural plan, 125 –131 data design, 122 –125 data quality, 128– 129 DBMS selection, 132 information delivery, 133 –136 storage specifications, 131 –132 ETL, considerations for, 127 methods, for gathering documentation, existing, 115–116 interviews, 111– 112 JAD sessions, 113–115 questions arrangement of, 111 types of, 110 questionnaires, 110 nature of, 100, 104 survey, preliminary, 76 –77 ROLAP (Relational OLAP), 395–396 Roll-up analysis, 390–392 Scalability, 167, 468 Schema, see STAR schema See also Snowflake schema SDLC (System Development Life Cycle), 81 –83 Security passwords, 503–504 policy, 502 privileges, user, 502– 503 tools, 504 Semiadditive facts, 237 Server options, see Hardware Slice-and-dice analysis, 392–393 SMP (Symmetric Multiprocessing), 178–179 Snowflake schema advantages and disadvantages, 260– 261 dimension tables, normalization of, 259–260 guidelines, for using, 262 Source systems, 34 –37 Source-to-target mapping, 288 Sparsity, 269–270 Sponsor, see Project sponsor Spreadsheet analysis, 377–378 SQL (Structured Query Language), 378–379 Standardizing facts, 276 Standards lack of, 64 metadata, 65, 217–218 OLAP, 65 –66, 380 STARS, family of summary, 277 tables core and custom, 274 snapshot and transaction, 273– 274 value chain and value circle, supporting, 274– 275 STAR schema advantages, 241–244 examples, 244 INDEX auction company supermarket video rental wireless phone service formation of, 230 keys, 239–241 navigation, optimized with, 242 query processing, most suitable for, 243 –244 Steering committee, data quality, 322 Storage, see Data storage Strategic information, 4, 5, Striping, 476 Success, critical factors for, 535– 536 Summarization, 485 See also Aggregates Supergrowth, 414 –416 Supermart, 31–32 Surrogate keys, in STAR schema, 240 Syndicated data, 60 Tables aggregate, 266 –270 dimension, 228 –230, 234 –236, 250 –255 fact, 229–230, 236 –238, 273 –275 Technical architecture, 148 –156 See also Architecture Technical support, 519 –552 Thin client, in OLAP architecture, 398 Time dimension, criticality of, 375 Tools availability, by functions, 130 –131 data quality, for, 326 –327 features, 186 –188 information delivery, for, 360 –366 OLAP, for, 402 –403 options, for ETL, 308 –309 security, for, 504 Top-down approach, 29–30 Training, see User training Transformation, see Data transformation Trends, significant data warehousing active data warehousing, 64 agent technology, use of, 53 agile development, 63, 84–85 analytics, 59 browser tools, enhancement of, 57 CRM, integration with, 63 data integration, 58 data types, multiple and new, 50– 52 data visualization, 52–54 dashboards and scorecards, 54, 187, 367 –371 571 data warehouse appliances, 56 ERP, integration with, 60 –61 growth and expansion, 46– 48 KM, integration with, 61 –63 multidimensional analysis, provisions for, 59 parallel processing, 54 –56 query tools, enhancement of, 56–57 real-time data warehousing, 50 syndicated data, use of, 60 vendor solutions, maturity of, 48–49 Web-enabling, 66 –69 User acceptance, 491– 492 Users classification of usage, based on, 349–350 job functions, based on, 350 divisions, broad explorers, 353–354 farmers, 353 miners, 354 operators, 353 tourists, 352 User support, 519–520 User training content of, 516–518 delivery of, 518–519 preparation for, 516– 518 Value chain, fact tables supporting, 274–275 Value circle, fact tables supporting, 275 Vendor solutions, evaluation guidelines for, 537–538 Vendors and products, highlights of, 539–548 World Wide Web (WWW) browser technology, 418–419 data source, for the data warehouse, 412–413 data warehouse, adapting to, 67– 68, 411 information delivery, Web-based, 414–418 security, considerations of, 419 technology, converging with data warehousing, 411–412 Web-enabled data warehouse clickstream analysis, 413 configuration, 68– 69 implementation, considerations for, 423– 424 processing model, 423–424 Web-OLAP approaches, implementation, 420–421 engine design, 421 .. .DATA WAREHOUSING FUNDAMENTALS FOR IT PROFESSIONALS Second Edition PAULRAJ PONNIAH DATA WAREHOUSING FUNDAMENTALS FOR IT PROFESSIONALS DATA WAREHOUSING FUNDAMENTALS FOR IT PROFESSIONALS... 315 WHY IS DATA QUALITY CRITICAL? / 316 What Is Data Quality? / 316 Benefits of Improved Data Quality / 319 Types of Data Quality Problems / 320 DATA QUALITY CHALLENGES / 323 Sources of Data Pollution... formats For more information about Wiley products, visit our web site at www.wiley.com Library of Congress Cataloging-in-Publication Data: Ponniah, Paulraj Data warehousing fundamentals for IT

Ngày đăng: 09/11/2019, 09:43

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w