Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 526 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
526
Dung lượng
5,31 MB
Nội dung
The Data WarehouseETLToolkit The Data WarehouseETLToolkit Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data Ralph Kimball Joe Caserta Wiley Publishing, Inc Published by Wiley Publishing, Inc 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright C 2004 by Wiley Publishing, Inc All rights reserved Published simultaneously in Canada eISBN: 0-764-57923-1 Printed in the United States of America 10 No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, e-mail: brandreview@wiley.com Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services If professional assistance is required, the services of a competent professional person should be sought Neither the publisher not the author shall be liable for damages arising herefrom The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books Library of Congress Cataloging-in-Publication Data Kimball, Ralph The data warehouseETLtoolkit : practical techniques for extracting, cleaning, conforming, and delivering data / Ralph Kimball, Joe Caserta p cm Includes index eISBN 0-7645-7923 -1 Data warehousing Database design I Caserta, Joe, 1965- II Title QA76.9.D37K53 005.74—dc22 2004 2004016909 Trademarks: Wiley, the Wiley Publishing logo, and related trade dress are trademarks or registered trademarks of JohnWiley & Sons, Inc and/or its affiliates All other trademarks are the property of their respective owners Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book Credits Vice President and Executive Group Publisher: Richard Swadley Vice President and Publisher: Joseph B Wikert Executive Editorial Director: Mary Bednarek Executive Editor: Robert Elliot Editorial Manager: Kathryn A Malm Development Editor: Adaobi Obi Tulton Production Editor: Pamela Hanley Media Development Specialist: Travis Silvers Text Design & Composition: TechBooks Composition Services Contents Acknowledgments xvii About the Authors xix Introduction xxi Part I Requirements, Realities, and Architecture Chapter Surrounding the Requirements Requirements Business Needs Compliance Requirements Data Profiling Security Requirements Data Integration Data Latency Archiving and Lineage End User Delivery Interfaces Available Skills Legacy Licenses Architecture ETL Tool versus Hand Coding (Buy a Tool Suite or Roll Your Own?) The Back Room – Preparing the Data The Front Room – Data Access The Mission of the Data Warehouse What the Data Warehouse Is What the Data Warehouse Is Not Industry Terms Not Used Consistently 4 7 8 9 10 16 20 22 22 23 25 vii viii Contents Resolving Architectural Conflict: A Hybrid Approach How the Data Warehouse Is Changing The Mission of the ETL Team Chapter 27 27 28 ETL Data Structures To Stage or Not to Stage Designing the Staging Area Data Structures in the ETL System Flat Files XML Data Sets Relational Tables Independent DBMS Working Tables Third Normal Form Entity/Relation Models Nonrelational Data Sources Dimensional Data Models: The Handoff from the Back Room to the Front Room Fact Tables Dimension Tables Atomic and Aggregate Fact Tables Surrogate Key Mapping Tables Planning and Design Standards Impact Analysis Metadata Capture Naming Conventions Auditing Data Transformation Steps Summary 29 29 31 35 35 38 40 41 42 42 Part II Data Flow 53 Chapter Extracting Part 1: The Logical Data Map Designing Logical Before Physical Inside the Logical Data Map Components of the Logical Data Map Using Tools for the Logical Data Map Building the Logical Data Map Data Discovery Phase Data Content Analysis Collecting Business Rules in the ETL Process Integrating Heterogeneous Data Sources Part 2: The Challenge of Extracting from Disparate Platforms Connecting to Diverse Sources through ODBC Mainframe Sources Working with COBOL Copybooks EBCDIC Character Set Converting EBCDIC to ASCII 55 56 56 58 58 62 62 63 71 73 73 45 45 46 47 48 48 49 49 51 51 52 76 76 78 78 79 80 Contents Flat Files Processing Fixed Length Flat Files Processing Delimited Flat Files XML Sources Character Sets XML Meta Data Web Log Sources W3C Common and Extended Formats Name Value Pairs in Web Logs ERP System Sources Part 3: Extracting Changed Data Detecting Changes Extraction Tips Detecting Deleted or Overwritten Fact Records at the Source Summary 80 81 81 83 84 85 87 89 90 91 93 93 94 94 97 98 100 102 105 106 109 111 111 Cleaning and Conforming Defining Data Quality Assumptions Part 1: Design Objectives Understand Your Key Constituencies Competing Factors Balancing Conflicting Priorities Formulate a Policy Part 2: Cleaning Deliverables Data Profiling Deliverable Cleaning Deliverable #1: Error Event Table Cleaning Deliverable #2: Audit Dimension Audit Dimension Fine Points Part 3: Screens and Their Measurements Anomaly Detection Phase Types of Enforcement Column Property Enforcement Structure Enforcement Data and Value Rule Enforcement Measurements Driving Screen Design Overall Process Flow The Show Must Go On—Usually Screens 113 115 116 117 117 119 120 122 124 125 125 128 130 131 131 134 134 135 135 136 136 138 139 Transferring Data between Platforms Handling Mainframe Numeric Data Using PICtures Unpacking Packed Decimals Working with Redefined Fields Multiple OCCURS Managing Multiple Mainframe Record Type Files Handling Mainframe Variable Record Lengths Chapter ix ...The Data Warehouse ETL Toolkit The Data Warehouse ETL Toolkit Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data Ralph Kimball Joe Caserta Wiley Publishing,... Metadata Defining Metadata Metadata—What Is It? Source System Metadata Data-Staging Metadata DBMS Metadata Front Room Metadata Business Metadata Business Definitions Source System Information Data... (Wiley, 1996), The Data Warehouse Lifecycle Toolkit (Wiley, 1998), The Data Webhouse Toolkit (Wiley, 2000), and The Data Warehouse Toolkit, Second Edition (Wiley, 2002) He also has written for Intelligent