Data Mining with SQL Server 2005 ZhaoHui Tang and Jamie MacLennan Data Mining with SQL Server 2005 ZhaoHui Tang and Jamie MacLennan Data Mining with SQL Server 2005 Published by Wiley Publishing, Inc 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright © 2005 by Wiley Publishing, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN-13: 978-0-471-46261-3 ISBN-10: 0-471-46261-6 Manufactured in the United States of America 10 1O/SR/QZ/QV/IN No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services If professional assistance is required, the services of a competent professional person should be sought Neither the publisher nor the author shall be liable for damages arising herefrom The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read For general information on our other products and services or to obtain technical support, please contact our Customer Care Department within the U.S at (800) 762-2974, outside the U.S at (317) 572-3993 or fax (317) 572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books Trademarks: Wiley, the Wiley logo, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc and/or its affiliates, in the United States and other countries, and may not be used without written permission All other trademarks are the property of their respective owners Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book To everyone in my extended family —ZhaoHui Tang To April, my kids, and my Mom and Dad —Jamie MacLennan About the Authors ZhaoHui Tang is a Lead Program Manager in the Microsoft SQL Server Data Mining team Joining Microsoft in 1999, he has been working on designing the data mining features of SQL Server 2000 and SQL Server 2005 He has spoken in many academic and industrial conferences including VLDB, KDD, TechED, PASS, etc He has published a number of articles for database and data mining journals Prior to Microsoft, he worked as a researcher at INRIA and Prism lab in Paris and led a team performing data-mining projects at Sema Group He got his Ph.D from the University of Versailles, France in 1996 Jamie MacLennan is the Development Lead for the Data Mining Engine in SQL Server He has been designing and implementing data mining functionality in collaboration with Microsoft Research since he joined Microsoft in 1999 In addition to developing the product, he regularly speaks on data mining at conferences worldwide, writes papers and articles about SQL Server Data Mining, and maintains data mining community sites Prior to joining Microsoft, Jamie worked at Landmark Graphics, Inc (division of Halliburton) on oil & gas exploration software and at Micrografx, Inc on flowcharting and presentation graphics software He studied undergraduate computer science at Cornell University iv Credits Acquisitions Editor Robert Elliot Project Coordinator Ryan Steffen Development Editor Sydney Jones Graphics and Production Specialists Carrie A Foster Lauren Goddard Jennifer Heleine Stephanie D Jumper Production Editor Pamela Hanley Copy Editor Foxxe Editorial Editorial Manager Mary Beth Wakefield Quality Control Technician Joe Niesen Proofreading and Indexing TECHBOOKS Production Services Vice President & Executive Group Publisher Richard Swadley Vice President and Publisher Joseph B Wikert v Contents About the Authors vi Credits v Foreword Chapter xvii Introduction to Data Mining What Is Data Mining Business Problems for Data Mining Data Mining Tasks Classification Clustering Association Regression Forecasting Sequence Analysis Deviation Analysis Data Mining Techniques Data Flow Data Mining Project Cycle Step 1: Data Collection Step 2: Data Cleaning and Transformation Step 3: Model Building Step 4: Model Assessment Step 5: Reporting Step 6: Prediction (Scoring) Step 7: Application Integration Step 8: Model Management 6 8 10 11 11 13 13 13 15 16 16 16 17 17 vii .. .Data Mining with SQL Server 2005 ZhaoHui Tang and Jamie MacLennan Data Mining with SQL Server 2005 ZhaoHui Tang and Jamie MacLennan Data Mining with SQL Server 2005 Published by... definition of data mining ■■ Determining which business problems can be solved with data mining ■■ Data mining tasks ■■ Using various data mining techniques ■■ Data mining flow ■■ The data mining project... Microsoft Data Mining Resources More on General Data Mining Popular Data Mining Web Site Popular Data Mining Conference Appendix A Importing Datasets Datasets MovieClick Dataset Voting Records Dataset