DESIGN AND IMPLEMENTATION OF DATA MINING TOOLS DESIGN AND IMPLEMENTATION OF DATA MINING TOOLS M Awad Latifur Khan Bhavani Thuraisingham Lei Wang Auerbach Publications Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2009 by Taylor & Francis Group, LLC Auerbach is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Printed in the United States of America on acid-free paper 10 International Standard Book Number-13: 978-1-4200-4590-1 (Hardcover) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging-in-Publication Data Design and implementation of data mining tools / M Awad [et al.] p cm Includes bibliographical references and index ISBN 978-1-4200-4590-1 (hardcover : alk paper) Data mining I Awad, M (Mamoun) QA76.9.D3D47145 2009 005.74 dc22 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the Auerbach Web site at http://www.auerbach-publications.com 2009000519 Dedication We dedicate this book to our respective families for their support that enabled us to write this book v Contents Preface xv About the Authors .xxi Acknowledgments xxiii Chapter Introduction 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Trends Data Mining Techniques and Applications Data Mining for Cyber Security: Intrusion Detection Data Mining for Web: Web Page Surfing Prediction Data Mining for Multimedia: Image Classification Organization of This Book .5 Next Steps Part I Data Mining Techniques and Applications Introduction to Part I Chapter Data Mining Techniques 11 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Introduction 11 Overview of Data Mining Tasks and Techniques 12 Artificial Neural Networks .13 Support Vector Machines .16 Markov Model 19 Association Rule Mining (ARM) 22 Multiclass Problem 25 2.7.1 One-vs-One .25 2.7.2 One-vs-All 26 vii viii ◾ Contents 2.8 Image Mining 26 2.8.1 Feature Selection .27 2.8.2 Automatic Image Annotation 28 2.8.3 Image Classification 28 2.9 Summary 29 References 29 Chapter Data Mining Applications 31 3.1 Introduction 31 3.2 Intrusion Detection 33 3.3 Web Page Surfing Prediction 35 3.4 Image Classification .37 3.5 Summary 38 References 38 Conclusion to Part I 41 Part II Data Mining Tool for Intrusion Detection Introduction to Part II 43 Chapter Data Mining for Security Applications 45 4.1 4.2 Overview 45 Data Mining for Cyber Security 46 4.2.1 Overview 46 4.2.2 Cyber Terrorism, Insider Threats, and External Attacks 47 4.2.3 Malicious Intrusions 48 4.2.4 Credit Card Fraud and Identity Theft 48 4.2.5 Attacks on Critical Infrastructures 49 4.2.6 Data Mining for Cyber Security 49 4.3 Current Research and Development .51 4.4 Summary and Directions .53 References 53 Chapter Dynamic Growing Self-Organizing Tree Algorithm 55 5.1 5.2 5.3 Overview 55 Our Approach 56 DGSOT 58 5.3.1 Vertical Growing .58 5.3.2 Learning Process 59 Contents ◾ ix 5.3.3 Horizontal Growing 61 5.3.4 Stopping Rule for Horizontal Growing .61 5.3.5 K-Level Up Distribution (KLD) 62 5.4 Discussion 63 5.5 Summary and Directions .63 References 64 Chapter Data Reduction Using Hierarchical Clustering and Rocchio Bundling 65 6.1 6.2 Overview 65 Our Approach 66 6.2.1 Enhancing the Training Process of SVM 66 6.2.2 Stopping Criteria .67 6.3 Complexity and Analysis 69 6.4 Rocchio Decision Boundary 73 6.5 Rocchio Bundling Technique 74 6.6 Summary and Directions .74 References 75 Chapter Intrusion Detection Results 77 7.1 Overview 77 7.2 Dataset 78 7.3 Results 78 7.4 Complexity Validation 80 7.5 Discussion 81 7.6 Summary and Directions .82 References 82 Conclusion to Part II 82 Part III Data Mining Tool for Web Page Surfing Prediction Introduction to Part III .83 Chapter Web Data Management and Mining 85 8.1 8.2 Overview 85 Digital Libraries .86 8.2.1 Overview 86 8.2.2 Web Database Management 87 .. .DESIGN AND IMPLEMENTATION OF DATA MINING TOOLS DESIGN AND IMPLEMENTATION OF DATA MINING TOOLS M Awad Latifur Khan Bhavani Thuraisingham Lei Wang Auerbach Publications Taylor & Francis... trends in data mining include mining Web data, mining distributed and heterogeneous databases, and privacy-preserving data mining where one ensures that one can get useful results from mining and at... malicious software This book focuses on three applications of data mining: cyber security, Web, and multimedia In particular, we will describe the design and implementation of systems and tools for