Compression schemes for mining large datasets: A machine learning perspective

Reduction in number of patterns by prototype selection based on large data clustering approaches; optimal selection of prototypes, dimensionality reduction through optimal selection of[r]

(1)

Advances in Computer Vision and Pattern Recognition

Compression

Schemes for Mining Large Datasets

T Ravindra Babu

M Narasimha Murty S.V Subrahmanya

(2)

Advances in Computer Vision and Pattern Recognition

For further volumes:

(3)

T Ravindra Babu r M Narasimha Murty r S.V Subrahmanya

Compression Schemes

for Mining Large Datasets

(4)

Infosys Technologies Ltd Bangalore, India M Narasimha Murty Indian Institute of Science Bangalore, India

Infosys Technologies Ltd Bangalore, India

Series Editors Prof Sameer Singh Rail Vision Europe Ltd Castle Donington Leicestershire, UK

Dr Sing Bing Kang

Interactive Visual Media Group Microsoft Research

Redmond, WA, USA

ISSN 2191-6586 ISSN 2191-6594 (electronic) Advances in Computer Vision and Pattern Recognition

ISBN 978-1-4471-5606-2 ISBN 978-1-4471-5607-9 (eBook) DOI 10.1007/978-1-4471-5607-9

Springer London Heidelberg New York Dordrecht Library of Congress Control Number: 2013954523

This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law

The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use

While the advice and information in this book are believed to be true and accurate at the date of pub-lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect to the material contained herein

Printed on acid-free paper

(5)

Preface

We come across a number of celebrated text books on Data Mining covering mul-tiple aspects of the topic since its early development, such as those on databases, pattern recognition, soft computing, etc We did not find any consolidated work on data mining in compression domain The book took shape from this realization Our work relates to this area of data mining with a focus on compaction We present schemes that work in compression domain and demonstrate their working on one or more practical datasets in each case In this process, we cover important data mining paradigms This is intended to provide a practitioners’ view point of compression schemes in data mining The work presented is based on the authors’ work on related areas over the last few years We organized each chapter to contain context setting, background work as part of discussion, proposed algorithm and scheme, implemen-tation intricacies, experimenimplemen-tation by implementing the scheme on a large dataset, and discussion of results At the end of each chapter, as part of bibliographic notes, we discuss relevant literature and directions for further study.

Data Mining focuses on efficient algorithms to generate abstraction from large datasets The objective of these algorithms is to find interesting patterns for further use by the least number of visits of entire dataset, ideal being a single visit Sim-ilarly, since the data sizes are large, effort is made in arriving at a much smaller subset of the original dataset that is a representative of entire data and contains at-tributes characterizing the data The ability to generate an abstraction from a small representative set of patterns and features that is as accurate as that can be obtained with entire dataset leads to efficiency in terms of both space and time Important data mining paradigms include clustering, classification, association rule mining, etc We present a discussion on data mining paradigms in Chap.2

(6)

ably quantized to binary values The chapter presents an algorithm that computes the dissimilarity in the compressed domain directly Theoretical notes are provided for the work We present applications of the scheme in multiple domains

It is interesting to explore when one is prepared to lose some part of pattern rep-resentation, whether we obtain better generalization and compaction We examine this aspect in Chap.4 The work in the chapter exploits the concept of minimum feature or item-support The concept of support relates to the conventional associa-tion rule framework We consider patterns as sequences, form subsequences of short length, and identify and eliminate repeating subsequences We represent the pattern by those unique subsequences leading to significant compaction Such unique subse-quences are further reduced by replacing less frequent unique subsesubse-quences by more frequent subsequences, thereby achieving further compaction We demonstrate the working of the scheme on large handwritten digit data

Pattern clustering can be construed as compaction of data Feature selection also reduces dimensionality, thereby resulting in pattern compression It is interesting to explore whether they can be simultaneously achieved We examine this in Chap.5 We consider an efficient clustering scheme that requires a single database visit to generate prototypes We consider a lossy compression scheme for feature reduc-tion We also examine whether there is preference in sequencing prototype selection and feature selection in achieving compaction, as well as good classification accu-racy on unseen patterns We examine multiple combinations of such sequencing We demonstrate working of the scheme on handwritten digit data and intrusion de-tection data

Domain knowledge forms an important input for efficient compaction Such knowledge could either be provided by a human expert or generated through an appropriate preliminary statistical analysis In Chap.6, we exploit domain knowl-edge obtained both by expert inference and through statistical analysis and classify a 10-class data through a proposed decision tree of depth of We make use of 2-class 2-classifiers, AdaBoost and Support Vector Machine, to demonstrate working of such a scheme

Dimensionality reduction leads to compaction With algorithms such as run-length encoded compression, it is educative to study whether one can achieve ef-ficiency in obtaining optimal feature set that provides high classification accuracy In Chap.7, we discuss concepts and methods of feature selection and extraction We propose an efficient implementation of simple genetic algorithms by integrating compressed data classification and frequent features We provide insightful discus-sion on the sensitivity of various genetic operators and frequent-item support on the final selection of optimal feature set

(7)

we propose schemes that exploit multiagent systems to solve these problems We discuss concepts of big data, MapReduce, PageRank, agents, and multiagent sys-tems before proposing multiagent syssys-tems to solve big data problems

The authors would like to express their sincere gratitude to their respective fam-ilies for their cooperation

T Ravindra Babu and S.V Subrahmanya are grateful to Infosys Limited for pro-viding an excellent research environment in the Education and Research Unit (E&R) that enabled them to carry out academic and applied research resulting in articles and books

T Ravindra Babu likes to express his sincere thanks to his family members Padma, Ramya, Kishore, and Rahul for their encouragement and support He dedi-cates his contribution of the work to the fond memory of his parents Butchiramaiah and Ramasitamma M Narasimha Murty likes to acknowledge support of his par-ents S.V Subrahmanya likes to thank his wife D.R Sudha for her patient support The authors would like to record their sincere appreciation for Springer team, Wayne Wheeler and Simon Rees, for their support and encouragement

(8)

Contents

1 Introduction

1.1 Data Mining and Data Compression

1.1.1 Data Mining Tasks

1.1.2 Data Compression

1.1.3 Compression Using Data Mining Tasks

1.2 Organization

1.2.1 Data Mining Tasks

1.2.2 Abstraction in Nonlossy Compression Domain

1.2.3 Lossy Compression Scheme and Dimensionality Reduction

1.2.4 Compaction Through Simultaneous Prototype and Feature Selection

1.2.5 Use of Domain Knowledge in Data Compaction

1.2.6 Compression Through Dimensionality Reduction

1.2.7 Big Data, Multiagent Systems, and Abstraction

1.3 Summary

1.4 Bibliographical Notes

References

2 Data Mining Paradigms 11

2.1 Introduction 11

2.2 Clustering 12

2.2.1 Clustering Algorithms 13

2.2.2 Single-Link Algorithm 14

2.2.3 k-Means Algorithm 15

2.3 Classification 17

2.4 Association Rule Mining 22

2.4.1 Frequent Itemsets 23

2.4.2 Association Rules 25

2.5 Mining Large Datasets 26

(9)

2.5.1 Possible Solutions 27

2.5.2 Clustering 28

2.5.3 Classification 34

2.5.4 Frequent Itemset Mining 39

2.6 Summary 42

2.7 Bibliographic Notes 43

References 44

3 Run-Length-Encoded Compression Scheme 47

3.1 Introduction 47

3.2 Compression Domain for Large Datasets 48

3.3 Run-Length-Encoded Compression Scheme 49

3.3.1 Discussion on Relevant Terms 49

3.3.2 Important Properties and Algorithm 50

3.4 Experimental Results 55

3.4.1 Application to Handwritten Digit Data 55

3.4.2 Application to Genetic Algorithms 57

3.4.3 Some Applicable Scenarios in Data Mining 59

3.5 Invariance of VC Dimension in the Original and the Compressed Forms 60

3.6 Minimum Description Length 63

3.7 Summary 65

References 66

4 Dimensionality Reduction by Subsequence Pruning 67

4.1 Introduction 67

4.2 Lossy Data Compression for Clustering and Classification 67

4.3 Background and Terminology 68

4.4 Preliminary Data Analysis 73

4.4.1 Huffman Coding and Lossy Compression 74

4.4.2 Analysis of Subsequences and Their Frequency in a Class 79 4.5 Proposed Scheme 81

4.5.1 Initialization 83

4.5.2 Frequent Item Generation 83

4.5.3 Generation of Coded Training Data 84

4.5.4 Subsequence Identification and Frequency Computation 84 4.5.5 Pruning of Subsequences 85

4.5.6 Generation of Encoded Test Data 85

4.5.7 Classification Using Dissimilarity Based on Rough Set Concept 86

4.5.8 Classification Usingk-Nearest Neighbor Classifier 87

4.6 Implementation of the Proposed Scheme 87

4.6.1 Choice of Parameters 87

(10)

4.6.3 Compressed Data and Pruning of Subsequences 89

4.6.4 Generation of Compressed Training and Test Data 91

4.7 Experimental Results 91

4.8 Summary 92

References 94

5 Data Compaction Through Simultaneous Selection of Prototypes and Features 95

5.1 Introduction 95

5.2 Prototype Selection, Feature Selection, and Data Compaction 96

5.2.1 Data Compression Through Prototype and Feature Selection 99

5.3 Background Material 100

5.3.1 Computation of Frequent Features 103

5.3.2 Distinct Subsequences 104

5.3.3 Impact of Support on Distinct Subsequences 104

5.3.4 Computation of Leaders 105

5.3.5 Classification of Validation Data 105

5.4 Preliminary Analysis 105

5.5 Proposed Approaches 107

5.5.1 Patterns with Frequent Items Only 107

5.5.2 Cluster Representatives Only 108

5.5.3 Frequent Items Followed by Clustering 109

5.5.4 Clustering Followed by Frequent Items 109

5.6 Implementation and Experimentation 110

5.6.1 Handwritten Digit Data 110

5.6.2 Intrusion Detection Data 116

5.6.3 Simultaneous Selection of Patterns and Features 120

5.7 Summary 122

References 123

6 Domain Knowledge-Based Compaction 125

6.1 Introduction 125

6.2 Multicategory Classification 126

6.3 Support Vector Machine (SVM) 126

6.4 Adaptive Boosting 128

6.4.1 Adaptive Boosting on Prototypes for Data Mining Applications 129

6.5 Decision Trees 130

6.6 Preliminary Analysis Leading to Domain Knowledge 131

6.6.1 Analytical View 132

6.6.2 Numerical Analysis 133

(11)

6.7 Proposed Method 136

6.7.1 Knowledge-Based (KB) Tree 136

6.8 Experimentation and Results 137

6.8.1 Experiments Using SVM 138

6.8.2 Experiments Using AdaBoost 140

6.8.3 Results with AdaBoost on Benchmark Data 141

6.9 Summary 143

References 144

7 Optimal Dimensionality Reduction 147

7.2 Feature Selection 149

7.2.1 Based on Feature Ranking 149

7.2.2 Ranking Features 150

7.3 Feature Extraction 152

7.3.1 Performance 154

7.4 Efficient Approaches to Large-Scale Feature Selection Using Genetic Algorithms 154

7.4.1 An Overview of Genetic Algorithms 155

7.4.2 Proposed Schemes 158

7.4.3 Preliminary Analysis 161

7.4.4 Experimental Results 163

7.4.5 Summary 170

7.5 Bibliographical Notes 171

References 171

8 Big Data Abstraction Through Multiagent Systems 173

8.2 Big Data 173

8.3 Conventional Massive Data Systems 174

8.3.1 Map-Reduce 174

8.3.2 PageRank 176

8.4 Big Data and Data Mining 176

8.5 Multiagent Systems 177

8.5.1 Agent Mining Interaction 177

8.5.2 Big Data Analytics 178

8.6 Proposed Multiagent Systems 178

8.6.1 Multiagent System for Data Reduction 178

8.6.2 Multiagent System for Attribute Reduction 179

8.6.3 Multiagent System for Heterogeneous Data Access 180

8.6.4 Multiagent System for Agile Processing 181

8.7 Summary 182

(12)

Appendix Intrusion Detection Dataset—Binary Representation 185

A.1 Data Description and Preliminary Analysis 185

A.2 Bibliographic Notes 189

References 189

Glossary 191

(13)

Acronyms

AdaBoost Adaptive Boosting

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies

CA Classification Accuracy

CART Classification and regression trees

CF Clustering Feature

CLARA CLustering LARge Applications

CLARANS Clustering Large Applications based on RANdomized Search

CNF Conjunctive Normal Form

CNN Condensed Nearest Neighbor

CS Compression Scheme

DFS Distributed File System

DNF Disjunctive Normal Form

DTC Decision Tree Classifier

EDW Enterprise Data Warehouse

ERM Expected Risk Minimization

FPTree Frequent Pattern Tree

FS Fisher Score

GA Genetic Algorithm

GFS Google File System

HDFS Hadoop Distributed File System

HW Handwritten

KB Knowledge-Based

KDD Knowledge Discovery from Databases

kNNC k-Nearest-Neighbor Classifier

MAD Analysis Magnetic, Agile, and Deep Analysis

MDL Minimum Description Length

MI Mutual Information

ML Machine Learning

NNC Nearest-Neighbor Classifier

NMF Nonnegative Matrix Factorization

PAM Partition Around Medoids

(14)

PCA Principal Component Analysis

PCF Pure Conjunctive Form

RLE Run-Length Encoded

RP Random Projections

SA Simulated Annealing

SBS Sequential Backward Selection

SBFS Sequential Backward Floating Selection

SFFS Sequential Forward Floating Selection

SFS Sequential Forward Selection

SGA Simple Genetic Algorithm

SSGA Steady Stage Genetic Algorithm

SVM Support Vector Machine

TS Taboo Search

(15)

Chapter 1

Introduction

In this book, we deal with data mining and compression; specifically, we deal with using several data mining tasks directly on the compressed data

1.1 Data Mining and Data Compression

Data mining is concerned with generating an abstraction of the input dataset using a mining task

1.1.1 Data Mining Tasks Important data mining tasks are:

1 Clustering Clustering is the process of grouping data points so that points in each group or cluster are similar to each other than points belonging to two or more different clusters Each resulting cluster is abstracted using one or more representative patterns So, clustering is some kind of compression where de-tails of the data are ignored and only cluster representatives are used in further processing or decision making

2 Classification In classification a labeled training dataset is used to learn a model or classifier This learnt model is used to label a test (unlabeled) pattern; this process is called classification

3 Dimensionality Reduction A majority of the classification and clustering algo-rithms fail to produce expected results in dealing with high-dimensional datasets Also, computational requirements in the form of time and space can increase enormously with dimensionality This prompts reduction of the dimensionality of the dataset; it is reduced either by using feature selection or feature extraction In feature selection, an appropriate subset of features is selected, and in feature extraction, a subset in some transformed space is selected

T Ravindra Babu et al., Compression Schemes for Mining Large Datasets, Advances in Computer Vision and Pattern Recognition,

DOI10.1007/978-1-4471-5607-9_1, © Springer-Verlag London 2013

(16)

4 Regression or Function Prediction Here a functional form for variableyis learnt (wherey = f (X)) from given pairs(X, y); the learnt function is used to predict the values ofyfor new values ofX This problem may be viewed as a general-ization of the classification problem In classification, the number of class labels is finite, where as in the regression setting,y can have infinite values, typically, y∈R

5 Association Rule Mining Even though it is of relatively recent origin, it is the earliest introduced task in data mining and is responsible for bringing visibility to the area of data mining In association rule mining, we are interested in finding out how frequently two subsets of items are associated

1.1.2 Data Compression

Another important topic in this book is data compression A compression scheme CS may be viewed as a function from the set of patternsX to a set of compressed patternsX It may be viewed as

CS:X ⇒X.

Specifically, CS(x) = x forx∈X andx∈X In a more general setting, we may view CS as giving outputx usingx and some knowledge structure or a dic-tionaryK So, CS(x, K) = x forx∈X andx∈X Sometimes, a dictionary is used in compressing and uncompressing the data Schemes for compressing data are the following:

• Lossless Schemes These schemes are such that CS(x) = x and there is an inverse CS−1 such that CS−1(x) = x For example, consider a binary string 00001111 (x) as an input; the corresponding run-length-coded string is 44 (x), where the first corresponds to a run of zeros, and the second corresponds to a run of ones Also, from the run-length-coded string 44 we can get back the input string 00001111 Note that such a representation is lossless as we getx fromxusing run-length encoding andxfromxusing decoding

• Lossy Schemes In a lossy compression scheme, it is not possible in general to get back the original data pointxfrom the compressed patternx Pattern recognition and data mining are areas in which there are a plenty of examples where lossy compression schemes are used

We show some example compression schemes in Fig.1.1

1.1.3 Compression Using Data Mining Tasks

Among the lossy compression schemes, we considered the data mining tasks Each of them is a compression scheme as:

(17)

Fig 1.1 Compression schemes

generated from the frequent itemsets So, association rules in general cannot be used to obtain the original input data points provided

• Clustering is lossy because the output of clustering is a collection of cluster repre-sentatives From the cluster representatives we cannot get back the original data points For example, inK-means clustering, each cluster is represented by the centroid of the data points in it; it is not possible to get back the original data points from the centroids

• Classification is lossy as the models learnt from the training data cannot be used to reproduce the input data points For example, in the case of Support Vector Machines, a subset of the training patterns called support vectors are used to get the classifier; it is not possible to generate the input data points from the support vectors

• Dimensionality reduction schemes can ignore some of the input features So, they are lossy because it is not possible to get the training patterns back from the dimensionality-reduced ones

So, each of the mining tasks is lossy in terms of its output obtained from the given data In addition, in this book, we deal with data mining tasks working on com-pressed data, not the original data We consider data compression schemes that could be either lossy or nonlossy Some of the nonlossy data compression schemes are also shown in Fig.1.1 These include run-length coding, Huffman coding, and the zip utility used by the operating systems

1.2 Organization

Material in this book is organized as follows

1.2.1 Data Mining Tasks

(18)

The data mining tasks considered are the following

• Clustering Clustering algorithms generate either a hard or soft partition of the input dataset Hard clustering algorithms are either partitional or hierarchical Partitional algorithms generate a single partition of the dataset The number of all possible partitions of a set ofnpoints intoKclusters can be shown to be equal to

1 K!

K

i=1

(−1)K−i

K i

(i)n.

So, exhaustive enumeration of all possible partitions of a dataset could be pro-hibitively expensive For example, even for a small dataset of 19 patterns to be partitioned into four clusters, we may have to consider around 11,259,666,000 partitions In order to reduce the computational load, each of the clustering al-gorithms restricts these possibilities by selecting an appropriate subset of the set of all possibleK-partitions In Chap.2, we consider two partitional algorithms for clustering One of them is theK-means algorithm, which is the most popu-lar clustering algorithm; the other is the leader clustering algorithm, which is the simplest possible algorithm for partitional clustering

A hierarchical clustering algorithm generates a hierarchy of partitions; par-titions at different levels of the hierarchy are of different sizes We describe the single-link algorithm, which has been classically used in a variety of areas includ-ing numerical taxonomy Another hierarchical algorithm discussed is BIRCH, which is a very efficient hierarchical algorithm Both leader and BIRCH are effi-cient as they need to scan the dataset only once to generate the clusters

• Classification We describe two classifiers in Chap.2 Nearest-neighbor classifier is the simplest classifier in terms of learning In fact, it does not learn a model; it employs all the training data points to label a test pattern Even though it has no training time requirement, it can take a long time for labeling a test pattern if the training dataset is large in size Its performance deteriorates as the dimen-sionality of the data points increases; also, it is sensitive to noise in the training data A popular variant is theK-nearest-neighbor classifier (KNNC), which la-bels a test pattern based on lala-bels ofKnearest neighbors of the test pattern Even though KNNC is robust to noise, it can fail to perform well in high-dimensional spaces Also, it takes a longer time to classify a test pattern

Another efficient and state-of-the-art classifier is based on Support Vector Ma-chines (SVMs) and is popularly used in two-class problems An SVM learns a subset of the set of training patterns, called the set of support vectors These cor-respond to patterns falling on two parallel hyperplanes; these planes, called the support planes, are separated by a maximum margin One can design the clas-sifier using the support vectors The decision boundary separating patterns from the two classes is located between the two support planes, one per each class It is commonly used in high-dimensional spaces, and it classifies a test pattern using a single dot product computation

(19)

algorithm; perhaps, it is responsible for the emergence of the area of data mining itself Even though it is initiated in market-basket analysis, it can be also used in other pattern classification and clustering applications We use it in the classifi-cation of hand-written digits in the book We describe the Apriori algorithm in Chap.2

Naturally, in data mining, we need to analyze large-scale datasets; in Chap.2, we discuss three different schemes for dealing with large datasets These include: 1 Incremental Mining Here, we use abstractionAKand the(K+1)th pointXK+1

to generate the abstractionAK+1 Here,AK is the abstraction generated after examining the firstKpoints It is useful in dealing with stream data mining; in big data analytics, it deals with velocity in the three-V model

2 Divide-and-Conquer Approach: It is a popular scheme used in designing efficient algorithms Also, the popular and state-of-the-art Map-Reduce scheme is based on this strategy It is associated with dealing volume requirements in the three-V model

3 Mining based on an intermediate representation: Here an abstraction is learnt based on accessing the dataset once or twice; this abstraction is an intermediate representation Once an intermediate representation is available, the mining is performed on this abstraction rather than on the dataset, which reduces the com-putational burden This scheme also is associated with the volume feature of the three-V model

1.2.2 Abstraction in Nonlossy Compression Domain

In Chap.3, we provide a nonlossy compression scheme and ability to cluster and classify data in the compressed domain without having to uncompress

The scheme employs run-length coding of binary patterns So, it is useful in deal-ing with either binary input patterns or even numerical vectors that could be viewed as binary sequences Specifically, it considers handwritten digits that could be repre-sented as binary patterns and compresses the strings using run-length coding Now the compressed patterns are input to a KNNC for classification It requires a defini-tion of the distancedbetween a pair of run-length-coded strings to use the KNNC on the compressed data

It is shown that the distance d(x, y)between two binary stringsx and y and the modified distanced(x, y)between the corresponding run-length-coded (com-pressed) stringsxandyare equal; that isd(x, y) =d(x, y) It is shown that the KNNC using the modified distance on the compressed strings reduces the space and time requirements by a factor of more than compared to the application of KNNC on the given original (uncompressed) data

(20)

intrusion detection dataset is that it does not affect the accuracy In this chapter, we provide an application of the scheme in classification of handwritten digit data and compare improvement obtained in size as well as computation time Second applica-tion is related to efficient implementaapplica-tion of genetic algorithms Genetic algorithms are robust methods to obtain near-optimal solutions The compression scheme can be gainfully employed in situations where the evaluation function in Genetic Algo-rithms is the classification accuracy of the nearest-neighbor classifier (NNC) NNC involves computation of dissimilarity a number of times depending on the size of training data or prototype pattern set as well as test data size The method can be used for optimal prototype and feature selection We discuss an indicative example The Vapnik–Chervonenkis (VC) dimension characterizes the complexity of a class of classifiers It is important to control theV C dimension to improve the performance of a classifier Here, we show that theV Cdimension is not affected by using the classifier on compressed data

1.2.3 Lossy Compression Scheme and Dimensionality Reduction We propose a lossy compression scheme in Chap.4 Such compressed data can be used in both clustering and classification The proposed scheme compresses the given data by using frequent items and then considering distinct subsequences Once the training data is compressed using this scheme, it is also required to appropriately deal with test data; it is possible that some of the subsequences present in the test data are absent in the training data summary One of the successful schemes em-ployed to deal with this issue is based on replacing a subsequence in the test data by its nearest neighbor in the training data

The pruning and transformation scheme employed in achieving compression re-duces the dataset size significantly However, the classification accuracy improves because of the possible generalization resulting due to compressed representation It is possible to integrate rough set theory to put a threshold on the dissimilarity between a test pattern and a training pattern represented in the compressed form If the distance is below a threshold, then the test pattern is assumed to be in the lower approximation (proper core region) of the class of the training data; otherwise, it is placed in the upper approximation (possible reject region)

1.2.4 Compaction Through Simultaneous Prototype and Feature Selection

(21)

• The impact of compression based on frequent items and subsequences on proto-type selection

• The representativeness of features selected using data obtained based on frequent items with a high support value

• The role of clustering and frequent item generation in lossy data compression and how the classifier is affected by the representation; it is possible to use clustering followed by frequent item set generation or frequent item set generation followed by clustering Both schemes are explored in evaluating the resulting simultaneous prototype and feature selection Here the leader clustering algorithm is used for prototype selection and frequent itemset-based approaches are used for feature selection

1.2.5 Use of Domain Knowledge in Data Compaction

Domain knowledge-based compaction is provided in Chap.6 We make use of do-main knowledge of the data under consideration to design efficient pattern classi-fication schemes We design a domain knowledge-based decision tree of depth that can classify 10-category data with high accuracy The classification approaches based on support vector machines and AdaBoost are used

We carry out preliminary analysis on datasets and demonstrate deriving domain knowledge from the data and from a human expert In order that the classification would be carried out on representative patterns and not on complete data, we make use of the condensed nearest-neighbor approach and the leader clustering algorithm We demonstrate working of the proposed schemes on large datasets and public domain machine learning datasets

1.2.6 Compression Through Dimensionality Reduction

Optimal dimensionality reduction for lossy data compression is discussed in Chap.7 Here both feature selection and feature extraction schemes are described In feature selection, both sequential selection schemes and genetic algorithm (GA) based schemes are discussed In sequential selection, features are selected one af-ter the other based on some ranking scheme; here each of the remaining features is ranked based on their performance along with the already selected features us-ing some validation data These sequential schemes are greedy in nature and not guarantee globally optimal selection It is possible to show that the GA-based schemes are globally optimal under some conditions; however, most of practical implementations may not be able to exploit this global optimality

(22)

continuous values, whereas the MI-based scheme is the most successful for select-ing features that are discrete or categorical; it has been used in selectselect-ing features in classification of documents where the given set of features is very large

Another popular set of feature selection schemes employ performance of the classifiers on selected feature subsets Most popularly used classifiers in such feature selection include the NNC, SVM, and Decision Tree classifier Some of the popular feature extraction schemes are:

• Principal Component Analysis (PCA) Here the extracted features are linear com-binations of the given features Signal processing community has successfully used PCA-based compression in image and speech data reconstruction It has also been used by search engines for capturing semantic similarity between the query and the documents

• Nonnegative Matrix Factorization (NMF) Most of the data one typically uses are nonnegative In such cases, it is possible to use NMF to reduce the dimensionality This reduction in dimensionality is helpful in building effective classifiers to work on the reduced-dimensional data even though the given data is high-dimensional

• Random projections (RP) It is another scheme that extracts features that are linear combinations of the given features; the weights used in the linear combinations are random values here

In this chapter, it is also shown as to how to exploit GAs in large-scale feature se-lection, and the proposed scheme is demonstrated using the handwritten digit data A problem with about 200-feature vector is considered for obtaining optimal subset of features The implementation integrates frequent features and genetic algorithms and brings out sensitivity of genetic operators in achieving optimal set It is prac-tically shown on how the choice of probability of initialization of the population, which is not often found in the literature, impacts the number of the final set of features with other control parameters remaining the same

1.2.7 Big Data, Multiagent Systems, and Abstraction

(23)

1.3 Summary

In this chapter, we have provided a brief introduction to data compression and min-ing compressed data It is possible to use all the data minmin-ing tasks on the compressed data directly Then we have given how the material is organized in different chapters Most of the popular and state-of-the-art mining algorithms are covered in detail in the subsequent chapters Various schemes considered and proposed are applied on two datasets, handwritten digit dataset and the network intrusion detection dataset Details of the intrusion detection dataset are provided inAppendix

1.4 Bibliographical Notes

A detailed description of the bibliography is presented at the end of each chap-ter, and notes on the bibliography are provided in the respective chapters This book deals with data mining and data compression There is no major effort so far in dealing with the application of data mining algorithms directly on the com-pressed data Some of the important books on compression are by Sayood (2000) and Salomon et al (2009) An early book on Data Mining was by Hand et al (2001) For a good introduction to data mining, a good source is the book by Tan et al (2005) A detailed description of various data mining task is given by Han et al (2011) The book by Witten et al (2011) discusses various prac-tical issues and shows how to use the Weka machine learning workbench de-veloped by the authors One of the recent books is by Rajaraman and Ullman (2011)

Some of the important journals on data mining are:

1 IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) ACM Transactions on Knowledge Discovery from Data (ACM TKDD) Data Mining and Knowledge Discovery (DMKD)

Some of the important conferences on this topic are: Knowledge Discovery and Data Mining (KDD) International Conference on Data Engineering (ICDE) IEEE International Conference on Data Mining (ICDM) SIAM International Conference on Data Mining (SDM)

References

J Han, M Kamber, J Pei, Data Mining: Concepts and Techniques, 3rd edn (Morgan Kaufmann, San Mateo, 2011)

D.J Hand, H Mannila, P Smyth, Principles of Data Mining (MIT Press, Cambridge, 2001) A Rajaraman, J.D Ullman, Mining Massive Datasets (Cambridge University Press, Cambridge,

(24)

D Salomon, G Motta, D Bryant, Handbook of Data Compression (Springer, Berlin, 2009) K Sayood, Introduction to Data Compression, 2nd edn (Morgan Kaufmann, San Mateo, 2000) P.-N Tan, M Steinbach, V Kumar, Introduction to Data Mining (Pearson, Upper Saddle River,

2005)

(25)

Chapter 2

Data Mining Paradigms

2.1 Introduction

In data mining, the size of the dataset involved is large It is convenient to visualize such a dataset as a matrix of sizen×d, wherenis the number of data points, andd is the number of features Typically, it is possible that eithernordor both are large In mining such datasets, important issues are:

• The dataset cannot be accommodated in the main memory of the machine So, we need to store the data on a secondary storage medium like a disk and transfer the data in parts to the main memory for processing; such an activity could be time-consuming Because disk access can be more expensive compared to accessing the data from the memory, the number of database scans is an important param-eter So, when we analyze data mining algorithms, it is important to consider the number of database scans required

• The dimensionality of the data can be very large In such a case, several of the conventional algorithms that use the Euclidean distance like metrics to charac-terize proximity between a pair of patterns may not play a meaningful role in such high-dimensional spaces where the data is sparsely distributed So, different techniques to deal with such high-dimensional datasets become important

• Three important data mining tasks are:

1 Clustering Here a collection of patterns is partitioned into two or more clus-ters Typically, clusters of patterns are represented using cluster representa-tives; a centroid of the points in the cluster is one of the most popularly used cluster representatives Typically, a partition or a clustering is represented by k representatives, wherekis the number of clusters; such a process leads to lossy data compression Instead of dealing with all the n data points in the collection, one can just use thekcluster representatives (whereknin the data mining context) for further decision making

2 Classification In classification, a machine learning algorithm is used on a given collection of training data to obtain an appropriate abstraction of the dataset Decision trees and probability distributions of points in various classes T Ravindra Babu et al., Compression Schemes for Mining Large Datasets,

Advances in Computer Vision and Pattern Recognition,

(26)

Fig 2.1 Clustering

are examples of such abstractions These abstractions are used to classify a test pattern

3 Association Rule Mining This activity has played a major role in giving a dis-tinct status to the field of data mining itself By convention, an association rule is an implication of the formA → B, whereAandB are two disjoint item-sets It was initiated in the context of market-basket analysis to characterize how frequently items inAare bought along with items inB However, generi-cally it is possible to view classification and clustering rules also as association rules

In order to run these tasks on large datasets, it is important to consider techniques that could lead to scalable mining algorithms Before we examine these tech-niques, we briefly consider some of the popular algorithms for carrying out these data mining tasks

2.2 Clustering

Clustering is the process of partitioning a set of patterns into cohesive groups or clusters Such a process is carried out so that intra-cluster patterns are similar and inter-cluster patterns are dissimilar This is illustrated using a set of two-dimensional points shown in Fig.2.1 There are three clusters in this figure, and patterns are represented as two-dimensional points The Euclidean distance between a pair of points belonging to the same cluster is smaller than that between any two points chosen from different clusters

The Euclidean distance between two pointsXandYin thep-dimensional space, wherexiandyi are theith components ofXandY, respectively, is given by

d(X, Y ) =

p

i=1

(xi−yi)2

1

(27)

Fig 2.2 Representing clusters

This notion characterizes similarity; the intra-cluster distance (similarity) is small (high), and the inter-cluster distance (similarity) is large (low) There could be other ways of characterizing similarity

Clustering is useful in generating data abstraction The process of data abstrac-tion may be explained using Fig.2.2 There are two dense clusters; the first has 22 points, and the second has points Further, there is a singleton cluster in the figure Here, a cluster of points is represented by its centroid or its leader The centroid stands for the sample mean of the points in the cluster, and it need not coincide with any one of the input points as indicated in the figure There is another point in the figure, which is far off from any of the other points, and it belongs to the third cluster This could be an outlier Typically, these outliers are ignored, and each of the remaining clusters is represented by one or more points, called the cluster repre-sentatives, to achieve the abstraction The most popular cluster representative is its centroid

Here, if each cluster is represented by its centroid, then there is a reduction in the dataset size One can use only the two centroids for further decision making For example, in order to classify a test pattern using the nearest-neighbor classifier, one requires 32 distance computations if all the data points are used However, using the two centroids requires just two distance computations to compute the nearest centroid of the test pattern It is possible that classifiers using the cluster centroids can be optimal under some conditions The above discussion illustrates the role of clustering in lossy data compression.

2.2.1 Clustering Algorithms

Typically, a grouping of patterns is meaningful when the within-group similarity is high and the between-group similarity is low This may be illustrated using the groupings of the seven two-dimensional points shown in Fig.2.3

(28)

Fig 2.3 A clustering of the two-dimensional points

whereas a partitional algorithm generates a single partition of the set of pat-terns This is illustrated using the two-dimensional data set consisting of points labeledA:(1,1)t, B:(2,2)t, C:(1,2)t, D:(6,2)t, E:(6.9,2)t, F:(6,6)t, and G:(6.8,6)t in Fig.2.3 This figure depicts seven patterns in three clusters Typ-ically, a partitional algorithm would produce the three clusters shown in Fig.2.3 A hierarchical algorithm would result in a dendrogram representing the nested grouping of patterns and similarity levels at which groupings change

In this section, we describe two popular clustering algorithms; one of them is hierarchical, and the other is partitional

2.2.2 Single-Link Algorithm

This algorithm is a bottom-up hierarchical algorithm starting withnsingleton clus-ters when there arendata points to be clustered It keeps on merging smaller clusters to form bigger clusters based on minimum distance between two clusters The spe-cific algorithm is described below

Single-Link Algorithm

1 Input:ndata points; Output: A dendrogram depicting the hierarchy

2 Form then×n proximity matrix by using the Euclidean distance between all pairs of points Assign each point to a separate cluster; this step results in n singleton clusters

3 Merge a pair of most similar clusters to form a bigger cluster The distance be-tween two clustersCiandCj to be merged is given by

Distance(Ci, Cj) = MinX,Yd(X, Y ) whereX∈CiandY∈Cj Repeat step till the partition of required size is obtained; ak-partition is

ob-tained if the number of clustersk is given; otherwise, merging continues till a single cluster of all thenpoints is obtained

(29)

Table 2.1 Distance matrix

A B C D E F G

A 0.0 1.4 1.0 5.1 6.0 7.0 7.6

B 1.4 0.0 1.0 4.0 4.9 5.6 6.3

C 1.0 1.0 0.0 5.0 5.9 6.4 7.0

D 5.1 4.0 5.0 0.0 0.9 4.0 4.1

E 6.0 4.9 5.9 0.9 0.0 4.1 4.0

F 7.0 5.6 6.4 4.0 4.1 0.0 0.8

G 7.6 6.3 7.0 4.1 4.0 0.8 0.0

Fig 2.4 The dendrogram obtained using the single-link algorithm

in Table2.1 A dendrogram of the seven points in Fig.2.3(obtained from the single-link algorithm) is shown in Fig.2.4 Note that there are seven leaves with each leaf corresponding to a singleton cluster in the tree structure The smallest distance be-tween a pair of such clusters is 0.8, which leads to merging {F} and {G} to form {F, G} Next merger leads to {D, E} based on a distance of 0.9 units This is fol-lowed by merging {B} and {C}, then {A} and {B, C} at a distance of unit each At this point we have three clusters By merging clusters further we get ultimately a single cluster as shown in the figure The dendrogram can be broken at different lev-els to yield different clusterings of the data The partition of three clusters obtained using the dendrogram is the same as the partition shown in Fig.2.3 A major is-sue with the hierarchical algorithm is that computation and storage of the proximity matrix requiresO(n2)time and space

2.2.3 k-Means Algorithm

(30)

Fig 2.5 An optimal clustering of the points

k-Means Algorithm

1 Selectkinitial centroids One possibility is to selectkout of the npoints ran-domly as the initial centroids Each of them represents a cluster

2 Assign each of the remainingn−kpoints to one of thesekclusters; a pattern is assigned to a cluster if the centroid of the cluster is the nearest, among all thek centroids, to the pattern

3 Update the centroids of the clusters based on the assignment of the patterns Assign each of thenpatterns to the nearest cluster using the current set of

cen-troids

5 Repeat steps and till there is no change in the assignment of points in two successive iterations

An important feature of this algorithm is that it is sensitive to the selection of the initial centroids and may converge to a local minimum of the squared-error crite-rion function value if the initial partition is not properly chosen The squared-error criterion function is given by

k

i=1

X∈Ci

X−centroidi2. (2.1)

We illustrate thek-means algorithm using the dataset shown in Fig.2.3 If we consider A, D, and F as the initial centroids, then the resulting partition is shown in Fig.2.5 For this optimal partition, the centroids of the three clusters are:

• centroid1:(1.33,1.66)t; centroid2:(6.45,2)t; centroid3:(6.4,2)t

• The corresponding value of the squared error is around units

The popularity of thek-means algorithm may be attributed to its simplicity It requires O(n) time as it computesnk distances in each pass and the number of passes may be assumed to be a constant Also, the number of clusterskis a constant Further, it needs to storekcentroids in the memory So, the space requirement is also small

(31)

Fig 2.6 A nonoptimal clustering of the two-dimensional points

• centroid1:(1,1)t; centroid2:(1.5,2)t; centroid3:(6.4,4)t

• The corresponding squared error value is around 17 units

2.3 Classification

There are a variety of classifiers Typically, a set of labeled patterns is used to classify an unlabeled test pattern Classification involves labeling a test pattern; in the process, either the labeled training dataset is directly used, or an abstraction or model learnt from the training dataset is used Typically, classifiers learnt from the training dataset are categorized as either generative or discriminative The Bayes classifier is a well-known generative model where a test patternXis classified or as-signed to classCi, based on the a posteriori probabilitiesP (Cj/X)forj=1, , C if

P (Ci/X) ≥ P (Cj/X) for allj.

These posterior probabilities are obtained using the Bayes rule using prior probabil-ities and the probability distributions of patterns in each of the classes It is possible to show that the Bayes classifier is optimal; it can minimize the average probability of error Support Vector Machine (SVM) is a popular discriminative classifier, and it learns a weight vectorW and a thresholdb from the training patterns from two classes It assigns the test patternX to classC1(positive class) ifWtX +b ≥ 0, else it assignsXto classC2(negative class)

The Nearest-Neighbor Classifier (NNC) is the simplest and popular classifier; it classifies the test pattern by using the training patterns directly An important property of the NNC is that its error rate is less than twice the error rate of the Bayes classifier when the number of training patterns is asymptotically large We briefly describe the NNC, which employs the nearest-neighbor rule for classification

Nearest-Neighbor Classifier (NNC)

Input: A training setX ={(X1, C1), (X2, C2), , (Xn, Cn)}and a test patternX Note thatXi, i=1, , n, andXare somep-dimensional patterns Further,Ci∈

(32)

Table 2.2 Data matrix

Pattern ID feature1 feature2 feature3 feature4 Class label

X1 1.0 1.0 1.0 1.0 C1

X2 6.0 6.0 6.0 6.0 C2

X3 7.0 7.0 7.0 7.0 C2

X4 1.0 1.0 2.0 2.0 C1

X5 1.0 2.0 2.0 2.0 C1

X6 7.0 7.0 6.0 6.0 C2

X7 1.0 2.0 2.0 1.0 C1

X8 6.0 6.0 7.0 7.0 C2

Output: Class label for the test patternX

Decision: AssignXto classCi ifd(X, Xi)= minj d(X, Xj)

We illustrate the NNC using the four-dimensional dataset shown in Table2.2 There are eight patterns,X1, , X8, from two classesC1andC2, four patterns from each class The patterns are four-dimensional, and the dimensions are characterized by feature1, feature2, feature3, and feature4, respectively In addition to the four features, there is an additional column that provides the class label of each pattern

Let the test patternX = (2.0,2.0,2.0,2.0)t The Euclidean distances between Xand each of the eight patterns are given by

d(X, X1)=2.0; d(X, X2)=8.0; d(X, X3)=10.0; d(X, X4)=1.41; d(X, X5)=1.0; d(X, X6)=9.05; d(X, X7)=1.41; d(X, X8)=9.05.

So, the Nearest Neighbor (NN) ofXisX5becaused(X, X5)is the smallest (it is 1.0) among all the eight distances So, NN(X) = X5, and the class label assigned toXis the class label ofX5, which isC1here, which means thatXis assigned to classC1 Note that NNC requires eight distances to be calculated in this example In general, if there arentraining patterns, then the number of distances to be calculated to classify a test pattern isO(n)

The nearest-neighbor classifier is popular because: It is easy to understand and implement

2 There is no learning or training phase; it uses the whole training data to classify the test pattern

3 Unlike the Bayes classifier, it does not require the probability structure of the classes

4 It shows good performance If optimal accuracy is 99.99 %, then with a large training data, it can give at least 99.80 % accuracy

(33)

1 It is sensitive to noise; if the NN(X)is erroneously labeled, thenXwill be mis-classified

2 It needs to store the entire training data; further, it needs to compute the distances between the test pattern and each of the training patterns So, the computational requirements can be large

3 The distance between a pair of points may not be meaningful in high-dimensional spaces It is known that, as the dimensionality increases, the distance between a pointX and its nearest neighbor tends toward the distance betweenX and its farthest neighbor As a consequence, NNC may perform poorly in the context of high-dimensional spaces

Some of the possible solutions to the above problems are:

1 In order to tolerate noise, a modification to NNC is popularly used; it is called the k-Nearest Neighbor Classifier (kNNC) Instead of deciding the class label ofX using the class label of the NN(X),Xis labeled using the class labels ofknearest neighbors ofX In the case of kNNC, the class label ofXis the label of the class that is the most frequent among the class labels of the k nearest neighbors In other words,Xis assigned to the class to which majority of itsknearest neigh-bors belong; the value of kis to be fixed appropriately In the example dataset shown in Table2.2, the three nearest neighbors ofX=(2.0,2.0,2.0,2.0)t are X5,X4, andX7 All the three neighbors are from classC1; soXis assigned to classC1

2 NNC requiresO(n)time to compute thendistances, and also it requiresO(n) space It is possible to reduce the effort by compressing the training data There are several algorithms for performing this compression; we consider here a scheme based on clustering We cluster thenpatterns intokclusters using the k-means algorithm and use thekresulting centroids instead of thentraining pat-terns Labeling the centroids is done by using the majority class label in each cluster

By clustering the example dataset shown in Table2.2using thek-means al-gorithm, with a value ofk= 2, we get the following clusters:

• Cluster1:{X1, X4, X5, X7}– Centroid:(1.0,1.5,1.75,1.5)t

• Cluster2:{X2, X3, X6, X8}– Centroid:(6.5,6.5,6.5,6.5)t

Note that Cluster1 contains four patterns fromC1and Cluster2 has the four pat-terns fromC2 So, by using these two representatives instead of the eight training patterns, the number of distance computations and memory requirements will reduce Specifically, Centroid of Cluster1 is nearer to X than the Centroid of Cluster2 So,Xis assigned toC1using two distance computations

3 In order to reduce the dimensionality, several feature selection/extraction tech-niques are used We use a feature set partitioning scheme that we explain in detail in the sequel

(34)

Support Vector Machine The support vector machine (SVM) is a very popu-lar classifier Some of the important properties of the SVM-based classification are:

• The SVM classifier is a discriminative classifier It can be used to discriminate between two classes Intrinsically, it supports binary classification

• It obtains a linear discriminant function of the formWtX + bfrom the training data Here,W is called the weight vector of the same size as the data points, and bis a scalar Learning the SVM classifier amounts to obtaining the values ofW andbfrom the training data

• It is ideally associated with a binary classification problem Typically, one of them is called the negative class, and the other is called the positive class

• IfXis from the positive class, thenWtX+b > 0, and ifXis from the negative class, thenWtX+b <

• It finds the parametersW andb so that the margin between the two classes is maximized

• It identifies a subset of the training patterns, which are called support vectors. These support vectors lie on parallel hyperplanes; negative and positive hyper-planes correspond respectively to the negative and positive classes A pointXon the negative hyperplane satisfiesWtX+b= −1, and similarly, a pointXon the positive hyperplane satisfiesWtX+b=

• The margin between the two support planes is maximized in the process of finding out W and b In other words, the normal distance between the sup-port planesWtX+b= −1 andWtX+b= is maximized The distance is

2

W It is maximized using the constraints that every patternX from the pos-itive class satisfies WtX+b ≥ +1 and every pattern X from the negative class satisfies WtX+b ≤ −1 Instead of maximizing the margin, we mini-mize its inverse This may be viewed as a constrained optimization problem given by

MinW W2

s.t yi

WtXi+b

≥1, i=1,2, , n,

whereyi=1 ifXi is in the positive class andyi = −1 ifXi is in the negative class

• The Lagrangian for the optimization problem is L(W, b)=

2W 2−

n

i=1 αi

yi

WtX−i+b−1.

In order to minimize the Lagrangian, we take the derivative with respect to b and gradient with respect to W, and equating to 0, we get αis that sat-isfy

αi≥0 and q

i=1

(35)

whereqis the number of support vectors, andW is given by W=

q

i=1 αiyiXi.

• It is possible to view the decision boundary asWtX+b= andWis orthogonal to the decision boundary

We illustrate the working of the SVM using an example in the two-dimensional space Let us consider two points,X1=(2,1)t from the negative class andX2= (6,3)t from the positive class We have the following:

• Usingα1y1+α2y2= and observing thaty1= −1 andy2=1, we getα1=α2 So, we useαinstead ofα1orα2

• As a consequence,W= −αX1+αX2 = (4α,2α)t

• We know thatWtX1+b= −1 andWtX2+b= 1; substituting the values ofW, X1, andX2, we get

8α+2α+b= −1, 24α+6α+b= 1.

By solving the above, we get 20α=2 orα= 101, from which and from one of the above equations we getb= −2

• FromW= (4α,2α)t andα= 101 we getW= (25,15)t

• In this simple example, we have started with two support vectors in the two-dimensional case So, it was easy to solve forαs In general, there are efficient schemes for finding these values

• If we consider a pointX=(x1, x2)t on the linex2= −2x1+5, for example, the point(1,3)t, thenWt(1,3)t−2= −1 asW= (25,15)t This line is the support line for the points in the negative class In a higher-dimensional space, it is a hyperplane

• In a similar manner, any point on the parallel linex2= −2x1+15, for exam-ple,(5,5)t satisfies the property thatWt(5,5)−2=1, and this parallel line is the support plane for the positive class Again in a higher-dimensional space, it becomes a hyperplane parallel to the negative class plane

• Note that the decision boundary is given by

2 5,

1

X−2 = 0.

So, the decision boundary25x1+15x2−2=0 lies exactly in the middle of the two support lines and is parallel to both Note that(4,2)t is located on the decision boundary

• A point (7,6)t is in the positive class as Wt(7,6)t −2= >0 Similarly, Wt(1,1)t−2= −1.4 <0; so,(1,1)tis in the negative class

(36)

• If the classes are not linearly separable, then we map the points to a high-dimensional space with a hope to find linear separability in the new space For-tunately, one can implicitly make computations in the high-dimensional space without having to work explicitly in it It is possible by using a class of kernel functions that characterize similarity between patterns

• However, in large-scale applications involving high-dimensional data like in text mining, linear SVMs are used by default for their simplicity in training

2.4 Association Rule Mining

This is an activity that is not a part of either pattern recognition or machine learning conventionally An association rule is an implication of the formA→B, whereA andB are disjoint itemsets;Ais called the antecedent, andB is called the conse-quent Typically, this activity became popular in the context of market-basket anal-ysis, where one is concerned with the set of items available in a super market, and transactions are made by various customers In such a context, an association rule provides information on the association between two sets of items that are frequently bought together; this facilitates in strategic decisions that may have a positive com-mercial impact in displaying the related items on appropriate shelves to avoid con-gestion or in terms of offering incentives to customers on some products/items

Some of the features of the association rule mining activity are:

1 The ruleA→Bis not like the conventional implication used in a classical logic, for example, the propositional logic Here, the rule does not guarantee the pur-chase of items in B in the same transaction where items inA are bought; it depicts a kind of frequent association betweenAandB in terms of buying pat-terns

2 It is assumed that there is a global set of itemsI; in the case of market-basket analysis,Iis the set of all items/product lines available for sale in a supermarket Note thatAandBare disjoint subsets ofI So, if the cardinality ofI isd, then the number of all possible rules is ofO(3d); this is because an item inI can be a part of Aor B or none of the two and there ared items In order to reduce the mining effort, only a subset of the rules that are based on frequently bought items is examined

3 Popularly, the quantity of an item bought is not used; it is important to consider whether an item is bought in a transaction or not For example, if a customer buys 1.2 kilograms of Sugar, loafs of Bread, and a tin of Jam in the same transaction, then the corresponding transaction is represented as {Sugar, Bread, Jam} Such a representation helps in viewing a transaction as a subset ofI

(37)

Table 2.3 Transaction data

Transaction Itemset

t1 {a,c,d,e}

t2 {a, d, e}

t3 {b, d, e}

t4 {a, b, c}

t5 {a, b, c, d}

t6 {a, b, d}

t7 {a, d}

2.4.1 Frequent Itemsets

A transactiont is a subset of the set of items I An itemsetX is a subset of a transactiontif all the items inXhave been bought int IfT is a set of transactions whereT ={t1, t2, , tn}, then the support-set ofXis given by

Support-set(X) = {ti|Xis a subset oft}.

The support ofXis given by the cardinality of Support-set(X)or|Support-set(X)| An itemset X is a frequent itemset if Support(X)≥Minsup, where Minsup is a user-provided threshold

We explain the notion of frequent itemset using the transaction data shown in Table2.3

Some of the itemsets with their supports corresponding to the data in Table2.3

are:

• Support({a, b, c})=2; Support({a, d})=5;

• Support({b, d})=3; Support({a, c})=3

If we use a Minsup value of 4, then the itemset{a, d}is frequent Further,{a, b, c} is not frequent; we call such itemsets infrequent There is a systematic way of enu-merating all the frequent itemsets; this is done by an algorithm called Apriori This algorithm enumerates a relevant subset of the itemsets for examining whether they are frequent or not It is based on the following observations

1 Any subset of a frequent itemset is frequent This is because if A andB are two itemsets such thatAis a subsetB, then Support(A)≥Support(B)because Support-set(A)⊆Support-set(B) For example, knowing that itemset{a, d}is frequent, we can infer that the itemsets {a}and{d}are frequent Note that in the data shown in Table 2.3, Support({a})=6 and Support({d})=6 and both exceed the Minsup value.

2 Any superset of an infrequent itemset is infrequent IfAandBare two itemsets such that A is a supersetB, then Support(A)≤Support(B) In the example,

(38)

Table 2.4 Printed characters

of 0 1 0

0 1 0

2.4.1.1 Apriori Algorithm

The Apriori algorithm iterates over two steps to generate all the frequent itemsets from a transaction dataset Each iteration requires a database scan These two steps are as follows

• Generating Candidate itemsets of sizek These itemsets are obtained by looking at frequent itemsets of sizek−1

• Generating Frequent itemsets of sizek This is achieved by scanning the transac-tion database once to check whether a candidate of sizekis frequent or not

It starts with the empty set (φ), which is frequent because the empty set is a subset of every transaction So, Support(φ) = |T|, where T is the set of transactions Note thatφis a size itemset as there are no items in it It then generates candidate itemsets of size 1; we call such itemsets 1-itemsets Note that every 1-itemset is a candidate In the example data shown in Table2.3, the candidate 1-itemsets are

{a},{b},{c},{d},{e} Now it scans the database once to obtain the supports of these 1-itemsets The supports are:

Support{a}=6; Support{b}=4; Support{c}=3; Support{d}=6; Support{e}=3.

Using a Minsup value of 4, we can observe that frequent 1-itemsets are{a},{b}, and{d} From these frequent 1-itemsets we generate candidate 2-itemsets The can-didates are{a, b},{a, d}, and{b, d} Note that the other 2-itemsets need not be con-sidered as candidates because they are supersets of infrequent itemsets and hence cannot be frequent For example, {a, c} is infrequent because {c} is infrequent A second database scan is used to find the support values of these candidates The supports are Support({a, b})=3, Support({a, d})=5, and Support({b, d})=3 So, only{a, d}is a frequent 2-itemset So, there can not be any candidates of size For example,{a, b, d}is not frequent because{a, b}is infrequent

(39)

Table 2.5 Transactions for

characters of TID Class

t1 0 0 0 Type1

t2 0 1 0 Type1

t3 0 0 1 Type1

t4 0 0 0 Type2

t5 0 0 1 Type2

t6 0 1 0 Type2

So, it is possible to represent data based on categorical features using transactions and mine them to obtain frequent patterns For example, with a small amount of noise, we can have transaction data corresponding to these 1s as shown in Table2.5 There are six transactions, each of them corresponding to a By using a Minsup value of 3, we get the frequent itemset{1,4,7}for Type1 and the frequent itemset

{3,6,9}for Type2 Naturally subsets of these frequent itemsets also are frequent

2.4.2 Association Rules

In association rule mining there are two important phases:

1 Generating Frequent Itemsets This requires one or more dataset scans Based on the discussion in the previous subsection, Apriori requiresk+1 dataset scans if the largest frequent itemset is of sizek

2 Obtaining Association Rules This step generates association rules based on fre-quent itemsets Once frefre-quent itemsets are obtained from the transaction dataset, association rules can be obtained without any more dataset scans, provided that the support of each of the frequent itemsets is stored So, this step is computa-tionally simpler

If X is a frequent itemset, then rules of the form A→ B where A⊂X and B=X−A are considered Such a rule is accepted if the confidence of the rule exceeds a user-specified confidence value called Minconf The confidence of a rule A→Bis defined as

Confidence(A→B) = Support(A∪B) Support(A) .

So, if the support values of all the frequent itemsets are stored, then it is possible to compute the confidence value of a rule without scanning the dataset

(40)

1 {a} → {d}; its confidence is 56 {d} → {a}; its confidence is 56

So, if the Minconf value is 0.5, then both these rules satisfy the confidence threshold. In the case of character data shown in Table2.5, it is appropriate to consider rules of the form:

• {1,4,7} →Type1 1

• {3,6,9} →Type2 1

Typically, the antecedent of such an association rule or a classification rule is a disjunction of one or more maximally frequent itemsets A frequent itemsetAis maximal if there is no frequent itemsetBsuch thatAis a subset ofB This illustrates the role of frequent itemsets in classification

2.5 Mining Large Datasets

There are several applications where the size of the pattern matrix is large By large, we mean that the entire pattern matrix cannot be accommodated in the main mem-ory of the computer So, we store the input data on a secondary storage medium like the disk and transfer the data in parts to the main memory for processing For example, a transaction database of a supermarket chain may consist of trillions of transactions, and each transaction is a sparse vector of a very high dimensionality; the dimensionality depends on the number of product-lines Similarly, in a network intrusion detection application, the number of connections could be prohibitively large, and the number of packets to be analyzed or classified could be even larger Another application is the clustering of click-streams; this forms an important part of web usage mining Other applications include genome sequence mining, where the dimensionality could be running into millions, social network analysis, text min-ing, and biometrics

An objective way of characterizing largeness of a data set is by specifying bounds on the number of patterns and features present For example, a data set having more than billion patterns and/or more than million features is large However, such a characterization is not universally acceptable and is bound to change with the devel-opments in technology For example, in the 1960s, “large” meant several hundreds of patterns So, it is good to consider a more pragmatic characterization; large data sets are those that may not fit the main memory of the computer; so, largeness of the data varies with the technological developments Such large data sets are typically stored on a disk, and each point in the set is accessed from the disk based on pro-cessing needs Note that disk access can be several orders slower compared to the memory access; this property remains in tact even though memory and disk sizes at different points time in the past are different So, characterizing largeness using this property could be more meaningful

(41)

large data sets Here, we provide an exhaustive set of design techniques that are use-ful in this context More specifically, we offer a unifying framework that is helpuse-ful in categorizing algorithms for mining large data sets; further, it provides scope for designing novel efficient mining algorithms

2.5.1 Possible Solutions

It is important that the mining algorithms that work with large data sets should scale up well Algorithms having nonlinear time and space complexities are ruled out Even algorithms requiring linear time and space may not be feasible if the number of dataset scans is large Based on these observations, it is possible to list the following solutions for mining large data sets

1 Incremental Mining The basis of incremental mining is that the data is consid-ered sequentially and the data points are processed step by step In most of the incremental mining algorithms, a small dataset is used to generate an abstraction New points are processed to update the abstraction currently available without examining the previously seen data points Also it is important that abstraction generated is as small as possible in size Such a scheme helps in mining very large-scale datasets

We can characterize incremental mining formally as follows Let

X = (X1, θ1, t1), (X2, θ2, t2), , (Xn, θn, tn)

be the set of n patterns, each represented as a triple, where Xi is theith pat-tern,θi is the class label ofXi, andti is the time-stamp associated withXi so that ti < tj ifi < j In incremental mining, as the data is considered sequen-tially, in a particular order, we may attach time stampst1, t2, , tnwith the pat-ternsX1, X2, , Xn LetAk represent the abstraction generated using the first kpatterns, andAnrepresent the abstraction obtained after all thenpatterns are processed Further, in incremental mining,Ak+1is obtained usingAkandXk+1 only

2 Divide-and-Conquer Approach Divide-and-conquer is a well-known algorithm design strategy It has been used in designing several efficient algorithms It has been used in efficient data mining A notable development in this direction is the Map-Reduce framework, which is popular in a variety of data mining applica-tions including text mining

3 Mining based on an Intermediate Abstraction The idea here is to use one or two database scans to obtain a compact representation of the dataset Such a represen-tation may fit into main memory Further processing is based on this abstraction, and it does not require any more dataset scans For example, as discussed in the previous section, once frequent itemsets are obtained using a small number of database scans, association rules can be obtained without anymore database scans

(42)

2.5.2 Clustering

2.5.2.1 Incremental Clustering

The basis of incremental clustering is that the data is considered sequentially and the patterns are processed step by step In most of the incremental clustering algorithms, one of the patterns in the data set (usually the first pattern) is selected to form an initial cluster Each of the remaining points is assigned to one of the existing clusters or may be used to form a new cluster based on some criterion Here, a new data item is assigned to a cluster without affecting the existing clusters significantly

The abstractionAk varies from algorithm to algorithm, and it can take different forms One of the popular schemes is whenAkis a set of prototypes or cluster rep-resentatives Leader clustering algorithm is a well-known member of this category It is described below

Leader Clustering Algorithm

Input: The dataset to be clustered and a Threshold value T provided by the user. Output: A partition of the dataset such that patterns in each cluster are within a

sphere of radiusT

1 Setk= Assign the first data pointX1to clusterCk Set the leader ofCk to be Lk = X1

2 Assign the next data pointX to one of the existing clusters or to a new clus-ter This assignment is done based on some similarity between the data point and the existing leaders Specifically, assign the data point X to clusterCj if d(X, Lj) < T; if there are more than oneCj satisfying the threshold require-ment, then assignXto one of these clusters arbitrarily If there is noCjsuch that d(X, Lj) < T, then incrementk, assignXtoCk, and setXto beLk

3 Repeat step till all the data points are assigned to clusters

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

BIRCH may be viewed as a hierarchical version of the leader algorithm with some additional representational features to handle large-scale data It constructs a data structure called the Cluster Feature tree (CF tree), which represents each cluster compactly using a vector called Cluster Feature (CF) We explain these notions using the dataset shown in Table2.6

• Clustering Feature (CF) Let us consider the cluster of points, {(1,1)t, (2,2)t} The CF vector is three-dimensional and is2, (3,3), (5,5), where the three com-ponents of the vector are as follows:

1 The first component is the number of elements in the cluster, which is here The second component is the linear sum of all the points (vectors) in the

clus-ter, which is(3,3) (=(1+2,1+2))in this example

(43)

Table 2.6

A two-dimensional dataset Pattern number

feature1 feature2

1 1

2

3 2

4

5

6 11

7 14

8 13

• Merging Clusters A major flexibility offered by representing clusters using CF vectors is that it is very easy to merge two or more clusters For example ifni is the number of elements inCi,lsi is the linear sum, andssi is the squared sum, then

CF vector of clusterCi isni, lsi, ssiand CF vector of clusterCj isnj, lsj, ssj, then

CF vector of the cluster obtained by mergingCi andCj is

ni+nj, lsi+lsj, ssi+ssj

• Computing Cluster Parameters Another important property of the CF represen-tation is that several statistics associated with the corresponding cluster can be obtained easily using it A statistic is a function of the samples in the cluster For example, if a clusterC = {X1, X2, , Xq}, then

Centroid ofC =CentroidC =

q j=1Xj

q =

ls q,

Radius ofC =R =

q

i=1(Xi−CentroidC)2 q

1

=

ss

i−2 ls2i

q + ls2i q2

q

1

.

• At the leaf node level, each cluster is controlled by a user-provided threshold If T is the threshold, then all the points in the cluster lie in a sphere of radiusT As the clusters are merged to form clusters at a previous level, one can use the merging property of the CF vectors.

(44)

Fig 2.7 Insertion of the first three patterns

level By inserting all the eight patterns we get the CF-tree shown in Fig.2.8 Some of the important characteristics of the incremental algorithms are:

1 They require one database scan to generate the clustering of the data Each pat-tern is examined only once in the process In the case of leader clustering algo-rithm, the clustering is represented by a set of leaders If the threshold value is small, then a larger number of clusters are generated Similarly, if the threshold value is large, then the number of clusters is small

2 BIRCH generates a CF-tree using a single database scan Such an abstraction captures clusters in a hierarchical manner Merging two smaller clusters to form a bigger cluster is very easy by using the merging property of the corresponding CF vectors.

3 The parameters controlling the size of the CF-tree are the number of clusters stored in each node of the tree and the threshold value used at the leaf node to fix the size of the clusters

4 Order-independence is an important property of clustering algorithms An al-gorithm is order-independent if it generates the same partition for any order in which the data is presented Otherwise it is order-dependent Unfortunately,

(45)

Fig 2.9 Order-dependence of leader algorithm

cremental algorithms can be order-dependent This may be illustrated using an example shown in Fig.2.9

By choosing the order in different ways, we get different partitions in terms of both the number and size of clusters For example, by choosing the three points labelledX1, X2, X3in that order as shown in the left part of the figure, we get four clusters irrespective of the order in which the pointsX4, X5, X6 are processed Similarly, by selecting the centrally located pointsX4andX5as shown in the right part of the figure as the first two points in the order, we get two clusters irrespective of the order of the remaining four points

2.5.2.2 Divide-and-Conquer Clustering

Conventionally, designers of clustering algorithms tacitly assume that the data sets fit the main memory This assumption does not hold when the data sets are large In such a situation, it makes sense to consider data in parts and cluster each part inde-pendently and obtain the corresponding clusters and their representatives Once we have obtained the cluster representatives for each part, we can cluster these repre-sentatives appropriately and realize the clusters corresponding to the entire data set If two or more representatives from different parts are assigned to some clusterC, then assign the patterns in the corresponding clusters (of these representatives) toC Specifically, this may be achieved using a two-level clustering scheme depicted in Fig.2.10 There arenpatterns in the data set All these patterns are stored on a disk Each part or block of sizepn patterns is considered for clustering at a time These pn data points are clustered in the main memory intokclusters using some clustering algorithm Clustering thesep parts can be done either sequentially or in parallel; the number of these clusters corresponding to all thep blocks is pkas there are kcluster in each block So, we will havepkcluster representatives By clustering thesepkcluster representatives using the same or a different clustering algorithm intokclusters, we can realize a clustering of the entire data set as stated earlier

(46)

Fig 2.10 Divide-and-conquer approach to clustering

• One-level Algorithm It does not employ divide-and-conquer It is the conven-tional single-link algorithm applied onn data points, which makes n(n2−1) dis-tance computations

• The Two-level Algorithm It requires:

– In each block at the first level, there are np points So, the number of distance computations in each block is2np(pn−1)

– There arep blocks at the first level So, the total number of distances at the first level isn2(np−1)

– There arepkrepresentatives at the second level So, the number of distances computed at the second level ispk(pk2−1)

– So, the total number of distances for the two-level divide-and-conquer algo-rithm isn2(np−1)+ pk(pk2−1)

• A Comparison The number of distances computed by the conventional single-link and two-level algorithm are shown in Table2.7for different values ofn,k, andp

(47)

Table 2.7 Number of distances computed No of data

points (n)

No of blocks (p)

No of clusters (k)

One-level algorithm

Two-level algorithm

100 4950 2495

500 20 124,750 15,900

1000 20 499,500 29,450

10,000 100 49,995,000 619,750

2.5.2.3 Clustering Based on an Intermediate Representation

The basic idea here is to generate an abstraction by scanning the dataset once or twice and then use the abstraction, not the original data, for further processing In order to illustrate the working of this category of algorithms, we use the dataset shown in Table2.5 We use a database scan to find frequent 1-itemsets Using a Minsup value of 3, we get the following frequent itemsets:{1}, {4}, {7}, {3}, {6}, and{9}; all these items have a support value of We perform one more database scan to construct a tree using only the frequent items First, we consider transaction t1and insert it into a tree as shown in Fig.2.11 Here, we consider only the frequent items present int1; these are 1, 4, and So, these are inserted into the tree by having one node for each item present The item numbers are indicated inside the nodes; in addition, the count values are also indicated along with the item numbers For example, in Fig.2.11(a), 1:1, 4:1, and 7:1 indicate that items 1, 4, and are present in the transaction Next, we considert2, which has the same items ast1, and so we simply increment the counts as shown in Fig.2.11(b) After examining all the six transactions, we get the tree shown in Fig.2.11 In the process, we need to create new branches and nodes appropriately as we encounter new transactions For example, after consideringt4, we have items 3,6, and present in it, which prompts us to start a new branch with nodes for the items 3, 6, and At this point, the counts on the right branch of the tree for these items are 3:1,6:1, and 9:1, respectively It is possible to store the items in a transaction in any order, but we used the item numbers in increasing order

Note that the two branches of the tree, which is called Frequent-Pattern tree or FP-tree, correspond to two different clusters; here each cluster corresponds to a different class of 1s Some of the important features of this class of algorithms are: They require only two scans of the database This is because each data item

is examined only twice An abstraction is generated, and it is used for further processing Centroids, leaders, and FP-tree are some example abstractions The intermediate representation is useful in other important mining tasks like

as-sociation rule mining, clustering, and classification For example, the FP-tree has been successfully used in association rule mining, clustering, and classification Typically, the space required by the intermediate representation could be much

(48)

Fig 2.11 A tree structure for the character patterns

There are several other types of intermediate representations Some of them are:

• It is possible to reduce the computational requirements of clustering by using a random subset of the dataset

• An important and not systematically pursued direction is to use a compression scheme to reduce the time and memory required to store the data The compres-sion scheme may be lossy or nonlossy Use the compressed data for further pro-cessing This direction will be examined in a great detail in the rest of the book

2.5.3 Classification

It is also possible to exploit the three paradigms in classification We discuss these directions next

2.5.3.1 Incremental Classification

Most of the classifiers can be suitably altered to handle incremental classification We can easily modify the NNC to perform incremental classification This can be done by incrementally updating the nearest neighbor of the test pattern The specific incremental algorithm for NNC is:

(49)

2 Next, whenXk+1 is encountered, we update the nearest neighbor of X to get

Ak+1usingAkandXk+1 Repeat step tillAnis obtained

We illustrate it using the dataset shown in Table2.2 Consider the test pattern X = (2.0,2.0,2.0,2.0)t andX1, X2, X3, X4 The nearest neighbor ofX out of these four points isX4, which is at a distance of 1.414 units; So,A4isX4 Now, if we encounterX5, thenA5gets updated, and it isX5becaused(X, X5) =1.0, and it is smaller thand(X,A4), which is 1.414 Proceeding further in this manner, we note thatA8isX5; so,Xis assigned toC1as the class label ofX5isC1

In a similar manner, it is possible to visualize an incremental version of the kNNC For example, the three nearest neighbors ofXafter examining the first four patterns in Table2.2areX4, X1, andX2 Now if we encounterX5, then the three nearest neighbors areX5, X4, and X1 After examining all the eight patterns, we get the three nearest neighbors ofXto beX5, X4, andX7 All the three neighbors are from classC1; so, we assignXtoC1

2.5.3.2 Divide-and-Conquer Classification

It is also possible to exploit the divide-and-conquer paradigm in classification Even though it is possible to use it along with a variety of classifiers, we consider it in the context ofN N C It is possible to use the division across either rows or columns of the data matrix

• Division across the rows It may be described as follows:

1 Let thenrows of the data matrix be partitioned intopblocks, where there are n

p rows in each block

2 Obtain the nearest neighbor of the test patternXin each block using np dis-tances Let the nearest neighbor ofX in theith block beXi, and its distance fromXbedi

3 Letdj be the minimum of the valuesd1, d2, , dp Then NN(X) = Xj Ties may be arbitrarily broken

Note that computations in steps and can be parallelized to a large extent We illustrate this algorithm using the data shown in Table2.2 Let us consider two (p=2) blocks such that:

– Block1 = {X1, X2, X3, X4}; – Block2 = {X5, X6, X7, X8}

(50)

• Division among the columns An interesting situation emerges when the columns are grouped together It can lead to novel pattern generation or pattern synthesis. The specific algorithm is given below:

1 Divide the number of featuresd intop blocks, where each block has pd fea-tures Consider data corresponding to each of these blocks in each of the classes

2 Divide the test patternX intop blocks; let the corresponding subpatterns be X1, X2, , Xp, respectively

3 Find the nearest neighbor of eachXifori =1,2, , pfrom the correspond-ingith block of each class

4 Concatenate these nearest subpatterns of the corresponding subpattern of the test pattern obtained for each class separately Among these concatenated pat-terns, obtain the nearest pattern to X; assign the class label of the nearest concatenated pattern toX

We explain the working of this scheme using the example data shown in Ta-ble2.2 and the test patternX = (2.0,2.0,2.0,2.0)t Let p =2 Let the two feature set blocks be

– Block1 = {feature1,feature2}; – Block2 = {feature3,feature4}

Correspondingly, the test pattern has two blocks,X1 = (2.0,2.0)t andX2 = (2.0,2.0)t The training data after partitioning into two feature blocks and reor-ganizing so that all the patterns in class are put together is shown in Table2.8 Note that, forX1, the nearest neighbor fromC1can be either the first subpattern of X5, which is denoted byX15, orX71; we resolve the tie in favor of the first pattern, which isX15 asX5appears beforeX7 in the table Further, the nearest subpattern fromC2forX1isX12 Similarly, for the second subpatternX2ofX, the nearest neighbors fromC1andC2respectively areX24andX22 Now, concate-nating the nearest subpatterns from the two classes, we have

– C1–X51:X24, which is(1.0,2.0,2.0,2.0)t; – C2–X21:X22, which is(6.0,6.0,6.0,6.0)t

Out of these two patterns, the pattern fromC1 is nearer toX than the pattern fromC2, the corresponding distances being 1.0 and 8.0, respectively So, we as-signXtoC1

There are some important points to be considered here:

1 In the above example, both the concatenated patterns are already present in the data However, it is possible that novel patterns are generated by con-catenating the nearest subpatterns For example, consider the test pattern Y = (1.0,2.0,1.0,1.0)t In this case, the nearest subpatterns fromC1 and C2forY1=(1.0,1.0)t andY2=(1.0,1.0)tare given below:

(51)

– The nearest neighbors ofY2fromC1andC2respectively areX21andX22 – Concatenating the nearest subpatterns fromC1, we get(1.0,2.0,1.0,1.0)t – Concatenating the nearest subpatterns fromC2, we get(6.0,6.0,6.0,6.0)t So, Y is classified as belonging to C1 because the concatenated pattern (1.0,2.0,1.0,1.0)t is closer toY than the pattern fromC2 Note that in this case, the concatenated pattern is the novel pattern(1.0,2.0,1.0,1.0)t, which is not a part of the training data from C1 So, this scheme has the potential to generate novel patterns from each of the classes and use them in decision making In general, if there arepblocks andni patterns in classCi, the space size of all possible concatenated patterns in the class isnpi, which can be much larger thanni

2 Even though the effective search space size or number of patterns examined from theith class isnpi, the actual effort involved in finding the nearest con-catenated pattern is ofO(nip), which is linear

3 There is no need to compute the distance betweenXand concatenated nearest subpatterns from each class separately if an appropriate distance function is used For example, if we use the squared Euclidean distance, then the distance between the test patternXand the concatenated subpatterns from a class is the sum of the distances between the corresponding subpatterns Specifically,

d2X,CNi(X) = p

j=1

d2(Xj,NNiXj,

where CNi(X)is the concatenated nearest subpattern ofXjsfrom classCi, and NNi(Xj)is the nearest subpattern ofXj fromCi For example,

– The nearest subpattern ofX1fromC1isX51, and that ofX2isX42

– The corresponding squared Euclidean distances ared2(X1, X51) = 1.0 and d2(X2, X42) = 0.0

– So, the distance betweenXand the concatenated pattern(1.0,2.0,2.0,2.0)t is 1.0+0.0 = 1.0

– Similarly, for C2, the nearest subpatterns of X1 andX2 are X21 andX22, respectively

– The corresponding distances ared2(X1, X12) = 32 andd2(X2, X22) = 32 So,d2(X, CN2(X)) = 32+32 =64

4 It is possible to extend this partition-based scheme to the kNNC.

2.5.3.3 Classification Based on Intermediate Abstraction

(52)

Table 2.8 Reorganized data matrix

Pattern ID feature1 feature2 feature3 feature4 Class label

X1 1.0 1.0 1.0 1.0 C1

X4 1.0 1.0 2.0 2.0 C1

X5 1.0 2.0 2.0 2.0 C1

X7 1.0 2.0 2.0 1.0 C1

X2 6.0 6.0 6.0 6.0 C2

X3 7.0 7.0 7.0 7.0 C2

X6 7.0 7.0 6.0 6.0 C2

X8 6.0 6.0 7.0 7.0 C2

1 Clustering-based Cluster the training data and use the cluster representatives as the intermediate abstraction Clustering could be carried out in each class sepa-rately The resulting clusters may be interpreted as subclasses of the respective classes For example, consider the two-class four-dimensional dataset shown in Table2.2 By clustering the data in each class separately using thek-means al-gorithm withk = we get the following centroids:

• C1 By selectingX1andX5as the initial centroids, the clusters obtained us-ing thek-means algorithm areC11= {X1}andC12= {X4, X5, X7}, and the centroids of these clusters are(1.0,1.0,1.0,1.0)t and(1.0,1.66,2.0,1.66)t, respectively Here,C11 andC12 are the first and second clusters obtained by grouping data inC1

• C2 By selecting X2 and X3 as the initial centroids, using the k-means al-gorithm, we get the clusters C21= {X2, X6, X8}andC22= {X3}, and the respective centroids are(6.33,6.33,6.33,6.33)t and(7.0,7.0,7.0,7.0)t

• Classification of X Using the four centroids, two from each class, instead of using all the eight training points, we classify the test pattern X = (2.0,2.0,2.0,2.0)t The distances between X and these four centroids are d(X, C11)= 2.0, d(X, C12)= 1.22, d(X, C21)= 8.66, and d(X, C22)= 10.0 So, X is closer to C12, which is a cluster (or a subclass) inC1; as a consequence,Xis assigned toC1

(53)

Table 2.9 Transaction data

for incremental mining Transaction Itemset given

Itemset in frequency order

t1 {a, c, d, e} {a, d}

t2 {a, d, e} {a, d}

t3 {b, d, e} {d, b}

t4 {a, c} {a}

t5 {a, b, c, d} {a, d, b}

t6 {a, b, d} {a, d, b}

t7 {a, b, d} {a, d, b}

2.5.4 Frequent Itemset Mining

In association rule mining, an important and time-consuming step is frequent itemset generation So, we consider frequent itemset mining here

2.5.4.1 Incremental Frequent Itemset Mining

There are incremental algorithms for frequent itemset mining They not fol-low the incremental mining definition given earlier They may require an additional database scan We discuss the incremental algorithm next

1 Consider a block ofmtransactions,Block1, to find the frequent itemsets Store the frequent itemsets along with their supports If an itemset is infrequent, but all its subsets are frequent, then it is a border set Obtain the set of such border sets. LetF1andB1be the frequent and border sets fromBlock1

2 Now let the database be extended by adding a block,Block2, of transactions Find the frequent itemsets and border set inBlock2 Let them beF2andB2 We update the frequent itemsets as follows:

• If an itemset is present in bothF1andF2, then it is frequent

• If an itemset is infrequent in both the blocks, then it is infrequent

• If an itemset is frequent inF1but not inF2, it can be eliminated by using the support values

• Itemsets absent inF1but frequent in the union of the two blocks can be ob-tained by using the notion of promoted border This happens when an itemset that is a border set inBlock1 becomes frequent in the union of the two blocks If such a thing happens, then additional candidates are generated and tested using another database scan

(54)

• F1 = {a :3}, {c: 2}, {d : 3}, {e : 3}, {a, c : 2}, {a, d : 2}, {a, e: 2},

{d, e:3},{a, d, e:2}

• B1= {b:1},{c, d:1},{c, e:1}

Now we encounter the incremental portion or Block2 consisting of remaining three transactions from Table2.9 For this part, the setsF2andB2are:

• F2= {a:3},{b:3},{d:3},{a, b:3},{b, d:3},{a, d:3},{a, b, d:3}

• B2= {c:1}

Now we know fromF1andF2that{a:6},{d:6}, and{a, d:5}are present in bothF1andF2 So, they are frequent Further note that{b:4}, a border set in Block1 gets promoted to become frequent So, we add it to the frequent itemsets We also need to consider {a, b},{b, d}, and{a, b, d}, which may become fre-quent However,{b, c}and{b, e}need not be considered because{c}and{e}are infrequent Now we need to make a scan of the database to decide that{a, b:4} and{b, d:4}are frequent, but not{a, b, d:3}

2.5.4.2 Divide-and-Conquer Frequent Itemset Mining

The divide-and-conquer strategy has been used in mining frequent itemsets The specific algorithm is as follows

Input: Transaction Data Matrix and Minsup value Output: Frequent Itemsets

1 Divide the transaction data intopblocks so that each block has np transactions Obtain frequent itemsets in each of the blocks LetFi be the set of frequent

itemsets in theith block

3 Take the union of all the frequent itemsets; let it beF That meansF= pi=1Fi Use one more database scans to find the supports of itemsets inF Those satisfy-ing the Minsup threshold are the frequent itemsets Collect them inFfinal, which is the set of all the frequent itemsets

Some of the features of this algorithm are as follows:

1 The most important feature is that if an itemset is infrequent in all thepblocks, then it cannot be frequent

2 This is a two-level algorithm, and it considers only those itemsets that are fre-quent at the first level for the possibility of being frefre-quent

3 The worst-case scenario emerges when at the end of the first level all the itemsets are members ofF This can happen in the case of datasets where the transaction are dense or nonsparse In such a case, using an FP-tree that stores the itemsets in a compact manner can be used

We explain this algorithm using the data shown in Table2.9 Let us consider two blocks and Minsup value of 4, which means a value of in each block.

(55)

• F1= {a},{c},{d},{e},{a, c},{a, d},{a, e},{d, e},{a, d, e}

• F2= {a},{b},{d},{a, b},{b, d},{a, d},{a, b, d}

• F= {a},{b},{c},{d},{e},{a, b},{a, c},{a, d},{a, e},{b, d},{d, e},{a, b, d}

• We examine the elements ofF and another dataset scan to getFfinal: Ffinal= {a:6},{b:4},{d:6},{a, b:4},{a, d:5},{b, d:4}

2.5.4.3 Intermediate Abstraction for Frequent Itemset Mining

It is possible to read the database once or twice and produce an abstraction and use this abstraction for obtaining frequent itemsets The most popular abstraction in this context is the Frequent Pattern Tree or FP-tree It is constructed using two database scans It has been used in Clustering and Classification However, it was originally proposed for obtaining the frequent itemsets The detailed algorithm for constructing an FP-tree is given below:

Input: Transaction Database and Minsup. Output: FP-tree.

1 Scan the dataset once to get the frequent 1-itemsets using the Minsup value. Scan the database once more and in each transaction ignore the infrequent items

and insert the remaining part of the transaction in decreasing order of support of the items Also maintain the frequency counts along with items such that if multiple transactions share the same subsets of items, then they are inserted into the same branch of the tree as shown in Fig.2.11 The frequency counts of the items in the branch are updated appropriately instead of storing them in multiple branches

Construction of the FP-tree was discussed using the data shown in Table2.5

and Fig.2.11 By examining the FP-tree shown in Fig.2.11(c), it is possible to show that{1,4,7} and{3,6,9} are the two maximal frequent itemsets Each of them corresponds to a type of (character 1), and also each itemset shares a branch in the tree Once the tree is obtained, frequent itemsets are found by going through the tree in a bottom-up manner It starts with a suffix based on less frequent items present in the tree This is efficiently done using an index structure

We illustrate the frequent itemset mining using the data shown in Table2.9 The corresponding FP-tree is shown in Fig.2.12 Some of the details related to the con-struction of the tree and finding frequent itemsets are:

• The frequent 1-itemsets are{a:6},{d:6}, and{b:4}by using a value of for Minsup and data in Table2.9

• We rewrite the transactions using the frequency order and Minsup information. Infrequent items are deleted, and frequent items are ordered in decreasing order of the frequency Ties are broken based on lexicographic order The modified transactions are shown in column of the table

(56)

Fig 2.12 An example FP-tree

• In order to mine the frequent itemsets from the tree, we start with the least fre-quent among the frefre-quent items, which isbin this case, and mine for all the item-sets from the tree withbas the suffix For this, we consider the FP-tree above the itembas shown by the curved line segment Itemd occurs in both the branches with frequencies and 1, respectively However, in terms of co-occurrence along withb, which has a frequency of in the left branch, we need to consider a fre-quency of fordanda This is because they concurred in three transactions only along withb Similarly, from the right branch we know thatbandd co-occurred once From this we get the frequencies of{b, d}and{a, b}to be from the left branch; in addition, from the right branch we get a frequency of for{b, d} This means that the cumulative frequency of{b, d}is 4, and so it is frequent, but not

{a, b}with a frequency of using the value of for Minsup.

• Next, we consider itemdthat appears afterbin the bottom-up order of frequency Note thatd has a frequency of 6, and by using it as the suffix we get the itemset

{a, d}, which has a frequency of from the left branch, and so it is frequent

• Finally, we considera, which has a frequency of 6, and so the itemset {a}is frequent

• Based on the above-mentioned conditional mining, we get the following frequent itemsets:{a:6},{d:6},{a, d:5},{b:4}, and{b, d:4}

2.6 Summary

(57)

have some scalable approaches Specifically, schemes requiring a small number of database scans are important In this chapter, conventional algorithms used for data mining were discussed first

There are three different directions for dealing with large-scale datasets These are based on incremental mining, divide-and-conquer approaches, and an interme-diate representation In an incremental algorithm, each data point is processed only once; so, a single database scan is required for mining Divide-and-conquer is a well-known algorithm design strategy, and it can be exploited in the context of the data mining tasks including clustering, classification, and frequent itemset mining The third direction deals with generating an intermediate representation by scanning the database once or twice and uses the abstraction, instead of the data, for further processing Tree structures like CF-tree and FP-tree can be good examples of inter-mediate representations Such trees can be built from the data very efficiently

2.7 Bibliographic Notes

Important data mining tools including clustering, classification, and association rule mining are discussed in Pujari (2001) A good discussion on clustering is provided in the books by Anderberg (1973) and Jain and Dubes (1988)) They discuss the single-link algorithm andk-means algorithm in a detailed manner Analysis of these algorithms is provided in Jain et al (1999).Thek-means algorithm was originally proposed by MacQueen (1967) Initial seed selection is an important step in the k-means algorithm Babu and Murty (1993) use genetic algorithms for initial seed selection Arthur and Vassilvitskii (2007) presented a probabilistic seed selection scheme The single-link algorithm was proposed by Sneath (1957) An analysis of the convergence properties of thek-means algorithm is provided by Selim and Is-mail (1984) Using thek-means step in genetic algorithm-based clustering, which converges to the global optimum, is proposed and discussed by Krishna and Murty (1999)

An authoritative treatment on classification is provided in the popular book by Duda et al (2000) A comprehensive treatment of the nearest-neighbor classifiers is provided by Dasarathy (1990) Prototype selection is important in reducing the com-putational effort of the nearest-neighbor classifier Ravindra Babu and Murty (2001) study prototype selection using genetic algorithms Jain and Chandrasekaran (1982) discuss the problems associated with dimensionality and sample size Sun et al (2013) propose a feature selection based on dynamic weights for classification The problems associated with computing nearest neighbors in high-dimensional spaces is discussed by Franỗois et al (2007) and Radovanovic et al (2009) Even though Vapnik (1998) is the proponent of SVMs, they were popularized by the tutorial pa-per by Burges (1998)

(58)

Ananthanarayana et al (2003) use a variant of the FP-tree, which can be built us-ing one database scan The role of frequent itemsets in clusterus-ing was examined by Fung (2002) Yin and Han (2003) use frequent itemsets in classification The role of discriminative frequent patterns in classification is analyzed by Cheng et al (2007) The survey paper by Berkhin (2002) discusses a variety of clustering algorithms and approaches that can handle large datasets Different paradigms for clustering large datasets was presented by Murty (2002) The book by Xu and Wunsch (2009) on clustering offers a good discussion on clustering large datasets A major prob-lem with distance-based clustering and classification algorithms is that discrimina-tion becomes difficult in dimensional spaces Clustering paradigms for high-dimensional data are discussed by Kriegel et al (2009) The Leader algorithm for incremental data clustering is described in Spath (1980) BIRCH is an incremental hierarchical platform for clustering, and it is proposed by Zhang (1997) Vijaya et al (2005) propose another efficient hierarchical clustering algorithm based on leaders Efficient clustering using frequent itemsets was presented by Ananthanarayana et al (2001) Murty and Krishna (1980) propose a divide-and-conquer framework for ef-ficient clustering Guha et al (2003) proposed a divide-and-conquer algorithm for clustering stream data Ng and Han (1994) propose two efficient randomized algo-rithms in the context of partitioning around medoids

Viswanath et al (2004) use a divide-and-conquer strategy on the columns of the data matrix to improve the performance of the kNNC Fan et al (2008) have devel-oped a library, called LIBLINEAR, for dealing with large-scale classification using logistic regression and linear SVMs Yu et al (2003) use CF-tree based clustering in training linear SVM classifiers efficiently Asharaf et al (2006) use a modified version of the CF-tree for training kernel SVMs Ravindra Babu et al (2007) have reported results on KNNC using run-length-coded data Random forests proposed by Breiman (2001) is one of the promising classifiers to deal with high-dimensional datasets

The book by Han et al (2012) provides a wider and state-of-the-art coverage of several data mining tasks and applications Topic analysis has become a popular ac-tivity after the proposal of latent Dirichlet allocation by Blei (2012) Yin et al (2012) combine community detection with topic modeling in analyzing latent communities In text mining and information retrieval, Wikipedia is used (Hu et al (2009)) as an external knowledge source Currently, there is a growing interest in analyzing Big Data (Russom (2011)) and Map-Reduce (Pavlo et al (2009)) framework to deal with large datasets

References

R Agrawal, R Srikant, Fast algorithms for mining association rules, in Proceedings of

Interna-tional Conference on VLDB (1994)

V.S Ananthanarayana, M.N Murty, D.K Subramanian, Efficient clustering of large data sets Pattern Recognit 34(12), 2561–2563 (2001)

(59)

M.R Anderberg, Cluster Analysis for Applications (Academic Press, New York, 1973)

D Arthur, S Vassilvitskii,K-means++: the advantages of careful seeding, in Proceedings of

ACM-SODA (2007)

S Asharaf, S.K Shevade, M.N Murty, Scalable non-linear support vector machine using hierar-chical clustering, in ICPR, vol (2006) pp 908–911

G.P Babu, M.N Murty, A near-optimal initial seed value selection fork-means algorithm using genetic algorithm Pattern Recognit Lett 14(10) 763–769 (1993)

P Berkhin, Survey of clustering data mining techniques Technical Report, Accrue Software, San Jose, CA (2002)

D.M Blei, Introduction to probabilistic topic models Commun ACM 55(4), 77–84 (2012) L Breiman, Random forests Mach Learn 45(1), 5–32 (2001)

C.J.C Burges, A tutorial on support vector machines for pattern recognition Data Min Knowl Discov 2, 121–168 (1998)

H Cheng, X Yan, J Han, C.-W Hsu, Discriminative frequent pattern analysis for effective classi-fication, in Proceedings of ICDE (2007)

B.V Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques (IEEE Press, Los Alamitos, 1990)

R.O Duda, P.E Hart, D.J Stork, Pattern Classification (Wiley-Interscience, New York, 2000) R.-E Fan, K.-W Chang, C.-J Hsich, X.-R Wang, C.-J Lin, LIBLINEAR: a library for large linear

classification J Mach Learn Res 9, 18711874 (2008)

D Franỗois, V Wertz, M Verleysen, The concentration of fractional distances IEEE Trans Knowl Data Eng 19(7), 873–885 (2007)

B.C.M Fung, Hierarchical document clustering using frequent itemsets M.Sc Thesis, Simon Fraser University (2002)

S Guha, A Meyerson, N Mishra, R Motwani, L O’Callaghan, Clustering data streams: theory and practice IEEE Trans Knowl Data Eng 15(3), 515–528 (2003)

J Han, J Pei, Y Yin, Mining frequent patterns without candidate generation, in Proc of

ACM-SIGMOD (2000)

J Han, M Kamber, J Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, San Ma-teo, 2012)

X Hu, X Zhang, C Lu, E.K Park, X Zhou, Exploiting Wikipedia as external knowledge for document clustering, in ACM SIGKDD, KDD (2009)

A.K Jain, B Chandrasekaran, Dimensionality and sample size considerations in pattern recogni-tion practice, in Handbook of Statistics, ed by P.R Krishnaiah, L Kanal (1982), pp 835–855 A.K Jain, R.C Dubes, Algorithms for Clustering Data (Prentice-Hall, Englewood Cliffs, 1988) A.K Jain, M.N Murty, P.J Flynn, Data clustering: a review ACM Comput Surv 31(3), 264–323

(1999)

H.-P Kriegel, P Kroeger, A Zimek, Clustering high-dimensional data: a survey on subspace clus-tering, pattern-based clustering and correlation clustering ACM Trans Knowl Discov Data 3(1), 1–58 (2009)

K Krishna, M.N Murty, Genetick-means algorithm IEEE Trans Syst Man Cybern., Part B, Cybern 29(3), 433–439 (1999)

J MacQueen, Some methods for classification and analysis of multivariate observations, in

Pro-ceedings of the Fifth Berkeley Symposium (1967)

M.N Murty, Clustering large data sets, in Soft Computing Approach to Pattern Recognition and

Image Processing, ed by A Ghosh, S.K Pal (World-Scientific, Singapore, 2002), pp 41–63

M.N Murty, G Krishna, A computationally efficient technique for data-clustering Pattern Recog-nit 12(3), 153–158 (1980)

R.T Ng, J Han, Efficient and effective clustering methods for spatial data mining, in Proc of the

VLDB Conference (1994)

A Pavlo, E Paulson, A Rasin, D.J Abadi, D.J Dewit, S Madden, M Stonebraker, A comparison of approaches to large-scale data analysis, in Proceedings of ACM SIGMOD (2009)

(60)

M Radovanovi´c, A Nanopoulos, M Ivanovi´c, Nearest neighbors in high-dimensional data: the emergence and influence of hubs, in Proceedings of ICML (2009)

T Ravindra Babu, M.N Murty, Comparison of genetic algorithm based prototype selection schemes Pattern Recognit 34(2), 523–525 (2001)

T Ravindra Babu, M.N Murty, V.K Agrawal, Classification of run-length encoded binary data Pattern Recognit 40(1), 321–323 (2007)

P Russom, Big data analytics TDWI Research Report (2011)

S.Z Selim, M.A Ismail,K-means-type algorithms: a generalized convergence theorem and char-acterization of local optimality IEEE Trans Pattern Anal Mach Intell 6(1), 81–87 (1984) P Sneath, The applications of computers to taxonomy J Gen Microbiol 17(2), 201–226 (1957) H Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood,

Chichester, 1980)

X Sun, Y Liu, M Xu, H Chen, J Han, K Wang, Feature selection using dynamic weights for classification Knowl.-Based Syst 37, 541–549 (2013)

V.N Vapnik, Statistical Learning Theory (Wiley, New York, 1998)

P.A Vijaya, M.N Murty, D.K Subramanian, Leaders–subleaders: an efficient hierarchical cluster-ing algorithm for large data sets Pattern Recognit Lett 25(4), 505–513 (2005)

P Viswanath, M.N Murty, S Bhatnagar, Fusion of multiple approximate nearest neighbor classi-fiers for fast and efficient classification Inf Fusion 5(4), 239–250 (2004)

D Xin, J Han, X Yan, H Cheng, Mining compressed frequent-pattern sets, in Proceedings of

VLDB Conference (2005)

R Xu, D.C Wunsch II, Clustering (IEEE Press/Wiley, Los Alamitos/New York, 2009)

X Yin, J Han, CPAR: classification based on predictive association rules, in Proceedings of SDM (2003)

Z Yin, L Cao, Q Gu, J Han, Latent community topic analysis: integration of community discov-ery with topic modeling ACM Trans Intell Syst Technol 3(4), 63:1–63:23 (2012).

H Yu, J Yang, J Han, Classifying large data sets using SVM with hierarchical clusters, in Proc.

of ACM SIGKDD (KDD) (2003)

(61)

Chapter 3

Run-Length-Encoded Compression Scheme

Data Mining deals with a large number of patterns of high dimension While dealing with such data, a number of factors become important such as size of data, dimen-sionality of each pattern, number of scans of database, storage of entire data, storage of derived summary of information, computations involved on entire data that lead to summary of information, etc In the current chapter, we propose compression al-gorithms that work on patterns with binary-valued features However, the alal-gorithms are applicable to floating-point-valued features and are appropriately quantized into a binary-valued feature set

Conventional methods of data reduction include clustering, sampling, use of suf-ficient statistics or other derived information from the data For clustering and clas-sification of such large data, the computational effort and storage space required would be prohibitively large In structural representation of patterns, string match-ing is carried out usmatch-ing an edit distance and the longest common subsequence One possibility of dealing with such patterns is to represent them as runs Efficient algo-rithms exist to compute approximate and exact edit distances of run-length-encoded strings In the current chapter, we focus on numerical similarity measure

We propose a novel idea of compressing the binary data and carry out clustering and classification directly on such compressed data We use run-length encoding and demonstrate that such compression reduces both storage space and computation time The work is directly applicable to mining of large-scale business transactions Major contribution of the idea is in developing a scheme wherein a number of goals are achieved such as reduced storage space, computation of distance function on run-length-encoded binary patterns without having to decompress, and preserving the same classification accuracy that could be obtained using original uncompressed data and significantly reduced processing time

We discuss theoretical foundations for such an idea, as well as practical imple-mentation results

(62)

3.2 Compression Domain for Large Datasets

Data Mining deals with large datasets The datasets can be formal databases, data warehouses, flat files, etc From the Pattern Recognition perspective, a “Large Dataset” can be defined as a set of patterns that are not amenable for in-memory storage and operations

Large Dataset Letnandd represent the number of patterns and the number of features, respectively Largeness of data can be defined as one of the following

• nis small, anddis large

• nis large, andd is small

• nis large, andd is large

Algorithms needing multiple data scans result in high processing requirements This motivates one to look for conventional and hybrid algorithms that are efficient in storage and computations

Viewing from Pattern Recognition perspective, a “large dataset” contains a large number of patterns, with each pattern being characterized by a large number of attributes or features For example, in case of large transaction data consisting of a large number of items, transactions are treated as patterns with items serving as features The largeness of datasets would make direct use of many conventional iterative Pattern Recognition algorithms for clustering and classification unwieldy

Generation of data abstraction by means of clustering was earlier successfully carried out by various researchers In order to scale up clustering algorithms, intelli-gent compression techniques were developed which generate sufficient statistics in the form of clustering features from a large input data Such statistics were further used for clustering Work based on the notion of scalable framework of clustering is carried out earlier, by identifying regions that are required to be stored in the memory, regions that are compressible and the regions of the database that can be discarded The literature contains work that was carried out in developing a method known as “squashing,” which consisted of three steps known as grouping the input large data into mutually exclusive groups, computing low-order moments within each group and generating pseudo-data Such data can be further used for cluster-ing Another important research contribution was the use of a novel frequent pattern tree structure for storing compressed, crucial information about frequent patterns

In this background, we attempt to represent the data in a compressed or compact way in a lossless manner and carry out clustering and classification directly on the compressed data One such scheme that we consider in this chapter is a compact data representation and carrying out clustering and classification directly on such a compact representation The compact and original representations are one-to-one We call a compact data representation lossless when the generation of uncompact or uncompressed data from compact data representation matches exactly with original data and lossy otherwise.

(63)

Handwritten Digit Dataset Handwritten digit data considered for illustration consists of 100,030 labeled 192-feature binary patterns The data consists of 10 categories, viz., to Of this entire data, 66,700 patterns, equally divided into 10 categories, are considered as training patterns, and 33,330 as test patterns, with ap-proximately 3330 patterns per class We present some sample handwritten patterns Each pattern of 192-features is represented as a 16×12 matrix The patterns in the figure represent nonzero features They are indicative of the zero and nonzero fea-ture combination, leading to varying run sequences of zeroes and ones

3.3 Run-Length-Encoded Compression Scheme

We discuss a scheme that compresses binary data as run lengths A novel algorithm that computes dissimilarity in the compressed domain is presented We begin by defining related terms, which in turn are used in describing the algorithm

3.3.1 Discussion on Relevant Terms

1 Run In any ordered sequence of elements of two kinds, each maximal sub-sequence of elements of like-kind is called a run.

For example, sequences 111 and 0000 are runs of 1s and 0s, respectively 2 Run Length The number of continuous elements of same kind is defined as run

length.

For example, the sequence of binary numbers 1110000 has run lengths of and of 1s and 0s, respectively

3 Run-String A complete sequence of runs in a pattern is called a run-string. For example, the run-string of the pattern 111000011000 is “3 3” 4 Length of Run-String The length of a run-string is defined as the sum of runs in

the string For example, the length of the run-string “3 3” is 12 (3+4+2+3) 5 Run Dimension The number of individual runs in a run-string is defined as the run dimension For example, the pattern 111000011000 has the run-string of “3 3” Its length is 12, and the run dimension is

(64)

Table 3.1 Illustrations of compressed data representation Sl No Pattern Compressed data

representation of the run-string

Length of the run-string

Run dimension

1 111100110011 2 2 12

2 011111001110 12

3 111111111111 12 12

4 000000000000 12 12

5 101010101010 1 1 1 1 1 1 12 12

6 010101010101 1 1 1 1 1 1 12 13

7 Decompression of Compressed Data Representation Based on the definitions above, the Compressed Data Representation of a given pattern can be expanded to its original form The Decompressed representation of CDR of a pattern con-sists of expanding the run-string form of the pattern into binary data

For example, consider the pattern 001000 The CDR of the pattern is 0213 Alternately, the CDR of 0213 implies that the original pattern started with 0s The number of starting 0s is 2, followed by one 1, followed by three 0s, i.e., 001000 Below, we state and elaborate the properties that will be uses in Algorithm3.1 Subsequently, we prove lemmas that are based on the proposed representation The definitions are explained through few illustrations in Table3.1

3.3.2 Important Properties and Algorithm

Property 3.1 For a constant length of input binary patterns, the run dimension across different patterns need not be the same.

Every given pattern under study consists of a constant number of features The features in the current context are binary The data to be classified consists of intra-and inter-pattern variations The variations in the patterns in terms of shape, size, orientation, etc result in varying lengths of runs of 0s followed by 1s and vice versa Hence, the run dimension across different patterns need not be the same This is illustrated by the following example Consider two patterns with equal number of features, 11001100 and 11110001 The corresponding run-strings are 2222 and 431 The run dimensions of the two patterns are and 3, respectively

Property 3.2 The sum of runs of a pattern is equal to the total number of bits present in the binary pattern.

(65)

illustrates this fact Consider a pattern havingmbinary features Let the features consist ofpcontinuous sequences of like kind, alternating between 1s and 0s and leading to a run sequence ofq1, q2, , qp

Bit string: bm−1 b1b0 Run string: q1q2 qp Then q1+q2+ · · · +qp=m

Property 3.3 Counted from left to right in a CDR, starting with 1, for the positions

1,2,3, of the input features, 1,3,5, represent the number of continuous 1s, and 2,4,6, represent the numbers of continuous 0s This follows from definition6

of Sect.3.3.1

Property 3.4 The run dimension of a pattern can, at most, be one more than the run-string length.

We provide an Algorithm3.1for computation of dissimilarity between two com-pressed patterns directly in the comcom-pressed domain with the help of run-strings. To start with, all the patterns are converted to Compressed Data Representation. LetC1[1 m1]andC2[1 m2]represent any two patterns in Compressed Data Representation form We briefly discus the algorithm In Step of the algorithm, we read the patterns in their compressed form,C1[1 m1]andC2[1 m2], with m1andm2being the lengths of the two compressed patterns considered It should be noted that m1 and m2 need not be equal, which most often is the case, even when both the patterns belong to the same class In Step 2, we initialize the runs corresponding to compressed formsC1[·] andC2[·], viz.,R1andR2, to the first runs in each of the compressed patterns and set counters runlencounter1 and runlencounter2to In Steps to 7, we compute the difference betweenR1andR2, iteratively, till one of them is reduced to zero As soon as one of them is re-duced to zero, the next element ofC1[1 m1]orC2[1 m2]is considered based on which of them is reduced to zero The distance is incremented by the mini-mum of current values ofR1 andR2 in Step whenever the difference between counters runlencounter1 and runlencounter2is odd It should be noted that when

|runlencounter1−runlencounter2|is odd, the corresponding runs are of unlike kind, viz., 0s and 1s In Step 7, runlencounter1and runlencounter2are appropriately re-set Step returns the Manhattan distance between the two patterns The while-loop is terminated when runlencounter1exceedsm1or runlencounter2exceedsm2 Algorithm 3.1 (Computation of Distance between Compressed Patterns)

Step 1: Read Compressed Pattern-1 in arrayC1[1 m1] and Compressed Pattern-2 in arrayC2[1 m2]

Step 2: Initialize

runcounter1and runcounter2to

(66)

Step 3: WHILE-BEGIN(from Step-4 to Step-8) Step 4: IfR1=0

(a) increment runlencounter1by 1,

(b) if runlencounter1> m1, go to Step (BREAK), (c) loadC1[runlencounter1] inR1

Step 5: IfR2=0

(a) increment runlencounter2by 1,

(b) if runlencounter2> m2, go to Step (BREAK), (c) loadC2[runlencount er2] inR2

Step 6: If|runlencounter1−runlencounter2|is odd, increment distance by min(R1,R2)

Step 7: IfR1≥R2

(a) SubtractR2fromR1, (b) SetR2=0

Else

(a) SubtractR1fromR2, (b) SetR1=0

Step 8: WHILE-END Step 9: Return distance

The computation is illustrated through an example Consider two patterns, [10110111] and [01101101] The Manhattan distance between the two patterns in their original, uncompressed form is By definition of Sect 3.3.1, the Com-pressed Data Representations of these patterns, respectively, are [1 3] and [0 2 1] At Step 2,m1=5,m2=7, R1=2,R2=0, runlencounter1=1, runlencounter2=1 The following is the computation path with values at the end of various steps

(a) Step 1:C1= [1 3],C2= [0 2 1]

Step 2: runlencounter2=1, runlencounter1=1,R1=1,R2=0, distance=0 (b) Step 5: runlencounter2=1, runlencounter1=2,R1=1,R2=1

Step 6: counter-difference is odd, distance=0+1=1

Step 7:R1=0,R2=0

(c) Step 4: runlencounter1=2, runlencounter2=2,R1=1,R2=0 Step 5: runlencounter1=2, runlencounter2=3,R1=1,R2=2 Step 6: counter-difference is odd, distance=1+1=2

Step 7:R1=0,R2=1

(d) Step 4: runlencounter1=3, runlencounter2=3,R1=2,R2=1 Step 7:R1=1,R2=0

(e) Step 5: runlencounter1=4, runlencounter2=3,R1=1,R2=1 Step 6: counter-difference is odd, distance=2+1=3

Step 7:R1=0,R2=0

(f) Step 4: runlencounter1=4, runlencounter2=4,R1=1,R2=0 Step 5: runlencounter1=4, runlencounter2=5,R1=1,R2=2 Step 6: counter-difference is odd, distance=3+1=4

(67)

(g) Step 4: runlencounter1=5, runlencounter2=5,R1=3,R2=1 Step 7:R1=2,R2=0

(h) Step 5: runlencounter1=5, runlencounter2=6,R1=2,R2=1 Step 6: counter-difference is odd, distance=4+1=5

Step 7:R1=1,R2=0

(i) Step 5: runlencounter1=5, runlencounter2=7,R1=1,R2=1 Step 7:R1=0,R2=0

(j) Step 6: STOP and return distance as 5.

By definition, f is a function fromχAintoχBif for every element ofχA, there is an assigned unique element ofχB, whereχAis the domain, andχBis the range of the function The function is one-to-one if different elements of the domainχA have distinct images The function is onto if each element ofχB is the image of some element ofχA The function is bijective if it is one-to-one and onto A function is invertible if and only if it is bijective

Lemma 3.1 LetχAand χB represent original and compressed data representa-tions Thenf :χA−→χBis a function, and it is invertible.

Proof For every element of original data inχA, viz., every original pattern, there is a unique element inχB, viz., compressed representation Each of the images is distinct Hence, the function is one-to-one and onto and hence bijective Alternately, consider a mapping fromχBtoχA Every compressed data representation leads to a unique element of the original data Hence, the function is invertible Specifically,

note thatχB=χA

Lemma 3.2 LetχAand χB represent original and compressed data representa-tions Let (xa, ya) and (xb, yb) denote arbitrary patterns represented inχAandχB representations, respectively Let the length of the strings inAben Then,

d(xa, ya)=d(xb, yb),

whered represents the Manhattan distance function between original data points, andd represents the Manhattan distance computation procedure based on Algo-rithm3.1between compressed data points.

Proof The proof is based on mathematical induction onn Forn=1, each position ofxaandya consists of either or The corresponding run-string by definition6 is either 01 or 1, respectively

Case a: Each ofxaandyais equal to Thend(xa, ya)=0 The corresponding run-strings ofxbandybare equal to 01 By Algorithm3.1,d(xb, yb)=0

Case b: Each ofxaandyais equal to Thend(xa, ya)=0 The corresponding run-strings ofxbandybare equal to By Algorithm3.1,d(xb, yb)=0

Case c: xa=1 andya=0 Thend(xa, ya)=1

The corresponding run-strings ofxbandybare equal to and 01 By Algorithm3.1,d(xb, yb)=1

(68)

Let the lemma be true forn=k, for somek≥1

Forn=k+1, the additional bit (feature) would be either or With this ad-ditional bit, the d-function provides either or depending on whetherkth and (k+1)st bits are alike or different, resulting in an additional distance of or In case ofχB, a bit matching with the previous bits will lead to incrementing last run by or creation of an additional run With all previous bits in case ofχA and all previous runs in case ofχBremaining unchanged, this leads to the situation where the run dimension is incremented by or only incrementing the run-size by This leads to the condition of Case a to Case d as discussed forn=1 Thus the lemma is proved The original and compressed data representations,χA andχB, provide stable ordering, i.e., the distances with the same value appear in the representation

Bas they in the representationA

Corollary 3.1 The representationsχAandχBprovide stable ordering.

Proof Consider an arbitrary pattern xa in A The corresponding pattern in B is xb, say The ordered distances between xa and every other pattern of χA is (x1, x2, , xk) By Lemma3.2, the distancesdanddprovide the same values for the equivalent patterns betweenχAandχB Thus, the ordered distances betweenxb and the corresponding patterns ofχA inχB are given by (x1, x2, , xk) Thus,

the representationsχAandχBprovide stable ordering

Corollary 3.2 Classification Accuracy of kNNC for any validkin both the schemes χAandχB is the same.

Proof By Corollary3.1, it is clear that the representationsχA andχBprovide sta-ble ordering Thus, the classification accuracy based on kNNC computed using rep-resentationχA and representationχB is the same The Minkowski metric for d-dimensional patterns a and b is defined as

Lq(a,b)=

d

i=1

|ai−bi|q

1

q

.

This is also referred to as theLq norm.L1andL2norms are called the Manhat-tan and Euclidean disManhat-tances, respectively The Hamming disManhat-tance is defined as the

number of places where two vectors differ

Lemma 3.3 The L1 norm and Hamming distances coincide for patterns with binary-valued features.

Lemma 3.4 TheL1norm computation is more expensive than that of the Hamming distance.

(69)

value of the difference inL1 norm, it is more expensive than computation of the

Hamming distance

Lemma 3.5 Hamming distance is equal to the squared Euclidean distance in the case of patterns with binary-valued features.

In view of Lemmas3.3,3.4, and3.5, we consider the Hamming distance as a dis-similarity measure for HW data used in the current work

3.4 Experimental Results

We consider multiple scenarios where the proposed algorithm can be applied such as classification of handwritten digit data, genetic algorithms, and artificial spacecraft health data

3.4.1 Application to Handwritten Digit Data

The algorithm is applied to a 10 % of the considered handwritten digit data We carried out experiments in two stages in order to demonstrate (a) nonlossy compres-sion nature of the algorithm and (b) savings in processing time In stage 1, the data is compressed and decompressed The decompressed data is found to be matching exactly with the original data both in content and size Table3.2provides statistics of class-wise runs Columns and contain arithmetic mean and standard devia-tion of the run dimension The maximum run length in the class of any of 1s or 0s is given in Column The range, a measure of dispersion, of the set of values is defined as the difference between maximum and minimum of values in the set The range of run dimension is given in Column Column contains the measure of central tendency, and Columns and contain the measures of dispersion It can be seen from the table that 3σ limits based on sample statistics of any class is much less than the number of features of the original pattern Figure3.1contains statistics of number of runs for class label “0” for about 660 patterns The figure indicates variation in the number of runs for different patterns It can be observed from the figure that even for patterns belonging to same class, there is a significant variability in the number of runs The patterns are randomly ordered, and hence the diagram does not demonstrate any secular trend among the patterns

(70)

Fig 3.1 Run statistics of class label “0” of 10 % data Table 3.2 Class-wise run statistics

Class label

Average class-wise run dimension

Standard deviation

Max run length in class

Range of run dimension

(1) (2) (3) (4) (5)

0 52.8 4.19 11 30

1 35.0 0.16 35

2 41.6 5.06 38 32

3 39.4 3.68 12 20

4 45.8 4.30 11 24

5 38.7 3.59 55 25

6 45.0 6.02 11 38

7 39.0 3.87 12 20

8 46.7 5.03 11 30

9 43.1 4.04 11 28

of information The CPU times taken on a single-processor computer are presented The results are provided in Table3.3 The CPU time provided in the table refers to the difference of time obtained through system calls at the start and end of the execution of the program With kNNC, the best accuracy of 92.47 % is obtained for k=7

(71)

Fig 3.2 Classification Accuracy with different values ofkusing kNNC Table 3.3 Data size and

processing times Description of data

Data CPU time

(sec) of kNNC Training data Test data

Original data as features

2,574,620 1,286,538 527.37 Compressed data

in terms of runs

865,791 432,453 106.83

3.4.2 Application to Genetic Algorithms

We present an overview of genetic algorithms before providing application of the proposed scheme to genetic algorithms

3.4.2.1 Genetic Algorithms

Genetic algorithms are randomized search algorithms for finding an optimal solution to an objective function They are inspired by natural evolution and natural genetics The algorithm simultaneously explores a population of possible solutions through generations with the help of genetic operators There exist many variants of genetic algorithms We discuss Simple Genetic Algorithm in the current subsection The genetic operators used in Simple Genetic Algorithm are the following

• Selection

• Cross-over

(72)

An important step in finding a solution is to encode a given problem for which an optimal solution is required The solution is found in the encoded space A com-mon method to encode is to represent the objective function as a binary string of lengthlwith decimal encoded mapping to various parameters that optimize the ob-jective function The obob-jective function is evaluated Consider a population of p such strings The value of an objective function or a fitness function is computed as a function of these parameters and evaluated for each string The following is an example of a population of strings withp=4 andl=20 The strings are initialized randomly

1: 01010011010011101011 2: 01101010010001000100 3: 01011001110101001001 4: 11010110100010100101

Next, the generation of population is computed based on the above genetic oper-ators We discuss selection There are a number of approaches to select highly fit individuals from one generation to another generation One such selection method is proportionate selection, where based on fitness value, more copies of highly fit in-dividuals from previous generation are carried forward to next generation This en-sures survival of the fittest individual As a second step, they are subjected to cross-over We briefly discuss a single point crosscross-over There are alternate approaches to cross-over known as uniform cross-over, 2-point crossover, etc The cross-over operation is performed on a pair of individuals, choosing them based on the proba-bility of cross-over Consider two strings randomly Choose a location of cross-over within strings randomly between andl−1 In order to illustrate cross-over, let the location be 8, counting from The genetic material between and is interchanged between the two strings to generate two new strings in the following manner

Strings before cross-over operation:

1: 01010011010011101011 3: 01011001110101001001 Strings after cross-over operation:

1: 01010011010101001001 3: 01011001110011101011

It can be noticed in the above schematic that the italic part is exchanged between the chosen pair of strings Cross-over helps in exploring newer solutions Mutation refers to flipping the string value between and The operation is performed based on the probability of mutation The following is an example of mutation operation performed at randomly chosen location, say, 11

Initial string: 01011001110011101011

String after Mutation operation: 01011001110111101011

(73)

when there is no newer exploration In summary, a genetic algorithm is characterized by the following set of key constituents

• Encoding mechanism of solutions

• Probability of cross-over,Pc Experimentally, it is chosen to be around 0.9

• Probability of mutation,Pm It is usually considered small, which otherwise can result in a random walk of solutions

• Probability of Initialization,Pi, can be optionally chosen as a parameter It dic-tates the solution space to be explored

• Termination criterion for the convergence to a near-optimal solution

• Appropriate mechanism for selection, cross-over, and mutation

3.4.2.2 Application

Usually, it takes a large number of generations to converge to an optimal or a near-optimal solution with genetic algorithms The computational expense is dominated by evaluation of the fitness function A large population size requires more time for evaluation of each string at every generation Consider a case where the fit-ness function is the classification accuracy of patterns involving a large set of high-dimensional patterns The features either are binary-valued or mapped to binary values The algorithm is directly applicable to such a scenario, leading to significant saving in computation time in arriving at convergence Some applications of the scheme are optimal feature selection where the string represents complete pattern with each bit representing presence or absence of a feature, and optimal prototype selection where the string is encoded as a parameter such as a distance threshold for a leader clustering algorithm that leads to optimal number of clusters, etc

3.4.3 Some Applicable Scenarios in Data Mining

In scenarios dealing with large data such as classification of transaction-type data or anomaly detection based on Spacecraft Health Keeping (HK) data, the proposed scheme provides significant improvement in (a) storage of the data in their com-pressed form and (b) classification of the comcom-pressed data directly The HW data can be represented as business transaction data consisting of transaction-wise item purchase status as illustrated in the current chapter It is clear from the presenta-tion that the efficiency of the algorithm increases with sparseness of the data In the following subsection, a scheme is proposed where data summarization or anomaly detection of Spacecraft HK data is presented

3.4.3.1 A Model for Application to Storage of Spacecraft HK Data

(74)

data through conventional methods, further analysis or operations on the data re-quires decompression, resulting in additional computational effort in decompress-ing Also, it might result in some loss of information in case of lossy compression The advantage of the proposed scheme can be summarized as below

• The data compression through the scheme is lossless Thus, data can be stored in compressed form, reducing storage requirements

• Data analysis involving dissimilarity computation such as clustering and classi-fication can make use of the proposed algorithm for dissimilarity computation directly between compressed patterns

3.4.3.2 Application to Anomaly Detection

Consider a remote sensing spacecraft carrying an optical imaging payload The time period during which a camera is switched-on is called duration of payload opera-tion In order to monitor a parameter, say, current variation(amp) during a payload operation, one strips out the relevant bytes from digital HK-data The profile of the parameter is obtained by plotting the parameter against time After appropriate pre-processing and normalization to fit to common pattern size, the choice of features can be either a set of sample statistics, such as moments, autocorrelation peaks, stan-dard deviations, and spectral peaks, or forming a pattern for structural matching In case of structural matching, the profile can be digitized with appropriate quantiza-tion such that all points of inflexion are present For example, a profile containing, sayk peaks can be digitized inmrows andncolumns The choice ofmandnis problem dependent Such a structure consists of binary values indicating the pres-ence or abspres-ence of the profile in a given cell, similar to HW data However, it should be noted that with reducing value ofmandn, the new form of data becomes more and more lossy Data-dependent analysis helps arriving at optimalmandn Thus, the data in real numbers is reduced in terms of binary data The data is compressed by the above scheme and stored for mission life Data summarization by means of clustering or anomaly detection by means of classification can make use of the above compressed data directly This forms a direct application of the proposed scheme

3.5 Invariance of VC Dimension in the Original and the Compressed Forms

(75)

f (X, ω)be a class of approximating functions indexed by abstract parameterωwith respect to a finite training datasetX.ωcan be scalar, vector, or matrix belonging to a set of parametersΩ

1 Risk Functional Given a finite sample (xi, yi) of sizen,L(y, f (x, ω))is the loss or discrepancy between output produced by the system and the learning machine for a given pointx The expected value of loss or discrepancy is called a risk functional,R(ω), which supposes the knowledge of the probability density function of the population from which the above sample is drawn.R(ω∗)is the unknown “True Risk Functional.”

2 Empirical Risk and ERM It is the arithmetic average of loss over the training data Empirical Risk Minimization (ERM) is an inductive learning principle

A general property necessary for any inductive principle is asymptotic con-sistency It requires that the estimates provided by ERM should converge to true values as the number of training data sample size grows large Learning theory helps to formulate conditions for consistency of the ERM principle

3 Consistency of ERM principle For bounded loss functions, the ERM principle is consistent iff the empirical risk converges uniformly to the true risk in the following sense:

lim n→∞P

sup ω

R∗(ω)−Remp(ω)> ε

=0 ∀ε >0.

HereP indicates probability, Remp(ω)the empirical risk for sample of sizen, andR∗(ω)is the true risk for the same parameter values,ω It indicates that any analysis of ERM principle must be a “worst-case analysis.”

Consider a class of indicator functionsQ(z, ω), ω∈Ω, and a given sample Zn= {zi, i=1,2, , n} The diversity of a set of functions with respect to a given sample can be measured by the number of different dichotomies,N (Zn), that can be implemented on the sample using the functionsQ(z, ω)

4 VC entropy The random entropy is defined asH (Zn)=lnN (Zn), which is a random variable Averaging the random entropy over all possible samples of size n generated from distributionF (z)gives

H (n)=ElnN (Zn)

;

H (n)is the VC entropy of the set of indicator functions on a sample of sizen The VC Entropy is a measure of the expected diversity of a set of indicator func-tions with respect a sample of a given size, generated from unknown distribution 5 Growth function The growth function is defined as the maximum number of di-chotomies that can be induced on a sample of sizenusing the indicator functions Q(z, ω)from a given set:

G(n)=ln max Zn

N (Zn),

(76)

and provides an upper bound for the distribution-dependent entropy A necessary and sufficient condition for consistency of the ERM principle is

lim n→∞

H (n) n =0.

However, it uses the notion of VC entropy defined in terms of unknown distribu-tion, and the convergence of the empirical risk to the true risk may be very slow The asymptotic rate of convergence is called fast if for anym > m0andc >0, the following exponential bound holds:

PR(ω)−Rω∗< ε=e−cmε2.

Statistical learning theory provides a distribution-independent necessary and suf-ficient condition for consistency of ERM and fast convergence, viz.,

lim n−→∞

G(n) n =0.

The growth function is either linear or bounded by a logarithmic function of the number of samples n The VC dimension is that value ofn (=h)at which the growth starts to slow down When the value is finite, then for large samples, the growth function does not grow linearly It is bounded by a logarithmic function, viz.,

G(n)≤h

1+lnn h

.

If the bound is linear for any n,G(n)=nln 2, then the VC-dimension for the set of indicator functions is infinite, and hence no valid generalization is pos-sible The VC dimension is explained in terms of shattering If h samples can be separated by a set of indicator functions in all 2h possible ways, then this set of samples is said to be shattered by the set of functions, and there not existh+1 samples that can be shattered by a set of functions For binary parti-tions of sizen,N (ZN)=2nandG(n)≤nln LetXndenote valuations onBn, with|Xn| =2n andXn identified with(0,1)n A Boolean function onBn is a mapping f :Xn−→(0,1) Thus, a Boolean function assigns labels or to each assignment of truth values for each of Boolean variables There exist 22n Boolean functions on Bn A Boolean formula is a legal string containing 2n literals b1, , bn,¬b1, ,¬bn with connectives∨ (and) and∧ (or) and the parenthesis symbols

6 Pure Conjunctive Form An expression of the form b1∧b2∧ · · · ∧bn is called a Pure Conjunctive Form (PCF)

(77)

8 Conjunctive Normal Form A conjunction of several “clauses” each of which is a disjunction of some literals is called a Conjunctive Normal Form (CNF) For example, (b1∨b2∨ ¬b5)∧(¬b1∨b6∨b7)∧(b2∨b5)is a CNF

9 Disjunctive Normal Form A disjunction of several “clauses” each of which is a conjunction of some literals is called a Disjunctive Normal Form (DNF) For example,(b1∧ ¬b3∧b5)∨(¬b2∧b4∧b6∧b7)∨(b4∧b6∧ ¬b7)is a DNF The HW data in its original form is represented as DNF

Theorem 3.1 Suppose that C is class of concepts satisfying measurability condi-tions.

1 C is uniformly learnable if and only if the VC dimension of C is finite. 2 If the VC dimension of C isd <∞, then

(a) For 0< ε <1 andδ >0, the algorithm learns a given hypothesis if the sam-ple size is at least

max

4 εln

2 δ,

8d ε ln

13 ε

.

(b) For 0< ε <12 andδ >0, the algorithm learns a given hypothesis only if the sample size is greater than

max

1−ε ε ln

1 δ, d

1−2ε(1−δ)+δ.

Theorem3.1allows computation of bounds on sample complexity Also, it shows that one needs to compute limits on the VC dimension of a given learner to under-stand the sample complexity of the problem of learning from examples

Theorem 3.2 The VC dimension in both Original and Run-Length-Encoded (RLE) forms of the given data is the same.

Proof The proposed scheme forms a nonlossy compression scheme It is shown in Sects.3.3and3.4 that dissimilarity computation between any two patterns in both the forms of the data provides the same value Thus, learning through kNNC provides the samek-nearest neighbors and classification accuracy The number of dichotomies generated and thereby the VC dimension is the same in either case

3.6 Minimum Description Length

(78)

in-ference based on Kolmogorov’s characterization of randomness The MDL is the sum of the code length of the data based on the considered model,L(model), and the error term specifying how the actual data differs from the model prediction of a code lengthL(data/model) Hence, the total code length,l, of such a code for representing binary string of the data output is

l=L(model)+L(data/model). The coefficient of compression for this string is

K= l n.

Applying the MDL principle in the current context, we represent theL(model) as the number of bits required to store the pattern LetL(data/model), which represents the prediction error, be e with the original data With the original data,

l=192·k1+e,

wherek1is number of bits required to store the given feature value, and the corre-sponding compression for this string is

K(original−model)=192·k1

n +

e n.

As demonstrated earlier in the sections, the run-length-encoded data provides com-pression The compression can be seen in terms of number of features The maxi-mum number of features in any of the patterns is 55 (Table3.2), which occurred for class It is clearly demonstrated that with the proposed algorithm3.1, the error in classification has remained the same Thus, with the compressed data considering k2bits to store each feature value in the compressed form,

l≤55·k2+e. The corresponding compression for this string is

K(RLE−model)≤55·k2 n +

e n,

which clearly shows a significant reduction in the compression ratio in the best case and same as the original data in the worst case The following theorem formalizes the concept

Theorem 3.3 The MDL of the compressed data is less than or equal to the MDL of the original data.

(79)

in Sects.3.3and3.4, the classification error remains the same Thus, the second term of MDL in either case does not change The MDL of compressed data is better than that of the MDL of the original data, as long ask2≤k1 In the worst case of alternating binary feature values,k2=k1, making the MDLs of both sets of data

equal

3.7 Summary

We consider patterns with binary-valued features The data is compressed by means of runs A novel method of computing dissimilarity in the compressed domain is proposed This results in significant reduction in space and time The process of compression and decompression is invertible The concept of computing dissimi-larity is successfully applied to large-size handwritten digit data Other application areas in finding solution through genetic algorithms and conventional data mining approaches are discussed The classification of the data both in its original form and compressed form results in the same accuracy The results demonstrate the advan-tage of the procedure, viz., improvement of classification time by a factor of five The algorithm has a linear time complexity The work will have pragmatic impact on Data Mining applications, large data clustering, and related areas

Approaches to data reduction include clustering (Jain et al.1999) and sampling (Han et al.2012) Some approaches to sufficient statistics or data derived infor-mation are discussed by Tian et al (1996), DuMouchel et al (2002), Bradley et al (1998), Breuing et al (2000), Fung (2002), Mitra et al (2000), and Girolami and He (2003) Algorithms to compute approximate and exact edit distances of run-length-encoded strings are discussed by Makinen et al (2003) Marques de Sa (2001) and Duda and Hart (1973) provide detailed discussions on clustering, classification, and distance metrics that are referred to in the current chapter The works by Hastie and Tibshirani (1998), Cherkassky and Mulier (1998), Vapnik (1999), Vidyasagar (1997), Vapnik and Chervonenkis (1991,1968), Rissanen (1978), and Blumer et al (1989) contain theoretical preliminaries on the VC dimension, minimum descrip-tion length, etc Discussions on nodescrip-tion of algorithm complexity can be found in Kolmogorov (1965), Chaitin (1966), Cherkassky and Mulier (1998), and Vapnik (1999)

(80)

References

R Agrawal, T Imielinski, A Swami, Mining association rules between sets of items in large databases, in Proc 1993 ACM-SIGMOD Int Conf Management of Data (SIGMOD’93) (1993), pp 266–271

A Blumer, A Ehrenfeucht, D Haussler, M.K Warmuth, Learnability and the Vapnik– Chervonenkis dimension J Assoc Comput Mach 36(4), 929–965 (1989)

P Bradley, U.M Fayyad, C Reina, Scaling clustering algorithms to large databases, in Proceedings

of 4th Intl Conf on Knowledge Discovery and Data Mining (AAAI Press, New York, 1998),

pp 9–15

M.M Breuing, H.P Kriegel, J Sander, Fast hierarchical clustering based on compressed data and OPTICS, in Proc 4th European Conf on Principles and Practice of Knowledge Discovery in

Databases (PKDD), vol 1910 (2000)

G.J Chaitin, On the length of programs for computing finite binary sequences J Assoc Comput Mach 13, 547–569 (1966)

V Cherkassky, F Mulier, Learning from Data (Wiley, New York, 1998)

R.O Duda, P.E Hart, Pattern Classification and Scene Analysis (Wiley, New York, 1973) W DuMouchel, C Volinksy, T Johnson, C Cortez, D Pregibon, Squashing flat files flatter in

Proc 5th Intl Conf on Knowledge Discovery and Data Mining, San Diego, CA (AAAI Press,

New York, 2002)

B.C.M Fung, Hierarchical document clustering using frequent itemsets M.Sc Thesis, Simon Fraser University (2002)

M Girolami, C He, Probability density estimation from optimally condensed data samples IEEE Trans Pattern Anal Mach Intell 25(10), 1253–1264 (2003)

D.E Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, Reading, 1989)

J Han, M Kamber, J Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, New York, 2012)

T Hastie, R Tibshirani, Classification by pairwise coupling Ann Stat 26(2) (1998)

A.K Jain, M.N Murty, P Flynn, Data clustering: a review ACM Comput Surv., 32(3) (1999) A.N Kolmogorov, Three approaches to the quantitative definitions of information Probl Inf

Transm 1(1), 1–7 (1965)

V Makinen, G Navarro, E Ukkinen, Approximate matching of run-length compressed strings Algorithmica 35(4), 347–369 (2003)

J.P Marques de Sa, Pattern Recognition—Concepts, Methods and Applications (Springer, Berlin, 2001)

P Mitra, C.A Murthy, S.K Pal, Data condensation in large databases by incremental learning with support vector machines, in Proc 15th International Conference on Pattern Recognition

(ICPR’00), vol (2000)

T Ravindra Babu, M Narasimha Murty, V.K Agrawal, Classification of run-length encoded binary strings Pattern Recognit 40(1), 321–323 (2007)

J Rissanen, Modeling by shortest data description Automatica 14, 465–471 (1978)

Z Tian, R Raghu, L Micon, BIRCH: an efficient data clustering method for very large databases, in Proceedings of ACM SIGMOD International Conference of Management of Data (1996) V Vapnik, Statistical Learning Theory, 2nd edn (Wiley, New York, 1999)

V Vapnik, A.Ya Chervonenkis, On the Uniform Convergence of Relative Frequencies of Events to

Their Probabilities Dokl Akad Nauk, vol 181 (1968) (Engl Transl.: Sov Math Dokl.)

V Vapnik, A.Ya Chervonenkis, The necessary and sufficient conditions for the consistency of the method of empirical risk minimization Pattern Recognit Image Anal 1, 284–305 (1991) (Engl. Transl.)

(81)

Chapter 4

Dimensionality Reduction by Subsequence Pruning

In Chap.3, we discussed one approach of dealing with large data In this approach, we compress given large dataset and work in the compressed domain to generate an abstraction This is essentially a nonlossy compression scheme

In the present chapter, we explore the possibility of preparing to lose some data in the given large dataset and still be able to generate an abstraction that is nearly as accurate as obtained with the original dataset In the proposed scheme, we make use of the concepts of frequent itemsets and support to compress the data and carry out classification of such compressed data The compression, in the current work, forms a lossy scenario We demonstrate that such a scheme significantly reduces storage requirement without resulting in any significant loss in classification accuracy

In this chapter, we initially discuss motivation for the current activity Subse-quently, we discuss basic methodology in detail Preliminary data analysis provides insights into data and directions for proper quantization for a given dataset We elab-orate the proposed lossy compression scheme and compression achieved at various levels We demonstrate working of the scheme on handwritten digit dataset

4.2 Lossy Data Compression for Clustering and Classification

Classification of high-dimensional, large datasets is a challenging task in Data Min-ing Clustering and classification of large data is in the focus of research in the recent years, especially in the context of data mining The largeness of data poses challenges such as minimizing number of scans of the database that is stored in the secondary memory, data summarization, apart from those issues related to clus-tering and classification algorithms, viz., scalability, high dimensionality, speed, prediction accuracy, etc

Data compression has been one of the enabling technologies for multimedia com-munication Based on the requirements of reconstruction, data compression is di-vided into two broad classes, viz., lossless compression and lossy compression For T Ravindra Babu et al., Compression Schemes for Mining Large Datasets,

(82)

optimal performance, nature of data and the context influence the choice of the com-pression technique In practice, many data comcom-pression techniques are in use The Huffman coding is based on the frequency of input characters Instead of attempting to compress a binary code as in Chap.3, a fixed number of binary features of each pattern is grouped or blocked together One of the objectives of current work is to compress and then classify the data Thus, it is necessary that the compressed form of the training and test data should be tractable and should require minimum space for storage and fast processing time Such a scheme should be amenable for further operations on data such as dissimilarity computation in its compressed form We propose a lossy compression scheme that satisfies these requirements

We make use of two concepts of association rule mining, viz., support and fre-quent itemset We show that the use of frefre-quent items that exceed a given support will avoid less frequent input features and provide better abstraction

The proposed scheme summarizes the given data in a single scan, initially as fre-quent items and subsefre-quently in the form of distinct subsequences Less frefre-quent subsequences are further pruned by their nearest neighbors that are more frequent This leads to a compact or compressed representation of data, resulting in signifi-cant compression of input data Test data requires additional mapping since some of the subsequences found in the test data would have got pruned in subsequence generation in the training dataset This could lead to inconsistency between the two encoded datasets during dissimilarity computation We discuss the need and modal-ities of transforming the test data in Sect.4.5.6 The test data is classified in its com-pressed form The classification of data directly in the comcom-pressed form provides a major advantage The lossy compression leads to highly accurate classification of test data because of possible improvement in generalization

Rough set approach-based classification with reference to dissimilarity limit of the test patterns is carried out The data thus reduced requires significantly less stor-age as compared to rough set-based schemes with similar classification accuracy

4.3 Background and Terminology

Consider a training data set consisting ofnpatterns Let each pattern consist ofd binary valued features Letεbe the minimum support for any feature of a pattern to be considered for the study We formally discuss the terms used further in the current chapter

1 Support Support of a feature, in the current work, is defined as actual number of patterns in which the feature is present Minimum support is referred to asε 2 Sequence Consider a set of integer numbers,

{S1, S2, }.

(83)

3 Subsequence LetS= {Sn},n=1,2, ,∞, is a sequence of integer numbers, and letS= {Si},i=1,2, ,∞, be a subsequence of the sequence of positive integers The composite functionS◦Sis called a subsequence ofS

For example, fori∈J, we have S(i)=si,

S◦S(i)=SS(i)=Ssi=ss

i, and hence

S◦S=(ss

i)

∞

i=1

4 Length of a subsequence LetSbe a subsequence The number of elements of the subsequence is referred to as the length of a subsequence,r

5 Block, Block Length We define a finite number of binary digits as a block The number of such digits in a block,b, is called the block length

6 Value of a Block The decimal equivalent of a block is the value of block,v 7 Minimum frequency for pruning When subsequences are formed, in order to

prune the number of subsequences, we aim to replace those less frequent subse-quences that remain below a chosen frequency threshold It is referred to mini-mum frequency,ψ

8 Dissimilarity threshold for replacing subsequences While replacing less fre-quent subsequences, we replace a subsequence by its neighbor that is below cer-tain distance It is referred to as dissimilarity threshold,η The parameter controls the fineness of a neighborhood for subsequence replacement

Table4.1provides the list of parameters used in the current work In the current implementation, all the parameters are integers

We illustrate the above concepts through the following examples

Illustration 4.1 (Sequence, subsequence, and blocks) Consider a pattern with

bi-nary features as {0,0,0,0,0,0,1,1,1,0,0,0} The sequence is represented by 000000111000 00000011 represents a subsequence of length (0000), (0011), (1000) represent blocks of length 4, each with the corresponding values of block as 0, 3,

Illustration 4.2 (Frequent itemsets) Consider five patterns with six features each,

counted from to The concepts of itemsets and support are presented to Table4.2 In the table, each row consists of a pattern, and each column contains presence(1) or absence(0) of the feature Last row contains the column-wise sum that indicates the support of the corresponding feature or item

The support of each feature is obtained by counting the number of nonzero val-ues The feature-wise supports are {3,2,3,1,4,3} With the help of support values, frequent features corresponding to different threshold values can be identified

(84)

Table 4.1 List of parameters

used Parameter Description

n Number of patterns or transactions

d Number of features or items prior to identification of frequent items

b Number of binary features that makes

one block

q Number of blocks in a pattern

v Value of block

r Length of a subsequence

ε Minimum support

ψ Minimum frequency for pruning a

subsequence

η Dissimilarity threshold for identifying nearest neighbor to a subsequence

Table 4.2 Itemsets and

support Sl No Itemset

1 1 0

2 1 1

3 0 1

4 0 0

5 1 1

Support 3

Table 4.3 Frequent itemsets

Minimum support Itemset

2 {1,2,3,5,6}

3 {1,3,5,6}

4 {5}

5 –

Illustration 4.3 (Impact of minimum support on distinct subsequences) We

(85)

Table 4.4 Distinct subsequences with minimum support,ε≥0

Parameters:n=5,d=6, n=2, No of patterns=5, No of distinct

subsequences=5

Sl No Patterns Blocks of length

Values of blocks

Subsequences of length

(1) (2) (3) (4) (5)

1 1 0 11,00,10 3,0,2 3,0,2 1 1 01,10,11 1,2,3 1,2,3 0 1 00,10,11 0,2,3 0,2,3 0 0 10,00,01 2,0,1 2,0,1 1 1 10,11,10 2,3,2 2,3,2

Table 4.5 Distinct subsequences with minimum support,ε≥3

subsequences=4

Values of blocks

Subsequences of length 1 0 10,00,10 2,0,2 2,0,2 0 1 00,10,11 0,2,3 0,2,3 0 1 00,10,11 0,2,3 0,2,3 0 0 10,00,01 2,0,1 2,0,1 1 10,10,10 2,2,2 2,2,2

At this stage, the set of subsequences are (2,0,2), (0,2,3), (0,2,3), (2,0,1), and (2,2,2) And since (0,2,3) repeats twice, the set of distinct subsequences is, (2,0,2), (0,2,3), (2,0,1), and (2,2,2) Consider Table4.4 It consists of five pat-terns arranged row-wise Column consists of list features for each pattern Column contains blocks of length considered from each pattern Column contains dec-imal equivalents of each of the blocks Column is the values of blocks arranged as a subsequence of length Observe the reduction in the number of distinct subse-quences from to with increasing minimum support from to Tables4.4,4.5, and4.6summarize the concepts discussed in the example

It can be noticed from Table4.5that since {0,2,3} repeats two times, the num-bers of distinct subsequences are given below

{2,0,2},{0,2,3},{2,0,1},{2,2,2} forε≥3.

In case ofε≥4 as shown in Table4.6, the distinct subsequence is {0,0,2} alone

Illustration 4.4 (Distinct subsequences and dissimilarity table) Consider the

(86)

Table 4.6 Distinct

Subsequences with minimum support,ε≥4

subsequences=1

Values of blocks

Subsequences of length 0 0 00,00,10 0,0,2 0,0,2 0 0 00,00,10 0,0,2 0,0,2 0 0 00,00,10 0,0,2 0,0,2 0 0 0 00,00,00 0,0,0 0,0,0 0 0 00,00,10 0,0,2 0,0,2

Table 4.7 Distinct subsequences and corresponding support

Sl No Subsequence No of repetitions

1 2,0,2

2 0,2,3

3 2,0,1

4 2,2,2

Table 4.8 Dissimilarity table of distinct subsequences in terms of Euclidean distance

Sl No

1

2 – √12 √5

3 – – √5

4 – – –

Illustration 4.5 (Pruning of distinct subsequences) Consider data in Table4.2 We notice in Sect.4.3that the number of distinct subsequences decreases with increas-ing threshold In Tables4.4,4.5, and4.6, we see the list of distinct subsequences with increasing minimum threshold from to

To illustrate the use ofεandη, we consider subsequences of hypothetical data as given in Table4.9 The table contains distinct subsequences and their corresponding frequencies In order to prune the number of distinct subsequences, we propose to replace all the subsequences that occur with frequency less than or equal to 2, i.e., ψ=2 Now from the table, {2,0,1} should be replaced by its nearest neighbor Table4.8consists of subsequence numbers to 4, both shown in rows and columns Each cell corresponding to subsequence numbers indicates dissimilarity between subsequencesiandj, say, each ranging from to From Table4.8, we notice that the nearest neighbor of {2,0,1} is {2,0,2}, which is at a distance of Thus, here η=1

Illustration 4.6 (Nearest neighbors for mapping previously unseen subsequence)

(87)

Table 4.9 Hypothetical distinct subsequences and support

Sl No Subsequence No of repetitions

1 2,0,2

2 0,2,3

3 2,0,1

4 2,2,2 10

from a test pattern Since it is possible that a subsequence generated from a test pattern is not seen in the training patterns, we assign it to a nearest neighbor among the subsequences of a training pattern This helps in assigning the same unique subsequence id to each of such subsequences in Test pattern It is experimentally seen that such an assignment does not adversely affect the classification accuracy

Illustration 4.7 (Classifying a test pattern transformed to subsequences) For a

chosen block size and length of subsequence, subsequences are formed from the training data Only those subsequences that are frequent based onψ, are retained Among the subsequences, unique subsequences are identified and numbered Ta-ble4.8provides the distances between any such two unique subsequences For each test pattern, subsequences are formed and corresponding unique id’s are assigned But it is possible that some of subsequences in test pattern are not seen earlier; we assign its nearest neighbor from among the pruned subsequences of the train-ing dataset The dissimilarity between the test and each traintrain-ing pattern is quickly carried out by accessing the values from the dissimilarity table

With this background of parameter definition, we describe the proposed scheme in the following section

4.4 Preliminary Data Analysis

We implement the proposed method on a large Handwritten digit dataset The cur-rent section provides a brief description of the data and preliminary analysis carried out

We consider a large handwritten digit dataset, which consists of 10 classes, to Each pattern consists of a 16×12 matrix, which is equal to 192 binary-valued features The total number of patterns is 100,030, which are divided into 66,700 training and 33,330 test patterns For demonstration of the proposed scheme, we consider 10 % of this dataset

(88)

Table 4.10 Basic statistics on number of nonzero features in the training data

Class label

Mean Standard deviation

Minimum number

Maximum number

(1) (2) (3) (4) (5)

0 66.4 10.8 38 121

1 29.8 5.1 17 55

2 64.4 10.5 35 102

3 59.8 10.3 33 108

4 52.9 9.3 24 89

5 61.3 10.0 32 101

6 58.4 8.5 34 97

7 47.3 7.7 28 87

8 67.4 11.4 36 114

9 55.6 8.4 31 86

Fig 4.1 A set of typical and atypical patterns of

handwritten data

minimum number of features Similarly, Column corresponds to a pattern that contained the maximum number among all the patterns within a class The table brings out complexity and variability in the given handwritten digit data For exam-ple, for class label 0, column indicates that there is at least one pattern of zero that contains just 38 nonzero features and the corresponding column contains that there is at least one pattern that has 121 nonzero features Apart from this, the orientation and shapes of different digit datasets also vary significantly Albeit the number of features in terms of statistics indicate that they are less than 192 for a given statis-tic, physically features of individual patterns occupy different feature locations, thus making 192 locations relevant for representing the pattern Further this results in a challenge in classification of such patterns

The data consisting of 10 classes with equal number of training patterns per class Typical and atypical patterns of the given training data are represented in Fig.4.1

4.4.1 Huffman Coding and Lossy Compression

(89)

4.4.1.1 Analysis with 6-Bit Blocks

Consider handwritten (HW) digit data consisting of 10 classes of 192-feature, la-beled data It is readable in matrix form, where each pattern is represented as 16×12 matrix In this matrix form, consider bits as a block, continuously Thus, each pat-tern would consist of 32 blocks Each block is typically represented by a character code, such asa,b, etc In the current context, each block is labeled as 1,2, ,32 With bits, each block can assume values from to 63 We propose the following scheme for analysis

1 Consider 6-bit blocks of the data

2 Compute the frequency of each of the values from to 63 in the entire training data

3 Present the item-value (0 to 63) and the corresponding frequency for generating complete binary tree, which eventually provides Huffman codes

4 Generate the Huffman code for each item-value of the training data Compare the size for its possible improvement with the original data Find the dissimilarity in the compressed state itself

7 Compare the CPU time and storage requirements

Table4.11contains the Huffman codes for 6-bit blocks In the table, column corre-sponds to the decimal equivalent value and referred to as “Label” Column consists of the number of occurrences of such a value, and column provides the correspond-ing Huffman code In order to accommodate all codes in a scorrespond-ingle table, we continue placing the data in columns to

The entire training data is reduced in the form of decimal equivalent values of 6-bit blocks Table4.12provides such sample encoded data From the table, similarity among the patterns of the same class can be noticed in terms of values of 6-bit blocks Table4.13contains the Huffman codes of a sample of training patterns

The following are important observations from the exercise

1 The space savings of Huffman coding with 6-bit coding is about 14 % only Both original training data and the compressed training data are binary

4.4.1.2 Analysis with 4-Bit Blocking

4-bit blocking results in 16 possibilities, viz., 0–15 Table4.16contains the results Table4.14contains 4-bit block coded training data Here again, one can notice sim-ilarity among the patterns of the same class in terms of values of blocks and also intra-pattern repetition of subsequences such as {0,3,8}

The following are important observations

(90)

Table 4.11 Huffman coding of values to 64 for entire training data

Label Frequency Code (leaf-to-root) Label Frequency Code (leaf-to-root)

(1) (2) (3) (4) (5) (6)

0 40,063 0 32 21,011 1111

1 15,980 1101 33 23 1010001000101

2 2176 1110011 34 31 011000100010

3 17,435 1011 35 22 0010001000101

4 2676 110010 36 100 11001000101

5 378 011000101 38 82 11110100010

6 8067 00011 39 11010110100010

7 12,993 0001 40 375 101000101

8 3840 110101 41 10001011011100010

9 385 111000101 44 465 100110011

10 45 110001000101 45 1001011011100010

11 279 010100010 46 51 101001000101

12 9104 10111 47 01010110100010

13 1364 0000101 48 23,845 110

14 2915 100101 49 53 00000100010

15 8550 00111 50 17 0011011100010

16 3833 010101 51 48 001001000101

17 75 01110100010 52 82 01011100010

18 101011011100010 54 86 00001000101

19 67 00110100010 55 31 111000100010

20 13 0001000100010 56 13,122 1001

21 00001011011100010 57 54 10000100010

22 15 1001000100010 58 11 11011011100010

23 44 111011100010 59 37 110110100010

24 11,710 1010 60 4074 010011

25 346 111100010 61 15 0010110100010

26 28 101000100010 62 1047 10110011

27 261 100100010 63 433 000110011

28 2561 010010

29 162 0011100010

30 670 01100010

31 2262 000010

4.4.1.3 Huffman Coding and Run-Length Compression

(91)

Table 4.12 Sample of 6-bit block coded training data Label Coded pattern

0 56 56 56 56 60 60 60 60 12 28 28 12 24 12 24 13 48 13 48 15 0 56 56 56 56 12 12 12 12 12 12 12 12 12 12 12 24 56 56 48 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 16 16 48 48 32 32 32 32 3 2 6 6

2 6 15 32 15 32 15 32 15 32 32 32 3 31 50 31 50 31 62 31 62 24 15 15 29 29 27 27 7 14 62 12 62 12 59 28 59 28 51 48 51 48 32

Table 4.13 Huffman codes corresponding to 6-bit block sample patterns

Label Huffman code

0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 0

0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 0 0 1

1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0

its compressed form, which is subject to interpretation of the Huffman code The following are salient statistics

1 No of original input features: 6670·192=1,280,640 No of features after Huffman coding (6 bit-blocks): 948,401 No of features in post-run-length coding: 473231

4.4.1.4 Lossy Compression: Assigning Longer Huffman Codes to Nearest Neighbors

(92)

Nearest-Table 4.14 Sample of 4-bit

blocked coded training data Label Coded pattern

0 8 8 12 12 12 12 12 12 12 12 8 7 12

0 8 15 15 12 12 12 12 12 12 12 12 12 15 15 15

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 12 0 12 0 0 8 8

Neighbor (NN) shorter length code features, the amount of storage required for Huffman would further come down The current exercise is aimed at such possible reduction

The procedure can be summarized as below

• Generate a Huffman code for 4-bit row-wise blocks for both training (6670) pat-terns and test patpat-terns (3333)

• Consider long patterns as shown Table4.16

• Assign longer length patterns to such NNs where the deviation is not large, by means of Equivalence Class Mapping

• Regenerate training and test data with 4-bit codes with newly mapped codes

• Classify the patterns at decimal valued patterns, by means of a table look up matrix

• Usek-Nearest-Neighbor Classifier (kNNC) fork=1 to 20

• In the exercise, we experimented on different combinations of assignment, such as assignment options to 6, as provided in columns to 10, respectively For example, consider column in Table4.16 The values in the column indicate by which of the label-codes current label-code is replaced For example, the code for label 4, viz., 1110101, is replaced by a shorter code corresponding label 3, viz., 1101 Similarly, code for label 13, viz, 10111111, is replaced by the code for label 12, viz., 1010, leading to a shorter code The last two rows of the table indicate the classification accuracy with kNNC after making such assignments and space savings for each such combination

This provides a view of assigning a given pattern having longer code to a pattern having a shorter code, thereby leading to lossy compression while achieving a good classification accuracy such as 93.8 % in case of option in column Also it should be noted in case of option 5, which is provided in column 10, that although the compression achieved is 32.8 %, the classification accuracy was reduced to 87 %

(93)

4.4.2 Analysis of Subsequences and Their Frequency in a Class The previous exercises focused on forming blocks of appropriate number of bits and computing the frequency of such blocks across all the training patterns The current exercise considers a sequence of decimal values of 4-bit blocks We identify repetition of such sequences across all the training patterns

In the current subsection, we consider the patterns belonging to class label and list out all possible subsequences and their frequencies Based on the list, we bring out some important observations and compression achieved The following are some of the observations

• The 192 features of a pattern make a reading sense, when arranged as 16×12 matrix Thus by making 4-bit blocks, 12 bits of a row leads to block values

• Identify all occurrences of same combination across all 667×16 sequences Ta-ble4.15 contains the summary of results The table enlists all possible subse-quences and the corresponding frequencies From the table one can notice repe-tition of some subsequences between and 543 to the extent of frequency of 100 or more This brings out an important aspect that

(a) although the patterns belong to the same class, there exist intra-class dissim-ilarities, and importantly,

(b) in spite of such intra-class variation at pattern-level, there exist significant subsequence-wise similarities

ã The 10,672 (667ì16) subsequences get reduced to 251 repeating subsequences, with frequency ranging from to 543

• It should be observed that a Huffman coding for such combination is not useful Earlier with 16 distinct values, the maximum code length of least repeating com-bination was bits If all the 251 comcom-binations are treated as separate codes, the corresponding Huffman code would be very long Alternately, it should be noted that the current representation itself could be considered as a compression scheme that compresses 10,672 nondistinct combinations of codes into 251 subsequences.

• Under the second argument of the above point, the given training and test data are encoded into this combination of subsequences The test data is classified at coded level using a look-up table of dissimilarities

4.4.2.1 Analysis of Repeating Subsequences for One Class

In the current exercise, the focus is in finding repeating sequences across rows A maximum of three rows is considered This brings out correlation among suc-cessive rows Here too, training data of class alone is considered The following procedure is followed

(94)

Table 4.15 Statistics of repeating subsequence and its frequency

(95)

Table 4.15 (Continued)

Subseq Freq Subseq Freq Subseq Freq Subseq Freq Subseq Freq {1,8,2} {0,15,6} {1,12,14} {6,3,12} {7,12,12} {0,12,6} {0,12,14} {7,11,12} {3,11,14} {7,15,14} {7,8,14} {0,0,12} {0,6,12} {1814} {7,8,6} {7,12,14} {15,1,12} {6,1,0} {2,8,8} {6,8,12} {6,6,0} {12,0,14} {11,0,12} {12,0,3} {14,7,8} {3,12,3} {15,15,14} {1,1,12} {7,1,14} {15,11,6} {3,3,12} {14,3,8} {12,1,14} {14,3,14} {15,15,6} {0,2,3} {0,14,2} {15,0,14} {14,0,2} {1,0,2} {3,8,2} {7,14,14} {7,1,0} {3,12,14} {7,7,12} {12,3,8} {12,7,0} {3,3,14} {3,14,12} {2,2,0}

• Identify three-consecutive repetitions

• Repeat the above to find (a) two-consecutive repetitions, (b) matching of the se-quence with the same sese-quence at any other place, and (c) only single occurrence of the combination

• The combinations are tabulated

• The total number of occurrences matched with input subsequences Thus, the implementation is validated The following are the statistics

1 No of three consecutive subsequence=101 (3018 after multiplying with fre-quency)

2 No of two consecutive subsequences=225 (6288)

3 No of one match of a subsequence with that at any other place (not consecutive)=115 (1260)

4 No of single occurrences=106 (106)

5 Observe that 3018+6288+1260+106=10672=667×16

6 Thus there are in all 101+225+115+106=451 combinations Here the subsequence need not be distinct since it includes those repeating once, two times and three times

As part of preliminary analysis, optimal feature selection using the Steady-State Genetic Algorithm considering entire training data together is carried out The num-ber of optimal features is found to be 106 out of 192 features, providing a classifi-cation accuracy of 92 % This leads to a total data reduction by about 45 %

4.5 Proposed Scheme

(96)

Table 4.16 Huffman coding and neighboring pattern assignment Label Frequency Huffman code Binary

representation

Assignment options

1

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

0 131,618 0000 0 0 0

1 32,874 0 0001 1 1 1

2 5374 1 1 0010 2 2

3 29,514 1 0011 3 3 3

4 3750 1 1 0100 4 4

5 246 0 1 1 1 0101 6 6

6 12,774 0 1 0110 6 6 6

7 10,175 1 1 0111 7 7

8 32,658 1 1 1000 8 8 8

9 3125 0 1 1 1001 9 8

10 345 1 1 1 1010 10 12 11 11 11 11

11 2806 1 1 1010 11 12 11 11 11 11

12 20,404 1 1100 12 12 12 12 12 12

13 2281 1 1 1 1101 13 12 13 12 12 12

14 10,495 1 1110 14 15 14 14 14 14

15 21,721 0 1 1111 15 15 15 15 15 15

Classification accuracy (%) kNNC withk=5 93.75 87 93.7 93.8 93.3 92.5 Savings in space (%) 25.9 26.1 26.8 27.5 28.4 32.8

Fig 4.2 Proposed scheme

(97)

• Initialization

• Frequent Item generation

• Generation of encoded training data

• Subsequence identification and frequency generation

• Pruning of subsequences

• Generation of encoded test data

• Classification using distance-based Rough concept

• Classification using kNNC

We elaborate each of the above steps in the following subsections

4.5.1 Initialization

In the given data, number of training patterns,n, and the number of features,d, are known Based on a priori domain knowledge of the data and through preliminary analysis on the training data, the following parameters are initialized to nonzero values

• minimum support,ε, for frequent item generation,

• block length,b,

• minimum frequency for subsequence pruning,ψ, and

• dissimilarity threshold for identifying nearest neighbors to the pruned subse-quences,η

4.5.2 Frequent Item Generation

The input data encountered in practice, such as sales transaction data, contains fea-tures that are not frequent They can equivalently be considered as noisy when the objective is robust pattern classification Also, the number of nonzero features dif-fers from pattern to pattern While generating an abstraction of entire data, it is necessary to smooth out the noisy behavior, which otherwise may lead to improper abstraction This can be visualized by considering data such as handwritten digit data or a sales transaction data With such datasets in the focus, the support of each feature across the training data is computed The items whose support is above a chosen value are considered for the study It should be noted here that the feature dimension,d, is kept unchanged, even though few features get eliminated across all the patterns with the chosen For example, forn=8 andd=8, the following sam-ple sets (A) and (B) represent patterns in their original form and the corresponding frequent items withε=3, respectively

(98)

SetB: (a) 11011010 (b) 10110010 (c) 11001001 (d) 01011010 (e) 01110000 (f) 11101010 (g) 11011010 (h) 00101000

4.5.3 Generation of Coded Training Data

At this stage, the training data consists only of frequent items Considering b binary items at a time, a decimal equivalent value is computed The value of b is a result of preliminary analysis or as obtained from domain knowledge The value of d is an integral multiple of b The d binary features of each pattern are now represented as q decimal values, whered=q·b

Consider SetBof the above example for illustrating coded data generation SetB: (a) 11011010 (b) 10110010 (c) 11001001 (d) 01011010

(e) 01110000 (f) 11101010 (g) 11011010 (h) 00101000

Withb=2 andq=4, the corresponding coded training data is given below Code Data: (a) 3122 (b) 2302 (c) 3021 (d) 1122

(e) 1300 (f) 3222 (g) 3122 (h) 0220

4.5.4 Subsequence Identification and Frequency Computation The sequence of decimal values corresponding to a pattern is in turn grouped into subsequences of decimal values Some examples of such datasets are large sales transaction datasets, where such similarity among the patterns is possible We arrive at the length of a subsequence based on preliminary analysis on the training data The length of a subsequence is a trade-off between representativeness and compact-ness

r decimal values form a subsequence We compute the frequency of each unique subsequence Not all subsequences identified are unique The number of distinct subsequences depends onε With increasingε, the number of distinct subsequences reduces

Continuing with example from the previous section, the following is a set of ordered subsequences with their corresponding frequencies forr=2

22:4 31:2 02:2 23:1 30:1 21:1 11:1 13:1 00:1 32:1 20:1 By mapping these subsequences to unique id’s, we can rewrite ordered subse-quences and their frequencies as follows

(99)

4.5.5 Pruning of Subsequences

Consider the subsequences generated and compute the number of occurrences of each unique subsequence We term this as the frequency of subsequences Arrange the sequences in descending order

In order to retain frequent subsequences, all subsequences whose frequency is less thanψare identified Each less frequent subsequence is replaced by its nearest neighbor from the frequent subsequences However, the nearest subsequence should remain within a prechosen dissimilarity threshold,η For meaningful generalization, the value is chosen to be small

We notice that the compression is achieved in two levels First, by discarding the items that remain below a chosen support,ε, and generating distinct subsequences Second, by reducing the number of distinct subsequences by the choice ofψandη A reduction in the number of features reduces the VC dimension

Let the total number of distinct subsequences in the data bem1 The number of distinct subsequences that remain after this step depend on the value ofψandε All the remaining distinct subsequences are numbered, say, from tom2 It should be noted thatm2≤m1for allψ >0 At the end of this step, the training data consists of justm2unique id’s, numbered from tom2

Continuing with the example discussed in Sect.4.5.4, with ψ=2, the given distinct sequences are replaced by their nearest neighbor Following the list demon-strates the assignment of subsequences by their more frequent nearest neighbors, subject toψ The bold letters represent the nearest neighbors of a replaced subse-quence for a chosen value ofη This leads to a lossy form of compression

22 31 02 23:22 30:31 21:22 11:31 13:31 00:02 32:31 20:22

4.5.6 Generation of Encoded Test Data

The dataset under study is divided into mutually exclusive training and test datasets By the choice ofεandψ, the training data is reduced to a finite number of distinct subsequences

The test data and training data sets are mutually exclusive subsets of the origi-nal dataset The dataset too passes through a transformation before the activity of classification of patterns

Proceeding on the similar lines as in Sect.4.3, b-bit decimal codes are generated for the test data It results in a set of subsequences It should be noted here that the minimum support,ε, is not made use of explicitly

However, it is likely that:

1 many of the subsequences in the test data are unlikely to be present in the ordered subsequences of the training data, and

(100)

Such a subsequence makes the dissimilarity computation between a training pat-tern and a test patpat-tern, which is represented in terms of subsequences, difficult to compute In view of this, at the time of classification of test data, each new subse-quence found in the test pattern is replaced by its nearest neighbor from the set ofm2 subsequences generated using the training set However, in this case,ηis computed as post facto information

4.5.7 Classification Using Dissimilarity Based on Rough Set Concept

Rough set theory is used here for classification A given classΩ is approximated, using rough set terminology by two sets, viz.,ΩL, lower approximation ofΩ and ΩU, upper approximation ofΩ.ΩLconsists of samples that are certain to belong toΩ.ΩU consists of samples that cannot be described as not belonging toΩ Here the decision rule is chosen based on dissimilarity threshold.ΩUcontains the training patterns that are neighbors by means of ordered distances without any limit on dis-similarity.ΩLcontains the training patterns that are below the chosen dissimilarity threshold We classify the patterns falling within lower approximation unambigu-ously We reject those patterns that fall between lower and upper approximation as unclassifiable

We discuss procedure to compute dissimilarity computation between compressed patterns in the following section

4.5.7.1 Dissimilarity Computation Between Compressed Patterns

Dissimilarity computation between compressed patterns, in the classification schemes, viz., the current and the method discussed in Sect.4.5, is based on the unique identities of the subsequences

For a chosen block length of 4-bits or 3-bits, all possible decimal codes are known a priori Every subsequence would then consist of known decimal codes Thus, in order to compute dissimilarity between two subsequences, storing an upper triangu-lar matrix containing distances between all possible pairs of decimal codes would be sufficient For example, in case of 4-bit blocks, the range of decimal codes is to 15 The size of the dissimilarity matrix is 16×16 Out of these 256 values, only 136 values corresponding to the upper triangular matrix are sufficient to compute the dissimilarity between subsequences In summary, the dissimilarity computation between a training and a test pattern is simplified in the following ways

• First with b-bit encoding, the pattern consists of q blocks, whereq=db Thus, it requires only q<d comparisons.

(101)

• Third, dissimilarity between two subsequences is carried out by simple table look-up

Since the data is inherently binary, the Hamming distance is used for dissimilarity computation

4.5.8 Classification Usingk-Nearest Neighbor Classifier

In this approach, each of the compressed test patterns consists of pruned subse-quences with reference to the values ofε andψ considered for generating com-pressed training data The dissimilarity is computed for each test pattern with all training patterns The firstkneighbors are identified depending on the dissimilarity value Based on majority voting, a test pattern is assigned a class label The classifi-cation accuracy depends on the value ofk

4.6 Implementation of the Proposed Scheme

The current section discusses each step of the proposed scheme in the context of considered data

4.6.1 Choice of Parameters

Parameters depend on nature of data They are identified experimentally The pa-rameters considered are the minimum support for frequent item generation (ε), the minimum frequency for pruning of subsequences (ψ), and the maximum dissimilar-ity limit (η) for assigning nearest neighbors For example, the value ofηis identified as The experiments are conducted on a large random sample from training data

For the number of bits for forming blocks, b, two lengths of and are consid-ered The 4-bit block values of two typical training patterns of classes with labels and , respectively, are provided in Table4.17 There are 48 decimal equivalent codes (block values) for each pattern In the table, space is left between successive subse-quence length of 3, in order to indicate row-wise separation in 16×12-bit pattern It may be noted that the maximum value of 4-bit block is 15 Also, similarity among different rows of a pattern, in terms of subsequences, should be noted

Table4.18contains a sample of 3-bit coded training data There are 64 block values for each pattern In the table space in between successive sets of 4-block values indicating row-wise separation In can be seen that in any given subsequence, the maximum value of a block is 7, which is indicative of 3-bit coding The similarity among the subsequences should taken note of

(102)

Table 4.17 Sample of 4-bit coded training data

Label Data

0 8 8 12 12 12 12 12 12 12 12 8 7 12

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Table 4.18 Sample of 3-bit coded training data

Label Data

0 0 0 0 0 0 0 7 4 6 4 6 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4.6.2 Frequent Items and Subsequences

The minimum support value,ε, is changed starting from For example, from Ta-ble4.17and Table4.18, observe repeating subsequences In Table4.17, the first pat-tern contains the following unique subsequences of length (r=3), viz., (0,3,8), (0,3,12), (0,7,12), (0,12,12), (1,9,12), (3,1,8), (3,7,0), (3,12,0) with respec-tive frequencies of repetitions as 4,2,2,1,2,2,2,1 In Table 4.18, observe the sub-sequences of length (r=4), viz., (0,0,7,0), (0,0,7,4), (0,1,7,4), (0,3,1,4), (0,6,3,4), (1,4,3,0), (1,5,6,0), (1,7,0,0) with respective frequencies of repeti-tions as 4,1,2,1,2,2,2,1 Increasing this value results in less number of distinct subsequences

The choice of the minimum support value,ε, influences the number of distinct subsequences Figure4.3depicts reduction in the number of distinct subsequences with increasing support value,ε With supportε=0, the number of distinct subse-quences is 690 Also, observe from the figure that atε=50, the number of distinct subsequences is 543 Compare this number of 4-bit encoded values of 543 distinct subsequences with the total number of such encoded values in the training data, viz.,

6670·192

4 =6670·48=320,160.

With grouping of subsequences of length 3, the number of distinct subsequences becomes 690 in the original data, which is further reduced by the choices ofε,η, andψ

(103)

Fig 4.3 Distinct subsequences as functions of support value (ε)

4.6.3 Compressed Data and Pruning of Subsequences

The distinct subsequences are numbered in the descending order of their frequency This forms the compressed data Table4.19contains typical compressed training data for one arbitrary pattern each in classes 0–9

The subsequences are pruned further by discarding infrequent subsequences This is carried out by choosing the value of ψ A largerψ reduces the number of distinct subsequences Figure4.4contains distinct subsequences and the corre-sponding frequency(ψ) for a minimum support value of 50, i.e.,ε=50 Observe that the maximum subsequence number is 543, and its corresponding frequency is

Figure4.5consists of the effect of frequency limit on pruning after replacing pruned subsequences by their nearest neighbor (NN) The data is generated for a specific value of support,ε=50

(104)

Table 4.19 Sample training data in terms of subsequence numbers

Label Unique codes of a pattern

0 23 23 23 23 51 51 39 39 82 98 98 25 25 43 43 30

0 23 23 13 13 87 57 57 46 46 46 116 116 22 9

0 3 5 5 22 22 25 58 58 56 56 14 14 11

0 18 18 11 85 29 29 59 112 58 58 114 58 25 25 68

1 10 10 10 10 10 10 10 1 10 10 10 10 10 10

1 17 17 2 1 1 3 28 28 8 8

1 15 15 1 1 15 15 15 3 3 3

1 10 10 1 1 1 3 3 28 28 28

2 8 19 19 19 19 1 3 508 508 86 86 139

2 30 30 236 236 188 188 18 18 195 398 398 566 566 442 442

2 24 24 34 34 8 34 34 80 80 190 190 202 202 60 60

2 6 5 6 12 12 13 13 9 9 18 18

3 6 66 66 12 12 12 12 58 58 25 25 5 7

3 45 45 2 2 6 4 21 21 21 21 35

3 20 20 13 13 4 12 12 6 2 14 14 30 30

3 12 12 13 13 32 4 5 83 83 14 14 19

4 4 4 25 25 29 29 16 14 14 1 1 15

4 55 55 55 55 70 70 32 32 14 14 14 14 10 10 1

4 2 47 100 100 100 100 11 11 1 15 15

4 55 55 21 32 37 48 25 14 14 11 10 10 1

5 35 35 9 34 60 60 42 2 26 26 49 49 49 78

5 95 95 81 51 51 13 13 122 21 21 137 67 67 14 18

5 55 55 51 51 11 11 34 30 30 3 30 30 24

5 4 35 35 11 11 18 18 1 274 274 42 42 60 60

6 4 23 23 12 1 13 13 89 89 89 13 13 12

6 24 24 36 8 36 36 24 112 156 212 212 522 522 177 33

6 8 8 24 29 29 16 132 132 25 25 56 5

6 15 15 3 8 8 43 16 16 155 155 14 14 14

7 5 2 10 7 15 15 3 28 28

7 3 13 13 13 53 53 2 12 12 7

7 13 13 50 50 2 1 15 15 3 28 28 28 28

7 11 11 11 5 2 1 7 3 28 28

8 5 49 49 159 159 49 49 49 162 162 74 74 49 49 14

8 20 20 13 13 13 64 26 26 1 32 32 13

8 12 12 13 13 64 64 6 7 11 11 154 154 30

8 6 5 26 26 11 11 5 26 26 5

9 11 11 5 29 25 25 58 14 14 5 4 4

9 2 13 13 32 48 48 9 4 4 4

9 12 12 12 12 66 66 6 6 2 2 2

(105)

Fig 4.4 Distinct subsequences and their frequencies

4.6.4 Generation of Compressed Training and Test Data

Based on pruned distinct subsequence list, say, to 106, the training data is regener-ated by replacing distinct subsequences with these mapped numbers As discussed in the previous subsection, each of those subsequences of the training data that are not available among the distinct subsequence list is replaced with its nearest neigh-bor among the distinct sequences, within the dissimilarity threshold,η=2 It should be noted here that in the considered datasets,ηremained within for both test and training data After generating compressed training data in the above manner, com-pressed test data is generated It should be noted that test data is represented as 4-bit blocks only No other operations are carried out on test data

4.7 Experimental Results

A number of case studies were carried out, by varying the values ofε,ψ, andη With “rough set” approach, the best classification accuracy obtained is 94.13 %, by classifying 94.54 % of the test patterns and rejecting 182 out of 3333 test patterns as unclassified

(106)

Fig 4.5 Distinct subsequences as functions of minimum frequency parameter (ψ)

made use of 452 distinct subsequence However, it should be noted that classifica-tion accuracy is not significantly different even for increasingψ For example, with 106 distinct subsequences, the classification accuracy obtained is 92.92 %.

4.8 Summary

Large handwritten digit data is compressed by means of a novel method in two stages, first by applying the limit on support value and subsequently on the fre-quency of so-generated subsequences In terms of subsequences of 4-bit blocks, the method reduced the original number of 690 subsequences without constraints on the support and frequency to 106 subsequences The classification accuracy improved as compared with the original data Further, this can be seen as effective feature reduction

(107)

Fig 4.6 Classification accuracy forε=50 and (ψ)

It should be noted here that the classification accuracy obtained here with lossy compression is higher than what is obtained with the original data set, 92.47 %, using kNNC fork=7 as discussed in Chap.3

The parameter values ofε,ψ, andηare data dependent With reduction in the number of patterns, the VC dimension reduces, provided that the NNC accuracy is not affected Perhaps, similar conclusions can be drawn under Probably Approx-imately Correct (PAC) learning framework with the help of Disjunctive Normal Forms as defined in Chap.3

(108)

can be found in Ravindra Babu et al (2004) and an extension of the same work can be found in Ravindra Babu et al (2012) Mobahi et al (2011) propose a method to segment images by texture and boundary compression on Minimum Description Length principle Talu and Türko˘glu (2011) suggest a lossless compression algo-rithm that makes use of novel encoding by means of characteristic vectors and a standard lossy compression algorithm Definitions of a sequence and subsequence can be found in Goldberg (1978)

References

R Agrawal, T Imielinski, A Swamy, Mining association rules between sets of items in large databases, in Proc 1993 ACM-SIGMOD International Conference on Management of Data

(SIGMOD’93), (1993), pp 266–271

J.S Deogun, V.V Raghavan, H Sever, Rough set based classification methods for extended deci-sion tables, in Proc of Intl Workshop on Rough Sets and Soft Computing, (1994), pp 302–309 J Ghosh, Scalable clustering, in The Handbook of Data Mining, ed N Ye (Lawrence Erlbaum

Assoc., Mahwah, 2003), pp 247–278 Chapter 10

R.R Goldberg, Methods of Real Analysis 1st edn (Oxford and IBH, New Delhi, 1978) R.M Gray, D.L Neuhoff, Quantization IEEE Trans Inf Theory, 44(6), 1–63 (1998)

J Han, J Pei, Y Yin, Mining frequent patterns without candidate generation, in Proc of ACM

SIG-MOD International Conference of Management of Data (SIGSIG-MOD’00), Dallas, Texas (2000),

pp 1–12

A.K Jain, R.C Dubes, Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, 1988) A.K Jain, M Narasimha Murty, P.J Flynn, Data clustering: a review ACM Comput Surv 31(3),

264–323 (1999)

B Karacah, H Krim, Fast minimization of structural risk by nearest neighbor rule IEEE Trans Neural Netw 14(1), 127–137 (2002)

D.A Lelewer, D.S Hirshberg, Data compression ACM Comput Surv 9, 261–296 (1987) H Mobahi, S Rao, A Yang, S Sastry, Y Ma, Segmentation of natural images by texture and

boundary compression Int J Comput Vis 95, 86–98 (2011)

Z Pawlak, J Grzymala-Busse, R Slowinksi, W Ziarko, Rough sets, Commun ACM, 38, 89–95 (1995)

T Ravindra Babu, M Narasimha Murty, V.K Agrawal, Hybrid learning scheme for data min-ing applications, in Proc of Fourth Intl Conf on Hybrid Intelligent Systems (IEEE Computer Society, 2004), pp 266–271 doi:10.1109/ICHIS.2004.56

T Ravindra Babu, M Narasimha Murty, S.V Subrahmanya, Quantization based sequence gener-ation and subsequence pruning for data mining applicgener-ations, in Pattern Discovery Using

Se-quence Data Mining: Applications and Studies, ed by P Kumar, P Krishna, S Raju

(Informa-tion Science Reference, Hershey, 2012), pp 94–110

K Sayood, Introduction to Data Compression, 1st edn (Morgan Kaufmann, San Mateo, 2000) M.F Talu, I Türko˘glu, Hybrid lossless compression method for binary images IU, J Elect

Elec-tron Eng 11(2) (2011)

(109)

Chapter 5

Data Compaction Through Simultaneous Selection of Prototypes and Features

In Chap.4, we presented a novel scheme for data reduction by subsequence gener-ation and pruning We extend the concept in the current work to select prototypes and features together from large data

Given a large dataset, it is always interesting to explore whether one can generate an abstraction with a subset or representative set of patterns drawn from the original dataset that is at least as accurate as the original data Such a representative dataset forms prototypes Drawing a random sample and resorting to pattern clustering are some of the approaches to generate a set of prototypes from a large dataset

In the given dataset, when each pattern is represented by a large set of features, it is efficient to operate on a subset of features Such a feature set should be repre-sentative This forms the problem of feature selection Some methods of selecting optimal or ideal subset of features are through optimization methods We explore the option of frequent-item support to generate a representative feature subset

In this process, we examine the following aspects

• Effect of frequent items on prototype selection

• Effect of support-based frequent items on feature selection and evaluation of their representativeness

• Impact of sequencing of clustering and frequent item generation on classification

• Combining clustering and frequent item generation resulting in simultaneous se-lection of patterns and features

The chapter is organized as follows In Sect.5.2, we provide a brief overview of prototype, feature selection, and resultant data compaction Section5.3contains a background material necessary for appreciating the proposed methods Section5.4

contains preliminary data analysis that provides insights into prototype and feature selection Section5.5contains a discussion on the approaches proposed in this work Implementation of proposed schemes and experimentation is discussed in Sect.5.6 The work is summarized in Sect.5.7 Section5.8contains bibliographic notes

(110)

5.2 Prototype Selection, Feature Selection, and Data Compaction

In a broad sense, any method that incorporates information from training samples in the design of a classifier employs learning When the dataset is large and each pattern is characterized by high dimension, abstraction or classification becomes an arduous task With large dimensionality of feature space, the need for large number of samples grows exponentially The limitation is known as Curse of Dimensional-ity For superior performance of a classifier, one aims to minimize Generalization Error, which is defined as the error rate on unseen or test patterns.

While dealing with high-dimensional large data, for the sake of abstraction gener-ation and scalability of algorithms, one resorts to either dimensionality reduction or data reduction or both Approaches to data reduction include clustering, sampling, data squashing, etc In clustering, this is achieved by means of considering cluster representatives either in their original form such as leaders, medoids, centroids, etc Sampling schemes involve simple random sampling with and without replacement or stratified random sampling Data squashing involves a form of lossy compres-sion where pseudo data is generated from the original data through different steps A number of other approaches like BIRCH generate a summary of original data that is necessary for further use

Prototype selection by using Partition Around Medoids (PAM), Clustering LARge Applications (CLARA), and Leader was earlier reported It was shown in the literature that classification performance of Leader clustering algorithm is better than that of PAM and CLARA Further, the computation time for Leader algorithm is much less when compared to PAM and CLARA for the same data set The com-putation complexity of PAM and CLARA areO(c(n−c)2)andO(ks2+c(n−c)), respectively, wherenis the size of the data,sis the sample size, andcis the number of clusters Leader has linear complexity Although CLARA can handle larger data than PAM, its efficiency depends on the sample size and unbiasedness

Medoids, PAM, CLARA, and CLARANS

(111)

multiple samples from the entire dataset, it compares the average minimum dissimilarity of all objects in the entire dataset The sample size suggested is 40+2k However, in practice, CLARA requires large amount of time when k is greater than 100 It is observed that CLARA does not always generate good prototypes and it is computationally expensive for large datasets with a complexity ofO(kc2+k(n−k)), where c is the sample size CLARANS (Clustering Large Applications based on RANdomized Search) combines the sampling technique with PAM CLARANS replaces the build part of PAM by random selection of objects Where CLARA considers a fixed sample size at every stage, CLARANS introduces randomness in sample selection Once a set of objects is selected, a new object is selected when a preset value of local minimum and maximum neighbors is searched using the swap phase. CLARANS generates better cluster representatives than PAM and CLARA The computational complexity of CLARANS isO(n2)

The other schemes for prototype selection include support vector machine (SVM) and Genetic Algorithm (GA) based schemes The SVMs are known to be expensive as they takeO(n3)time The GA-based schemes need multiple scans of dataset, which could be prohibitive when large datasets are processed

As an illustration of prototype selection, we compare two algorithms that require a single database scan, viz., Condensed Nearest-Neighbor (CNN) and Leader clus-tering algorithms for prototype selection

The outline of CNN is provided in Algorithm5.1 The CNN starts with the first sample as a selected point (BIN-2) Subsequently, using the selected pattern, the other patterns are classified The first incorrectly classified sample is included as an additional selected point Likewise with selected patterns, all other patterns are classified to generate a final set of representative patterns

Algorithm 5.1 (Condensed Nearest Neighbor rule)

Step 1: Set two bins called BIN-1 and BIN-2 The first sample is placed in BIN-2 Step 2: The second sample is classified by the NN rule, using current contents of

BIN-2 as reference set If the second sample is classified correctly, it is placed in BIN-1; otherwise, it is placed in BIN-2

Step 3: Proceeding in this manner, theith sample is classified by the current con-tents of BIN-2 If classified correctly, it is placed in BIN-1; otherwise, it is placed in BIN-2

Step 4: After one passes through the original sample set, the procedure continues to loop through BIN-1 until termination in one of the following ways (a) The BIN-1 is exhausted with all its members transferred to BIN-2, or (b) One complete pass is made through BIN-1 with no transfers to BIN-2 Step 5: The final contents of BIN-2 are used as reference points for the NN rule;

(112)

The Leader clustering algorithm is provided in Sect.2.5.2.1 Discussions related to Leader algorithm are provided in Sect.5.3.4

A comparative study is conducted between CNN and Leader by providing all the 6670 patterns as training data and 3333 patterns as test data for classifying them with the help of Nearest-Neighbor Classifier (NNC) Table5.1provides the results In the table, Classification Accuracy is represented as CA CPU Time refers to the process-ing time computed on Pentium III 500 MHz computer as time elapsed between the first and last computations The table provides a comparison between both methods It demonstrates the effect of threshold on the number of (a) prototypes selected, (b) CA, and (c) processing time A finite set of thresholds is chosen to demonstrate the effect of distance threshold It should be noted that for binary patterns, the Ham-ming and Euclidean distances provide equivalent information At the same time, it reduces computation time in terms of computation of squares of deviation and the square root Hence, we choose the Hamming distance as a dissimilarity measure

The exercises indicate that compared to Leader algorithm, CNN requires more time for obtaining the same classification accuracy But CNN provides fewer but a fixed set of prototypes corresponding to a chosen order of input data Leader algo-rithm offers a way of improving the classification accuracy by means of threshold value-based prototype selection and thus provides a greater flexibility to operate with In view of this and based on the earlier comparative study with PAM and CLARA, Leader is considered for prototype selection in this study We use the NNC as the classifier In order to achieve efficient classification, we use the set of proto-types obtained using the Leader algorithm Our scheme offers flexibility to select different sizes of prototype sets

Dimensionality reduction is achieved through either feature selection or feature extraction In feature selection, it is achieved through removing redundant features This is achieved by optimal feature selection by deterministic and random search algorithms Some of the conventional algorithms include feature selection by indi-vidual merit basis, branch-and-bound algorithm, sequential forward and backward selection, plus l–take awayralgorithm, max–min feature selection, etc Feature ex-traction methods utilize all the information contained in feature space to obtain a transformed space resulting in lower dimension

Considering these philosophical and historical notes, in order to obtain gener-alization and regularization, we examine a large handwritten digit data in terms of feature selection and data reduction and whether there exists an equivalence between the two

Four different approaches are presented, and the results of exercises are pro-vided in driving home the issues involved We classify large handwritten digit data by combining dimensionality reduction and prototype selection The compactness achieved by dimensionality reduction is indicated by means of the number of com-binations of distinct subsequences A detailed study of subsequence-based lossy compression is presented in Chap.4 The concepts of frequent items and Leader cluster algorithms are used in the work

(113)

Table 5.1 Comparison

between CNN and leader Distance threshold

No of prototypes

CA (%) CPU time

(sec) CNN

– 1610 86.77 942.76

Leader

5 6149 91.24 1171.81

10 5581 91.27 1066.54

15 4399 90.40 896.84

18 3564 90.20 735.53

20 3057 88.03 655.44

22 2542 87.04 559.52

25 1892 84.88 434.00

27 1526 81.70 363.00

patterns are divided in training and test patterns in the ratio of 67 % and 33 % approximately About % of total dataset is used validation data, and it is taken out of training dataset Each pattern consists of 192 binary features The number of patterns per class is nearly equal

5.2.1 Data Compression Through Prototype and Feature Selection In Chap.4, we observed that increasing frequent-item support till a certain value leads to data compaction without resulting in significant reduction in classification accuracy We explore whether such a compaction would lead to selection of better prototypes than selection without such a compaction Similarly, we study whether activity that leads to feature selection would result in a better representative feature set We propose to evaluate both these activities through classification of unseen data

5.2.1.1 Feature Selection Through Frequent Features

In this chapter, we examine whether frequent-item support helps in arriving at such a discriminative feature set We explore to select such a feature set with varying sup-port values We evaluate each such selected set through classifying unseen patterns

5.2.1.2 Prototype Selection for Data Reduction

(114)

a single database scan The leaders form cluster representatives The clustering al-gorithm requires an optimal value of dissimilarity threshold Since such a value is data dependent, a random sample from the input data is used to arrive at the thresh-old Each cluster is formed with reference to a leader The leaders are retained, and the remaining patterns are discarded The representativeness of leaders is evaluated with the help of pattern classification of unseen patterns with the help of the set of leaders

5.2.1.3 Sequencing Feature Selection and Prototype Selection for Generalization

Prototype selection and feature selection reduce the data size and dimensionality, re-spectively It is educative to examine whether feature selection using frequent items followed by prototype selection or vice versa would have any impact on classifica-tion accuracy We experiment both these orderings to evaluate relative performance

5.2.1.4 Class-Wise Data vs Entire Data

Given a multiclass labeled dataset, we examine the relative performance of consid-ering the dataset class-wise and a single large set of multiclass data We observe from Fig.5.4and Table5.6that patterns belonging to different class labels require different numbers of effective features to represent the pattern Identifying a class-wise feature set or patterns would likely to be a better representative of the class On the contrary, it is interesting to examine whether there could be a common threshold for prototype selection and common support threshold for selecting a feature set to represent the entire dataset

5.3 Background Material

Consider training data containingnpatterns, each havingd features Letεandζ be the minimum support for considering any feature for the study and the dis-tance threshold for selecting prototypes, respectively For continuity of notation, we follow the same terminology as provided in Table4.1 Also, the terms defined in Sect.4.3are valid in the current work too Additional terms are provided be-low

1 Leader Leaders are cluster representatives obtained by using Leader Clustering algorithm

2 Distance Threshold for clustering (ζ) It is the threshold value of the distance used for computing leaders

Illustration 5.1 (Leaders, choice of first leader, and impact of threshold) In order

(115)

Table 5.2 Transaction and

items Transaction

No

Items

1

1 1 0

2 1 1

3 0 1

4 0 0

5 1 1

of leaders, we consider UCI-ML dataset on iris We demonstrate the concepts on iris-versicolor data We consider the petal length and width as two features per pattern In applying the Leader algorithm, we consider the Euclidean distance as a dissimilarity measure To start with, we consider the distance threshold (ε) of 1.4 cm and consider the first pattern as leader The result is shown in Fig.5.1 The figure contains two clusters with respective cluster members shown as dif-ferent symbols Leaders are shown with superscribed square symbols In order to demonstrate the order dependence of leaders, we consider the same distance threshold and select pattern no 16 as the first leader As shown in Fig.5.2, we still obtain two clusters with location of different first leader and different num-ber of cluster memnum-bers As a third example, we consider the distance threshold of 0.5 cm We obtain seven clusters as shown in Fig.5.3 Note that the leaders are shown with a superscribed square When we consider a large threshold of, say, 5.0 cm, we obtain a single cluster, which essentially is a scatter plot of all patterns

3 Transaction A transaction is represented using a set of items that are possible to be purchased In any given transaction, all or a subset of the items could be purchased Thus, a transaction indicates presence or absence of items purchased This is analogous to a pattern with presence or absence of binary-valued fea-tures

Illustration 5.2 (Transaction or Binary-Valued Pattern) Consider a transaction

with six items We represent an item bought as “1” or not bought as “0” We represent five transactions with the corresponding itemsets in Table5.2 For ex-ample, in transaction 3, items 3, 5, and are purchased

(116)

Fig 5.1 Leader clustering with a threshold of 1.4 cm on Iris-versicolor dataset First leader is selected as pattern sl no

(117)

Fig 5.3 Leader clustering with a threshold of 0.5 cm on Iris-versicolor dataset First leader is selected as pattern sl no 16 The data is grouped into seven clusters

most often does not coincide with one of the input patterns On the contrary, leaders are one of the patterns

5.3.1 Computation of Frequent Features

In the current section, we describe an experimental setup wherein we examine which of the features helps discrimination using frequent-item support This is done by counting the number of occurrences of each feature in the training data If the num-ber is less than a given support thresholdε, the value of the feature is set to be absent in all the patterns After identifying the features that have support less than εas infrequent, the training data set is modified to contain “frequent features” only As noted earlier, the value ofεis a trade-off between the minimal description of the data under study and the maximal compactness that could be achieved The actual value depends on the size of training data, such as class-wise data of 600 patterns, each or full data of 6000 patterns

(118)

Fig 5.4 Sample Patterns with frequent features having support 1, 100, 200, 300 and 400

5.3.2 Distinct Subsequences

The concepts of sequence, subsequence, and length of a subsequence are used in the context of demonstrating compactness of a pattern, as discussed in Chap.4 For example, consider the pattern containing binary features, {01110110110100011011 0010 .} Considering a block of length the pattern can be written as {0111 0110 1101 0001 1011 0010 .} The corresponding values of blocks, which are deci-mal equivalents of the 4-bit blocks, are{7,6,13,1,11,2, } When arranged as a 16×12 pattern matrix, each row of the matrix would contain three blocks, each of length of bits, as{(7,6,13), (1,11,2), } Let each set of three such codes form a subsequence, e.g., {(7,6,13)} In the training set, all such distinct subse-quences are counted Original data of 6000 training patterns consists of 6000·192 features When arranged as subsequences, the corresponding number of distinct sub-sequences is 690

We count the frequency of subsequences, which is the number of occurrences, of each of the subsequences Subsequently, they are ordered in descending order of their frequency The sequences are sequentially numbered for internal use For example, the first two of the most frequent distinct subsequences,{(0,6,0)} and

{(0,3,0)}, are repeated 8642 times and 6447 times, respectively As the minimum support value,ε, is increased, some of the binary feature values should be set to zero This would lead to reduction in the number of distinct subsequences, and we would show later that it also provides a better generalization

5.3.3 Impact of Support on Distinct Subsequences

As discussed in Sect.5.3.2, with increasingε, the number of distinct subsequences reduces For example, consider the pattern{(1101 1010 1011 1100 1010 1011 .)} The corresponding 4-bit block values are {(13,10,11), (12,10,11), } Sup-pose that with the chosen support, the feature number in the considered pat-tern is absent This would make the patpat-tern{(110010101011110010101011, .)} Thus, the original distinct subsequences{(13,10,11), (12,10,11), }reduce to

(119)

5.3.4 Computation of Leaders

The Leader computation algorithm is described in Sect.2.5.2of Chap.2 The leaders are considered as prototypes, and they alone are used further, either for classification or for computing frequent items, depending on the adopted approach This forms data reduction

5.3.5 Classification of Validation Data

The algorithm is tested against the validation data usingk-Nearest-Neighbor Classi-fier (kNNC) Each time, prototypes alone are used to classify test patterns Different approaches are followed to generate prototypes Depending on the approach, the prototypes are either “in their original form” or in a “new form with reduced num-ber of features.” The schemes are discussed in the following section

5.4 Preliminary Analysis

We carry out elaborate preliminary analysis in arriving at various parameters and also to study the sensitivity of such parameters

Table5.3contains the results on preliminary experiments considering training dataset as 6670 patterns, which combine training and validation data The exercises provide insights on the choice of thresholds, the number of patterns per class, the re-duced set of training patterns, and classification accuracy with the test data It can be noticed from the table that as the distance threshold (ε) increases, the number of pro-totypes reduces, and that the classification reaches the best accuracy for an optimal set of thresholds, beyond which it starts reducing The table consists of class-wise thresholds for different discrete choices One such threshold set that provides the best classification accuracy is{3.5,3.5,3.5,3.8,3.5,3.7,3.5,3.5,3.5,3.5}, and the accuracy is 94.0 %

Table5.4provides the results on impact of various class-wise support thresholds for a chosen set of leader distance threshold values We consider a class-wise dis-tance threshold of 3.5 for prototype selection using the Leader clustering algorithm for the study Column of the table contains the class-wise support values, column consists of the totals of prototypes, which are the sums of class-wise prototypes generated with a common distance threshold of 3.5, and column consists of classi-fication accuracies using kNNC The table provides an interesting aspect that when the patterns with frequent features are only selected, the number of representative patterns also reduces

(120)

Table 5.3 Experiments with leader clustering with class-wise thresholds and prototypes

Class-wise percentage support

#Proto-types CA

0

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)

4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4207 92.53

(412) (21) (580) (545) (500) (606) (398) (243) (528) (374)

3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5764 93.61

(609) (74) (663) (657) (649) (661) (686) (549) (642) (624)

3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 5405 93.61

(561) (49) (647) (640) (629) (657) (577) (452) (628) (565)

3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 5219 93.85

(542) (42) (641) (630) (612) (653) (548) (409) (606) (536)

3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 4984 93.88

(506) (34) (628) (606) (593) (648) (516) (363) (593) (497)

3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 4764 93.82

(418) (30) (616) (589) (567) (640) (489) (322) (570) (463)

Table 5.4 Experiments with support value for a common set of prototype selection

Support threshold

No of prototypes Classification accuracy (%)

(1) (2) (3)

5 4981 93.82

10 4974 93.82

15 4967 93.61

20 4962 93.58

25 4948 93.67

30 4935 93.61

35 4928 93.49

40 4915 93.7

45 4899 93.55

50 4887 93.79

55 4875 93.76

(121)

Table 5.5 Distinct subsequences and classification accuracy for varying support and constant set of input patterns

Support threshold

Distinct subsequences

Classification accuracy (%)

(1) (2) (3)

0 690 92.47

15 648 92.35

25 599 92.53

45 553 92.56

55 533 92.89

70 490 92.29

80 468 92.26

90 422 92.20

100 395 92.32

5.5 Proposed Approaches

With the background of the discussion provided in Sect.5.2, we propose the follow-ing four approaches

• Patterns with frequent features only, considered in both class-wise and combined sets of multi-class data

• Cluster representatives only in both class-wise and entire datasets

• Frequent item selection followed by prototype selection in both class-wise and entire datasets

• Prototype selection followed by frequent items in both class-wise and combined datasets

In the following subsections, we elaborate each approach

5.5.1 Patterns with Frequent Items Only

In this approach, we consider entire training data For a chosenε, we form patterns containing frequent features only With the training data containing frequent fea-tures, we classify validation patterns using kNNC By varyingε, the Classification Accuracy (CA) is computed The value ofεthat provides the best CA is identified, and results are tabulated The entire exercise is repeated considering class-wise data and class-wise support as well as full data It should be noted that the support value depends on data size The procedure can be summarized as follows

Class-Wise Data

• Consider class-wise data of 600 patterns per class

(122)

1 Compute frequent features

2 Consider class-wise training data with frequent features and combine them to form full training dataset of 6000 patterns It should be noted here that the prototype set is not changed here

3 Classify validation patterns and record CA Compute the number of distinct subsequences

Full Data

• Consider full training dataset of 6000 patterns containing all 10 classes together

• By changing support in small discrete steps carry out the following steps Compute frequent features

2 Consider full training data with frequent features Classify validation patterns and record CA Compute the number of distinct subsequences

5.5.2 Cluster Representatives Only

In this approach, we consider training data and use Leader clustering algorithm to identify leaders The leaders form prototypes Use the set of prototypes to classify validation data For computing leaders, we change the distance threshold value,ζ, in small steps The training data is considered class-wise and as a full dataset sepa-rately The procedure can be summarized as follows

Class-Wise Data

• By changing the distance threshold in small discrete values carry out the follow-ing steps

1 Compute class-wise leaders that form prototypes

2 Consider class-wise prototypes and combine them to form a full training dataset of prototypes

3 Classify validation patterns and record CA

Full Data

• Consider a full training dataset of 6000 patterns containing all 10 classes together

• By changing the thresholdζ in small discrete values carry out the following steps Compute leaders over entire data

(123)

5.5.3 Frequent Items Followed by Clustering

In the current approach, to start with, the frequent items are identified for different support values,ε The training data at this stage contains only frequent features The training data is subjected leader clustering to identify cluster representatives The prototypes thus formed are used to classify validation data The data is considered class-wise and as full data The procedure is summarized as follows

Class-Wise Data

• By changing support threshold,ε, in small discrete values carry out the following steps

1 Compute class-wise frequent features

2 Combine class-wise patterns having frequent features to form the training dataset

3 Compute prototypes for different values ofζ to identify prototype patterns that contain frequent features

4 Classify validation data patterns

5 Compute the number of distinct subsequences

Full Data

• By changing the support threshold,ε, in small discrete values carry out the fol-lowing steps

1 Compute frequent features

2 Carry out clustering and identify prototypes for different values ofζ Classify validation patterns and record CA

4 Compute the number of distinct subsequences

5.5.4 Clustering Followed by Frequent Items

In the current approach, the clustering is carried out for different distance thresh-old values,ζ, as first step The training data at this stage contains only prototypes Frequent features in the prototypes are identified The prototypes thus formed with a pruned set of features are used to classify validation data The data is considered class-wise and as full data The procedure is summarized as follows

Class-Wise Data

(124)

1 Compute leaders

2 Combine all class data to form training dataset

3 Compute frequent items among all the leaders for different values ofζ Use training data so generated to classify validation data patterns Compute the number of distinct subsequences

Full Data

• By changing the distance threshold,ζ, in small discrete values carry out the fol-lowing steps

1 Compute leaders

2 Compute frequent items among the leaders Classify validation patterns and record CA Compute the number of distinct subsequences

5.6 Implementation and Experimentation

The proposed schemes are demonstrated on two types of datasets In Sect.5.6.1, we implement them on handwritten digit data that has binary-valued features In Sect.5.6.2, we consider intrusion detection data provided under KDDCUP’99 chal-lenge The data consists of floating-point-valued features with different ranges for each feature The data is described inAppendix Each of the features has a different range of values We appropriately quantize them to convert the considered data into binary data We implement prototype selection and simultaneous prototype and fea-ture selection The second dataset is considered to demonstrate the applicability of the schemes on different types of data We base further discussions and conclusions primarily in Sect.5.6.1

5.6.1 Handwritten Digit Data

(125)

Col-umn of the table contains the sequence number of the approach ColCol-umn con-sists of approaches such as feature selection, prototype selection, and their relative sequencing Column consists of data on whether the approach is experimented on class-wise or entire dataset Column contains the support threshold used for the activity, and column contains the distance threshold for leader clustering Column contains the number of prototypes, and column contains the num-ber of distinct subsequences The numnum-ber of distinct sequences is derived based on the distinct features As the distinct number of features reduces, the number of distinct subsequences reduces leading to compaction Columns and con-tain the classification accuracy corresponding to validation and test datasets It can be observed from the table that for approaches and 2, the number of pro-totypes is unaffected since the focus is on feature selection Further, for frequent item support-based feature selection, the number of distinct subsequences reduced to 361 when the full data is considered compared to 507 of that of class-wise data

We provide an approach-wise observation summary below

In Approach 1, we consider all patterns (ζ=0) We initially consider class-wise patterns, and by varying the support values of (ε) from to 200 we identify fre-quent items (features) As part of another exploration, we consider the full dataset and change the support (ε) from to 2000 Thus, the number of effective items (or features) gets reduced per pattern in both cases, which in turn results in reduction in the number of distinct subsequences It should however be noted here that in the case of class-wise data, the set of frequent features is different for different classes, and for full data, the frequent feature set is common for the entire dataset On valida-tion dataset, the best classificavalida-tion accuracy is obtained with 507 out of 669 distinct subsequences for class-wise data and with 450 out of 669 distinct subsequences for the full dataset The classification accuracies (CA) with test data for class-wise data and entire dataset are 92.32 % and 92.05 %, respectively Figure5.5contains actual reduction in number of distinct subsequences with increasing threshold on entire data However beyondεvalue of 450, the CA deteriorates as loss of feature information further affects ability to discriminate patterns Table5.6indicates re-duction obtained in number of features Observe from the table that the rere-duction is significant Column of the table contains the class labels Column contains the numbers of nonzero features for each class For a support threshold value of 160, the number of nonzero features reduced significantly as shown in column The percentage of reduction in number of features is provided in column It can be observed that the maximum number of reduction in features of 57.3 % occurred for the class label 1, and the least reduction occurred for class with a value of 42.2 %

(126)

Fig 5.5 Number of distinct subsequences vs increasing support on full data

Table 5.6 Results of Feature Selection using Support of 160 Class

label

Number of non-zero features among 600 training patterns (Max features=192)

Frequent features withε=160

Reduction in number of features

(1) (2) (3) (4)

0 176 95 46.0 %

1 103 44 57.3 %

2 186 101 45.7 %

3 174 87 50.0 %

4 170 85 50.0 %

5 181 96 47.0 %

6 171 90 47.4 %

7 170 67 60.6 %

8 173 100 42.2 %

9 175 80 54.3 %

(127)

Table 5.7 Results with each approach Sl

No

Approach Description Support threshold (ε) Leader threshold (ζ) No of proto-types Distinct subseq CA with valdn data CA with test data

(1) (2) (3) (4) (5) (6) (7) (8) (9)

1 Feature Selection (FeaSel)

Class-wise data

160 6000 507 92.52 % 92.32 %

2 Feature Selection (FeaSel)

Full data 450 6000 361 92.09 % 92.05 %

3 Prototype Selection (ProtoSel)

Class-wise data

0 3.1 5064 669 93.14 % 93.31 %

4 Prototype Full data 3.1 5041 669 93.13 % 92.26 %

5 FeaSel followed by ProtoSel

Class-wise data

40 3.1 5027 542 93.43 % 93.52 %

6 FeaSel followed by ProtoSel

Full data 190 3.1 5059 433 93.58 % 93.34 %

7 ProtoSel followed by FeaSel

Class-wise data

180 3.1 5064 433 93.58 % 93.34 %

8 ProtoSel followed by FeaSel

Full data 300 3.1 5041 367 93.58 % 93.52 %

of the largest distance, there would be a single cluster, and in the case of the distance threshold of 0.0, the number of clusters equals the number of training patterns

In Approach 3, frequent items are first computed on the original data and then followed by clustering on the patterns containing frequent features only Frequent features are computed by changingεin steps of in the range to 200 As a next step, prototypes are computed for each case using the Leader clustering algorithm For computing leaders, the distance threshold values (ζ) are changed from to 5.0 in steps of 1.0 The data thus arrived at is tested on validation dataset The parameter set that provided the best classification on the validation dataset is used to compute the classification accuracy on the test dataset The corresponding classification ac-curacies with class-wise data and full dataset are 93.52 % and 93.34 %, respectively. It should be noted that the respective numbers of distinct subsequences of these two cases are 542 and 433 Figure5.7presents the classification accuracy with increas-ing support values in Approach

(128)

Fig 5.6 Change in the number of leaders with increasing distance threshold

corresponding to best cases with validation data for class-wise and full datasets are 93.34 % and 93.52 %, respectively Because of prototype selection, there is also a reduction in number of prototypes from original 6000 to 5064 and 5041 for each of these two cases The numbers of distinct subsequences in these cases are 433 and 367, respectively

The number of distinct subsequences indicates compactness achieved Further, even with good amount of reduction in the number of distinct subsequences, there is no significant reduction in Classification Accuracy (CA) This can be observed from Fig.5.8, corresponding to Approach with full training data The figure displays CA for various values of support considering entire data and a distance threshold of 3.1 Observe that CA with support of 300 reaches maximum

(129)

Fig 5.7 Classification accuracy as a function of support value (ε=3.1)

(130)

Fig 5.9 Change in the number of leaders with increasing support

5.6.2 Intrusion Detection Data

The objective of this subsection is to illustrate the applicability of the schemes on different types of data.Appendixcontains a description of intrusion detection dataset that is part of KDDCUP’99 challenge The data that consists of floating-point-valued features is quantized into binary data The procedure is discussed in theAppendix

5.6.2.1 Prototype Selection

The objective of the exercises in the current section is to identify a subset of original data as data representatives or prototypes The Leader clustering algorithm is used to identify the prototypes The input dataset consists of five classes of equal number of patterns For each of the classes, we identify cluster representatives independently We combine them to form training data that consists of prototypes only With the help of this training dataset, we classify 411,029 test patterns The experiments are conducted with different dissimilarity thresholds

(131)

Fig 5.10 Results of prototype selection for the category “normal”

Table 5.8 Case study details Case

No

Normal u2r dos r2l probe

Thr Pats Thr Pats Thr Pats Thr Pats Thr Pats

1 100 3751 100 33 7551 715 1000

2 50 10,850 10 48 7551 895 1331

3 40 14,891 100 48 7551 895 1331

(40,100,2,1,1) We further make use of the datasets mentioned in Table5.8in the subsequent sections

Algorithm 5.2 (Prototype Selection Algorithm)

Step 1: Compute class-wise leaders in each of the classes (normal, u2r, dos, r2l, probe) (Figs.5.10,5.11,5.12,5.13,5.14)

Step 2: Combine the class-wise leaders to form training data Step 3: Classify test data

Step 4: Repeat the exercises with different distance thresholds The Euclidean dis-tance is used for both exercises

(132)

Fig 5.11 Results of prototype selection for category “u2r”

(133)

Fig 5.13 Results of prototype selection for category “r2l”

(134)

Table 5.9 Results with

prototypes Case No Training data

size

CA (%) Cost

1 13,050 91.66 0.164046

2 20,660 91.91 0.159271

3 24,702 91.89 0.158952

5.6.3 Simultaneous Selection of Patterns and Features

We noted previously in TableA.8that not all features in the data are frequent We make use of this fact to examine, by considering only frequent features, whether we can classify the test patterns better This is considered in two ways

In the first method, we first find prototypes and then find frequent features within the prototypes, which we term as “Leaders followed by Frequent features.” In the second method, we consider frequent features in the entire data and then identify prototypes, which we term as “Frequent features followed by Leaders.” The overall algorithm is provided in Algorithm5.3

Algorithm 5.3 (Simultaneous Selection of Patterns and Features)

Step 1: Compute the support of each of the features across the given data

Step 2: For a given distance threshold, identify features that exceed the threshold Term them frequent features

Step 3: Eliminate infrequent features from both training and test data by setting the corresponding feature values to 0.0 No of patterns remains same

Step 4: Classify test patterns and compute the cost

5.6.3.1 Leaders Followed by Frequent Features

The costs of assigning a wrong label to pattern is not the same across different classes For example, the cost of assigning a pattern from class “normal” to class “u2r” is 2, from “normal” to “dos” is 2, and from “normal” to “probe” is Since the cost matrix is not symmetric, the cost of assigning “u2r” to “normal” is not the same as that of “normal” to “u2r”

The prototype set mentioned in Case of Table5.8is considered for the study Frequent features are obtained from the dataset with help of minimum item support All features are considered with a support of % The number of effective features reduces from 38 to 21 and 18 with respective supports of 10 % and 20 % Table5.10

(135)

Table 5.10 Results with frequent item support on case (247,072 patterns)

Support No of features CA Cost

0 % 38 91.89 % 0.1589

10 % 21 91.95 % 0.1576

20 % 18 91.84 % 0.1602

patterns reduces storage space and computation time The scenario also leads to in-crease in classification accuracy and reduction in assignment cost till representatives are preserved

5.6.3.2 Frequent Feature Identification Followed by Leaders

In this case, entire training data is considered, and support of each feature is com-puted When support of % is applied on entire data, the number of features reduces to 22 and with 10 %, and the number of features reduces to 17 Figure5.15 summa-rizes the results

The leader computation is restricted to data containing features with 10 % sup-port Table5.11contain the results Observe from the table that the exercise corre-sponding to a distance threshold of 20.0 provided the least cost with classification accuracy nearly unchanged for thresholds 5.0–100.0

(136)

Table 5.11 Results on original data having features with 10 % support

Distance Threshold

No of leaders CA Cost

5.0 17,508 91.83 % 0.1588

10.0 1,5749 91.85% 0.1586

20.0 15,023 91.83 % 0.1585

50.0 9669 84.60 % 0.2990

100.0 3479 82.97 % 0.3300

5.7 Summary

With the objective of handling large data classification problem efficiently, we ex-amined the usefulness of prototype selection and feature selection individually and in combination During the process, a multicategory training data is considered both as a single multicategory dataset and as class-wise We also examined the effective-ness of sequencing of prototype selection and feature selection Feature selection using frequent item support has been studied We consider kNNC for classifying the given large, high-dimensional handwritten data Elaborate experimentation has been carried out and results presented

The contributions of the data compaction scheme are to show that through com-binations of prototype selection and feature selection we obtain better Classification Accuracy, instead of considering these two activities independently This amounts to simultaneous selection of patterns and features together Clustering “data con-taining frequent features” has provided a good amount of compactness for classifi-cation, from 669 distinct subsequences to 367 subsequences, viz., 45 % reduction Such compactness did not result in reduction in classification accuracy

It is clear that frequent feature selection leads to reduction in the number of fea-tures We have shown through experiments that such a reduction improves classi-fication accuracy from 91.79 % to 92.52 % when the data is considered class-wise for feature selection The prototype selection excludes redundant patterns, leading to an improvement of CA to 93.14 % when the data is considered class-wise It should be noted that when the data is considered class-wise, selection ofψ orζ is relevant to the training data within the class only The combination scheme where features and prototypes are selected simultaneously provided the best classification accuracy Further, Scheme (Approach 4) provided both a compaction of 45 % and the best classification accuracy of 93.58 %

Similar observations can be made with intrusion detection data

The following is the summary of work The numerical data corresponds to the experiments with handwritten digit data

(137)

• Prototype selection leads to data reduction The classification accuracy obtained is superior to frequent feature usage only However, the number of distinct features is the same as the original data, viz., 669

• The class-wise method of frequent feature selection followed by prototype selec-tion provided the best classificaselec-tion accuracy of 93.52 % Similarly, complete data-based prototype selection followed by feature selection provides the same best accuracy The corresponding compaction achieved in terms of the number of fea-tures is 35.3 %

• Clustering followed by frequent feature selection provides the best compaction of 45.1 % as compared to original distinct subsequences of 669 while providing the classification accuracy of 93.58 %

Duda et al (2000) provide a discussion on pattern and feature selection and on pattern classification approaches Kittler (1986) provides a discussion on feature se-lection and extraction Importance of a simple model is emphasized by Domingos (1998) A discussion on dimensionality and data reduction approaches can be found in Jain et al (1999) and Pal and Mitra (2004) Data squashing is discussed by Du-Mouchel et al (2002) Sampling schemes for prototype selection can be found in Pal and Mitra (2004) and Han et al (2012) Zhang et al (1996) provides data sum-marization scheme known as BIRCH Bradley et al (1998) provides a discussion on scaling of clustering algorithms for large data Description of PAM, CLARA and CLARANS (Clustering Large Applications using RANdomized Search) can be found in Kaufman and Rousseeuw (1989) A discussion on Leader algorithm is pro-vided in Spath (1980) A comparison of prototype selection methods using genetic algorithms is provided in Ravindra et al (2001) Burges (1998) provides a detailed discussion on support vector machines, which in turn can be used to select proto-types Hart (1968) provides Condensed Nearest-Neighbor algorithm Agrawal and Srikant (1994) propose the concept of support in their seminal work on association rule mining Ravindra et al (2005) discuss simultaneous selection of prototypes and features Lossy compression using the concepts of frequent item support and distinct subsequences is provided by Ravindra et al (2004) The dataset used for illustrating of computation of leaders is taken from UCI-ML repository (2013)

References

R Agrawal, R Srikant, Fast algorithms for mining association rules, in Proceedings of

Interna-tional Conference on VLDB (1994)

P Bradley, U.M Fayyad, C Reina, Scaling clustering algorithms to large databases, in Proceedings

of 4th Intl Conf on Knowledge Discovery and Data Mining (AAAI Press, New York, 1998),

(138)

C.J.C Burges, A tutorial on support vector machines for pattern recognition Data Min Knowl Discov 2(2), 121–167 (1998).

P Domingos, Occam’s two razors: the sharp and the blunt, in Proc of 4th Intl Conference on

Knowledge Discovery and Data Mining (KDD’98), ed by R Agrawal, P Stolorz (AAAI Press,

New York, 1998), pp 37–43

W DuMouchel, C Volinksy, T Johnson, C Cortez, D Pregibon, Squashing flat files flatter, in

Proc 5th Intl Conf on Knowledge Discovery and Data Mining, San Diego, CA (AAAI Press,

New York, 2002)

R.O Duda, P.E Hart, D.J Stork, Pattern Classification (Wiley-Interscience, New York, 2000) J Han, M Kamber, J Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, New

York, 2012)

P.E Hart, The condensed nearest neighbor rule IEEE Trans Inf Theory IT-14, 515–516 (1968) A.K Jain, M.N Murty, P Flynn, Data clustering: a review ACM Comput Surv 32(3) (1999) J Kittler, Feature selection and extraction, in Handbook of Pattern Recognition and Image Proc.,

ed by T.Y Young, K.S Fu (Academic Press, San Diego, 1986), pp 59–83

L Kaufman, P.J Rousseeuw, Finding Groups in Data—An Introduction to Cluster Analysis (Wiley, New York, 1989)

S.K Pal, P Mitra, Pattern Recognition Algorithms for Data Mining (Chapman & Hall/CRC, Lon-don/Boca Raton, 2004)

T Ravindra Babu, M Narasimha Murty, V.K Agrawal, Hybrid learning scheme for data min-ing applications, in Proc of Fourth Intl Conf on Hybrid Intelligent Systems (IEEE Computer Society, Los Alamitos, 2004), pp 266–271 doi:10.1109/ICHIS.2004.56

T Ravindra Babu, M Narasimha Murty, V.K Agrawal, On simultaneous selection of prototypes and features in large data, in Proceedings of the First International Conference on Pattern

Recognition and Machine Intelligence Lecture Notes in Computer Science, vol 3776 (Springer,

Berlin, 2005), pp 595–600

T Ravindra Babu, M Narasimha Murty, Comparison of genetic algorithm based prototype selec-tion schemes Pattern Recognit 34(2), 523–525 (2001)

H Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood, Chichester, 1980)

T Zhang, R Ramakrishnan, M Livny, BIRCH: an efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD International Conference of Management of

Data (SIGMOD’96) (1996), pp 103–114

(139)

Chapter 6

Domain Knowledge-Based Compaction

With large datasets, it is difficult to choose good structures in order to control the complexity of models unless there exists prior knowledge on data With the ob-jective of classification of large datasets, one can achieve significant compaction, and thereby performance improvement, by integrating prior knowledge The prior knowledge may relate to nature of the data, discrimination function, or the learning hypothesis

The prior or domain knowledge is either provided by a domain expert or is derived through rigorous data analysis as we demonstrate in the current work No-Free-Lunch theorem emphasizes that in the absence of assumptions, prefer-ence does not exist across one learning algorithm over the other The assumptions or knowledge about the domain is important in designing classification algorithms We stress the importance of deriving such domain knowledge through preliminary data analysis so as to automate the process of classification of multiclass data using binary classifiers We exploit this aspect in the present work

In the current chapter, we consider binary classifiers Various approaches exist for multiclass classification such as one-vs-one and one-vs-rest, with each of the approaches requiring many comparisons for determining the category of a pattern We propose to exploit domain knowledge on the data in labeling the 10-category patterns through a novel decision tree of depth We apply such a scheme to classify the patterns using support vector machines (SVM) and adaptive boosting (AdaBoost) The overall classification accuracy thus obtained is shown to be better than the previously reported values on the same data The proposed method also integrates clustering-based reduction of the original large data

Major contributions of the work are the following

• Exploiting domain knowledge in devising a multicategory tree classifier with a depth of just 4.

• Use of SVMs with appropriate kernels

(140)

• Use of AdaBoost

• Employing clustering methods and CNN to obtain representative patterns.

• Integrating representative patterns, domain knowledge, SVMs, or AdaBoost in obtaining the classification accuracy better than reported earlier with less design and classification times as compared to application of the same scheme on full data

The chapter is organized as follows Section6.2provides a brief discussion on different schemes of multicategory data classification using a 2-class classifier Sec-tion6.3contains an overview of SVM A brief discussion on the AdaBoost method is provided in Sect.6.4 An overview of decision trees is provided in Sect.6.5 Sec-tion6.6contains preliminary analysis on the data that helps to extract the domain knowledge on data The proposed method is provided in Sect.6.7 Experimental results using support vector machines and AdaBoost are provided in Sect.6.8 Sec-tion6.9 contains the summary of the work A discussion on material for further study and related literature is provided in Sect.6.10

6.2 Multicategory Classification

Two-class classification rules using SVMs or AdaBoost are easier to learn In case of multicategory data ofc-classes, the classification is done using multiple binary classification stages, and the results are combined to provide overall classification accuracy The approaches can be classified into one-vs-rest, one-vs-one, and error-correcting codes In case of one-vs-rest decision, we consider training samples from each class as positive and the rest as negative We assign that class to a test pattern, which wins most comparisons In case of one-vs-one, c(c2−1) binary classifications are resorted to, considering two classes at a time We assign that class to the test pattern that has the largest number of votes In case of pairwise coupling, the output of binary classifier is interpreted as the posterior probability of positive class, and a test pattern is assigned the class that has the largest posterior probability Error-correcting code methods adapt coding matrices A multicategory problem is reduced to a two-category method in the literature through multiple approaches Figures6.1

and6.2contain examples of one-vs-one and one-vs-rest classification

6.3 Support Vector Machine (SVM)

The current subsection provides a quick overview of the support vector machine-related material

Let (xi, yi),i=1,2, , m, represent patterns to be classified,x∈Rp, andyis the category,+1 or−1 This leads to a learning machine that maps fromx−→y orx−→f (x, α) A choice ofαleads to a trained machine The risk or expectation of the test error for a trained machine is

R(α)=

1

(141)

Fig 6.1 One-vs-one classification First subfigure

on left-top contains five

patterns that are under consideration

Fig 6.2 One-vs-rest classification

Suppose that training patterns satisfy the following constraints:

wTx+b≥ +1 foryi= +1, (6.2)

wTx+b≤ −1 foryi= −1. (6.3)

Consider the points with equality in (6.2) and (6.3) These points lie on the hyper-planeswTx+b= +1 andwTx+b= −1 Such points are called Support Vectors The hyperplanes are parallel with a distance 2/w, called margin

The above problem can be restated as a convex quadratic programming prob-lem for maximizing the margin, which is equivalent to the following minimization problem:

Minw,b φ (w)= 2w

2

subject to yi

wTxi+b

≥1, i=1,2, , l.

(6.4) By introducing positive Lagrangian multipliers,λi, i=1,2, , m, the solution of the optimization problem provides support vectors forλi>0 The decision func-tion is given by

f (x)=sign

m

i=1 yiλ∗ix

Tx i+b∗

(142)

In case of nonlinear decision functions, we map data to higher-dimensional fea-ture space and construct a separating hyperplane in this space The mapping is given by the following equations:

X−→H,

x−→φ (x). (6.6)

The decision functions are given by

f (x)=signφ (x)Tw∗+b∗, f (x)=sign

m

i=1

yiλ∗iφ (x)Tφ (x)i+b∗

. (6.7)

Consider a kernel function equation (6.8), which constructs an optimal separating hyperplane in the spaceH without explicitly performing calculations in this space: K(x, z)=φ (x)Tφ (z). (6.8) With this, the decision function given by (6.7) can be rewritten as

f (x)=sign

m

i=1

yiλ∗iK(x, xi)+b∗

. (6.9)

In case of nonseparable data, we introduce a vector of slack variables(ξ1, ξ2, , ξm)T that measure the amount of violation of constraints The problem is re-stated as follows:

Minw,b Φ

w, b, (ξ1, ξ2, , ξm)

=1

2w 22+C

m

i=1 ξik subject to yi

wTφ (xi)+b

≥1−ξi, ξi≥0, i=1,2, , m.

(6.10)

6.4 Adaptive Boosting

Boosting is a general method for improving accuracy of a learning algorithm It makes use of a base learning algorithm with an accuracy of at least 50 % It adds a new component classifier, through multiple stages, forming an ensemble Decision rule based on the ensemble provides CA higher than that provided by single use of base learning algorithm

(143)

incorrectly classified examples are appropriately increased so that the weak learner at the next stage is forced to focus on the hard examples in the training set The final classification is based on a weighted linear combination of stage-wise assignment It was theoretically shown that with a learning algorithm slightly better than random, training error drops exponentially

The AdaBoost algorithm in its original form is provided in Algorithm6.1 An outline of implementation aspects of the AdaBoost algorithm is provided in Algo-rithm6.2

6.4.1 Adaptive Boosting on Prototypes for Data Mining Applications

When the data is large, repeated application of the algorithm on entire data is expen-sive in terms of computation and storage In view of this, we consider prototypes of the large data In the current work we propose a scheme that incorporates AdaBoost on prototypes generated by the Leader clustering algorithm on handwritten digit data consisting of 10 classes

Algorithm 6.1 (AdaBoost Algorithm with leader prototypes)

Step 1: Consider n patterns (p) and corresponding labels (l) as (pi, li), i = 1,2, , n The labelslitake values or−1 Initialize pattern-wise weights toW1(i)=1n

Step 2: For each iterationj=1, , m, carry out the following: Train weak learner using distributionWj Consider training patterns according to the weight distribution,Wj Compute leaders

Step 3: Weak learner finds a weak hypothesis,hj, that maps input patterns to labels

{−1,1}for the given weight distributionWj using leaders

Step 4: The errorεj in the weak hypothesishj is the probability of misclassifying a new pattern, as the patterns are chosen randomly according to the distri-butionWj

Step 5: Computeδj=0.5 ln( 1−εj

εj )

Step 6: Generate a new weight distribution,Wj+1(i)= Wj(i)

Sj ×e

−δj if patternp

iis correctly classified andWj+1(i)=WSj(i)

j ×e

δj if patternp

i is misclassified Combining, Wj+1(i)=

Wj(i)exp(δjlihj(pi))

Sj , where Sj is the normalization factor such thatmi=1Wj+1(i)=1

Step 7: Output the final hypothesis,H=sign(mi=1δihi)

(144)

Algorithm 6.2 (AdaBoost Implementation)

Step 1: Consider 2-class training dataset with labels+1 and−1 To begin with, assign equal weights to each pattern For each of the iterationsi=1, , n, carry out steps to

Step 2: Classify training data using component classifier, viz., kNNC This forms a weak hypothesishi that maps each patternxto labels+1 or−1 Compute the error in theith iteration, the sum of weights of misclassified patterns Step 3: Compute the parameterαi as 0.5 ln(1−εiεi)

Step 4: With the help ofαi, update the weights of patterns such that the weight of misclassified pattern is increased, so that the subsequent iteration focusses on those patterns Normalize the weights so as to make a distribution after updating all the weights

Step 5: Select patterns according to the weight distribution

Step 6: Compute the final hypothesis as a weighted majority ofnweak hypotheses, H=sign(ni=1αihi(x))

From the above discussion it is clear that the choice of base learning algorithm and amount of training data provided for learning influence the efficiency As the number of iterations increases, the processing time becomes significant

While applying AdaBoost to the current data, one option is to use the entire data at each iteration subject to the weight distribution This would obviously consume large amount of processing time At this stage, the following three considerations emerge

• Reduction of the training time would make use of AdaBoost more efficient

• Efficiency also depends on the nature of base learning algorithm

• While dealing with data under consideration, whether inherent characteristics in the data would help in designing an efficient algorithm, which brings the domain knowledge of the data into use

These considerations lead to an efficient multistage algorithm

6.5 Decision Trees

(145)

Fig 6.3 Axis parallel split.

First subfigure on top-left

contains original dataset

Second and third subfigures

contain first axis parallel split based onx1 and second axis parallel split based onx2, respectively The decision tree is shown in the fourth

subfigure

Fig 6.4 Oblique split First

subfigure on top-left contains

original dataset Second figure indicates oblique split based on function ofx1 andx2 The decision tree is shown in the

third subfigure

present in the data When a splitting rule depends on a single attribute at each in-ternal node, it is termed as a univariate split An example of a univariate split is the axis-parallel split When a linear combination of attributes is used as a splitting rule, it is achieved through linear or oblique splits Finding an optimal oblique split is an active research area Figures6.3and6.4contain examples of axis-parallel and oblique splits

Some frequently used methods of creating decision trees are known as ID3, C4.5, and CART (Classification and regression trees) The disadvantages of the decision tree methods are the possibility of over-fitting and large design time

In the current work, we propose a decision tree classifier where at every nonleaf node, a decision is made based on a binary classifier, viz., support vector machine or AdaBoost.

6.6 Preliminary Analysis Leading to Domain Knowledge

(146)

Fig 6.5 Cluster analysis of data The data containing 10 classes of handwritten data is clustered into two groups usingk-means clustering The subplots correspond to class-wise numbers of pat-terns belonging to clusters and 2, respectively

• Analytical view of the data to identify possible similarities

• Numerical analysis by resorting to statistical analysis and clustering the data to find out patterns belonging to different classes that could be grouped to-gether

• Resorting to classification of a large random sample from the given data and make observations from the confusion matrix

6.6.1 Analytical View

Physical view of handwritten digits indicate that the digits{0,3,5,6,8}share sig-nificant similarity among them Similarly,{1,2,4,7,9}are alike Since the patterns are handwritten, it is found that and appear quite similar since most often the up-per loop of is not completed while writing Similar observations can be made for

{0,6}(when the loop for zero is incomplete at the bottom),{8,3,5},{3,5},{4,9},

(147)

Fig 6.6 Box-plot of nonzero features in full pattern The figure contains class-wise statistics of nonzero features of patterns The box-plot helps in finding similarity among the classes in terms of measures of central tendency and measures of dispersion

6.6.2 Numerical Analysis

As part of numerical analysis, we consider a training dataset of 6670 patterns con-sisting of equal number of patterns from each class We use k-means clustering algorithm withk=2 to find two distinct groups We find that clustering results into two groups of patterns where the numbers of patterns in the clusters are 3035 and 3635, respectively We provide the class-wise number of patterns for each cluster in Fig.6.5 The analysis segregates some digits such as 0, 1, 3, 6, and more crisply, and other digits such as 4, 5, and show dominant belonging to one of the two clusters Some digits such as and are almost divided equally into both clusters

Subsequently, we studied nonzero features of the training patterns Figure6.6

(148)

Fig 6.7 Box-plot of nonzero features in top and bottom halves of patterns Each pattern is divided into two halves known as top and bottom halves Depending on the handwritten digit, there is simi-larity in the corresponding halves This helps in grouping the classes for devising knowledge-based tree It also provides a view of complexity in classifying such data

Fig 6.8 Box-plot of nonzero features in left and right halves of patterns In this case, the data is divided into left and right halves, and the statistics on such halves are computed and presented as

box-plots The plots help notice similarities among halves of distinct classes

We can notice from the figures and inferences that the groups of classes are sim-ilar to those observed in Sect.6.6.1

6.6.3 Confusion Matrix

A confusion matrix provides a glimpse of correctly and incorrectly classified num-bers of patterns, which is a result of a classification experiment In the matrix, each column indicates occurrences of predicted class and each row indicates occurrences of actual class

(149)

Table 6.1 Confusion matrix corresponding to

classification accuracy of 76.06 %

Label

0 255 9 10 19 14 333 0 0 0 0 2 17 282 9 2 23 273 0 22 36 0 179 0 112 10 33 265 8 32 287 0 10 44 1 0 246 0 41 46 46 23 20 123 64

9 17 0 17 292

Table 6.2 Confusion matrix corresponding to

classification accuracy of 86.92 %

Label

0 317 2 1 3 332 0 0 0 10 12 268 9 11 13 3 12 261 0 17 20 0 0 301 0 27 1 19 268 5 16 10 312 0 1 13 0 22 0 274 0 24 21 271 14

9 1 0 22 0 14 293

through confusion matrix In most cases, we notice that the misclassification oc-curred between{1,7},{7,9},{3,5},{3,8},{5,8},{0,6},{4,9},{1,2},{1,7}, etc Most of these misclassifications can be visualized from the nature of handwritten patterns This again leads to groups of classes similar to the ones as detailed in the previous two subsections Two sample confusion matrices that classified 3333 test patterns based on prototypes generated using genetic algorithms, which are chosen during initial generations in pursuit of obtaining optimal prototypes, are provided in Tables6.1and6.2 The chosen cases, which were part of initial stages of the experi-ment before a best set was identified, have a classification accuracy of 76.058 % and 86.92 %, respectively, using a nearest-neighbor classifier To elaborate the contents of the matrix, consider the row corresponding to label “4” It can be seen that 179 of 333 patterns are labeled correctly as “4”, 36 of them are classified as “1”, patterns as “6”, patterns as “7”, and 112 as “9” The column corresponding to class “4” contains the number of patterns of other classes labeled as “4”

(150)

used Parameter Description

n Number of patterns or transactions

d Number of features or items

l Number of leaders

s Number of support vectors

of the data and may not necessarily provide final grouping in a crisp manner in every such attempt In summary, the analysis leads to grouping of classes This is discussed in detail in the following section on the proposed scheme

6.7 Proposed Method

Consider a training dataset consisting ofn patterns Let each pattern consist ofd binary features In order to reduce the data size, two approaches are followed in combination to compute representative patterns, viz., leaders and condensed near-est neighbors The Leader clustering algorithm Spath (1980) is used to update pat-tern representatives in terms of leaders The Condensed Nearest-Neighbor (CNN) algorithm refers to the set of representative patterns that classify every other pat-tern within the training dataset correctly Computation of prototypes using CNN is discussed in detail in Sect.5.2 Union of CNN and leaders is used for generating representative patterns for the application of SVM and leaders alone for the ap-plication of AdaBoost Each leader is considered as a cluster representative The number of leaders depends on the threshold value Letlbe the number of leaders The number of support vectors is represented bys Table6.3contains the summary of parameters

The algorithm is provided in Algorithm6.3

6.7.1 Knowledge-Based (KB) Tree

Based on preliminary data analysis discussed in Sect.6.6, we notice that input data can be divided into two groups consisting of Set 1: {0, 3, 5, 6, 8} and Set 2: {1, ,4, 7, 9} at the first stage Thus, given a test pattern, a decision is made to classify given digit into Set vs Set at this stage Subsequently, based on the same analysis, the decisions are made between (0, 6) vs (3, 5, 8), vs 6, vs (3, 5), and vs 5; (4, 9) vs (1, 2, 7), vs 9, vs (1, 7), and vs The corresponding decision tree is presented in Fig.6.9

Algorithm 6.3 (Classification algorithm based on Knowledge Based Tree)

(151)

Fig 6.9 Knowledge-based multicategory tree classifier

Step 2: Compute condensed nearest neighbors of the data

Step 3: Compute leaders with a prechosen distance threshold, using Leader cluster-ing algorithm

Step 4: Combine the CNNs and leaders to form a union The new set forms the set of representatives

Step 5: As a first step for classification, divide 10-class training data set into two classes (0,3,5,6,8) as+1 and the remaining training patterns as−1 Com-pute support vectors for classifying these sets

Step 6: Divide patterns of each of the sets of classes into two sets successively, till leaf node contains a single class; classify using binary classifier and pre-serve the corresponding set of support vectors Obpre-serve that in all it requires comparisons At every stage, include only those patterns that are correctly classified

Step 7: Given a test pattern, based on respective sets of support vectors, classify into two classes, +1 or −1 Once it reaches a leaf, compare the label of the test pattern with that assigned by the classifier Compute classification accuracy

6.8 Experimentation and Results

(152)

Preliminary analysis on the data such as clustering, computation of measures of central tendency, and dispersion suggest similarity among the classes(1,2,4,7,9) and(0,3,5,6,8)and further (1,2,7),(4,9),(0,6),(5,3,8), etc These observa-tions are further used of in designing a multiclass classifier based on multiple 2-class classifiers

We conduct experiments with SVM and AdaBoost classifiers using KB Tree

6.8.1 Experiments Using SVM

Based on preliminary analysis and domain knowledge of HW data, we combine training data belonging to different classes as shown in Fig.6.9 Observe from the figure that the procedure involves at most comparisons, where each leaf of the tree contains a single class

SVMlightis used for computing support vectors

1 Generation of Representative Patterns Consider n training patterns.

• Using CNN approach, arrive at a setAofn1representative patterns

• With class-wise distance thresholds,εs, compute the set of leaders,B Let the number of leaders ben2;n1< nandn2< n

• The set of representative patterns isC=A∪B, withk (< n)patterns 2 Multiple binary classification Based on preliminary analysis and domain

knowl-edge, combine similar classes thereby dividing multicategory classification into multiple binary classification problems, as shown in Fig.6.9 Observe that differ-ent stages are marked as (1) to (5d) During training at every binary branching, the set of considered labeled patterns is classified into two classes For example, at (2a), the labeled patterns of classes (0,3,5,6,8)are classified into mapped label of+1 corresponding to labels(0,6)at (3a) and−1 corresponding to labels (3,5,8)at (3b) Similarly, patterns at 3(a) are classified into+1 corresponding to label (0) at (4a) and−1 corresponding to label (6) at 4(b).

Table6.4provides the results of experiments with SVM using different ker-nels The support vectors and related parameters generated at each stage are preserved for classification

(153)

Table 6.4 Experiments with

SVM Case Kernel Degree CA (%)

(0,3,5,6,8)vs(1,2,4,7,9) Gaussian – 98.20 (0,6)vs(3,5,8) Polynomial 98.84

(0,6) Polynomial 99.54

8 vs(3,5) Polynomial 97.71

3 vs Polynomial 96.07

(4,9)vs(1,2,7) Polynomial 98.78

2 vs(1,7) Polynomial 99.59

experimentation with different values ofε, a distance threshold value of 3.5 is cho-sen for all classes, except for class with label 1, for whichεof 2.5 is considered Together, prototypes generated by CNN and leaders is 4800

Table6.5contains CA obtained with test data at every stage with corresponding mapped labels of+1 and−1 In the table, sets and respectively correspond to (0,3,5,6,8)and(1,2,4,7,9), where a Gaussian kernel is used “Degree” refers to the degree of polynomial kernel The overall CA with 4800 representative patterns is 94.75 %, which is better than the reported value on the same dataset With full train-ing dataset of 6670 patterns, the overall CA also is same as above, which indicates that the proposed procedure captures all support vectors required for classification The raining times of CPU computed in seconds on PIII 500 MHz machine by the proposed method with reduced data and full data are 143 seconds and 288 seconds, respectively The corresponding testing times are 113.02 seconds and 145.85 sec-onds In other words, the proposed method requires only 50 % of training time and 77 % of testing time as compared to the case with full data In summary, the pro-posed Knowledge-Based Multicategory Tree Classifier provides a higher CA than other schemes using the same data, requires a lesser number of comparisons than existing SVM-based multicategory classification methods, and requires less training and testing times than that with full data.

Table 6.5 Results (RBF kernel at level and polynomial kernels at all other levels of tree) Case set vs

set

(0,6)vs (3,5,8)

0 vs

8 vs (3,5)

3 vs

(4,9)vs (1,2,7)

4 vs

2 vs (1,7)

1 vs

Degree – 3 4 2

(154)

Parameter Description

n No of training patterns

k No of features per pattern pi ith training pattern

hi ith Hypothesis at iterationj

H Final hypothesis

εj Error after each iterationj

αj Derived parameter fromεj, viz., 0.5 ln(1−εiεi)

Wj Weight distribution at iterationj

m Maximum number of iterations

ζ Distance threshold for Leader clustering

6.8.2 Experiments Using AdaBoost

6.8.2.1 Prototype Selection

Prototypes are selected using the Leader clustering algorithm as discussed in Sect.2.5.2.1 The leader clustering algorithm begins with any arbitrary pattern in the training data as the first leader Subsequently, the patterns that lie within a pre-chosen distance threshold are considered to be part of the cluster represented by a given leader As the comparison progresses, a new pattern that lies outside the dis-tance threshold is considered as the next leader The algorithm continues till all the training patterns are examined It is clear that a small threshold would result in a large number of leaders and a large threshold value would lead to a single cluster An optimal threshold is experimentally determined

6.8.2.2 Parameters Used in Adaptive Boosting

The list of parameters used in the algorithm are provided in Table6.6

Elaborate experimentation is carried out for different values of distance thresh-old (ζ) for computing prototypes using the Leader clustering algorithm kNNC is the classifier at each stage The classification is carried out for different values ofk The classification at every stage is binary From every stage, only those correctly clas-sified test patterns at the stage are passed to subsequent stage as input test patterns For example, consider a test pattern with “5” The pattern if classified correctly and passes through the following stages:

1 (0,3,5,6,8)vs(1,2,4,7,9), (0,6)vs(3,5,8),

(155)

AdaBoost Case

description

Value ofk in kNNC

Distance threshold (ζ)

Classification accuracy (%)

Set vs Set 3.2 98.3

(4,9) vs (1,2,7) 3.0 98.1

4 vs 3.4 96.6

2 vs (1,7) 3.0 99.7

1 vs 3.0 99.5

(0,6) vs (3,5,8) 4.0 99.0

0 vs 10 4.0 99.4

8 vs (3,5) 2.5 97.6

3 vs 3.4 96.0

The experiments are conducted on validation dataset The set of parameters (ζ andk) that lead to the best classification accuracy on the validation dataset are used for verification with test dataset Table6.7contains the results In the table, set corresponds to(0,3,5,6,8), and set corresponds to(1,2,4,7,9) We notice that the overall classification accuracy depends on the number of misclassifications at each stage In Fig.6.9, leaf nodes contain class-wise correctly classified patterns They are denoted in italics The “overall CA” is 94.48 %, which is better than the previously reported value on the same data Fork=1, kNNC, i.e., NNC, of full training data of 6000 patterns against test data provides a CA of 90.73 % The best accuracy of 92.26 % is obtained fork=5 of the kNNC The “overall CA” obtained through the proposed scheme is better than CA obtained with NNC and kNNC on the complete dataset A decision tree of a depth of is used to classify 10-class patterns. This is a significant improvement over one-against-all and one-vs-one multiclass classification schemes

With increase of the distance threshold for leader clustering, viz.,ζ, the number of prototypes reduces For example, forζ=3.2, for the input dataset for classifica-tion of set vs set 2, the number of prototypes reduces by 20 % without adversely affecting the classification accuracy Secondly, it is important to note that as we ap-proach the leaf nodes, the number of training patterns reduces For example, we start with 6000 patterns at the root of the decision tree, and at the stage of vs 6, the number of patterns reduces to 1200 Another, interesting observation is that the number of prototypes at the stage {0 vs 6} is 748 for the distance thresholdζ=4.0 This is a reduction by 38 %.

Tables6.8,6.9, and6.10contain the results at some stages of the KB Tree clas-sification

6.8.3 Results with AdaBoost on Benchmark Data

(156)

Table 6.8 Results on

AdaBoost for set vs set Distance threshold

Ave num of prototypes per iteration Ave CA with training data CA with validation data

2.0 1623 97.50 97.60

2.5 1568 97.60 97.60

3.0 1523 97.30 97.68

3.5 1335 97.21 97.38

4.0 1222 96.86 97.12

4.5 861 95.95 96.33

5.0 658 93.96 95.36

AdaBoost for vs rest of set Distance threshold

Ave num of prototypes per iteration Ave CA with trg data CA with valdn data

2.0 1288 99.19 98.58

2.5 1277 99.13 98.50

3.0 1255 99.14 98.43

3.5 1174 99.18 98.43

4.0 1006 98.90 98.28

4.5 706 97.88 98.20

5.0 536 97.28 98.28

AdaBoost for vs rest of set Distance threshold

Ave num of prototypes per iteration Ave CA with trg data CA with valdn data

2.0 1148 99.32 98.43

2.5 1054 99.42 98.65

3.0 966 99.39 98.80

3.5 827 99.22 98.80

4.0 645 98.25 98.65

4.5 417 97.40 98.05

5.0 317 91.73 98.28

Table6.11consists of details on each of the benchmark data and CA (Classifica-tion Accuracy) obtained using the current method

(157)

Table 6.11 Details on

benchmark data Name of

dataset Training data size Test data size Number of features Number of classes

WINE 100 78 13

THYROID 3772 3428 21

SONAR 104 104 60

benchmark data Name of

the dataset Case de-scription Dist threshold Average num of leaders CA (%)

WINE vs non-1 3.0 23 98.72 %

2 vs non-2 1.5 43 93.59 %

3 vs non-3 3.7 98.72 %

THY vs non-1 2.0 261 98.83 %

2 vs non-2 3.7 104 94.31 %

3 vs non-3 3.0 156 93.84 %

SONAR vs non-0 4.0 65 95.19 %

Euclidean distance as a dissimilarity measure Classification Accuracies (CA) on WINE, THYROID, and SONAR respectively are 92.31 %, 93.26 %, and 95.19 %

We notice from Table6.12that in the first two data sets of WINE and THY-ROID, the average CAs obtained viz., 97.01 % and 95.66 %, respectively, are better than those of NNC with entire dataset The accuracy is obtained with prototypes numbering 25 and 174 as compared to the full data size of 100 and 3772, respec-tively In case of the third data set, viz., SONAR, the CA obtained is same as NNC, but with less number of patterns The average number of prototypes of 65 is less than the original data containing 104 patterns

6.9 Summary

(158)

polynomial degree as well as for different distance thresholds for AdaBoost The best classification accuracy obtained using validation data is identified Using the models thus obtained, test data is classified The CA obtained with the approach is 94.75 % with SVM, and it is 94.48 % with AdaBoost with kNNC as a component classifier The obtained result is better than the reported result on the same data in the literature

The scheme is a novel way of dealing with large data where prototypes are identi-fied, domain knowledge is used in identifying the number of comparisons required, and support vectors are computed to classify the given multicategory data with high accuracy

A support vector machine is a widely used classification method Foundations and details discussions on the method can be found in Vapnik (1999), Scholkopf and Smola (2002), Breuing and Buxton (2001), and Burges (1998) It is applied on vari-ety of problems that include face detection by Osuna et al (1997), face recognition by Guo et al (2000), handwritten digit recognition by Scholkopf et al (1995), and Decoste and Scholkopf (2002) SVMlightsoftware by Joachims (1999) helps in appli-cation of support vector machines on various problems Dong and Krzyzak (2005) emphasize the prior knowledge on data to choose good structures in order to control the complexity of models in large datasets Fung and Mangasarian (2005), Platt et al (2000), Tax and Duin (2002), Allwein et al (2000), and Hastie and Tibshirani (1998) provide insights on multiclass classification Rifkin and Klautau (2004) and Mil-gram et al (2006) contain useful discussions one-vs-one and one-vs-all approaches Murthy (1998) provides a detailed discussion on decision trees The Leader clus-tering algorithm is discussed in Spath (1980) Hart (1968) proposed the Condensed Nearest-Neighbor approach Duda et al (2000) provide an insightful discussions on No-Free-Lunch Theorem, decision trees, Adaptive Boosting (AdaBoost), and clas-sification approaches Freund and Schapire (1997,1999) and Schapire (1990,1999,

2002) discuss boosting and boosting C4.5, comparing C4.5 with boosting stumps, weak learnability and AdaBoost algorithm WINE, THYROID, and SONAR data sets1are obtained from UCI-ML database Ravindra et al (2004) carried out work on AdaBoost for classification of large handwritten digit data

References

E.L Allwein, R.E Schapire, Y Singer, Reducing multiclass to binary: a unifying approach to margin classifiers Mach Learn Res 1, 113–141 (2000)

(159)

R Breuing, B Buxton, An introduction to support vector machines for data mining, in Proc 12th

Conf Young Operational Research, Nottingham, UK (2001), pp 3–15

C.J.C Burges, A tutorial on support vector machines for pattern recognition Data Min Knowl Discov 2(2), 121–167 (1998)

D Decoste, B Scholkopf, Training invariant support vector machines Mach Learn 46(1–3), 161– 190 (2002)

J.X Dong, A Krzyzak, Face SVM training algorithm with decomposition on very large datasets IEEE Trans Pattern Anal Mach Intell 27(4), 603–618 (2005)

R.O Duda, P.E Hart, D.J Stork, Pattern Classification, 2nd edn (Wiley, New York, 2000) Y Freund, R.E Schapire, A decision-theoretic generalization of on-line learning and an application

to boosting J Comput Syst Sci 55(1), 119–139 (1997)

Y Freund, R.E Schapire, A short introduction to boosting J Jpn Soc Artif Intell 14(5), 771–780 (1999)

G.M Fung, O.L Mangasarian, Multicategory proximal support vector machine classifiers Mach Learn 59, 77–97 (2005)

G Guo, S.Z Li, K Chan, Face recognition by support vector machines, in Proc of Fourth IEEE

Intl Conf on Automatic Face Gesture Recognition (IEEE Computer Society, Los Alamitos,

2000), pp 196–201

P.E Hart, The condensed nearest neighbor rule IEEE Trans Inf Theory 14(3), 515–516 (1968) T Hastie, R Tibshirani, Classification of pairwise coupling Ann Stat 26(2), 451–471 (1998) T Joachims, Making large-scale SVM learning practical, in Advances in Kernel Methods—Support

Vector Learning, ed by B Scholkopf, C.B Burges, A Smola (MIT Press, Cambridge, 1999)

J Milgram, M Cheriet, R Sabourin, “One against one” or “one against all”: which one is bet-ter for handwritten recognition with SVMs? in Tenth Inbet-ternational Workshop on Frontiers in

Handwritten Recognition (2006)

S.K Murthy, Automatic construction of decision trees from data: a multi-disciplinary survey Data Min Knowl Discov 2, 345–389 (1998)

E Osuna, R Freund, F Girosi, Training support vector machines: an application to face detection, in Proc IEEE Conf Computer Vision and Pattern Recognition (1997), pp 130–136

J.C Platt, N Cristianini, J Shawe-Taylor, Large margin DAGs for multiclass classification Adv Neural Inf Process Syst 12, 547–553 (2000)

T Ravindra Babu, M Narasimha Murty, V.K Agrawal, Adaptive boosting with leader based learn-ers for classification of large handwritten digit data, in Proc of Fourth IEEE Intl Conf on

Hybrid Intelligent Systems, California (2004), pp 326–331

R Rifkin, A Klautau, In defense of one-vs-all classification J Mach Learn Res 5, 101–141 (2004)

R.E Schapire, The strength of weak learnability Mach Learn 5, 197–227 (1990)

R.E Schapire, Theoretical views of boosting and applications, in Algorithmic Learning Theory, Lecture Notes in Computer Science, vol 1720 (1999), pp 13–25

R.E Schapire, The boosting approach to machine learning: an overview, in MSRI Workshop on

Nonlinear Estimation and Classification (2002)

B Scholkopf, C.J.C Burges, V.N Vapnik, Extracting support data for a given task, in Proc First

Intl Conf Knowledge Discovery and Data Mining (1995), pp 252–257

B Scholkopf, A.J Smola, Learning with Kernels (MIT Press, Cambridge, 2002)

H Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood, Chichester, 1980)

D.M Tax, R.P Duin, Using two-class classifiers for multiclass classification, in Proc of 16th IEEE

Intl Conf on Pattern Recognition, vol (2002), pp 124–127

(160)

Chapter 7

Optimal Dimensionality Reduction

In data mining applications, one encounters a high-dimensional large number of patterns Often, such large datasets are also characterized by a large number of fea-tures It is observed that not all features contribute to generating an abstraction, and an optimal subset of features is sufficient both for representation and classification of unseen patterns Feature selection refers to the activity of identifying a subset of features that help in classifying unseen patterns well

For large datasets, even repeated simple operations such as computation of the distance between two binary-valued patterns result in significant amount of compu-tation time Reduction in number of patterns by prototype selection based on large data clustering approaches; optimal selection of prototypes, dimensionality reduc-tion through optimal selecreduc-tion of feature subsets, and optimal feature extracreduc-tion are some of the approaches that help in improving the efficiency

In data mining, a dataset may be viewed as a matrix of sizen×d, wherenis the number of data points, anddis the number of features In mining such datasets, the dimensionality of the data can be very large; the associated problems are the following

• If the dimensionalitydis large, then building classifiers and clustering algorithms on such datasets can be difficult The reasons are as follows

1 If the dimensionality increases, then the computational resource requirement for mining also increases; dimensionality affects both time and space require-ments

2 Typically, both classification and clustering algorithms that use the Euclidean distance like metrics to characterize similarity between a pair of patterns may overfit Further, it becomes difficult to discriminate between patterns based on the distances in high-dimensional spaces where the data is sparsely dis-tributed The specific issue is that as the dimensionality increases, it is dif-ficult to discriminate between the nearest and farthest neighbors of a point T Ravindra Babu et al., Compression Schemes for Mining Large Datasets,

(161)

X based on their distances from X; these distances will be almost the same

3 There could be situations, specifically in areas like medical informatics, where the number of data pointsnis small relative to the number of features d It is not uncommon to have hundreds of data points and millions of features in some applications In such cases, it is again important to reduce the dimen-sionality

• Dimensionality reduction is achieved using one of the following approaches 1 Feature Selection Here

– We are given a set of features,FD= {f1, f2, , fD}, which characterize the patternsX1, X2, , Xn

– So each pattern is a D-dimensional vector Feature selection involves selecting a set F of d features, where d < D, F ⊂FD, and F =

{f1, f2, , fd} That is, eachfiis somefj∈FD

– Selecting such a subsetF of FD is done by using some heuristic or by optimizing a criterion function Primarily, there are two different schemes for feature selection These are the following

(a) Filter methods These employ schemes to select features without us-ing the classifiers directly in the process; for example, features may be ranked based on correlation with the class labels

(b) Wrapper methods Here, features are selected by using a classifier in the process Classifiers based on the nearest-neighbor rule, decision trees, support vector machines, and naïve Bayes rule are used in feature se-lection; here features are selected based on accuracy of the resulting classifier using these selected features

2 Feature Extraction It may be viewed as selection of features in a trans-formed space Each feature extracted may be viewed as either a linear or a nonlinear combination of the original features For example, if FD is the given set ofD features, then the reduced set of features extracted is the set

F = {f1, f2, , fd}, where d < D andfi =

j=D

j=1 αjfj with real num-bers αj andfj ∈FD; so the new features are linear combinations of the given features Here we consider feature extraction based on linear combina-tions only; note that feature selection is a special case of feature extraction where all but oneαj are zero These feature extraction schemes may charac-terized as follows:

(a) Deterministic Hereαjare deterministic quantities obtained from the data (b) Stochastic In these cases,αj are randomly chosen real numbers

(162)

7.2 Feature Selection

In feature selection, we need to rank either the features or subsets of features to ultimately select a subset of features There are different schemes for ranking

7.2.1 Based on Feature Ranking

Using this scheme, feature selection is achieved based on the following algorithm Rank individual features using some scheme As mentioned earlier, letFDbe the

set of given featuresf1, f2, , fD Let the features ranked using the scheme bef1> f2>· · ·> fd> fd+1>· · ·> fD−1> fD, which means thatf1is the best feature followed byf2, thenf3, and so on

2 Select a subset ofd features as follows

(a) Consider the set of features{f1, f2, , fd}

(b) Select the best featuref1; select the next feature to befj wherefj is such that{f1, fj}>{f1, fi}for alli=jandi=2, , D Repeat till the required number (d) of features are selected Note that here once we select a feature, for example, f1, then we have it in the final set of selected features Any feature to be included in the current set of selected features is ranked based on how it performs jointly with the already selected ones This is called the Sequential Forward Selection (SFS) scheme In this case, there is no way to delete a feature that is already selected

In the above schemes, we ranked features for inclusion in the set In a symmetric manner, we can also rank features for possible deletion Here, the general scheme is as follows

1 Consider the possibility of deletingfi,i=1,2, , DfromFD Let the result-ing set be F−i Let the ranking of the resulting sets be F1> F2>· · ·> FD, where for eachFi, there is anF−j such thatFi=F−j

2 We selectF1which hasD−1 features; then, recursively, we keep eliminating one feature at a time till we ultimately get a set of required number (d) of fea-tures

This scheme is called the Sequential Backward Selection (SBS) scheme Some of the properties of the SBS are the following

1 Once a featuref is deleted from a set of size l (>d), then this feature cannot figure in later, that is, in sets of size less thanl

2 It is useful whendis close toD; for example, when selecting 90 features from a set of 100 features It is inefficient to use this scheme for selecting 10 out of 100 features; SFS is better in such cases

(163)

schemes permit nonmonotonic behavior For example, the Sequential Forward Floating Selection (SFFS) scheme permits us to delete features that have been added earlier Similarly, the Sequential Backward Floating Selection (SBFS) permits in-clusion of features that were discarded earlier We explain the working of the SFFS next:

1 LetFDbe a given set ofDfeatures; Letk=0 andFk=φ

2 Addition based on SFS We use SFS to select a feature to be added Letfi be the best feature along with the (already selected) features inF Then updateF to includefi, that is, setk=k+1 andFk=Fk−1∪ {fi}

3 Conditional deletion Delete a featurefjfromFkifFk\ {fj}> Fk−1 Repeat steps and to get a feature subset of sized

SFFS permits deletion of a feature that was added earlier LetF0=φ Letfi be the best feature that is added to generateF1= {fi} Let the next best (along withfi) befj; so,F2= {fi, fj} Note that deletingfi orfj fromF2will not be possible Let the next feature to be added isfl makingF3= {fi, fj, fl} Now we can delete fi fromF3 if{fj, fl}is the best subset of size So, by adding and conditionally deleting, we could delete a feature (fihere) that was added before Such a nonmono-tonic behavior is also exhibited by SBFS; in SBFS, we conditionally add features that were deleted earlier

7.2.2 Ranking Features

Ranking may be achieved by examining the association between feature value and the class label and/or classification accuracy These may be realized as follows: 1 Filter Methods These are schemes in which the features are ranked based on

some function of the values assumed by the individual features Two popular parameters used in this category are as given below

(a) Fisher’s score It is based on the separation between the means of two classes with respect to sum of the variances for each feature In a two-class situation, the Fisher score FS is

FS(fj)=

(μ1(j )−μ2(j ))2 σ1(j )2+σ2(j )2 ,

whereμi(j ) is the sample mean value of featurefj in classi=1,2, and σi(j )is the same standard deviation offj for classi=1,2 For a multiclass problem, it is of the form

FS(fj)=

C

i=1ni(μi(j )−μi)2

C

i=1niσi(j )2 ,

(164)

(b) Mutual Information Mutual information (MI) gives information that one random variable gives about another MI is very popular in selecting im-portant words/terms in classifying documents; here MI measures the amount of information the presence or absence of termt provides about classifying a document to a classc MI of featurefi, MI(fi), is given by

MI(fi)=

i,j∈{0.1} nij

n log2

nijn

l∈{0,1}nil

l∈{0.1}nlj .

2 Wrapper Methods Here, one may use the classification accuracy of a classifier to rank the features One may use any of the standard classifiers based on a feature and compute the classification accuracy using training and validation datasets A majority of the classifiers have been used in this manner Some of them are the following

• Nearest-Neighbor Classifier (NNC) LetFDbe the set of features, andSTrain andSValidatebe the sets of training and validation data, respectively Then the ranking algorithm is as follows:

– For each feature fi ∈FD,i=1,2, , D, compute classification perfor-mance using NNC, that is, find out the number of correctly classified pat-terns fromSValidateby obtaining the nearest neighbor, fromSTrain, of each validation pattern Letnibe the number of correctly classified patterns using fi only

– Rank the features based on the ranking ofnis;fj> fk(fjis superior tofk) ifnj> nk(njis larger thannk) Resolve ties arbitrarily It is possible to rank subsets of features also using this scheme

• Decision Tree Classifier (DTC) Here the ranking is done by building a one-level decision tree classifier corresponding to each feature The specific rank-ing algorithm is as follows:

– Build a decision tree classifier DTCi based on featurefi and the training datasetSTrain fori=1,2, , D Obtain the number of patterns correctly classified fromSValidateusing DTCi, i=1,2, , D; let the number of pat-terns obtained using DTCi beni

– Rank the features based on thenis;fj is superior tofkifnj> nk

• Support Vector Machine (SVM) Like NNC and DTC, in the case of SVM, we also rank features or sets of features by training an SVM using the feature(s) onSTrain We can again rank features by using the classification accuracy on the validation dataset

(165)

(a) Decision tree based Decision tree classifier learning algorithms build a de-cision tree using the training data The resulting tree structure inherently captures the relevant features One can exploit this structure to rank features as follows:

• Build a decision treeDT using the training dataSTrain Each node in the decision tree is associated with a feature

• Use the Breadth First Search (BFS) to order the features used in the de-cision tree Let the output of the BFS bef1, f2, , fd This ordering gives a ranking of the features in terms of their importance

(b) Support Vector Machine (SVM) based In a two-class scenario, learning an SVM from the training data involves obtaining a weight vector W and a threshold weightbsuch thatWtXi+b <0 ifXi is from the negative class andWtXi+b >0 ifXi is from the positive class Here,W andX areD -dimensional vectors Let W=(w1, w2, , wD)t; naturally, eachwi indi-cates the importance of the featurefi It is possible to view the entries ofW as weights of the corresponding features This is achieved using the SVM Specifically,

• Use an SVM learning algorithm on the training data STrainto obtain the weight vectorWand threshold (or bias)b

• Sort the elements ofWbased on their magnitude; ifwi is negative, thenfi contributes to the negative class, and ifwi is positive, thenfi contributes to the positive class So, the importance of featurefi is characterized by the magnitude ofwi

• Now rank features based on the sorted order fj is superior to fk if

|wj|>|wk|

(c) Stochastic Search based Here the candidate feature subsets are generated using a stochastic search technique like Genetic Algorithms (GAs), Tabu Search (TS), or Simulated Annealing (SA) These possible solutions are evaluated using different classifiers using classification accuracy on a val-idation set to rank the solutions The best solution (feature subset) is chosen NNC is one of the popular classifiers in this context

In Sect.7.4, we provide a detailed case study of feature selection using GAs

7.3 Feature Extraction

Feature extraction deals with obtaining new features that are linear combinations of the given features There are several well-known schemes; some of the popular ones are the following

(166)

to show that the resulting directions are eigenvectors of the covariance matrix of the data; also, it corresponds to minimizing some deviation (error) between the original data in theDspace and the projected data (corresponding to thed principal components) in thedspace

Let the data matrix of sizen×DbeAwhere there arendata points and each is a point in aD-dimensional space If the data is assumed to be normalized to be zero-mean, then the covariance matrix may be viewed asE(AAt); the sample covariance matrix is proportional toAAt and is of sizen×n It is possible to show thatAAtis symmetric; so the eigenvalues are real assuming thatAhas real entries Further, it is possible to show that the eigenvalues ofAAt andAtA(of sizeD×D) are the same but for some extra zero eigenvalues, which are|n−D| in number

The eigenvectors and eigenvalues ofAAtare characterized by AAtXi=λiXi.

Similarly, the corresponding eigenvectors and eigenvalues ofAtAare given by AtAAtXi

=λi

AtXi

.

Typically, singular value decomposition (SVD) of the matrixAis used to com-pute the eigenvectors of the matrix AAt Then the top d eigenvectors (corre-sponding to the largestdeigenvalues) are used to represent thenpatterns in the d-dimensional space

2 Nonnegative Matrix Factorization (NMF) This is based on the assumption that the data matrix Ais a nonnegative real matrix We partition then×Dmatrix into two nonnegative matricesB (n×K)andC (K×D) This is achieved using an optimization problem given by

min f (B, C)=1

2A−BC

F such that BandC≥0,

where the cost function is the square of the Frobenius norm (entry-wise differ-ence) betweenAandBC A difficulty associated with this approach is that when onlyAis known, but neitherBnorCis known, then the optimization problem is nonconvex and is not guaranteed to give the globally optimal solution

However, once we get a decomposition ofAinto a product ofB andC, then we have a dimensionality reduction as obtained inB(of sizen×K); each of the ndata points is represented usingKfeatures Typically,KD, and so there is a dimensionality reduction

3 Random Projections (RP) Both PCA and NMF may be viewed as deterministic schemes However, it is possible to get linear combinations of features using ran-dom weights; a ranran-dom projection scheme typically may be viewed as belonging to extracting new features using randomly weighted linear combinations of the givenDfeatures This may be expressed as

B=AR,

(167)

be viewed as a lower-dimensional representation ofA An important property of RP is that under some conditions, it is possible to show that the pairwise dis-tances are preserved; ifXandY are points in theDspace andXandYare the corresponding points in theK-dimensional space, thenX−Y2approximates

X−Y2 This means that the Euclidean distances between pairs of points are preserved

7.3.1 Performance

It is observed based on experimental studies that PCA performed better than RP on a variety of datasets Specifically, on the OCR dataset, PCA-based SVM classifier gave 92 % accuracy using the RBF kernel; using the same classifier on the OCR data RP-based feature set gave an accuracy of 88.5 %

In the following selection, we discuss two efficient approaches to feature selec-tion using genetic algorithms

7.4 Efficient Approaches to Large-Scale Feature Selection Using Genetic Algorithms

On many practical datasets, it is observed that prototype patterns or representative feature subsets or both together provide better classification performance as com-pared to using entire dataset and features considered Such subsets also help in re-ducing classification cost

Pattern recognition literature is replete with many feature selection approaches In this section, we propose to obtain optimal dimensionality reduction using Ge-netic Algorithms, through efficient classification of OCR pattern The efficiency is achieved by resorting to nonlossy compression of patterns and classifying them in the compressed domain itself We further examine combining frequent item support-based feature reduction for possible improvement in classification accuracy

Through experiments, we demonstrate that the proposed approaches result in an optimal feature subset that, in turn, results in improved classification accuracy and processing time as compared to conventional processing

In the present work, we propose algorithms that integrate the following aspects.

• Run-length compression of data and classification in the compressed domain

• Optimal feature selection using genetic algorithms

• Domain knowledge of data under consideration

• Identification of frequent features and their impact on classification accuracy combined with genetic algorithms

(168)

working of the algorithm Experiments and results are discussed in Sect.7.4.4 The work is summarized in Sect.7.4.5

7.4.1 An Overview of Genetic Algorithms

Genetic algorithms are search and optimization methods based on the mechanisms of natural genetics and evolution Since these algorithms are motivated by the com-petition and survival of the fittest in Nature, we find analogy with them The GAs have advantages over conventional optimization methods in finding global optimum solution or near-global optimal solution while avoiding local optima Over the years, the applications rapidly spread to almost all engineering disciplines Since their in-troduction, a number of developments and variants have been introduced and de-veloped into mature topics such as multiobjective genetic algorithms, interactive genetic algorithms, etc In the current section, we briefly discuss the basic concepts with a focus on implementation of a simple genetic algorithm (SGA) and few ap-plications A brief discussion on SGA can be found in Chap The discussion provided in the present section forms the background to subsequent material

SGA is characterized by the following

• Population of chromosomes or binary strings of finite length

• Fitness function and problem encoding mechanism

• Selection of individual strings

• Genetic operators, viz., cross-over and mutation

• Termination and other control mechanisms

It should be noted here that each of the topics is studied in depth through research works Since the current section is intended to provide completeness on the discus-sion with a focus on implementation aspect, interested readers are directed to the references listed out at the end of the section We also intentionally avoid discussion on other evolutionary algorithms

Objective Function SGA is intended to find optimal set of parameters that optimize a function For example, find a set of parameters,x1, x2, , xn, that maximizes a functionf (x1, x2, , xn)

Chromosomes A bit-string or chromosome consists of a set of finite number of bits,l, called the length of the chromosome Bit-string encoding is a classical method adapted by the researchers The chromosomes are used to encode pa-rameters that represent a solution to the optimization problem Alternate encod-ing mechanisms include binary encodencod-ing, gray code, floatencod-ing point, etc SGA makes use of a population of chromosomes with a finite population size, C Each bit of the bit-string is called allele in genetic terms Both the terms are used interchangeably in the literature

(169)

the fitness function Given the values ofx1, x2, , xn, the fitness can be com-puted We encode the chromosome to represent the set of the parameters This forms the key step of a GA Encoding depends on the nature of the optimization problem The following are two examples of encoding mechanisms It should be noted that the mechanisms are problem dependent, and one can find novel ways of encoding a given problem

Example Suppose that we need to select a subset of features out of a group of features that represent a pattern The chromosome length is considered equal to the total number of features in the pattern, and each bit of the chromosome rep-resents whether the corresponding feature is considered The fitness function in this case can be the classification accuracy based on the selected set of features Example Suppose that we need to find values of two parameters that minimize (maximize) a given function and the parameters assume real values The chro-mosome is divided into two parts representing the two parameters The binary equivalent of the expected range of real values of the parameters are considered as corresponding lengths, viz.,l1andl2 The length of the chromosome is given byl1+l2

Selection Mechanism Selection refers to identifying individual chromosomes from previous generation to the next generation of evolution while giving empha-sis to highly fit individuals in the current generation There are many selection schemes that are used in practice For example, the Roulette wheel selection scheme consists of a sector in roulette wheel such that the angle subtended by the sector is proportional to its fitness This ensures that more copies of highly fit individuals move on to the next generation Many alternate approaches for selection mechanisms are used in practice

Crossover Pairs of individuals,s1 ands2, are chosen at random from population and are subjected to crossover Crossover takes place when the prechosen prob-ability of crossover,Pc, exceeds a generated random number in the range [0,1] In the “single point crossover” scheme, the position, say,k, within chromosome is chosen at random from the numbers 1,2, , (l−1)with equal probability Crossover takes place atk, resulting in two new offsprings containing alleles from tokofs1and from(k+1)tolofs2for offspring and from tokofs2 and from(k+1)tolofs1for offspring The operation is depicted in Fig.7.1 The other crossover schemes include two-point crossover, uniform crossover, etc

Mutation Mutation of a bit consists changing it from to or vice versa based on probability of mutation,Pm This provides better exploration of solution space by restoring genetic material that could possibly be lost through generations The activity consists of generating a random number in the range [0,1] If the random number is greater thanPm, mutation is resorted The bit position of mutation is determined randomly by choosing a random number in[0, l] A higher value for Pmcauses more frequent disruption The operation is depicted in Fig.7.2 Termination Many criteria exist for termination of the algorithm Some approaches

(170)

Fig 7.1 Crossover operation

Fig 7.2 Mutation operation

Control Parameters The choice of population sizeCand the values ofPcandPm affect the solution and speed of convergence Although large population size assures the convergence, it increases computation time The choice of these pa-rameters is problem dependent We demonstrate the effect of their variability in Sect 7.4.4 Adaptive schemes for choosing the values ofPc andPm show improvement on final fitness value

SGA With the above background, we briefly discuss working of a Simple Genetic Algorithm as given below After encoding the parameters of an optimization problem, consider n chromosomes, each of length l Initialize the population with a probability of initialization,PI WithPI=0, all the alleles are considered for each chromosome, and withPI=1, none are considered Thus, as the value ofPIvaries from to 1, more alleles with value are expected, thereby resulting in lesser number of features getting selected for the chromosome In Sect.7.4.4, we demonstrate the effect of variation ofPI and provide a discussion As the next step, we evaluate the function to obtain fitness values of each chromosome of the function

(171)

Simple Genetic Algorithm

{

Step 1: Initialize population containing ‘C’ strings of length ‘l’, each with probability of initialization, Pi; Step 2: Compute fitness of each chromosome; while termination criterion not met

{

Step 3: Select population for the next generation; Step 4: Perform crossover based on Pc and

mutation Pm;

Step 5: Compute fitness of each updated chromosome;

} }

7.4.1.1 Steady-State Genetic Algorithm (SSGA)

In the general framework of Genetic Algorithms, we choose entire feature set of a pattern as a chromosome Since the features are in binary form, they indicate the presence or absence of the corresponding feature in a pattern The genetic operators of Selection, Cross-over, and Mutation with corresponding probability of selection (PI), probability of cross-over (Pc) and probability of mutation (Pm) are used Like in the case of SGA, the given dataset is divided into training, validation, and test data Classification accuracy on validation data using NNC forms the fitness func-tion Table7.1contains the terminology used in the paper

In case of SSGA, we retain a chosen percentage of highly fit individuals from generation to generation, thereby preventing loss of such individuals during the gen-erations at a given point of time It is termed as generation gap Thus, SSGA permits largerPmvalues as compared to SGA

7.4.2 Proposed Schemes

(172)

a generation gap of 40 % Algorithm7.2integrates the concept of frequent features in addition to GA-based optimal feature selection

Algorithm 7.1 (Algorithm for Feature Selection using Compressed Data

Classifi-cation and Genetic Algorithms)

Step 1: Consider a population of ‘C’ chromosomes, with each chromosome con-sisting of ‘l’ features Initiate each chromosome by setting a feature to ‘1’ as selected with a given probability,PI

Step 2: For each chromosome in the population,

(a) Consider those selected features in the chromosome

(b) With the selected features in training and validation data sets, compress the data

(c) Compute classification accuracy of validation data directly using the compressed form The classification accuracy forms the fitness function (d) Record the number of alleles, classification accuracy for each

chromo-some, and generation-wise average fitness value

Step 3: In computing next generation of chromosomes, carry out the following steps

(a) sort the chromosomes in the descending order of their fitness (b) preserve 40 % of highly fit individuals for the next generation

(c) the remaining 60 % of the next population are obtained by subject-ing randomly selected individuals from current population to cross-over and mutation with respective probabilitiesPcandPm

Step 4: Repeat Steps and till there is no significant change in the average fitness between successive generations

In the framework of optimal feature selection using genetic algorithms, each chromosome is considered to represent entire candidate feature set The popula-tion containingC chromosomes is initialized in Step Since the features are bi-nary valued, the initialization is carried out by setting a feature to “1” with a given probability of initialization,PI Based on the binary value or of an allele, the corresponding feature is considered either selected or not, respectively

In Step 2, for each initialized chromosome, original training and validation data is updated to contain only those selected features The data is compressed using run-length compression algorithm The validation data is classified in its compressed form, and the average classification accuracy is recorded

In Step 3, the subsequent population is generated The best 40 % of the cur-rent population are preserved The remaining 60 % are generated by subjecting en-tire current population to genetic operators of selection, single-point cross-over, and mutation with preselected probabilities The terminating criterion is verified for per-centage change of fitness between two successive generations

(173)

probabilities of selection, cross-over, mutation, etc The nature of exercises and the results are discussed in the following section

Genetic Algorithms (GAs) are well studied for feature selection and feature ex-traction We restrict our study for feature selection Given a feature set, C, the prob-lem of dimensionality reduction can be defined as arriving at a subset of original feature set of dimensiond < C such that the best classification accuracy is ob-tained Single dominant computation block is the evaluation of fitness function If this could be speeded up, overall speed can be achieved In order to achieve this, we propose to compress the training and validation data and compute the classification accuracy directly on the compressed data without having to uncompress

In the current section, before discussing the proposed procedure, we present com-pressed data classification and Steady-State Genetic Algorithm for feature selection in the following subsections

7.4.2.1 Compressed Data Classification

We make use of the algorithm discussed in Chap.3to compress input binary data and operate directly on the compressed data without decompressing for classifica-tion using runs This forms a total nonlossy compression–decompression scenario It is possible to perform this when classification is achieved with the help of the Manhattan distance function The distance function on the compressed data results in the same classification accuracy as that obtained on the original data as shown in Chap.3 The compression algorithm is applied on large data, and it is noticed to reduce processing requirements significantly

7.4.2.2 Frequent Features

Albeit Genetic Algorithms provide optimal feature subset, it is interesting to explore whether the input set of features can be reduced by simpler means Frequent pattern approach, as discussed in Sects.2.4.1and4.5.2, provides frequently used features, which could possibly help discrimination too A binary-valued pattern can be con-sidered as a transaction with each feature representing the presence and absence of a feature Support of an item can be defined as the percentage of transactions in the given database that contain the item We make use of the concept of support in identifying the feature set that is frequent above a chosen threshold This results in reduction in the number of features that need to be explored for an optimal set In Sect.7.4.3, as part of preliminary analysis on the considered data, we demon-strate this aspect Figure7.3demonstrates the concept of support The support and percentage-support are used equivalently in the present chapter

Algorithms7.1and7.2are studied in detail in the following sections

Algorithm 7.2 (Optimal Feature Selection using Genetic Algorithms combined

with frequent features)

(174)

Fig 7.3 The figure depicts the concepts of transaction, items, and support

Step 2: Consider only those frequent features for further exploration Step 3: All steps of Algorithm7.1

We briefly elaborate each of the steps along with results of preliminary analysis

7.4.3 Preliminary Analysis

Preliminary analysis of the data brings out insights of the data and forms domain knowledge The analysis primarily consists of computation of measures of central tendency and dispersion, feature occupancy of patterns, class-wise variability, and inter-class similarities The results of the analysis help in choosing appropriate pa-rameters and forming the experimental setup

We consider 10-class handwritten digit data consisting of 10,000 192-featured patterns Each digit is formed as a 16×12 matrix with binary-valued features The data is divided into three mutually exclusive sets for training, validation, and test-ing

(175)

Fig 7.4 Statistics of features in the training dataset

Fig 7.5 The figure contains nine 3-featured patterns occupying different feature locations in 3×3 pattern representation It can be observed that all locations are occupied cumulatively at the end of sample patterns

representation The middle figure indicates the standard deviation of the number of nonzero features, indicating comparatively a larger dispersion for the digits 0, 2, 3, 5, and The third plot in the figure provides an interesting aspect of occupancy of features within digit Considering the digit 0, although, on the average, 68 nonzero features suffice to represent the digit, the nonzero features occupied about 175 of 192 features by one training pattern or the other Similar observations can be seen for other digits too

(176)

Fig 7.6 The figure contains patterns with frequent features excluded with minimum support thresholds of 13, 21, 52, and 70 The excluded feature regions are depicted as gray and black portion corresponding to retain a feature set for exploration

7.4.3.1 Redundancy of Features Vis-a-Vis Support

We make use of the concept of support, as discussed in Fig.7.3and Sect.7.4.2.2to identify the features that occur above a prechosen support threshold We compute empirical the probability for each feature We vary support to find the set of frequent features We will later examine experimentally whether such excluded features have impact on feature selection Figure7.6contains an image of a 192-featured pattern with excluded features corresponding to various support thresholds The figure in-dicates features of low minimum support It should be noted that they occurred in this case of low minimum support on the edges of the pattern As the support is increased, the pattern representability will be affected

7.4.3.2 Data Compression and Statistics

The considered patterns consist of binary-valued features The data is compressed using the run-length compression scheme as discussed in Chap.3 The scheme con-sist of the following steps

• Consider each pattern

• Form runs of continuous occurrence of each feature For ease of dissimilarity computation, consider each pattern as starting with a feature value of 1, so that the first run corresponds to number of 1s In case the first feature of the pattern is 0, the corresponding length would be

The compression results in unequal number of runs for various patterns as shown Fig.7.7 The dissimilarity computation in the compressed domain is based on the work in Chap.3

7.4.4 Experimental Results

Experimentation is planned to explore each of the parameters of Table7.1in order to arrive at a minimal set of features that provides the best classification accuracy

(177)

Fig 7.7 Statistics of runs in compressed patterns For each class label, the vertical bar indicates the range of number of runs in the patterns For example, for class label “0”, the compressed image length ranges from 34 to 67 The discontinuities indicate that there are no patterns that have compressed lengths of 36 to 39 The figure provides range of compressed pattern lengths corresponding to the original pattern length of 192 for all the patterns

Table 7.1 Terminology

Term Description

C Population size

t No of generations

l Length of chromosome

PI Probability of initialization

Pc Probability of cross-over

Pm Probability of mutation

ε Support threshold

appropriate values of these three values, we proceed with feature selection We also bring out comparison of computation time with and without compression and bring out comparisons All the exercises are carried out with run-length-encoded nonlossy compression, and classification is performed in the compressed domain directly

7.4.4.1 Choice of Probabilities

(178)

gen-Fig 7.8 Result of genetic algorithms after 10 generations on sensitivity of the probabilities of initialization, cross-over, and mutation The two plots in each case indicate the number of features for best chromosome across 10 generations and the corresponding fitness value

erations For these exercises, we consider the complete set containing 192 features Figure7.8contains the results of these exercises The objective of the study is to obtain a subset of features that provides a reasonable classification accuracy Choice of Probability of Initialization (PI) A feature is included when the

(179)

Fig 7.9 Figure depicts of impact of choice ofPI.X-axis represents the number of features, andY-axis represents the classification accuracy Popsize(c)=40, No of generations(n)=20, Pc=0.99,Pm=0.001, Gengap=20 % From the figures counted column-wise For figures in

column 1, the values ofPI are 0.2, 0.3; in column 2, 0.4, 0.5, and in column 3, 0.6 kNNC is used for classification

view of the reduced number of features Based on the above results,PI is cho-sen as 0.2

We present a novel data visualization to demonstrate the trend of results with changing value of various parameters, say,ε Here we consider all the fitness values and plot them as a scatter plot It forms a cloud of results With varying parameter value, the cloud changes its position in bothXandY axis directions Figure 7.9indicates variation in the results with changing value of PI while keeping the remaining parameter set constant It can be seen from the figure that with increase of probability of initialization, the average classification accuracy changed from nearly 90 % to 82 % as the number of features varied from 160 to about 70 It should also be noted that the points disperse as the number of selected features reduces

Probability of Cross-over (Pc) Pc is studied for the values between 0.8 and 0.99 The recombination operator provides new offsprings from two parent chromo-somes It is usually chosen to have a relatively higher value of above 0.8 It can be seen from Fig.7.8that asPc increases, the classification accuracy im-proves Interestingly, the corresponding number of features also reduces;Pcfor the study is chosen as 0.99

(180)

Fig 7.10 The figure contains an optimal feature set represented in the pattern The preset features based of frequent support are 13, 21 and 52 The corresponding best feature sets, as shown above, provided a classification accuracy of 88.3 %, 88.5 %, and 88.8 %, respectively, with validation data and 88.03 %, 87.3 %, and 87.97 % with test data This is an example of small feature sets providing relatively higher classification accuracy

increase of the classification accuracy is not assured There is no consistent num-ber of features as well For the current study,Pmis chosen as 0.001 However, SSGA ensures retaining of few highly fit individuals across generations

7.4.4.2 Experiments with Complete Feature Set

Complete feature set of 192 features is considered for the experiments as the initial set for optimal feature selection The number of generations for each run of SSGA is greater than 40 The best results of the exercises in terms of classification accuracy (CA) are summarized below

With complete dataset and 192 features, the CA with validation and test datasets are 80.80 % and 90.34 %, respectively With 175 features, the best CA of 90.85 % is obtained with the validation dataset The corresponding CA with test data is 90.40 % with NNC and 91.60 % with kNNC withk=5 It can be observed that this result is better than the one obtained with complete feature set This emphasizes the fact that an optimal feature set that is a subset of complete feature set can provide a higher CA Similar observation can be made from Table7.2too

7.4.4.3 Experiments with A Priori Excluded Features

(181)

Fig 7.11 Popsize(c)=60, No of generations(n)=40,Pc=0.99,Pm=0.001, Gengap=30,

PI=0.1 The figures are counted column-wise The support values and the corresponding number of features are shown in parenthesis Plots in column correspond to 0(0), 0(0) withPI=0.2, 0.001 % (23), and 0.003 % (38); in column 2, they are 0.004 % (47), 0.006 % (53), 0.01 % (60), and 0.011 % (64) kNNC is the classifier

Table 7.2 Feature selection using GAs with minimum support-based feature exclusion Minimum

support

Classification accuracy with validn data

Classification accuracy with test data

Optimal set of features

Feature reduction

13 88.3 % 88.0 % 118 38.5 %

20 88.5 % 87.1 % 104 45.8 %

52 88.8 % 88.0 % 93 52.6 %

corresponding classification accuracies with validation dataset are 88.3 %, 88.5 %, and 88.8 % The classification accuracies with test data sets are 88.0 %, 87.13 %, and 88.0 % The reduction in feature set sizes as compared to original 192 features is significant They are 38.5 %, 45.8 %, and 51.6 % The results are summarized in Table7.2

(182)

Fig 7.12 Cases correspond to feature selection with kNNC classifier Popsize(c)=60, No of generationsn=40,Pc=0.99,Pm=0.001, Gengap=30,PI =0.1 The figures are counted column-wise For figures in column 1, the support values and corresponding number of features are 0.017 % (71), 0.019 % (76), 0.032 % (81), 0.045 % (85), and in column 2, they are 0.055 % (91), 0.06 % (95), 0.064 % (102) kNNC is the classifier

are arranged column-wise In Fig.7.11, the image in column corresponds to all features The following observations can be made from the figures

In summary, from the above analysis, the following inferences can be drawn

• Increasing minimum support leads to increasing exclusion of the number of fea-tures

• It can be noted from both figures that with reducing the number of features, cloud of results remains nearly invariant w.r.t classification accuracy as shown alongY -axis up to a reduction of 85 features The classification accuracy remains around 90 % Subsequently, it affects the classification accuracy, although not signifi-cantly, till the reduction to 102 features Subsequently, the reduction in accuracy is drastic

• The number of optimal features that provide good classification accuracy demon-strate significant reduction with increasing support value It starts from 155–180 from complete feature-set exploration to 60–80 for 102 features

• Interestingly, the results shown in the figures indicate that there are significant redundant features that not really contribute to discrimination of patterns

• Frequent patterns help feature reduction Equivalently, by increasing support we tend to exclude less discriminant features

(183)

Table 7.3 Improvement in CPU time due to proposed algorithm on Intel Core-2 Duo processor

Nature of data CPU time

With uncompressed data 11,428.849 sec

With compressed data 6495.940 sec

7.4.4.4 Impact of Compressed Pattern Classification

Data compression using run lengths proposed in Chap.3 is nonlossy, and it was shown theoretically too The experiments are repeated, and it is found that the clas-sification accuracy remains the same

Another important aspect is the CPU time improvement The CPU times taken by both compressed and original datasets after 16 generations of SSGA are com-pared It is found that on Intel Core-2 Duo processor, the CPU time improved by compressed data process to the tune of 43 % The times are provided in Table7.3

7.4.5 Summary

Feature selection aims to achieve certain objective such as (a) optimizing an evalu-ation measure like classificevalu-ation accuracy of unseen patterns, (b) certain restriction on evaluation measure, (c) best commitment among its size and the value of its evaluation measure, etc When the size of initial feature-set is more than 20, the process forms large-scale feature selection problem With the number of featuresd, the search space is equal to 2d The problem is further complex, when (a) num-ber of patterns is huge, (b) data contains multicategory patterns, and (c) numnum-ber of features is much larger than 20 We provided an overview of feature selection and feature extraction methods We presented a case study on optimal feature selection using genetic algorithms along with providing a discussion on genetic algorithms In the case study, we focused on efficient methods of large-scale feature selection that provide a significant improvement in the computation time while providing the classification accuracy at least as good as that of a complete dataset The proposed methods are applied in feature selection of large datasets and demonstrate that the computation time improves by almost 50 % as compared to conventional approach.

We integrate the following aspects in the current work

• Feature selection of high-dimensional large dataset using Genetic Algorithms

• Domain knowledge of data under study obtained through preliminary analysis

• Run-length compression of data

• Classification of compressed data directly in the compressed domain Further, from the discussions it is clear that:

(184)

• Mutual Information and Fisher’s score are important and popular in filter selec-tion

• PCA is superior to Random Projections; NMF can get stuck in a locally optimal solution

• Genetic Algorithms combined with frequent features lead to significant reduction in the number of features and also improve the computation time

7.5 Bibliographical Notes

Duda et al (2001) provide an overview of feature selection A good discussion on the design and applicability of distance functions in high-dimensional spaces can be found in Hsu and Chen (2009) Efficient and effective floating sequential schemes for feature selection are discussed in Pudil et al (1994) and Somol et al (1999) Var-ious schemes including the ones based on Fisher’s score and Mutual Information are considered by Punya Murthy and Narasimha Murty (2012) A good introduction to NMF is provided by Lee and Seung (1999) An authoritative coverage on Random Projections is given by Menon (2007) A well-known reference on PCA is given by Jolliffe (1986) Cover and Van Camenhout (1977) contains demonstration of the need for exhaustive search for optimal feature selection Goldberg (1989), Davis and Mitchell (1991), and Man et al (1996) provide a detailed account of genetic algo-rithms, including issues in implementation Siedlecki and Sklansky (1989) demon-strate superiority of solution using Genetic Algorithms as compared to an exhaus-tive search, sequential search and branch-bound with the help of a 30-dimensional dataset Several variants of Genetic Algorithms are used for feature selection on different types of data, such as works by Siedlecki and Sklansky (1989), Yang and Honavar (1998), Kimura et al (2009), Punch et al (1993), Raymer et al (1997), etc Raymer et al (2000) focus on feature extraction for dimensionality reduction using genetic algorithms Oliveira et al (2001) demonstrate feature selection using a simple genetic algorithm and iterative genetic algorithm Greenhagh and Marshall (2000) discuss convergence criteria for genetic algorithms Comparison of genetic algorithm-based prototype selection schemes was provided by Ravindra Babu and Narasimha Murty (2001) Raymer et al (1997) and Ravindra Babu et al (2005) demonstrate simultaneous selection of feature and prototypes Run-length-encoded compression and dissimilarity computation in the compressed domain are provided in Ravindra Babu et al (2007) A utility of frequent item support for feature selec-tion was demonstrated in Ravindra Babu et al (2005) Cheng et al (2007) argue that frequent features help discrimination

References

(185)

T.M Cover, J.M Van Camenhout, On the possible orderings in the measurement selection prob-lem IEEE Trans Syst Man Cybern 7(9), 657–661 (1977)

L.D Davis, M Mitchell, Handbook of Genetic Algorithms (Van Nostrand Reinhold, New York, 1991)

R.O Duda, P.E Hart, D.G Stork, Pattern Classification (Wiley, New York, 2001)

D.E Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, Reading, 1989)

D Greenhagh, S Marshall, Convergence criteria for genetic algorithms SIAM J Comput 3(1), 269–282 (2000)

C.-M Hsu, M.-S Chen, On the design and applicability of distance functions in high-dimensional data space IEEE Trans Knowl Data Eng 21, 523–536 (2009)

I.T Jolliffe, Principal Component Analysis (Springer, New York, 1986)

Y Kimura, A Suzuki, K Odaka, Feature selection for character recognition using genetic algo-rithm, in Fourth Intl Conf on Innovative Computing, Information and Control (ICICIC) (2009), pp 401–404

D.D Lee, H Seung, Learning the parts of objects by non-negative matrix factorization Nature 401, 788–791 (1999)

K.F Man, K.S Tang, S Kwong, Genetic algorithms: concepts IEEE Trans Ind Electron 43(5), 519–534 (1996)

A.K Menon, Random projections and applications to dimensionality reduction B.Sc (Hons.) The-sis, School of Info Technologies, University of Sydney, Australia (2007)

L.S Oliveira, N Benahmed, R Sabourin, F Bortolozzi, C.Y Suen, Feature subset selection using genetic algorithms for handwritten digit recognition, in Computer Graphics and Image

Process-ing (2001), pp 362–369

P Pudil, J Novovicova, J Kittler, Floating search methods in feature selection Pattern Recognit Lett 15, 1119–1125 (1994)

W.F Punch, E.D Goodman, M Pei, L.C Shun, P Hovland, R Enbody, Further research on feature selection and classification using genetic algorithms, in ICGA (1993), pp 557–564

C Punya Murthy, M Narasimha Murty, Discriminative feature selection for document classifica-tion, in Proceedings of ICONIP (2012)

T Ravindra Babu, M Narasimha Murty, V.K Agrawal, On simultaneous selection of prototypes and features in large data pattern recognition, in LNCS, vol 3776 (Springer, Berlin, 2005), pp. 595–600

T Ravindra Babu, M Narasimha Murty, Comparison of genetic algorithm based prototype selec-tion schemes Pattern Recognit 34(2), 523–525 (2001)

T Ravindra Babu, M Narasimha Murty, V.K Agrawal, Classification of run-length encoded binary data Pattern Recognit 40(1), 321–323 (2007)

M.L Raymer, W.F Punch, E.D Goodman, P.C Sanschagrin, L.A Kuhn, Simultaneous feature extraction and selection using a masking genetic algorithm, in Proc 7th Intl Conf on Genetic

Algorithms (1997)

M.L Raymer, W.F Punch, E.D Goodman, L.A Kuhn, A.K Jain, Dimensionality reduction using genetic algorithms IEEE Trans Evol Comput 4(2), 164–171 (2000)

W Siedlecki, J Sklansky, A note on genetic algorithms for large-scale feature selection Pattern Recognit Lett 10, 335–347 (1989)

P Somol, P Pudil, J Novovicova, P Paclik, Adaptive floating search methods in feature selection Pattern Recognit Lett 20, 1157–1163 (1999)

(186)

Chapter 8

Big Data Abstraction Through Multiagent Systems

Big Data is proving to be a new paradigm after data mining in large or massive data analytics With increasing ability to store large volumes of data at every second, the need for making sense of the data for summarization and business exploita-tion is steadily increasing The data is emanating from customer records, pervasive sensors, sense of keeping every data item for potential subsequent analysis, secu-rity paranoia, etc Big Data theme is gaining importance especially because large volumes of data in variety of formats are found related and need to be processed in conjunction with each other Large databases, which are conventionally built on predefined schema, are not directly usable However, there are arguments in the lit-erature for and against the use of Map-Reduce algorithm as compared to massive parallel databases Such databases are built by many commercial players

Agent-mining interaction is gaining importance in research community in solv-ing massive data problems in divide-and-conquer manner The interaction is mutual such as agent driving data mining and vice versa We discuss these issues in more detail in the chapter

We propose to solve Big Data analytics problems through multiagent systems We propose few problem solving schemes In Sect.8.2, we provide an overview of Big Data and challenges it offers to research community Section8.3discusses large data problems as solved by conventional systems Section8.4contains a dis-cussion on overlap between big data and data mining A disdis-cussion on multiagent systems is provided in Sect.8.5 Section8.6contains proposed multiagent systems for abstraction generation with Big Data

8.2 Big Data

Big data is marked by voluminous heterogeneous datasets that need to be accessed and processed in real time to generate abstraction Such an abstraction is valuable T Ravindra Babu et al., Compression Schemes for Mining Large Datasets,

(187)

for scientific or business decisions depending on nature of data These attributes are conventionally termed as threev’s, known as volume, velocity, and variety Some experts add an additionalv, known as value The big data analytics has also led to a new inter-disciplinary topic, called data science, which combines statistics, ma-chine learning, natural language processing, visualization, and data mining Associ-ated terminologies to data science are data products and data services

The need for Big Data analytics or abstraction arose due to increasing ability to sense and store the data, omnipresence of data, ability to see the business poten-tial of such data sets Some examples are the trails of data that one leaves as one browses web pages, tweets his/her opinions, social media channels, visits to multi-ple stores to purchase varieties of items, scientific data such as genome sequencing, astronomy, oceanography, clinical data, applications such as drug re-purposing, etc Researchers propose MAD (Magnetic, Agile, and Deep) analysis practice for Big Data, self-tuning systems such as Starfish with respect a popular big data system. The scenarios lead to demand of increasing agility in data accessing and processing and to the need of accepting multiple data sources and generating sophisticated an-alytics The need for such analytics in turn seeks the development and use of more efficient machine learning algorithms and statistical analysis that integrate parallel processing of data, etc Some conventional pattern recognition algorithms or statis-tical methods need to be strengthened in these directions

The Map-Reduce algorithm and its variants play a pivotal role in Big Data appli-cations

8.3 Conventional Massive Data Systems

Conventionally an “Enterprise Data Warehouse (EDW)” is the source for large data analysis Business intelligence software bases its analysis on this data and gener-ates insights by querying EDW The EDW system is a centralized data resource for analytics The EDW is marked by a systematic data integration with well-defined schema, permitting predefined structures of data only for storage and analysis This should be contrasted with heterogeneous data sets such as unstructured data of click, text such as twitter messages, images, voice data, etc., and semi-structured data such as xml or rss-feeds and combinations of them Parallel SQL database management systems (DBMS) provide solution to large data systems For the sake of complete-ness, to name a few, some commercial systems for parallel DBMS are TeraData, Asterdata, Netezza, Vertica, Oracale, etc

8.3.1 Map-Reduce

(188)

Fig 8.1 Map-reduce system The figure depicts broad stages of input task, phases of map, reduce and output The activity is under control of a master controller The user has overall control on the programming system

makes use of divide-and-conquer approach in an abstract sense The system con-sists of multiple simple computing elements, termed as compute nodes, networked together through gigabit network The Distributed File System (DFS) is followed to suit cluster computing Some examples of operational distributed file systems are Google File System (GFS), Hadoop Distributed File System (HDFS), and Cloud-Store A conceptual Map-Reduce system is shown in Fig.8.1 As depicted, a Map-Reduce system consists of finding an input task that is divided into multiple Map tasks The tasks in turn lead to output tasks through an intermediate processing stage The system is under the control of a master controller, which ensures an opti-mal allocation and fault-tolerance The entire activity is under the control of a user program

(189)

8.3.2 PageRank

An epoch-making contribution to evaluate relative importance of web pages based on a search query is PageRank scheme The PageRank is a real number between and The higher the value, the higher relevance of the result to the query is indi-cated by the rank PageRank is determined by simulated random web surfers who would execute random walk on the web pages coming across certain nodes more often than others Intuitively, PageRank considers the pages visited more often as more relevant However, in order to circumvent deliberate attempts to spam with terms making PageRank invalid, the relevance of page is judged not only by the content of the page, but also by the terms in the near-links directed to the page Implementation of the PageRank scheme takes care of large-scale representation of transition matrix of web, efficient computation practice alternatives of matrix– vector multiplication including use of Map-Reduce, methods to take care of dead ends, spider traps through taxation, etc Dead end is one where a web page has no outgoing links It affects PageRank computation reaching a value of zero for dead end links and also few pages that have dead ends This is taken care by removing the nodes that have dead ends recursively as the computation progresses Spider trap is a condition where web pages have links with each other in a finite set of nodes without having outlinks, thus leading to PageRank computation based on those finite nodes only This condition is taken care by a procedure called taxation parameter that lies between and and provides a small probability to a random surfer to leave the web and include an equivalent number of random surfers Some of the relevant ter-minology includes computation of topic-sensitive PageRank, biased random walks, and spam farm Topic-sensitive PageRank, essentially, is PageRank biasing toward a set of web pages, known as a teleport set to suit user’s interest through biased ran-dom walks Link spam refers to a deliberate unethical effort to increase PageRank of certain pages This is tackled by Trust Rank designed to lower the rank of spam pages and spam mass, which is a relative rank measure to identify possible spam pages

Apart from the use of the PageRank algorithm, each search engine should use a propriety set of parameters, including weighting parameters to optimize its perfor-mance and query relevance

8.4 Big Data and Data Mining

(190)

Thus, Big Data offers newer challenges in the above terms to data mining ap-proaches The formal interaction between Big Data and Data Mining is beginning to develop into areas such as mining massive datasets

8.5 Multiagent Systems

Agents refer to computational entities that are autonomous, understand the envi-ronment, interact with other agents or humans, act reactively and proactively in achieving an objective The agents are termed intelligent when they can achieve the objective by optimizing their own performance given the environment and objec-tive When more than one agent is involved in accomplishing a task with all the previously discussed attributes, we call such a system a Multiagent system Example 8.1 An example of agents is footballers playing in a field With a common objective of scoring a goal against opposition, each of the players acts autonomously to reach the objective, collaborate, proactively and reactively tackle the ball to seize the initiative in achieving the objective

Example 8.2 Face detection system can be designed as a multiagent system Face detection can be defined as detecting a human face in a given image Some of the challenges faced by the activity are background clutter, illumination variation, back-ground matching with skin color of a person, partial occlusion, pose, etc A multi-agent face detection system consists of multi-agents, each capable of carrying out activ-ity autonomously and share its outcome with other agents For example, an agent carrying out skin color detection shares region containing skin and skin-color like artifacts The second agent may carry out detection of size and rotation of face about the axis coming out of paper through ellipse fitting The third agent carries out tem-plate matching of face in the given region A combiner agent combines the results to finally localize the face

Data mining and Multiagent systems are both inter-disciplinary Multiagent sys-tems encompasses multiple disciplines such as artificial intelligence, sociology, and philosophy With recent developments, it includes many other disciplines, including data mining

8.5.1 Agent Mining Interaction

(191)

much larger than dividing the dataset intonsubsets and assigning each dataset to an autonomous agent In other words,O((n1+n2+ · · · +np)k) > O((nk1+nk1+

· · · +nkp)) This is a case for agents supporting data mining Alternately, clustering of agents is an example of data mining supporting agents The literature is replete with a number of examples on both these aspects of agent mining interaction

The agent mining interaction can take place at many levels such as interface, performance, social, infrastructure, etc

8.5.2 Big Data Analytics

Analytics with Big Data is equivalently called as Big data analytics, Advanced An-alytics, Exploratory AnAn-alytics, or Discovery analytics The business literature uses these terms synonymously

The challenges in the big data analytics are data sizes reaching exabytes, data availability in distributed manner as against centralized data sources, semi-structured and unsemi-structured datasets, streaming data, flat data schemes as compared to pre-defined models, complex schema containing inter-relationships, near-real time and batch processing requirements, less dependence on SQL, continuous data updates, etc The analysis methods that required to be suitably improved for mas-sive parallel processing are Multiagent systems, data mining methods, statistical methods, large data visualization, natural language processing, text mining, graph methods, instantiation approaches to streaming data etc Data preprocessing chal-lenges include integration of multiple data types, integrity checks, outlier handling, and missing data issues Commercial implementation of big data analytics will have to integrate cloud services and Map-Reduce paradigm

8.6 Proposed Multiagent Systems

Multiagent systems are suitable for distributed data mining applications We pro-vide dipro-vide-and-conquer approach to generate abstraction in big data We propro-vide few examples of such systems for generating abstraction on large data The proposed schemes relate to data reduction in terms of identifying representative patterns, re-duction in number of attributes/features, analytics in large data sets, heterogeneous dataset access and integration, and agile data processing

The schemes are practical and implemented earlier We briefly discuss results for some schemes

8.6.1 Multiagent System for Data Reduction

(192)

Fig 8.2 Multiagent system for prototype selection in big data In the figure, each clustering agent corresponds to a different clustering algorithm

not be uniform across the datasets Some datasets could inherently form clusters of hyper-spherical nature, some could be curvilinear in high dimensions, etc A single clustering algorithm alone would not be able to capture representative patterns in each such case For example, for dataset 1, we use partitional clustering method-1, for dataset 2, partitional clustering method-2, for dataset 3, we use the hierarchical clustering method, etc., as those methods are best suited for the nature of datasets Figure8.2contains a proposed scheme for a Multiagent system for data reduction The proposed method addresses each of the threev’s, viz., volume, variety, and velocity of big data

In the figure, we indicate different clustering algorithms to access the datasets It should be noted here that, based on preliminary analysis on a sample of dataset, an appropriate clustering algorithm is chosen for prototype selection for the corre-sponding dataset The evaluation of selected prototypes is carried out by an evalu-ation agent for each combinevalu-ation of dataset and clustering algorithm An example of evaluation agent is classification of a test dataset, which is a subset independent from the training dataset

8.6.2 Multiagent System for Attribute Reduction

(193)

Fig 8.3 KDD framework for data abstraction Multiple activities encompass each box The dotted line is further expanded separately

Alternatively, feature selection and reduction can be achieved sequentially by addition of another set of agents at a layer that is lower to clustering agents in the figure

8.6.3 Multiagent System for Heterogeneous Data Access

One major objective of Big Data is the ability to access and process multiple types of data such as text messages, numerical, categorical, images, audio messages, etc and integrate them together for further use such as generating business intelligence from them It is an acknowledged fact that data access from different formats con-sumes significant amount of time for an experimental researcher For an operational system, it is always advantageous to place such a multiagent system in place Given that each of these heterogeneous datasets relates to the same theme, the partici-pating agents need to interact with each other and share the information among them Figure8.3contains data analytics in conventional Knowledge Discovery from Databases (KDD) framework The figure contains three broad stages of KDD pro-cess The first block contains substages of data access, data selection, generating tar-get data for preprocessing, where preprocessing each data type includes cleansing The second block corresponds to the substages of data transformation that makes data amenable for further processing The third block corresponds to application of machine learning, statistics, and data mining algorithms that generate the final data abstraction

(194)

Fig 8.4 Multiagent system for data access and preprocessing The objective is to provide frame-work where different streams of data are accessed and preprocessed by autonomous agents, which also cooperate with fellow agents in generating integrated data The data thus provided is further processed to make it amenable for application of data mining algorithms

other Although in the figure the horizontal arrows indicate exchange of informa-tion between the agents adjacent to each other, the exchange happens among all the agents They are depicted as shown for brevity The preprocessed information is thus aggregated by another agent and makes it amenable for further processing

8.6.4 Multiagent System for Agile Processing

The proposed system for agile processing is part of Data Mining process of Fig.8.3 The system corresponds to the velocity part of Big Data The processing in big data can be real-time, near-real-time, or batch processing We briefly discuss some of the options for such processing The need for agility is emphasized in view of large volumes of data where conventional schemes may not provide the insights at such speeds The following are some such options

• Pattern Clustering to reduce the dataset meaningfully through some validation and operate only on such a reduced set to generate abstraction of entire data

• Focus on important attributes by removing redundant features,

• Compress the data in some form and operate directly on such compressed datasets

(195)

8.7 Summary

In the present chapter, we discuss the big data paradigm and its relationship with data mining We discussed the related terminology such as agents, multiagent sys-tems, massive parallel databases, etc We propose to solve big data problems using multiagent systems We provide few cases for multiagent systems The systems are indicative

(196)

References

A Abouzeid, K Bajda-Pawlikowski, D Abadi, A Silberschatz, A Rasin, HadoopDB: an archi-tectural hybrid of MapReduce and DBMS technologies for analytical workloads, in VLDB’09, France (2009)

A Agogino, K Tumer, Efficient agent-based clustering ensembles, in AAMAS’06 (2006), pp. 1079–1086

S Brin, L Page, The anatomy of large-scale hyper-textual Web search engine Comput Netw ISDN Syst 30, 107–117 (1998)

L Cao, C Zhang, F-Trade: an agent-mining symbiont for financial services, in AAMAS’07, Hawaii, USA (2007)

J Cohen, B Dolan, M Dunlap, MAD skills: new analysis practices for big data, in VLDB’09, (2009), pp 1481–1492

J Dean, S Ghemawat, MapReduce: simplified data processing on large clusters, in OSDI’04: 6th

Symposium on Operating Systems Design and Implementation (2004), pp 137–149

U.M Fayyad, G Piatetsky-Shapiro, P Smyth, R Uthurusamy, Advances in Knowledge Discovery

and Data Mining (AAAI Press/MIT Press, Menlo Park/Cambridge, 1996)

J Ferber, Multi-agent Systems: An Introduction to Distributed Artificial Intelligence (Addison-Wesley, Reading, 1999)

S Gurruzzo, D Rosaci, Agent clustering based on semantic negotiation ACM Trans Auton Adapt Syst 3(2), 7:1–7:40 (2008)

G Halevi, Special Issue on Big Data Research Trends, vol 30 (Elsevier, Amsterdam, 2012) H Herodotou, H Lim, G Luo, N Borisov, L Dong, F.B Cetin, S Babu, Starfish: a self-tuning

system for big data analytics, in 5th Biennial Conference on Innovative Data Systems Research

(CIDR’11) (USA, 2011), pp 261–272

M Loukides, What is data science, O’ Reillly Media, Inc., CA (2011).http://radar.oreilly.com/r2/ release-2-0-11.html/

C.D Manning, P Raghavan, H Schutze, Introduction to Information Retrieval (Cambridge Uni-versity Press, Cambridge, 2008)

L Page, S Brin, R Motwani, T Winograd, The PageRank citation ranking: bringing order to the Web Technical Report Stanford InfoLab (1999)

J.J Patil, Data Jujitsu: the art of turning data into product, in O’Reilly Media (2012)

A Pavlo, E Paulson, A Rasin, D.J Abadi, D.J DeWitt, S Madden, M Stonebraker, A comparison of approaches to large-scale data analysis, in SIGMOD’09 (2009)

A Rajaraman, J.D Ullman, Mining of Massive Datasets (Cambridge University Press, Cambridge, 2012)

T Ravindra Babu, M Narasimha Murty, S.V Subrahmanya, Multiagent systems for large data clustering, in Data Mining and Multi-agent Integration, ed by L Cao (Springer, Berlin, 2007), pp 219–238 Chapter 15

T Ravindra Babu, M Narasimha Murty, S.V Subrahmanya, Multiagent based large data clustering scheme for data mining applications, in Active Media Technology ed by A An et al LNCS, vol 6335 (Springer, Berlin, 2010), pp 116–127

P Russom, iBig data analytics TDWI Best Practices Report, Fourth Quarter (2011)

J Tozicka, M Rovatsos, M Pechoucek, A framework for agent-based distributed machine learn-ing and data minlearn-ing, in Autonomous Agents and Multi-agent Systems (ACM Press, New York, 2007) Article No 96

G Weiss (ed.), Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence (MIT Press, Cambridge, 2000)

M Wooldridge, N.R Jennings, Towards a theory of cooperative problem solving, in Proc of

Work-shop on Distributed Software Agents and Applications, Denmark (1994), pp 40–53

P.C Zikopoulos, C Eaton, D deRoos, T Deutsch, G Lapis, Understanding Big Data: Analytics

(197)

Appendix

Intrusion Detection Dataset—Binary Representation

Network Intrusion Detection Data was used during KDD-Cup99 contest Even 10 %-dataset can be considered large as it consists of 805049 patterns, each of which is characterized by 38 features We use this dataset in the present study, and hereafter we refer to this dataset as a “full dataset” in the current chapter In the current chap-ter, we apply the algorithms and methods developed so far on the said dataset and demonstrate their efficient working With this, we aim to drive home the generality of the developed algorithms

The appendix contains data description and preliminary analysis

A.1 Data Description and Preliminary Analysis

Intrusion Detection dataset (10 % data) that was used during KDD-Cup99 contest is considered for the study The data relates to access of computer network by au-thorized and unauau-thorized users The access by unauau-thorized users is termed as intrusion Different costs of misclassification are attached in assigning a pattern belonging to a class to any other class The challenge lies in detecting intrusion belonging to different classes accurately minimizing the cost of misclassification Further, whereas the feature values in the data used in the earlier chapters contained binary values, the current data set assumes floating point values.

The training data consists of 41 features Three of the features are binary at-tributes, and the remaining are floating point numerical values For effective use of these attributes along with other numerical features, the attributes need to be as-signed proper weights based on the domain knowledge Arbitrary weightages could adversely affect classification results In view of this, only 38 features are considered for the study On further analysis, it is observed that values of two of the 38 features in the considered 10 %-dataset are always zero, effectively suggesting exclusion of these two features (features numbered 16 and 17, counting from feature 0) The training data consists of 311,029 patterns, and the test data consists of 494,020 pat-terns They are tabulated in TableA.1 A closer observation reveals that not all

fea-T Ravindra Babu et al., Compression Schemes for Mining Large Datasets, Advances in Computer Vision and Pattern Recognition,

(198)

Table A.1 Attack types in training data

Description No of patterns No of attack types No of features

Training data 311,029 23 38

Test data 494,020 42 38

Table A.2 Attack types in training data

Class No of types Attack types

normal normal

dos back, land, neptune, pod, smurf, teardrop

u2r buffer–overflow, loadmodule, perl, rootkit

r2l ftp-write, guess-password, imap, multihop,

phf, spy, warezclient, warezmaster

probe ipsweep, nmap, portsweep, satan

Table A.3 Additional attack

types in test data Additional attack types

snmpgetattack, processtable, mailbomb, snmpguess, named, sendmail, named, sendmail, httptunnel, apache2, worm, sqlattack, ps, saint, xterm, xlock, upstorm, mscan, xsnoop

Table A.4 Assignment of unknown attack types using domain knowledge

Class Attack type

dos processtable, mailbomb, apache2, upstorm u2r sqlattack, ps, xterm

r2l snmpgetattack, snmpguess, named, sendmail, httptunnel, worm, xlock, xsnoop

probe saint, mscan

tures are frequent, which is also brought out in the preliminary analysis We make use of this fact during the experiments

(199)

Table A.5 Class-wise numbers of patterns in training data of 494,020 patterns

Class Class-label No of patterns

normal 97,277

u2r 52

dos 391,458

r2l 1126

probe 4107

Table A.6 Class-wise distribution of test data based on domain knowledge

Class Class-label No of patterns

normal 60,593

u2r 70

dos 229,853

r2l 16,347

probe 4166

Table A.7 Cost matrix

Class type normal u2r dos r2l probe

normal 2

u2r 2

dos 2

r2l 2

probe 2

knowledge are considered, and test data is formed accordingly TableA.4contains assigned types based on domain knowledge One important observation that can be made from the mismatch between NN assignment and TableA.4is that the class boundaries overlap, which leads to difficulty in classification TableA.5contains the class-wise distribution of training data TableA.6provides the class-wise distri-bution of test data based on domain knowledge assignment

In classifying the data, each wrong pattern assignment is assigned a cost The cost matrix is provided in TableA.7 Observe from the table that the cost of assigning a pattern to a wrong class is not uniform For example, the cost of assigning a pattern belonging to class “u2r” to “normal” is Its cost is more than that of assigning a pattern from “u2r” to “dos”, say

Feature-wise statistics of training data are provided in TableA.8 The table con-tains a number of interesting statistics They can be summarized below

• Ranges of mean values (Column 2) of different features are different

• Standard deviation (Column 3), which is a measure of dispersion, is different for different feature values

(200)

Table A.8 Feature-wise statistics Feature

No

Mean value SD Min Max Bits

(VQ) Resoln (494021)

Suprt

(1) (2) (3) (4) (5) (6) (7) (8)

1 47.979302 707.745756 58,329 16 1.4e−5 12,350 3025.609608 988,217.066787 693,375,616 30 6.0e−10 378,679 868.529016 33,039.967815 5,155,468 23 7.32e−8 85,762

4 0.000045 0.006673 0.06 22

5 0.006433 0.134805 3.0 0.06 1238

6 0.000014 0.005510 3.0 0.06 3192

7 0.034519 0.782102 30.0 0.03 63

8 0.000152 0.015520 5.0 0.08 63

9 0.148245 0.355342 1.0 0.06 73,236

10 0.010212 1.798324 884.0 10 9.8e−4 2224

11 0.000111 0.010551 1.0 0.06 55

12 0.000036 0.007793 2.0 0.12 12

13 0.011352 2.012716 993.0 10 9.8e−4 585

14 0.001083 0.096416 28.0 3.2e−2 265

15 0.000109 0.011020 2.0 0.12 51

16 0.001008 0.036482 8.0 0.04 454

17 0.0 0.0 0.0 0

18 0.0 0.0 0.0 0

19 0.001387 0.037211 1.0 0.06 685

20 332.285690 213.147196 511.0 2.0e−3 494,019 21 292.906542 246.322585 511.0 2.0e−3 494,019

22 0.176687 0.380717 1.0 0.06 89,234

23 0.176609 0.381016 1.0 0.06 88,335

24 0.057433 0.231623 1.0 0.06 29,073

25 0.057719 0.232147 1.0 0.06 29,701

26 0.791547 0.388189 1.0 0.06 490,394

27 0.020982 0.082205 1.0 0.06 112,000

28 0.028998 0.142403 1.0 0.06 34,644

29 232.470786 64.745286 255.0 3.9e−3 494,019 30 188.666186 106.040032 255.0 3.9e−3 494,019

31 0.753782 0.410779 1.0 0.06 482,553

32 0.030906 0.109259 1.0 0.06 146,990

33 0.601937 0.481308 1.0 0.06 351,162

34 0.006684 0.042134 1.0 0.06 52,133

35 0.176754 0.380593 1.0 0.06 94,211

36 0.176443 0.380919 1.0 0.06 93,076

37 0.058118 0.230589 1.0 0.06 35,229

Định dạng
Số trang	208
Dung lượng	2,81 MB