Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 46 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
46
Dung lượng
1,19 MB
Nội dung
A CONTRAST PATTERN BASED CLUSTERING ALGORITHM FOR CATEGORICAL DATA A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science By NEIL KOBERLEIN FORE B.S., Rhodes College, 2003 2010 Wright State University WRIGHT STATE UNIVERSITY SCHOOL OF GRADUATE STUDIES August 27, 2010 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY Neil Koberlein Fore ENTITLED A Contrast Pattern based Clustering Algorithm for Categorical Data BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science. Guozhu Dong, Ph.D. Thesis Director Mateen Rizki, Ph.D. Department Chair Committee on Final Examination Keke Chen, Ph.D. Krishnaprasad Thirunarayan, Ph.D. Andrew T. Hsu, Ph.D. Dean, School of Graduate Studies iii ABSTRACT Fore, Neil Koberlein. M.S., Department of Computer Science and Engineering, Wright State University, 2010. A Contrast Pattern based Clustering Algorithm for Categorical Data. The data clustering problem has received much attention in the data mining, machine learning, and pattern recognition communities over a long period of time. Many previous approaches to solving this problem require the use of a distance function. However, since clustering is highly explorative and is usually performed on data which are rather new, it is debatable whether users can provide good distance functions for the data. This thesis proposes a Contrast Pattern based Clustering (CPC) algorithm to construct clusters without a distance function, by focusing on the quality and diversity/richness of contrast patterns that contrast the clusters in a clustering. Specifically, CPC attempts to maximize the Contrast Pattern based Clustering Quality (CPCQ) index, which can recognize that expert- determined classes are the best clusters for many datasets in the UCI Repository. Experiments using UCI datasets show that CPCQ scores are higher for clusterings produced by CPC than those by other, well-known clustering algorithms. Furthermore, CPC is able to recover expert clusterings from these datasets with higher accuracy than those algorithms. iv TABLE OF CONTENTS Page 1. INTRODUCTION AND PROBLEM DEFINITION 1 2. PRELIMINARIES 3 2.1 Clustering, Datasets, Tuples, and Items 3 2.2 Frequent Itemsets 3 2.3 Terms for CPC 4 2.4 Equivalence Classes 4 2.5 F 1 Score 5 2.6 CPCQ 5 3. RATIONALE AND DESIGN OF ALGORITHM 7 3.1 MPD and CPC Concepts 7 3.2 MPD Rationale – Mutual Patterns in CP Groups 8 3.3 Mutual Pattern Quality 9 3.4 Pattern Volume 10 3.5 Example 11 3.6 MPD Definition 12 3.7 The CPC Algorithm 13 v 3.7.1 Step 1: Find Seed Patterns 14 3.7.2 Step2: Add Diversified Contrast Patterns to G 1 15 3.7.3 Step 3: Add Remaining Patterns Based on Tuple Overlap 16 3.7.4 Step 4: Assign Tuples 17 4. EXPERIMENTAL EVALUATION 19 4.1 Datasets and Clustering Algorithms 19 4.2 CPC Parameters 20 4.3 Experiment Settings 21 4.4 SynD Dataset 21 4.5 Mushroom Dataset 22 4.6 SPECT Heart Dataset 22 4.7 Molecular Biology (Splice Junction Gene Sequences) Dataset 24 4.8 Molecular Biology (Promoter Gene Sequences) Dataset 24 4.9 Effect of Pattern Limit on Execution Time and Memory Use 25 4.10 Effect of Pattern Limit on Clustering Quality 27 4.11 Effect of Pattern Volume on Clustering Quality 28 5. RELATED WORKS 30 6. DISCUSSION AND CONCLUSION 32 6.1 Alternative Approaches to Cluster Construction 32 6.2 Tuple Diversity 33 vi 6.3 Item Diversity 33 6.4 Chain Connections through Mutual Patterns 34 6.5 Discussion on MPD Values 34 3.7 Conclusion and Future Work 35 REFERENCES 36 vii LIST OF FIGURES Figure Page 1. Intra-Group Connection through a Mutual Pattern 9 2. Mutual Pattern Quality 10 3. CPC Algorithm Steps 14 4. CPC Step 1 Pseudocode 15 5. CPC Step 2 Pseudocode 16 6. CPC Step 3 Pseudocode 17 7. CPC Step 4 Pseudocode 18 8. Execution Time: Mushroom, minS=0.01 26 9. Memory Use: Mushroom, minS=0.01 26 10. Effect of maxP on F 1 and CPCQ scores: SPECT Heart 27 11. Effect of maxP on F 1 and CPCQ scores: Mushroom 28 viii LIST OF TABLES Table Page 1. SynD and its CPC Clustering 12 2. SynD clusterings and CPCQ Scores 21 3. Mushroom F 1 and CPCQ Scores 22 4. SPECT Heart F 1 and CPCQ Scores 23 5. Splice F 1 and CPCQ Scores 24 6. Promoter F 1 and CPCQ Scores 25 7. Effect of PV on F 1 Score: Mushroom, Splice 29 8. Effect of PV on CPCQ Score: Mushroom 29 ix ACKNOWLEDGEMENTS I would like to give my special thanks to Dr. Dong, for his kindness and patience in helping me to accomplish this work. Without his valuable guidance, this thesis would not have been possible. I would also like to thank Dr. Keke Chen and Dr. Krishnaprasad Thirunarayan for being a part of my thesis committee and giving me helpful comments and suggestions. Finally, I would like to thank my parents for their support and love throughout my studies at Wright State. 1 1. INTRODUCTION AND PROBLEM DEFINITION Clustering is an important unsupervised learning problem with relevance in many applications, especially explorative data analysis, in which prior domain knowledge may be scarce. Traditional approaches to clustering often make use of a distance function to define the similarity between data points and guide the clustering process. Good distance functions are crucial to clustering quality, but they are domain-specific and can require more knowledge than is available to users. This thesis proposes a novel Contrast Pattern based Clustering (CPC) algorithm for discovering high-quality clusters from categorical data without relying on prior knowledge of the dataset. Since clustering is highly explorative, such an algorithm may often be preferred over one requiring a user-provided distance function. Ideally, this algorithm should be scalable and able to produce clusters that correspond closely to the classes provided by domain experts for datasets having expert-provided classes. To accomplish this, CPC only relies on the frequent patterns of the given dataset. Specifically, it is designed to maximize the Contrast Pattern based Clustering Quality (CPCQ) score. The CPCQ index has been demonstrated to recognize expert clusterings as superior to those created by well-known algorithms [1]. While the CPCQ index scores whole clusters based on the contrast patterns of those clusters, CPC constructs clusters bottom-up on the basis of frequent patterns only and hence does not have access to the whole clusters (and their associated contrast patterns) during the cluster-construction process. Therefore, the challenge here is to establish a relationship between individual patterns and use it to guide the clustering process. This is done using a [...]... of all patterns {PEC | mat(PEC) = mat(P)} Each EC can be concisely defined by a closed pattern and a set of minimal generator (MG) patterns In any EC, no MG pattern is a subset of another pattern, and each pattern is a superset of at least one MG pattern and a subset of the closed pattern For efficiency, CPC does not consider each pattern in an EC Instead, the term 4 "pattern" refers to an EC, and... Although F1 scores for CPC clusterings are not always higher than those for the other algorithms, CPCQ scores are highest for five of the six CPC clusterings, excluding the expert clustering 4.6 SPECT HEART DATASET The SPECT Heart dataset is an example of image data that has been preprocessed into categorical attributes (the preprocessed data is available at UCI) Each of 267 cardiac SPECT images was... { {a1 }}) because |mat( {a1 }) ∩ mat( {a2 })| = 0 (i.e diversity is high) and mat( {a1 })'s only overlapping pattern, {b1}, is a mutual pattern of mat( {a1 }) and mat( {a2 }), making MPD( {a1 }, {a2 }) the highest 11 MPD value for C1 Similarly, {a5 } would be added to G1(C2), and so on When completed, G1(C1) = { {a1 }, {a2 }, {a3 }} and G1(C2) = { {a4 }, {a5 }, {a6 }}, and tuples are assigned to clusters as shown in the table... tuples of a dataset When a pattern' s items are a subset of a tuple's itemset, we say that the tuple matches the pattern When all of a pattern' s matching tuples form a subset of a certain tuple set, we say the tuple set contains the pattern The support of a pattern is the frequency with which it occurs in the dataset with respect to the total number of tuples in the dataset; this can be expressed as a percentage... maxI=100 and M=1*10-6, and for SKM, maxI=500 (WEKA's default value for each) Also, for EM, SKM, and FF, only results for seed values 1-4 are shown This is done for space, but we found these configurations to well represent these algorithms for each dataset Finally, CPCQ has two parameters: minimum support threshold (minS) and maximum number of CP groups (N) N=5 is used in all datasets, and a reasonable value... clusters based on the clusters associated with their matching patterns In list form, these steps are: 1 Find K seed patterns, one for each cluster 2 Add diversified patterns based on MPD values, forming a CP group G1 for each cluster 3 Add remaining patterns to the pattern sets of the clusters based on tuple overlap 4 Assign tuples to clusters based on their matching patterns These steps are illustrated... value for minS is used for each dataset, depending on its size 4.4 SYND DATASET We first show EM, SKM, and FF clusterings for the example dataset SynD used in chapter 3 Each of these algorithms creates a different clustering These clusterings and their CPCQ scores are shown in Table 2 Table 2 SynD clusterings and CPCQ scores tuple ID 1 2 3 4 5 6 7 8 9 10 11 12 A1 A2 a1 b1 a1 b1 a2 b1 a2 b2 a3 b2 a3 b2 a4 ... For example, if many patterns overlap mat(P), then many mutual patterns may exist between P and each cluster since each overlapping pattern is potentially a mutual pattern, but that does not imply that P is a strong candidate for each cluster when adding patterns to G1 Therefore, MPD values are normalized by the pattern volume (PV) of each argument's matching tuple set 10 The PV of a tuple set TS is the... contained in the same cluster (e.g {a2 } overlaps mat({b1}) and mat({b2}), {a5 } overlaps mat({b3}) and mat({b4}), etc.), and no mutual patterns exist between C1 and C2 When constructing C1 and C2, the seed patterns could be any pair of patterns from separate clusters in this case because the MPD value would be zero for each pair Suppose {a1 } and {a6 } are chosen as seeds Then, {a2 } would be added to G1(C1)... ratio, high overlap with mat(P1), high overlap with mat(P2), and low overlap with (D – (mat(P1) ∪ mat(P2))) In addition, MPD values are higher if a larger portion of the patterns overlapping (mat(P1) ∪ mat(P2)) are mutual patterns 12 MPD for a pattern P and pattern set PS must also be defined since patterns are to be scored with clusters, represented by pattern sets MPD(P,PS) can be defined similarly . Wright State University, 2010. A Contrast Pattern based Clustering Algorithm for Categorical Data. The data clustering problem has received much attention in the data mining, machine learning,. by a closed pattern and a set of minimal generator (MG) patterns. In any EC, no MG pattern is a subset of another pattern, and each pattern is a superset of at least one MG pattern and a subset. of a dataset. When a pattern& apos;s items are a subset of a tuple's itemset, we say that the tuple matches the pattern. When all of a pattern& apos;s matching tuples form a subset of a certain