1. Trang chủ
  2. » Luận Văn - Báo Cáo

object similarity through correlated third-party objects

69 343 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 69
Dung lượng 639,99 KB

Nội dung

Object Similarity through Correlated Third-Party Objects A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science By Ting Sa B.S. Shanghai University of Electric Power, China, 2005 2008 Wright State University WRIGHT STATE UNIVERSITY SCHOOL OF GRADUATE STUDIES AUGUST 11, 2008 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY Ting Sa ENTITLED Object Similarity through Correlated-Third- Party Objects BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science . Guozhu Dong, Ph. D. Thesis Director Thomas Sudkamp, Ph. D. Department Chair Committee on Final Examination Yong Pei, Ph. D. Krishnaprasad Thirunarayan, Ph. D. Joseph F. Thomas, Jr., Ph. D. Dean, School of Graduate Studies iii Abstract Sa, Ting. M.S., Department of Computer Science and Engineering, Wright State University, 2008. Object Similarity through Correlated Third-Party Objects. Given a pair of objects, it is of interest to know how they are related to each other and the strength of their similarity. Many previous studies focused on two types of similarity measures: The first type is based on closeness of attribute values of two given objects, and the second type is based on how often the two objects co-occur in transactions/tuples. In this thesis we study a new “behavior-based” similarity measure, which evaluates similarity between two objects by considering how similar their correlated “third-party” object sets are. Behavior-based similarity can help us find pairs of objects that have similar external functions but do not have very similar attribute values or do not co-occur quite often. After introducing and formalizing behavior-based similarity, we give an algorithm to mine pairs of similar objects under this measure. We demonstrate the usefulness of our algorithm and this measure using experiments on several news and medical datasets. iv TABLE OF CONTENTS 1. Introduction 1 2. Preliminaries and related work 3 2.1 Transaction, itemset, and an example on correlation 3 2.2 Support and confidence 5 2.3 Common correlation measures 6 2.3.1 Cosine measure 7 2.3.2 All-confidence measure 8 2.3.3 Coherence measure 9 2.3.4 Cosine, all-confidence and coherence vs other correlation measures 10 2.3.5 Comparison for the cosine, all-confidence and coherence 14 2.4 Other similarity measures 15 3. Problem definition 16 3.1 Feature-based/co-occurrence-based similarity vs behavior-based similarity 16 3.2 Definitions of Sim3P 18 3.3 Behavior-based similarity measure 20 3.4 Behavior-based similarity measure vs correlation measures 25 4. Algorithm issues 29 4.1 Overview of the algorithm 29 4.2 Finding all the objects 30 4.3 Finding correlated 3rd party objects 33 4.4 Pruning 35 v 5. Experimental evaluation 38 5.1 Testing data sets 38 5.1.1 News data set 39 5.1.2 Colon cancer data set 43 5.2 Comparing Sim3P with Other Measures 47 5.2.1 When other measure values are high, the Sim3P value is high 48 5.2.2 High Sim3P does not imply high other measure values 51 5.3 Efficiency testing results 55 6. Conclusions and future work 58 7. References 59 vi LIST OF FIGURES Figure 1. Data sets for feature-based similarity measures 17 Figure 2. Data sets for behavior-based similarity measures 17 Figure 3. The meaning of (Corr(X) + Corr(Y) – Corr(X,Y)) 22 Figure 4. The overview of the algorithm 29 Figure 5. Process of finding all the objects 30 Figure 6. A sample of a map 31 Figure 7. Bit set model 33 Figure 8. Process of finding the correlated 3rd party objects 34 Figure 9. Process of finding the shared correlated 3rd party objects 34 Figure 10. The structure of CorrMap 36 Figure 11. Identical objects map structure 36 Figure 12. Identical objects pruning steps 37 Figure 13. Format of Data Sets for Behavior-Based Similarity 38 Figure 14. News data set 39 Figure 15. Transformed news data set 40 Figure 16. List of 9 categories of news data set 40 Figure 17. Size of news data set 41 Figure 18. Original colon cancer data set 44 Figure 19. Binning steps 46 Figure 20. Transformed colon cancer data set 47 Figure 21. The running execution time for news data set 57 vii LIST OF TABLES Table 1. Supermarket data set 4 Table 2. A 2 × 2 contingency table for two items 12 Table 3. Comparison of five correlation measures 13 Table 4. Sample data base with 9 items and 8 transactions 18 Table 5. The records that A occurs 19 Table 6. An example for extracting 3P-identical pairs 26 Table 7. An example for extracting 3P-inclusion pairs 27 Table 8. Objects distribution according to objects’ category 42 Table 9. Total Number of Objects-Pairs Distribution according to Objects’ Category 43 Table 10. Top 10 cosine pairs for colon cancer data set 48 Table 11. Top 10 all-confidence pairs for colon cancer data set 48 Table 12. Top 10 coherence pairs for colon cancer data set 49 Table 13. Top 10 cosine pairs for news data set 49 Table 14. Top 10 all-confidence pairs for news data set 50 Table 15. Top 10 coherence pairs for news data set 50 Table 16. Different results between Sim3P and cosine from the colon cancer data set 51 Table 17. Different results between Sim3P and all-confidence from the colon cancer data set 51 Table 18. Different results between Sim3P and coherence from the colon cancer data set 52 Table 19. Different results between Sim3P and cosine from the news data set 52 Table 20. Different results between Sim3P and all-confidence from the news data set 53 viii Table 21. Different results between Sim3P and coherence from the news data set 54 Table 22. Objects distribution according to objects’ category after optimization 55 Table 23. Total number of object-pairs distribution according to objects’ category after optimization 56 Table 24. The running results for colon cancer data set 57 ix Acknowledgement I would like to give my special thanks to Dr. Dong, for his kindness and patience in guiding me to accomplish this work. Without his valuable guidance this thesis would not have been possible. I also would like to thank Dr. Yong Pei and Dr. Krishnaprasad Thirunarayan for being a part of my thesis committee and giving me helpful comments and suggestions. . Finally, I would like to thank my parents, my uncle and auntie for their support and love all throughout my graduate studies at Wright State. 1 1. Introduction Given a pair of objects, it is of interest to know how they are related to each other and the strength of their similarity. Similarity measures can be used in many data retrieval, data mining and analysis tasks. For example, we can group the objects of a given application into clusters based on their similarity values; clusters can provide a more efficient organization for retrieving information and can be used to segment patients into groups for improved treatment, and to segment companies or customers for improved business decision making, etc. Many similarity measures have been proposed previously, which are often based on comparing the objects’ internal feature values or the objects’ co-occurrences [EJ+06, FK+03, HH01, TK+02]. For such measures, if the values of the internal attributes are close to each other or the objects often co-occur in transactions/tuples, then the objects are considered similar. However, there exist many objects that may not have similar internal features or high co-occurrence frequencies, but they are still quite similar with each other. For example, there can be a pair of genes (examples will be given in the experiment section), whose internal structures are not very similar and they seldom co-occur, but their relationships with other genes are quite similar. It should be interesting to mine these gene pairs since they may provide useful information for biomedical research. We name this kind of similarity as behavior-based similarity. It measures the similarity between two objects by considering how similarly the two objects are related to other third-party objects. Given two objects X and Y, if the set of objects related to X is [...]... Definition 3.5 Given two objects X and Y, if X’s correlated 3rd party objects set and Y’s correlated 3rd party objects set has no shared 3rd party objects, then we say X and Y are 3Pdissimilar Example 4: For object H, its correlated 3rd party objects set is {H, G}, for object I, its correlated 3rd party objects set is {I, F}, H and I do not share any correlated 3rd party objects, so they are 3P-dissimilar... Definition 3.3 Given two objects X and Y, if X’s correlated 3rd party objects set is the same as Y’s correlated 3rd party objects set, then we say X and Y are 3P-identical Example 2: Object A’s correlated 3rd party objects set is {A, B, C, D, E, F}, and object C’s correlated 3rd party objects set is {A, B, C, D, E, F} Since the above two sets are identical, we say object A and object C are 3P-identical... Definition 3.4 Given two objects X and Y, if X’s correlated 3rd party objects set is a super (parent) set of Y’s correlated 3rd party objects set, then we say X is a 3P-parent of (3Pincludes) Y Example 3: For object B, its correlated 3rd party objects set is {A, B, C, D, E, F, G} Since B’s correlated 3rd party objects set is a super set (parent set) of A (C)’s correlated 3rd party objects set, so we... interesting information for us Object A’s related objects Object B’s related Compare the related objects to see if A and B are similar to each other or not objects Object C, Object C, Object D, Object D, Object E, Object E, … … Figure 2 Data sets for behavior-based similarity measures 17 In summary, feature-based similarity can be used to find all the feature-based similar objects It is applicable to... Consider objects A and B When using the behavior-based similarity to find their shared correlated 3rd party objects, we can see that for A, its correlated 3rd party objects set is {A, B, C, D}; for B, its correlated 3rd party objects set is also {A, B, C, D} Since the two sets are the same, we see that objects A and B are 3P-identical objects Then we check the correlation relationship between objects. .. From definition 3.6, we know how to check whether two objects are behaviorbased similar or not What we need to do is to determine how many correlated 3rd party objects that the pair -objects share If the total number of the shared correlated 3rd party objects is nearly the same as the total number of all the correlated 3rd party objects, then these two objects should be very 3P-similar to each other Based... behavior-based similarity for a pair of objects: Sim 3 P ( X , Y ) = Corr ( X , Y ) Corr ( X ) + Corr (Y ) − Corr ( X , Y ) (3.1) In formula 3.1, Corr(X,Y) denotes the total number of the correlated 3rd party objects that relate to both objects X and Y; Corr(X) denotes the total number of the correlated 3rd party objects that relate to object X; Corr(Y) means the total number of the correlated 3rd party objects. .. those objects that are represented as numerical feature vectors However, these above measures are all based on the internal features of the objects None of them evaluates the objects similarity through other objects The results gained from these measures do not include these behavior-based similar objects and the ignorance of these behavior-based similar objects causes a limitation on the usage of similarity. .. Definition 3.1 If an object X has a co-occurrence relationship with another object Y, then we say X is a correlated 3rd party object for Y For a given object, its correlated 3rd party object set is the set that contains all its correlated 3rd party objects and this object itself Example 1: Object A appears in the first and third transactions and in none of the other transactions, so objects in these two... illustrate the idea of the proof Still let us have a Look at the above simple example For object E, its correlated 3rd party objects set is {C, E}; for object C, its correlated 3rd party objects set is {A,B,C,E} Since C’s correlated 3rd party objects set is the super (parent) set of E’s correlated 3rd party objects set, according to the definition 3.4, we see that C 3P-includes E Then we check their . of similarity as behavior-based similarity. It measures the similarity between two objects by considering how similarly the two objects are related to other third-party objects. Given two objects. University, 2008. Object Similarity through Correlated Third-Party Objects. Given a pair of objects, it is of interest to know how they are related to each other and the strength of their similarity. . extracting 3P-inclusion pairs 27 Table 8. Objects distribution according to objects category 42 Table 9. Total Number of Objects- Pairs Distribution according to Objects Category 43 Table 10. Top

Ngày đăng: 30/10/2014, 20:11

TỪ KHÓA LIÊN QUAN

w