Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 25 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
25
Dung lượng
454,96 KB
Nội dung
152 Kurt Hornik and Walter Böhm Table 2. Formation of a third class in the Euclidean consensus partitions for the Gordon-Vichi macroeconomic ensemble as a function of the weight ratio w between 3- and 2-class partitions in the ensemble. 1.5 India 2.0 India, Sudan 3.0 India, Sudan 4.5 India, Sudan, Bolivia, Indonesia 10.0 India, Sudan, Bolivia, Indonesia 12.5 India, Sudan, Bolivia, Indonesia, Egypt f India, Sudan, Bolivia, Indonesia, Egypt these, 85 female undergraduates at Rutgers University were asked to sort 15 English terms into classes “on the basis of some aspect of meaning”. There are at least three “axes” for classification: gender, generation, and direct versus indirect lineage. The Euclidean consensus partitions with Q = 3 classes put grandparents and grandchil- dren in one class and all indirect kins into another one. For Q = 4, {brother, sister} are separated from {father, mother, daughter, son}. Table 3 shows the memberships for a soft Euclidean consensus partition for Q = 5 based on 1000 replications of the AO algorithm. Table 3. Memberships for the 5-class soft Euclidean consensus partition for the Rosenberg- Kim kinship terms data. grandfather 0.000 0.024 0.012 0.965 0.000 grandmother 0.005 0.134 0.016 0.840 0.005 granddaughter 0.113 0.242 0.054 0.466 0.125 grandson 0.134 0.111 0.052 0.581 0.122 brother 0.612 0.282 0.024 0.082 0.000 sister 0.579 0.391 0.026 0.002 0.002 father 0.099 0.546 0.122 0.158 0.075 mother 0.089 0.654 0.136 0.054 0.066 daughter 0.000 1.000 0.000 0.000 0.000 son 0.031 0.842 0.007 0.113 0.007 nephew 0.012 0.047 0.424 0.071 0.447 niece 0.000 0.129 0.435 0.000 0.435 cousin 0.080 0.056 0.656 0.033 0.174 aunt 0.000 0.071 0.929 0.000 0.000 uncle 0.000 0.000 0.882 0.071 0.047 Figure 1 indicates the classes and margins for the 5-class solutions. We see that the memberships of ‘niece’ are tied between columns 3 and 5, and that the margin of ‘nephew’ is only very small (0.02), suggesting the 4-class solution as the optimal Euclidean consensus representation of the ensemble. Hard and Soft Euclidean Consensus Partitions 153 uncle aunt cousin niece nephew son daughter mother father sister brother grandson granddaughter grandmother grandfather 0.0 0.2 0.4 0.6 0.8 1.0 4 4 4 4 1 1 2 2 2 2 5 3/5 3 3 3 Fig. 1. Classes (incicated by plot symbol and class id) and margins (differences between the largest and second largest membership values) for the 5-class soft Euclidean consensus parti- tion for the Rosenberg-Kim kinship terms data. Quite interestingly, none of these consensus partitions split according to gender, even though there are such partitions in the data. To take the natural heterogene- ity in the data into account, one could try to partition them (perform clusterwise aggregation, Gaul and Schader (1988)), resulting in meta-partitions (Gordon and Vichi (1998)) of the underlying objects. Function cl_pclust in package clue pro- vides an AO heuristic for soft prototype-based partitioning of classifications, allow- ing in particular to obtain soft or hard meta-partitions with soft or hard Euclidean consensus partitions as prototypes. References BARTHÉLEMY, J.P. and MONJARDET, B. (1981): The median procedure in cluster analysis and social choice theory. Mathematical Social Sciences, 1, 235–267. BARTHÉLEMY, J.P. and MONJARDET, B. (1988): The median procedure in data analysis: new results and open problems. In: H. H. Bock, editor, Classification and related methods of data analysis. North-Holland, Amsterdam, 309–316. BOORMAN, S. A. and ARABIE, P. (1972): Structural measures and the method of sorting. In R. N. Shepard, A. K. Romney and S. B. Nerlove, editors, Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, 1: Theory. Seminar Press, New York, 225–249. CHARON, I., DENOEUD, L., GUENOCHE, A. and HUDRY, O. (2006): Maximum transfer distance between partitions. Journal of Classification, 23(1), 103–121. DAY, W. H. E. (1981): The complexity of computing metric distances between partitions. Mathematical Social Sciences, 1, 269–287. DIMITRIADOU, E., WEINGESSEL, A. and HORNIK, K. (2002): A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence, 16(7), 901–912. GAUL, W. and SCHADER, M. (1988): Clusterwise aggregation of relations. Applied Stochas- tic Models and Data Analysis, 4, 273–282. 154 Kurt Hornik and Walter Böhm GORDON, A. D. and VICHI, M. (1998): Partitions of partitions. Journal of Classification, 15, 265–285. GORDON, A. D. and VICHI, M. (2001): Fuzzy partition models for fitting a set of partitions. Psychometrika, 66(2), 229–248. GUSFIELD, D. (2002): Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters, 82, 159–164. HORNIK, K. (2005a): A CLUE for CLUster Ensembles. Journal of Statistical Software, 14(12). URL http://www.jstatsoft.org/v14/i12/ . HORNIK, K. (2005b): Cluster ensembles. In C. Weihs and W. Gaul, editors, Classifi- cation – The Ubiquitous Challenge. Proceedings of the 28th Annual Conference of the Gesellschaft für Klassifikation e.V., University of Dortmund, March 9–11, 2004. Springer-Verlag, Heidelberg, 65–72. HORNIK, K. (2007a): clue: Cluster Ensembles. R package version 0.3-12. HORNIK, K. (2007b): On maximal euclidean partition dissimilarity. Under preparation. HORNIK, K. and BÖHM, W. (2007): Alternating optimization algorithms for Euclidean and Manhattan consensus partitions. Under preparation. MIRKIN, B.G. (1974): The problem of approximation in space of relations and qualitative data analysis. Automatika y Telemechanika, translated in: Information and Remote Con- trol, 35, 1424–1438. PAPADIMITRIOU, C. and STEIGLITZ, K. (1982): Combinatorial Optimization: Algorithms and Complexity. Prentice Hall, Englewood Cliffs. ROSENBERG, S. (1982): The method of sorting in multivariate research with applications selected from cognitive psychology and person perception. In N. Hirschberg and L. G. Humphreys, editors, Multivariate Applications in the Social Sciences. Erlbaum, Hills- dale, New Jersey, 117–142. ROSENBERG, S. and KIM, M. P. (1975): The method of sorting as a data-gathering procedure in multivariate research. Multivariate Behavioral Research, 10, 489–502. RUBIN, J. (1967): Optimal classification into groups: An approach for solving the taxonomy problem. Journal of Theoretical Biology, 15, 103–144. WAKABAYASHI, Y. (1998): The complexity of computing median relations. Resenhas do Instituto de Mathematica ed Estadistica, Universidade de Sao Paolo, 3/3, 323–349. ZHOU, D., LI, J. and ZHA, H. (2005): A new Mallows distance based metric for comparing clusterings. In ICML ’05: Proceedings of the 22nd International Conference on Machine Learning. ISBN 1-59593-180-5. ACM Press, New York, NY, USA, 1028–1035. Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim {srendle, schmidt-thieme}@ismll.uni-hildesheim.de Abstract. A central task when integrating data from different sources is to detect identical items. For example, price comparison websites have to identify offers for identical products. This task is known, among others, as record linkage, object identification, or duplicate detec- tion. In this work, we examine problem settings where some relations between items are given in advance – for example by EAN article codes in an e-commerce scenario or by manually labeled parts. To represent and solve these problems we bring in ideas of semi-supervised and constrained clustering in terms of pairwise must-link and cannot-link constraints. We show that extending object identification by pairwise constraints results in an expressive framework that subsumes many variants of the integration problem like traditional object identification, matching, iterative problems or an active learning setting. For solving these integration tasks, we propose an extension to current object identification models that assures consistent solutions to problems with constraints. Our evaluation shows that additionally taking the labeled data into account dramatically increases the quality of state-of-the-art object identification systems. 1 Introduction When information collected from many sources should be integrated, different ob- jects may refer to the same underlying entity. Object identification aims at identifying such equivalent objects. A typical scenario is a price comparison system where offers from different shops are collected and identical products have to be found. Decisions about identities are based on noisy attributes like product names or brands. More- over, often some parts of the data provide some kind of label that can additionally be used. For example some offers might be labeled by a European Article Number (EAN) or an International Standard Book Number (ISBN). In this work we investi- gate problem settings where such information is provided on some parts of the data. We will present three different kinds of knowledge that restricts the set of consistent solutions. For solving these constrained object identification problems we extend the generic object identification model by a collective decision model that is guided by both constraints and similarities. 172 Steffen Rendle and Lars Schmidt-Thieme 2 Related work Object identification (e.g. Neiling 2005) is also known as record linkage (e.g. Win- kler 1999) and duplicate detection (e.g. Bilenko and Mooney 2003). State-of-the-art methods use an adaptive approach and learn a similarity measure that is used for predicting the equivalence relation (e.g. Cohen and Richman 2002). In contrast, our approach also takes labels in terms of constraints into account. Using pairwise constraints for guiding decisions is studied in the community of semi-supervised or constrained clustering – e.g. Basu et al. (2004). However, the problem setting in object identification differs from this scenario because in semi- supervised clustering typically a small number of classes is considered and often it is assumed that the number of classes is known in advance. Moreover, semi-supervised clustering does not use expensive pairwise models that are common in object identi- fication. 3 Four problem classes In the classical object identification problem C classic a set of objects X should be grouped into equivalence classes E X . In an adaptive setting, a second set Y of objects is available where the perfect equivalence relation E Y is known. It is assumed that X and Y are disjoint and share no classes – i.e. E X ∩E Y = . In real world problems often there is no such clear separation between labeled and unlabeled data. Instead only the objects of some subset Y of X are labeled. We call this problem setting the iterative problem C iter where (X,Y,E Y ) is given with X ⊇Y and Y 2 ⊇E Y . Obviously, consistent solutions E X have to satisfy E X ∩Y 2 = E Y . Examples of applications for iterative problems are the integration of offers from different sources where some offers are labeled by a unique identifier like an EAN or ISBN, and iterative integration tasks where an already integrated set of objects is extended by new objects. The third problem setting deals with integrating data from n sources, where each source is assumed to contain no duplicates at all. This is called the class of matching problems C match . Here the problem is given by X = {X 1 , ,X n } with X i ∩X j = and the set of consistent equivalence relations E is restricted to relations E on X with E ∩X 2 i = {(x, x)|x ∈ X i }. Traditional record linkage often deals with matching problems of two data sets (n = 2). At last, there is the class of pairwise constrained problems C constr . Here each problem is defined by (X,R ml ,R cl ) where the set of objects X is constrained by a must-link R ml and a cannot-link relation R cl . Consistent solutions are restricted to equivalence releations E with E ∩R cl = and E ⊇ R ml . Obviously, R cl is symmet- ric and irreflexive whereas R ml has to be an equivalence relation. In all, pairwise constrained problems differ from iterative problems by labeling relations instead of labeling objects. The constrained problem class can better describe local informa- tions like two offers are the same/ different. Such information can for example be provided by a human expert in an active learning setting. Information Integration of Partially Labeled Data 173 Fig. 1. Relations between problem classes: C classic ⊂ C iter ⊂ C constr and C classic ⊂ C match ⊂ C constr . We will show, that the presented problem classes form a hierarchy C classic ⊂ C iter ⊂ C constr and that C classic ⊂ C match ⊂ C constr but neither C match ⊆ C iter nor C iter ⊆ C match (see Figure 1). First of all, it is easy to see that C classic ⊆ C iter because any problem X ∈ C classic corresponds to an iterative problem without labeled data (Y = ). Also C classic ⊆ C match because an arbitrary problem X ∈ C classic can be trans- formed to a matching problem by considering each object as its own dataset: X 1 = {x 1 }, ,X n = {x n }. On the other hand, C iter ⊆ C classic and C match ⊆ C classic , because C classic is not able to formulate any restriction on the set of possible solutions E as the other classes can do. This shows that: C classic ⊂ C match , C classic ⊂ C iter (1) Next we will show that C iter ⊂C constr . First of all, any iterative problem (X,Y, E Y ) can be transformed to a constrained problem (X,R ml ,R cl ) by setting R ml ←{(y 1 ,y 2 )|y 1 ≡ E Y y 2 } and R cl ←{(y 1 ,y 2 )|y 1 ≡ E Y y 2 }. On the other hand, there are problems (X,R ml ,R cl ) ∈ C constr that cannot be expressed as an iterative problem, e.g.: X = {x 1 ,x 2 ,x 3 ,x 4 }, R ml = {(x 1 ,x 2 ),(x 3 ,x 4 )}, R cl = If one tries to express this as an iterative problem, one would assign to the pair (x 1 ,x 2 ) the label l 1 and to (x 3 ,x 4 ) the label l 2 . But one has to decide whether or not l 1 = l 2 . If l 1 = l 2 , then the corresponding constrained problem would include the constraint (x 2 ,x 3 ) ∈ R ml , which differs from the original problem. Otherwise, if l 1 = l 2 ,this would imply (x 2 ,x 3 ) ∈ R cl , which again is a different problem. Therefore: C iter ⊂ C constr (2) Furthermore, C match ⊆ C constr because any matching problem X 1 , ,X n can be expressed as a constrained problem with: X = n i=1 X i , R cl = {(x,y)|x,y ∈X i ∧x = y}, R ml = There are constrained problems that cannot be translated into a matching problem. E.g.: 174 Steffen Rendle and Lars Schmidt-Thieme X = {x 1 ,x 2 ,x 3 }, R ml = {(x 1 ,x 2 )}, R cl = Thus: C match ⊂ C constr (3) At last, there are iterative problems that cannot be expressed as matching prob- lems, e.g.: X = {x 1 ,x 2 ,x 3 }, Y = {x 1 ,x 2 }, x 1 ≡ E Y x 2 And there are matching problems that have no corresponding iterative problem, e.g.: X 1 = {x 1 ,x 2 }, X 2 = {y 1 ,y 2 } Therefore: C match ⊆ C iter , C iter ⊆ C match (4) In all we have shown that C constr is the most expressive class and subsumes all the other classes. 4 Method Object Identification is generally done by three core components (Rendle and Schmidt- Thieme (2006)): 1. Pairwise Feature Extraction with a function f : X 2 → R n . 2. Probabilistic Pairwise Decision Model specifying probabilities for equivalences P[x ≡ y]. 3. Collective Decision Model generating an equivalence relation E over X. The task of feature extraction is to generate a feature vector from the attribute de- scriptions of any two objects. Mostly, heuristic similarity functions like TFIDF- Cosine-Similarity or Levenshtein distance are used. The probabilistic pairwise deci- sion model combines several of these heuristic functions to a single domain specific similarity function (see Table 1). For this model probabilistic classifiers like SVMs, decision trees, logic regression, etc. can be used. By combining many heuristic func- tions over several attributes, no time-consuming function selection and fine-tuning has to be performed by a domain-expert. Instead, the model automatically learns which similarity function is important for a specific problem. Cohen and Richman (2002) as well as Bilenko and Mooney (2003) have shown that this approach is suc- cessful. The collective decision model generates an equivalence relation over X by using sim(x,y) := P[x ≡ y] as learned similarity measure. Often, clustering is used for this task (e.g. Cohen and Richman (2002)). Information Integration of Partially Labeled Data 175 Table 1. Example of feature extraction and prediction of pairwise equivalence P[x i ≡ x j ] for three digital cameras. Object Brand Product Name Price x 1 Hewlett Packard Photosmart 435 Digital Camera 118.99 x 2 HP HP Photosmart 435 16MB memory 110.00 x 3 Canon Canon EOS 300D black 18-55 Camera 786.00 Object Pair TFIDF-Cos. Sim. FirstNumberEqual Rel. Difference Feature Vector P[x i ≡ x j ] (Product Name) (Product Name) (Price) (x 1 ,x 2 ) 0.6 1 0.076 (0.6, 1, 0.076) 0.8 (x 1 ,x 3 ) 0.1 0 0.849 (0.1, 0, 0.849) 0.2 (x 2 ,x 3 ) 0.0 0 0.860 (0.0, 0, 0.860) 0.1 4.1 Collective decision model with constraints The constrained problem easily fits into the generic model above by extending the collective decision model by constraints. As this stage might be solved by clustering algorithms in the classical problem, we propose to solve the constrained problem by a constraint-based clustering algorithm. To enforce the constraint satisfaction we sug- gest a constrained hierarchical agglomerative clustering (HAC) algorithm. Instead of a dendrogram the algorithm builds a partition where each cluster should contain equivalent objects. Because in an object identification task the number of equivalence classes is almost never known, we suggest model selection by a (learned) threshold T on the similarity of two clusters in order to stop the merging process. A simplified representation of our constrained HAC algorithm is shown in Algorithm 1. The al- gorithm initially creates a new cluster for each object (line 2) and afterwards merges clusters that contain objects constrained by a mustlink (line 3-7). Then the most sim- ilar clusters, that are not constrained by a cannotlink, are merged until the threshold T is reached. From a theoretical point of view this task might be solved by an arbitrary, prob- abilistic HAC algorithm using a special initialization of the similarity matrix and minor changes in the update step of the matrix. For satisfaction of the constraints R ml and R cl , one initializes the similarity matrix for X = {x 1 , ,x n } in the following way: A 0 j,k = ⎧ ⎪ ⎨ ⎪ ⎩ +f, if (x j ,x k ) ∈ R ml −f, if (x j ,x k ) ∈ R cl P[x j ≡ x k ] otherwise As usual, in each iteration the two clusters with the highest similarity are merged. After merging cluster c l with c m the dimension of the square matrix A reduces by one – both in columns and rows. For ensuring constraint satisfaction, the similarities between c l ∪c m to all the other clusters have to be recomputed: 176 Steffen Rendle and Lars Schmidt-Thieme A t+1 n,i = ⎧ ⎪ ⎨ ⎪ ⎩ +f, if A t l,i =+f ∨A t m,i =+f −f, if A t l,i = −f ∨A t m,i = −f sim(c l ∪c m ,c i ) otherwise For calculating the similarity sim between clusters, standard linkage techniques like single-, complete- or average-linkage can be used. Algorithm 1 Constrained HAC Algorithm 1: procedure CLUSTERHAC(X, R ml , R cl ) 2: P ←{{x}|x ∈X} 3: for all (x,y) ∈R ml do 4: c 1 ← cwherec∈ P ∧x ∈ c 5: c 2 ← cwherec∈ P ∧y ∈ c 6: P ← (P \{c 1 ,c 2 }) ∪{c 1 ∪c 2 } 7: end for 8: repeat 9: (c 1 ,c 2 ) ← argmax c 1 ,c 2 ∈P∧(c 1 ×c 2 )∩R cl = sim(c 1 ,c 2 ) 10: if sim(c 1 ,c 2 ) ≥ T then 11: P ←(P \{c 1 ,c 2 }) ∪{c 1 ∪c 2 } 12: end if 13: until sim(c 1 ,c 2 ) < T 14: return P 15: end procedure 4.2 Algorithmic optimizations Real-world object identification problems often have a huge number of objects. An implementation of the proposed constrained HAC algorithm has to consider several optimization aspects. First of all, the cluster similarities should be computed by dy- namic programming. So the similarities between clusters have to be collected just once and afterward can be inferred by the similarities, that are already given in the similarity-matrix: sim sl (c 1 ∪c 2 ,c 3 )=max{sim sl (c 1 ,c 3 ),sim sl (c 2 ,c 3 )} single-linkage sim cl (c 1 ∪c 2 ,c 3 )=min{sim cl (c 1 ,c 3 ),sim cl (c 2 ,c 3 )} complete-linkage sim al (c 1 ∪c 2 ,c 3 )= |c 1 |·sim al (c 1 ,c 3 )+|c 2 |·sim al (c 2 ,c 3 ) |c 1 |+ |c 2 | average-linkage Second, a blocker should reduce the number of pairs that have to be taken into account for merging. Blockers like the canopy blocker (McCallum et al. (2000)) Information Integration of Partially Labeled Data 177 Table 2. Comparison of F-Measure quality of a constrained to a classical method with different linkage techniques. For each data set and each method the best linkage technique is marked bold. Data Set Method Single Linkage Complete Linkage Average Linkage Cora classic/constrained 0.70/0.92 0.74/0.71 0.89/0.93 DVD player classic/constrained 0.87/0.94 0.79/0.73 0.86/0.95 Camera classic/constrained 0.65/0.86 0.60/0.45 0.67/0.81 reduce the amount of pairs very efficiently, so even large data sets can be handled. At last, pruning should be applied to eliminate cluster pairs with similarity below T prune . These optimizations can be implemented by storing a list of cluster-distance- pairs which is initialized with the pruned candidate pairs of the blocker. 5 Evaluation In our evaluation study we examine if additionally guiding the collective decision model by constraints improves the quality. Therefore we compare constrained and unconstrained versions of the same object identification model on different data sets. As data sets we use the bibliographic Cora dataset that is provided by McCallum et al. (2000) and is widely used for evaluating object identification models (e.g. Cohen et al. (2002) and Bilenko et al. (2003)), and two product data sets of a price comparison system. We set up an iterative problem by labeling N% of the objects with their true class label. For feature extraction of the Cora model we use TFIDF-Cosine-Similarity, Levenshtein distance and Jaccard distance for every attribute. The model for the product datasets uses TFIDF-Cosine-Similarity, the difference between prices and some domain-specific comparison functions. The pairwise decision model is chosen to be a Support Vector Machine. In the collective decision model we run our con- strained HAC algorithm against an unconstrained (‘classic’) one. In each case, we run three different linkage methods: single-, complete- and average-linkage. We re- port the average F-Measure quality of four runs for each of the linkage techniques and for constrained and unconstrained clustering. The F-Measure quality is taken on all pairs that are unknown in advance – i.e. pairs that do not link two labeled objects. F-Measure = 2·Recall ·Precision Recall + Precision Recall = TP TP + FN , Precision = TP TP + FP Table 2 shows the results of the first experiment where N = 25% of the objects for Cora and N = 50% for the product datasets provide labels. As one can see, the best constrained method always clearly outperforms the best classical method. When switching from the best classical to the best constrained method, the relative error reduces by 36% for Cora, 62% for DVD-Player and 58% for Camera. An informal [...]... dispersion” D(p1 , , pk 1 , r, s, pk +1 , , pK ) ≥ D(p1 , , pk 1 , pk , pk +1 , , pK ) where p ∈ PK , 0 ≤ r, s and r + s = pk (MP) “mixing probabilities increases dispersion” D( (1 − r)p + rq) ≥ (1 − r)D(p) + rD(q) for 0 < r < 1 and p, q ∈ PK In addition to concavity we assume D to be continuous on all of PK , K ≥ 2 16 6 Ulrich Müller-Funk (EC) “consistency w.r to empty cells” D(p1 , , pK... denote the eigenvalues of 1/ 2 H 1/ 2 , H = diag( , u (pk ), ) 1, , K Proof a) is a direct consequence of Witting and Müller-Funk (19 95), Satz 5 .10 7 b), p 10 7 ("Delta method") b) follows from their Satz 5 .12 7, p 13 4 The limiting distribution in b) must be worked out for every g seperately For the Gini index DG this becomes Lu N DG ( pN ) − 1 + ˆ 1 K →L K 1 K K 1 Yi2 i =1 3 Segmentation Again,... (Georgi and Schliep (20 06) ) for details) Mixture Based Group Inference in Fused Geno- and Phenotype Data 12 3 3 Results We applied the CSI mixture model based clustering to the genotype and phenotype data separately, as well as to the fused data set For each data set we trained models with 1 to 10 components and model selection was performed using the Normalized Entropy Criterion (NEC) (C Biernacki (19 99))... Group Inference in Fused Genotype and Phenotype Data Benjamin Georgi1 , M.Anne Spence2 , Pamela Flodman2 , Alexander Schliep1 1 2 Max-Planck-Institute for Molecular Genetics, Department of Computational Molecular Biology, Ihnestrasse 73, 14 195 Berlin, Germany University of California, Irvine, Pediatrics Department, 307 Sprague Hall, Irvine, CA 9 269 7, USA Abstract The analysis of genetic diseases has... of the 10 th International Conference on Knowledge Discovery and Data Mining (KDD-2004) BILENKO, M and MOONEY, R J (2003): Adaptive Duplicate Detection Using Learnable String Similarity Measures In: Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (KDD-2004) COHEN, W W and RICHMAN, J (2002): Learning to Match and Cluster Large HighDimensional Data Sets for Data Integration... J-table are denoted as usual, e.g (1) p1 = pi or p2 |1 ( j|i) = pi j /pi The next assertion parallels the well-known formula “ 2 (Y ) ≥ E 2 (Y |X) ” Proposition 1 Let D be a measure of dispersion and (pi j ) probabilities on a twoway-table Then, (1) D p2 |1 (·|i) pi D p(2) ≥ i Proof Consequence of (MP) Measures of Dispersion and Cluster-Trees for Categorical Data 16 7 The proposition implies that any... mixture coefficients, component distribution K k =1 k = 1 and each Mixture Based Group Inference in Fused Geno- and Phenotype Data 12 1 p fk (xi ; k) = fk j (xi j ; k j) (2) j =1 is a product distribution over X1 , , Xp parameterized by parameters = ( k1 , , kp ) In other words, we assume conditional independence between k features within the mixture components and adopt the Naive Bayes model as component... uniform distribution uK = K 1 (1, , 1) Proof (PI), (MD) and (MP) are immediate consequences of the definition (MA) follows from a well-known lemma by Schur (cf Tong (19 80), Lemma 6. 2 .1) In order to see (SC), just write r = pK , s = (1 − )pK and employ concavity Obviously, Dg (p) can efficiently be estimated from an multinomial i.i.d sample ˆ ˆ X (1) , , X (N) , pN = N 1 n X (n) In fact, Dg ( pN... Koubarakis, K Böhm and E Ferrari (Eds.): Advances in Database Technology—EDBT 2004 Springer, Berlin, 12 3 14 6 BARBARA, D., LI, Y and COUTO, J (2002): COOLCAT: An entropy-based algorithm for categorical clustering Proceedings of the Eleventh International Conference on Information and Knowledge Management, 582–589 BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A and STONE, C.J (19 84): Classification and Regression... P(K)P(G) p Z j < 1 and K and P(G) < 1 are hyper parameter which with P(K) j =1 act as a regularization of the structure learning by introducing a bias towards less complex models into the posterior Here, and were chosen as weak priors by the heuristic introduced in (Georgi and Schliep (20 06) ) with a = 0.05 Since exhaustive evaluation of all possible structures is infeasible, the structure learning is carried . data. grandfather 0.000 0.024 0. 012 0. 965 0.000 grandmother 0.005 0 .13 4 0. 0 16 0.840 0.005 granddaughter 0 .11 3 0.242 0.054 0. 466 0 .12 5 grandson 0 .13 4 0 .11 1 0.052 0.5 81 0 .12 2 brother 0. 61 2 0.282 0.024 0.082. 0.000 sister 0.579 0.3 91 0.0 26 0.002 0.002 father 0.099 0.5 46 0 .12 2 0 .15 8 0.075 mother 0.089 0 .65 4 0 .13 6 0.054 0. 066 daughter 0.000 1. 000 0.000 0.000 0.000 son 0.0 31 0.842 0.007 0 .11 3 0.007 nephew 0. 012 0.047. Name) (Product Name) (Price) (x 1 ,x 2 ) 0 .6 1 0.0 76 (0 .6, 1, 0.0 76) 0.8 (x 1 ,x 3 ) 0 .1 0 0.849 (0 .1, 0, 0.849) 0.2 (x 2 ,x 3 ) 0.0 0 0. 860 (0.0, 0, 0. 860 ) 0 .1 4 .1 Collective decision model with