Link Mining: Models, Algorithms, and Applications Philip S Yu · Jiawei Han · Christos Faloutsos Editors Link Mining: Models, Algorithms, and Applications 123 Editors Philip S Yu Department of Computer Science University of Illinois at Chicago 851 S Morgan St Chicago, IL 60607-7053, USA psyu@cs.uic.edu Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign 201 N Goodwin Ave Urbana, IL 61801, USA hanj@cs.uiuc.edu Christos Faloutsos School of Computer Science Carnegie Mellon University 5000 Forbes Ave Pittsburgh, PA 15213, USA christos@cs.cmu.edu ISBN 978-1-4419-6514-1 e-ISBN 978-1-4419-6515-8 DOI 10.1007/978-1-4419-6515-8 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010932880 c Springer Science+Business Media, LLC 2010 All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Preface With the recent flourishing research activities on Web search and mining, social network analysis, information network analysis, information retrieval, link analysis, and structural data mining, research on link mining has been rapidly growing, forming a new field of data mining Traditional data mining focuses on “flat” or “isolated” data in which each data object is represented as an independent attribute vector However, many real-world data sets are inter-connected, much richer in structure, involving objects of heterogeneous types and complex links Hence, the study of link mining will have a high impact on various important applications such as Web and text mining, social network analysis, collaborative filtering, and bioinformatics As an emerging research field, there are currently no books focusing on the theory and techniques as well as the related applications for link mining, especially from an interdisciplinary point of view On the other hand, due to the high popularity of linkage data, extensive applications ranging from governmental organizations to commercial businesses to people’s daily life call for exploring the techniques of mining linkage data Therefore, researchers and practitioners need a comprehensive book to systematically study, further develop, and apply the link mining techniques to these applications This book contains contributed chapters from a variety of prominent researchers in the field While the chapters are written by different researchers, the topics and content are organized in such a way as to present the most important models, algorithms, and applications on link mining in a structured and concise way Given the lack of structurally organized information on the topic of link mining, the book will provide insights which are not easily accessible otherwise We hope that the book will provide a useful reference to not only researchers, professors, and advanced level students in computer science but also practitioners in industry We would like to convey our appreciation to all authors for their valuable contributions We would also like to acknowledge that this work is supported by NSF through grants IIS-0905215, IIS-0914934, and DBI-0960443 Chicago, Illinois Urbana-Champaign, Illinois Pittsburgh, Pennsylvania Philip S Yu Jiawei Han Christos Faloutsos v Contents Part I Link-Based Clustering Machine Learning Approaches to Link-Based Clustering Zhongfei (Mark) Zhang, Bo Long, Zhen Guo, Tianbing Xu, and Philip S Yu Scalable Link-Based Similarity Computation and Clustering 45 Xiaoxin Yin, Jiawei Han, and Philip S Yu Community Evolution and Change Point Detection in Time-Evolving Graphs 73 Jimeng Sun, Spiros Papadimitriou, Philip S Yu, and Christos Faloutsos Part II Graph Mining and Community Analysis A Survey of Link Mining Tasks for Analyzing Noisy and Incomplete Networks 107 Galileo Mark Namata, Hossam Sharara, and Lise Getoor Markov Logic: A Language and Algorithms for Link Mining 135 Pedro Domingos, Daniel Lowd, Stanley Kok, Aniruddh Nath, Hoifung Poon, Matthew Richardson, and Parag Singla Understanding Group Structures and Properties in Social Media 163 Lei Tang and Huan Liu Time Sensitive Ranking with Application to Publication Search 187 Xin Li, Bing Liu, and Philip S Yu Proximity Tracking on Dynamic Bipartite Graphs: Problem Definitions and Fast Solutions 211 Hanghang Tong, Spiros Papadimitriou, Philip S Yu, and Christos Faloutsos vii viii Contents Discriminative Frequent Pattern-Based Graph Classification 237 Hong Cheng, Xifeng Yan, and Jiawei Han Part III Link Analysis for Data Cleaning and Information Integration 10 Information Integration for Graph Databases 265 Ee-Peng Lim, Aixin Sun, Anwitaman Datta, and Kuiyu Chang 11 Veracity Analysis and Object Distinction 283 Xiaoxin Yin, Jiawei Han, and Philip S Yu Part IV Social Network Analysis 12 Dynamic Community Identification 307 Tanya Berger-Wolf, Chayant Tantipathananandh, and David Kempe 13 Structure and Evolution of Online Social Networks 337 Ravi Kumar, Jasmine Novak, and Andrew Tomkins 14 Toward Identity Anonymization in Social Networks 359 Kenneth L Clarkson, Kun Liu, and Evimaria Terzi Part V Summarization and OLAP of Information Networks 15 Interactive Graph Summarization 389 Yuanyuan Tian and Jignesh M Patel 16 InfoNetOLAP: OLAP and Mining of Information Networks 411 Chen Chen, Feida Zhu, Xifeng Yan, Jiawei Han, Philip Yu, and Raghu Ramakrishnan 17 Integrating Clustering with Ranking in Heterogeneous Information Networks Analysis 439 Yizhou Sun and Jiawei Han 18 Mining Large Information Networks by Graph Summarization 475 Chen Chen, Cindy Xide Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, and Jiawei Han Part VI Analysis of Biological Information Networks 19 Finding High-Order Correlations in High-Dimensional Biological Data 505 Xiang Zhang, Feng Pan, and Wei Wang Contents ix 20 Functional Influence-Based Approach to Identify Overlapping Modules in Biological Networks 535 Young-Rae Cho and Aidong Zhang 21 Gene Reachability Using Page Ranking on Gene Co-expression Networks 557 Pinaki Sarder, Weixiong Zhang, J Perren Cobb, and Arye Nehorai Index 569 572 Density-based clustering (cont.) seed growth, 542–543 Desiderata, 352 component structure, 352 giant component structure, 352 star structure, 352 DEVICE: evolving groups, 97 DEVICE data set, 94 DFS code, 485, 489 Difference matrix, 214–215 Differentiation-based method, 178 Digg.com, 164, 178 Dimensionality reduction, 5, 8, 508–510, 528 Directed acyclic graphs (DAG), 323 Directed time graph, 342 Direct mining strategies, 247 gboost, 247–250 GraphSig, 254–258 LEAP, 250–254 Mb T (model-based search tree), 258–260 Dirichlet distribution, 37 Dirichlet process (DP), 33 Discovery-driven InfoNetOLAP, 427–430 Discriminative learning, 153 DISTINCT, 295–297, 299–302 Divisive algorithms, 125 DM, 228, 230 Document–category matrix, 10 Domain–domain interactions, 114 DP algorithm, 367 Dynamic affiliation network, 310, 312, 315 Dynamic bipartite graph, proximity tracking on, 211–212 dynamic proximity, applications, 224–225 track-centrality, 225–226 track-proximity, 226–227 dynamic proximity, computations BB_LIN on static graphs, 219–220 challenges for dynamic setting, 220–221 solutions, 221–224 dynamic proximity and centrality, 216–217 fixed degree matrix, 218–219 static setting, 216 updating the adjacency matrix, 217–218 experimental results, 227 data sets, 228–229 effectiveness, 229–230 efficiency, 232–234 on track-proximity, 230–232 problem definitions, 213–215 Index related work, 212 dynamic graph mining, 213 static graph mining, 212–213 Dynamic community identification (DCI), 307 algorithms for finding communities, 318–324 approximation via bipartite matching, 321–322 approximation via path cover, 322–323 group graph, 319 optimal individual coloring, 323–324 applications to real-world data sets, 327 data sets, 327–333 results, 332–333 experimental validation, 325 southern women, 325–327 Theseus’ Ship, 325–326 optimization, 314–316 overview, 310–312 problem, 312 complexity, 317–318 formulation, 313 notation and definitions, 312 optimization problem, 314–316 social costs, 316–317 static and dynamic networks, 308–310 Dynamic graph mining, 213 Dynamic programming, 102, 319, 323–324, 327, 330, 332, 366 Dynamic proximity applications, 224–225 track-centrality, 225–226 track-proximity, 226–227 and centrality, 216–217 fixed degree matrix, 218–219 static setting, 216 updating the adjacency matrix, 217–218 computations BB_LIN on static graphs, 219–220 challenges for dynamic setting, 220–221 solutions, 221–224 monotonicity property of, 218–219 Dynamic relational data clustering through graphical models, 31–32 experiments, 37 real data set, 39–42 synthetic data set, 37–39 infinite hierarchical Hidden Markov State model (iH2 MS), 32–34 model formulation and algorithm, 34–37 Dynamic social networks, 310–311 See also Social networks Index E Eager algorithms, 147 Ebay (ebay.com), 179, 338 Ec.sports.hockey, Edge addition, 214, 366, 542 restriction to, 365, 368–370, 375–376 Edge deletion, 219, 362, 365–366 Efficient computation, 435–436 Ego-differentiation, 178 Eigenvalue, 195–196 Email communications, 167 graphs, 340–341 networks, 73–74 EM algorithm, 311 ENRON change point detection, 98 email data set, 74–75, 92–93 graph, 378 Entity identification, 267 Entity resolution, 118–122, 267–269, 273, 275, 278–279 approach, 119–121 definition, 119 for graphs, 269–272 attribute-based entity resolution, 269–270 collective relational entity resolution, 271–272 relational entity resolution, 270–271 issues, 121–122 problems of, 109 Epinions.com, 157 Erdö–Rényi random graphs, 341 eSocial networks affiliation networks, 309 clique-finding techniques, 124 communities in, 310 graph summarization, 475 heterogeneous, 272 identity anonymization in, 359–384 privacy preserving methods, 361 and Markov logic, 136 modeling, 312 and online networks, 108 ranking of, 557–558 SSnetViz, 273–278 structure and evolution of online, 337–355 Euclidean distance, 13–15 Evolution, 164–165 Exhaustive enumeration, 84 Expectation-maximization algorithm, 125 Expectation–maximization (EM) algorithm, 21 573 Exponential weighting, 218 ExpressDB, 558, 563 histogram, 566 yeast RNA expression data sets, reachability in, 565 Extracting communities, 183 Extraction of differential gene expression (EDGE), 565 F Facebook (facebook.com), 163, 173, 338, 362, 389, 478 Facebook’s “Beacon” service, 362 FacetNet, 311 False negatives, 477 False positives, 477 cause of, 480 verifying, 485–487 summary-guided isomorphism checking, 486–487 Fast-batch-update, 223–224, 232–234 Fast-single-update, 221–223, 234 Feature selection, 241–242, 510 on subgraph patterns, 244–247 Feature selection on subgraph patterns, 244–247 integrating CORK with gSpan to prune search space, 246–247 quality criterion CORK, 246 submodularity, 245–246 Federated database approaches, 267 Federated query processing, 267 Financial transaction data set, 95 Finite Markov chain, 194 First-order knowledge base (KB), 137 First-order logic, 136–140, 158 key property of, 148–149 Flickr (flickr.com), 163–164, 166, 168, 265, 337–340, 342–351, 354–355 Floor of Vectors, 256 FOIL, 154 Forest-fire graph model, 341 Forest fire model, 115 Formal language, 136 Frequent pattern-based graph classification, 237–238 direct mining strategies, 247 gboost, 247–250 GraphSig, 254–258 LEAP, 250–254 Mb T (model-based search tree), 258–260 mining subgraph features for classification, 240–241 574 Frequent pattern-based graph (cont.) feature selection on subgraph patterns, 244–247 frequent subgraph features, 241–243 graph fragment features, 243–244 tree and cyclic pattern features, 243–244 problem formulation, 238–239 related work, 239–240 Frequent subgraph, 239, 255, 257–260, 390, 476, 478, 482, 484, 487, 489, 497–498 definition, 239, 478–479 features, 241–243 bug localization, 242–243 selection, 244–246 graph fragment features, 244 structure learning and clustering, 154 F-SimRank, 63 Functional flow, 548 Functional influence model, 536–537, 543–545, 554 Functional modules, 536–539, 548, 550, 552 Function symbols, 137 FVMine, 256 G Gaussian points, 37 Gboost, 247 boosting framework, 247–248 branch-and-bound search approach, 248–250 Gene appearance, 565 Gene clustering, 558–559, 561–563 Gene co-expression networks, 557–558 gene reachability using page ranking on, 557–567 Gene connections in CoE networks, 566–567 Gene expression, 558–559, 563, 565 Gene oncology annotations, 561 General relational clustering, 22–31, 33, 42, 52 through probabilistic generative model, 22–31 General relational clustering through probabilistic generative model, 22–24 experiments, 27 actor–movie data, case study, 31 bi-clustering and tri-clustering, 28–31 graph clustering, 27–28 model formulation and algorithms, 24–27 Gene ranking, 560–561 connectivity information in, 560 Index VAP, 565 Gene reachability using page ranking on gene co-expression networks, 557–558 application to real data, 563 data sets, 563–565 gene connections in CoE networks, 566–567 reachability in expressDB yeast RNA expression data sets, 565 VAP gene ranking, 565 method, 558 nearest-neighbor-based CoE networks, 558–559 page-rank method for ranking genes, 560–561 replacing α, 561 numerical examples, 561 clustering, 563 estimating αm , 561–562 Genome-wide biological networks, 534 GetQij, 220 Giant components k-cores, 350 Gibbs sampling approaches, 112 Girvan–Newman algorithm, 126 Global aggregation, 217, 317 Global formulations, 112–113 Google Page-Rank algorithm, 557–561 Google scholar, 191–192 Graph anonymity, 362 anonymization, 365–366, 384 classification, 238 clustering, 16, 23–24, 27–28, 436, 441, 454, 536–537 community structures, 390 clustering algorithms, 536 compression, 403–406 MDL representation of graphs, 405–406 S-node representation of web graph, 404–405 construction, 366, 370 basics on realizability of degree sequences, 370–372 Greedy_Swap algorithm, 372–375 restriction to edge additions, 375–376 encoding, 81–82 cost, 82 information integration framework, 268, 278–280 kernel, 241 mining, 76–77 partitioning, 7, 11–14, 23 Index pattern-based classification of, 239 schema matching, 268 stream, 78, 80 stream encoding, 84 stream segment, 80 Graph databases, information integration for, 265–267 entity resolution for graphs, 269 attribute-based entity resolution, 269–270 collective relational entity resolution, 271–272 relational entity resolution, 270–271 example applications, 272 D-Dupe, 272–273 SSnetViz, 273–278 framework, 267–269 Graphical models, 112, 136, 139, 158, 229 dynamic relational data clustering, 31–42 Graph information integration, 278–279 framework for, 267–269 Graph mining, 253, 257, 436, 476, 498 static, 212–213 Graph-modification algorithms, 360 Graph pattern, 249–250, 476–478, 480–482, 485, 492, 499 Graph pattern mining, 238, 241, 250, 475–477 GraphScope, 76–77, 90 compression objective, 81 graph encoding, 81–82 graph segment encoding, 82–84 graph stream encoding, 84 initialization, 91 partition identification, 84–89 time segmentation, 89–90 Graph segment encoding, 82–84 graph encoding cost, 83–84 partition encoding cost, 82–83 Graph segment partitions, 80 GraphSig, 254–258 application to graph classification, 257–258 calculating p-value of feature vector, 255–256 regions of interest, 256–257 sliding window across graphs, 255 Graph structure complete coverage, 244 generation process, 244 precise representation, 244 topological complexity, 244 Graph summarization, 391–392, 477 by aggregation, 392 interactive, 389–407 575 by aggregation, 392 aggregation-based, 393–399 related topics to, 403–407 scalability of, 402 mining large information networks by, 475–500 Graph summarization, mining large information networks by, 475–478 bounding the false-negative rate, 482–485 experimental results, 491–492 real data set, 492–495 synthetic data set, 495–498 iterative SUMMARIZE-MINE, 487–491 preliminaries, 478–479 related work, 498–499 SUMMARIZE-MINE framework, 479 discarding false positives, 481 overall algorithm layout, 481–482 recovering false negatives, 480–481 verifying false positives, 485–487 summary-guided isomorphism checking, 486–487 Graph visualization, 407 Gray circles, 51 GREEDY algorithm, 406 Greedy_Swap_Additions, 378–379 Greedy_Swap algorithm, 372–373, 378 Probing scheme, 374–375 Grevy’s zebra (Equus grevyi) data set, 331 Ground-truth gene regulatory networks, 557, 559–560 Group(s), 312, 392 detection, 122–128, 183 problems of, 109 graph of interaction sequence, 321 profiling, 177–179, 183 relationships, 392 structure, 126 and properties in social media, 163–184 GSpan, 485 H Haggle data set, 331 HDP-HTM, 32 model, 35 Heterogeneity, 164 Heterogeneous information network, 430 integrating clustering in, analysis, 439–471 Heterogeneous networks, 183, 441–443, 446, 454–455, 471 community extraction in, 168–176 in social media, 165–168 576 Heterogeneous networks, 165–168, 183 motivations to study network heterogeneity, 168 Heterogeneous relational clustering, 42 through spectral analysis, 4–11 Heterogeneous relational clustering through spectral analysis, 4–5 experiments, 7–8 clustering on bi-type relational data, 8–9 clustering on tri-type relational data, 9–11 model formulation and algorithm, 5–7 Heterogeneous relational data (HRD), 5–9 structures of, Hidden Markov model (HMM), 33, 312 Hierarchical classification, 10 Hierarchical clustering approaches, 50 bottom-up approaches, 539 techniques, 125–126 top-down approaches, 539–540 Hierarchical dirichlet processes (HDP), 32 Hierarchical random graph, 116 Hierarchical transition matrix (HTM), 32, 34 High-dimensional biological data, 505–506 CARE algorithm, 517 choosing the subsets of points, 518–520 feature subsets selection, 518 challenges and contributions, 509–510 experiments, 525 real data, 531–532 synthetic data, 525–531 motivation, 506–509 problem formalization, 511–517 linear reducible subspace, 512–515 nonlinear reducible subspace, 515–517 REDUS algorithm, 520–525 finding overall reducible subspace, 521–522 intrinsic dimensionality estimator, 520–521 maximum reducible subspace, 522–525 related work feature selection, 510 feature transformation, 510 intrinsic dimensionality, 511 subspace clustering, 511 High dimensional data, 505–506, 509–511, 517 HITS, 187, 190–191, 440, 470 Homogeneous network, 440 Index Homogeneous relational clustering, 4, 11–19, 42 Homogeneous relational clustering through convex coding, 11–13, 42 experiments, 16 data sets and parameter setting, 16–17 results and discussion, 17–19 model formulation and algorithms, 13–15 Homogeneous relational data, 19–20 experiments, 21–22 model formulation and algorithm, 20–21 Homophily, 136, 177 Horn clauses, 138 Huffman coding, 81 Hyperedges, 115 Hypergraph, 115 I I-aggregated graph, 417 ICA, see Iterative classification algorithm (ICA) ICDE 2005, 415 Identity anonymization in social networks, 359–361 degree anonymization, 366–368 restriction to edge additions, 368–370 experiments, 376 data sets, 377–378 evaluating graph construction algorithms, 378–383 graph construction, 370 basics on realizability of degree sequences, 370–372 Greedy_Swap algorithm, 372–375 restriction to edge additions, 375–376 overview, 365–366 problem definition, 363–365 restriction to edge additions, 365 related work, 361–363 Identity disclosure, 360 I-divergence, 13, 15 IID assumption, 23 IMDB.com, 284 Independent and identically distributed (IID) assumption, 109 Inductive logic programming (ILP), 138, 154 Infinite hierarchical hidden Markov state model (iH2 MS), 32–34 InfoNetOLAP, 409–411 aggregated graph, 411 constraints and partial materialization, 425 classification, 426 constraint pushing, 426–427 Index discovery-driven, 427–430 guiding principles, 428–428 network discovery for effective OLAP, 428–429 experiments real data set, 431–432 synthetic data sets, 432–435 framework, 412, 415–418 informational dimensions of, 416 measure classification, 418–420 I-algebraic, 419 I-distributive, 419 I-holistic, 419–420 optimization, 420–422 attenuation, 423–425 localization, 422–423 related work, 435–436 systematic study, 414–415 Informational dimensions of InfoNetOLAP, 416 Informational OLAP, 414, 416, 436 Information centrality, 425 Information dimension, 511 Information integration for graph databases, 265–267 entity resolution for graphs, 269 attribute-based entity resolution, 269–270 collective relational entity resolution, 271–272 relational entity resolution, 270–271 example applications, 272 D-Dupe, 272–273 SSnetViz, 273–278 framework, 267–269 Information loss, 159 Information network, 415, 440 Information-theoretic co-clustering (ITCC), 77 Ingenuity pathway analysis (IPA), 558, 565 Initialization, 91, 148, 225 Instance-level heterogeneity, 267 Instance-level integration, 267 Integrated social network subgraph, 280 Integrating clustering with ranking, 439–442 experiments, 463 NetClus, 466–469 RankClus, 464–466 NetClus, 454–456 algorithm summary and time complexity analysis, 463 framework, 456 posterior probability for target objects and attribute objects, 458–461 577 probabilistic generative model, 456–458 ranking distribution for attribute objects, 462–463 RankClus, 447 algorithm summarization, 451–453 cluster centers and distance measure, 451 extensions to arbitrary multi-typed information network, 453–454 mixture model of conditional rank distribution, 448–451 overview, 447–448 ranking functions, 442–444 alternative ranking functions, 446–447 authority ranking, 444–446 simple ranking, 444 related work, 469–471 Interacting relationships, 412 Interaction sequence, 312 Interactive graph summarization, 389–391 aggregation-based graph summarization, 393 k-SNAP operation, 395–398 SNAP operation, 393–395 top-down k-SNAP approach vs bottom-up k-SNAP approach, 399 application on coauthorship graphs, 399–402 discussion, 403 related topics graph compression, 403–406 graph visualization, 407 scalability of, 402 summary graphs, 392 Inter-entity similarity, 269–271 Interestingness function, 426 Interlacing eigenvalues theorem, 513 Internet Movie Database (IMDb), 266 Internet topology, 340 Inter-object relationships, 50 Interpretation, 137 Intranode graphs, 404 Inviters, 352–353 I-OLAP, 421 IPA, see Ingenuity pathway analysis (IPA) IP addresses, 108 Isolated communities, 339 ISOMAP, 510 Iteration, 482 Iterative algorithm, Iterative bound optimization, 13 Iterative centroid search, 550 578 Iterative classification algorithm (ICA), 111–112 Iterative SUMMARIZE-MINE, 487–491 J Jaccard coefficient, 64, 120, 270 Jaro-Winkler score, 120 K K -anonymous vector of integers, 363 KDD, 65, 231 explorations, 107 K -degree anonymous graph, 363 Knowledge base (KB), 137–138 K-SNAP operation, 393, 395–398 limitations, 395–396 measuring quality of k-SNAP summaries, 396–397 bottom-up k-SNAP approach, 397–398 top-down k-SNAP approach, 397–398 L Lanczos algorithm, 77 Laplacian eigenmaps, 510 Large information networks graph summarization, 475–500 Large-Scale Networks, 164 Latent dirichlet allocation (LDA), 20, 121, 272 Laws of commutation, association, and distribution, 58 Lazy-A, 148 Lazy inference, 146–148 LazySAT, 148 Leaf nodes, 51–52 Leak detection, 115 LEAP, 250 mining discriminative patterns for graph classification, 252 mining discriminative subgraphs for bug localization, 253–254 structural leap search, 250–254 Levenshtein (edit distance), 120 Lexicographical order, 485 Lifted inference, 148–152 Lifted network, 150 construction algorithm, 151 Linear algebra, 225 Linear discriminant analysis (LDA), 506 Linear reducible subspace, 515 Linear regression, 198–199 Linguistic networks, 340 Linkage-based clustering, 45–51 empirical study, 63–64 DBLP database, 64–67 Index evaluation measures, 64 synthetic databases, 67–70 SimTree, 48–49, 51–53 SimTree, building, 53 aggregation-based similarity computation, 57–60 complexity analysis, 61–63 initializing, frequent pattern mining, 53–55 iterative adjustment of SimTrees, 61 refining similarity between nodes, 55–57 Linkage-based node similarity, 56 Linkage-based similarity, 56 Link analysis, 436, 469–470 Link-based clustering, 3–4, 155 deterministic approaches, heterogeneous relational clustering through spectral analysis, 4–11 homogeneous relational clustering through convex coding, 11–19 generative approaches, 19 dynamic relational data clustering through graphical models, 31–42 general relational clustering through probabilistic generative model, 22–31 homogeneous relational data, 19–22 LinkClus, 48–50, 52–57, 61–62 Link completion, 115 Link disclosure, 360 LinkedIn (linkedin.com), 389 Linkers, 352–353 Link information, Link matrix, 443 Link prediction, 108, 113–118, 136, 155, 213, 362 problems of, 109 Local conditional classifiers, 110–111 Local correlation, 506 Localization in InfoNet I-OLAP, 422 Localization in InfoNet T-OLAP, 423 Local linear embedding (LLE), 510 Log-linear models, 139 Loopy belief propagation (LBP), 113 Low-rank approximation, 312 Lp-norm, 77 M MAP/MPE inference, 143–144 Marginal and conditional probabilities, Markov logic, 144–146 Markov blanket, 142 Index Markov chain, 191, 195 Markov chain Monte Carlo (MCMC), 136, 142 Markov clustering (MCL), 538 Markov logic, 135–137, 139–142 Alchemy system, 157–158 applications, 155 collective classification, 155–156 viral marketing, 156–157 first-order logic, 137–139 inference, 142 MAP/MPE inference, 143–144 marginal and conditional probabilities, 144–146 Markov network inference, 142–143 scaling up inference, 145–152 learning, 152 discriminative weight learning, 153–154 generative weight learning, 152–153 Markov network learning, 152 structure learning and clustering, 154–155 Markov networks, 139 Markov logic decision networks (MLDN), 141–142, 157 Markov logic network (MLN), 140–141, 152 Markov models, 153 Markov networks, 136, 139, 156 inference, 142–143 MLN, 140 Markov process to model dynamic communities, 311 Markov random fields, 24, 139 Matrix inversion, 219, 223, 226–227, 232 Maximal marginal relevance (MMR), 244 Maximum a posteriori (MAP) inference, 143 Maximum reducible subspace, 517 intrinsic dimensionality-based method, 523–524 point distribution-based method, 524–525 MaxWalkSAT, 143–144, 147–148, 153 MCMC, 145 inference algorithm for MLN, 146 MC-SAT algorithm, 145 MDL representation, 406 Mean-field relaxation labeling, 113 MergeDist, 398 Metagroup, 310 approach, 310–311 METIS, 15–18, 76 579 “Middle band” activity versus “core” activity, 355 Minimality, 394 Minimum description length (MDL), 74, 76 Minimum path cover problem, 323 Mining case studies, 95 CELLPHONE: evolving groups, 97 DEVICE: evolving groups, 97 ENRON: change point detection, 98 NETWORK: interpretable groups, 95–97 TRANSACTION, 97–98 Mining dynamic graphs, 77 Mining large information networks by graph summarization, 475–478, 481 bounding the false-negative rate, 482–485 experimental results, 491–492 real data set, 492–495 synthetic data set, 495–498 iterative SUMMARIZE-MINE, 487–491 preliminaries, 478–479 related work, 498–499 SUMMARIZE-MINE framework, 479 discarding false positives, 481 overall algorithm layout, 481–482 recovering false negatives, 480–481 verifying false positives, 485–487 summary-guided isomorphism checking, 486–487 Mining optimal bug signature, 253 Mining quality, 91 Mining static graphs, 76–77 Mining subgraph features for classification, 240–241 feature selection on subgraph patterns, 244–247 frequent subgraph features, 241–243 graph fragment features, 243–244 tree and cyclic pattern features, 243–244 Mining time-evolving graphs, 213 Mining Top-K Bug Signatures, 254 Mixed membership models, 25 Mixed membership relational clustering, 25 Mixed membership relational clustering (MMRC), 25–27, 31 Mixture model of conditional rank distribution parameter estimation using EM algorithm, 450–451 target object, 448–450 MMRFS, 244 Model-based search tree (Mb T), 258–260 bound on number of returned features, 260 pattern enumeration scalability analysis, 259–260 580 Modeling and simulation of functional influence, 543 efficiency analysis, 546–547 functional influence model, 543–545 simulation of functional influence, 545–546 Modularity, 127 Modularity-based techniques, 126–127 Modularity maximization, 173 Modularization algorithm, 548 functional influence simulation, 549 post-process, 549–550 source selection, 548–549 Module detection approaches, survey, 537 density-based clustering, 540–543 hierarchical clustering, 539–540 partition-based clustering, 537–538 Modulo inference, 152 Molecular complex detection (MCODE), 540–541 Monotone, 426 Most probable explanation (MPE) inference, 143 Multi-dimensional networks, 166–168, 172–176, 183 multi-mode and, 176 Multi-dimensional scaling (MDS), 510 Multi-dimensional view, 431 Multi-level view, 412 Multi-mode and multi-dimensional networks, connections between, 176 Multi-mode networks, 165–166, 168–172, 176, 183 in academia, 166 notations, 169 in YouTube, 165–166 Multi-relational network, 167 Multi-resolution summaries, 393 Mutual Reinforcement K-means (MRK), 9, 11 Myspace (myspace.com), 163, 338, 389, 478 N Naïve relational entity resolution approach, 120, 271 Navigability in social networks, 341 NCSTRL, 192 NDLTD, 192 Negated atomic formula, 137 Negative edges, 114 Negative superedge graphs, 405 Neighbor tuples, 296–297 set resemblance of, 298 NetClus, 454–456, 466–469, 470–471 accuracy study, 468–469 Index algorithm summary and time complexity analysis, 463 case study, 466 data set, 466 framework, 456 posterior probability for target objects and attribute objects, 458–461 probabilistic generative model, 456–458 ranking distribution for attribute objects, 462–463 study on parameters, 467–468 study on ranking functions, 466–467 Net-cluster, 442, 455, 456–458, 471 of database area, 442 probabilistic generative model for target objects, 456–458 ranking distributions in, 466–467 NetFlix data set, 215, 220, 229, 232–233 Netflix (netflix.com), 215, 219–220, 228–229, 232–234, 284, 390 Network evolution, rudimentary model of, 340 flow data set, 92 interpretable groups, 95–97, 100 packets, 74 viewer, 274 New pages, 189 NIPS, 228–229 Node attribute-based approaches, link prediction, 116–117 Noisy and incomplete networks, 107–108 collective classification, 109–113 approaches, 110–113 definition, 110 entity resolution, 118 approach, 119–121 definition, 119 issues, 121–122 group detection, 122 approaches, 123–127 definition, 122 issues, 127–128 link prediction, 113–114 approach, 116–117 definition, 114–115 issues, 117–118 terminology and notation, 109 Noisy data, 395, 540 Non-disjoint affiliations, 311 Non-giant component, 339, 345 Non-negative real-valued function, 139 Non-sibling nodes, 58 Normalized cut (NC), 16–17 Index spectral clustering, Normalized mutual information (NMI), 8, 17–18, 27 Null model, 126 O Object distinction, 294 veracity analysis and, 283–303 identical names, 294–303 Object distinction, veracity analysis and, 283–285 distinguishing objects with identical names, 294–296 authorship on DBLP, case study, 300–303 clustering references, 299–300 similarity between references, 296–298 supervised learning with automatically constructed training set, 298–299 information with multiple conflicting information providers on the web, 285–286 book authors, case study on, 291–294 computational model, 287–291 problem definitions, 286–287 OEM, 266 OLAP-style aggregation methods, 391 Old pages, 188–189 Onagers (Equus hemionus khur) data set, 331 One-pruned subgraph, 350 On-line analytical processing (OLAP), 411–415, 435–436, 476 Online friendship, 341 Online social networks, 337–340 future works, 354–355 measurements, 342 basic timegraph properties, 343–345 component properties, 345–346 data sets, 342–343 structure of giant component, 349–352 structure of middle region, 346–349 model, 340, 352 description, 352–353 desiderata, 352 simulations, 353–354 organization, 340 related work experimental studies, 340–341 mathematical models, 341 Optimal group interaction matrix, 170–171 OSB00, 329 Overlapping modules in biological networks, 535–537 581 modeling and simulation of functional influence, 543 efficiency analysis, 546–547 functional influence model, 543–545 simulation of functional influence, 545–546 modularization algorithm, 548 functional influence simulation, 549 post-process, 549–550 source selection, 548–549 module detection approaches, survey, 537 density-based clustering, 540–543 hierarchical clustering, 539–540 partition-based clustering, 537–538 protein interaction networks, application to, 550 data source, 550–552 identification of overlapping modules, 552–553 statistical assessment of modules, 553 P PageRank, 187, 189–191, 440, 557, 560 Pairwise Markov random field (pairwise MRF), 112–113 Partial feature coverage, 242 Partial materialization, 426 Partition-based clustering, 537–538 Partition change rate, 100–101 Partition encoding cost, 82 Partition function, 139 PartitionIdentification, 80, 84–89 cost computation for partition assignments, 87–89 determining the number of partitions, 86–87 finding the best partitions, 85–86 Passive users, 352 Path-based Node Similarity, 52–53 Path-based similarity, 52 PathCoverCommunities (PCC), 323, 327, 332–333 Pattern-based classification of graph, 239 classification based on associations (CBA), 239 classification based on multiple association rules (CMAR), 239 classification based on predictive association rules (CPAR), 239–340 HARMONY, 240 RCBT, 240 Pattern mining, dicovery-driven InfoNetOLAP, 429 582 Pattern tree, 486 PCC, see PathCoverCommunities (PCC) Periodic non-affiliation network, 310 Phone call graphs, 340 Plains Zebra (Equus burchelli) data set, 331 Poisson distribution, 27 Political violence and terrorism research (PVTR) network, 274–275 PopRank, 191 Positive superedge graphs, 404 PostgreSQL database, 402 Potential co-referent pairs, 121 Potential edges, 114 Potential function, 139 Powergrid graph,378 Predicate symbols, 137 Preferential attachment model, 115 Principal component analysis (PCA), 174, 506–511 Privacy, 361–362 data sets, 342–343 preserving data analysis, 360, 384 Privacy breach, 361 Privacy-preserving data analysis, 384 Probabilistic latent semantic indexing (PLSI), 19–20 Probabilistic model, 121, 141, 157, 272, 295 Program analysis data, 492–495 Projected clustering algorithms, 506–507 Proliferation of social networks, 360 Prolific attribute, 400 Prolog programming language, 138 Proof, 424 Protection of links between individual graph entities, 362 Protein interaction networks, 535, 538, 543 application, 550–553 data source, 550–552 identification of overlapping modules, 552–553 statistical assessment of modules, 553 COD method, 542 MCODE, 541 Protein-protein interaction (PPI) networks, 108, 114, 116–118 Proximity tracking on dynamic bipartite graph, 211–212 dynamic proximity, applications, 224–225 track-centrality, 225–226 track-proximity, 226–227 dynamic proximity, computations BB_LIN on static graphs, 219–220 Index challenges for dynamic setting, 220–221 solutions, 221–224 dynamic proximity and centrality, 216–217 fixed degree matrix, 218–219 static setting, 216 updating the adjacency matrix, 217–218 experimental results, 227 data sets, 228–229 effectiveness, 229–230 efficiency, 232–234 on track-proximity, 230–232 problem definitions, 213–215 related work, 212 dynamic graph mining, 213 static graph mining, 212–213 P-SimRank, 63, 65 PTrack, 215 Publication database (PubDb), 45–46, 50 Publication search, time sensitive ranking and, 187–190 empirical evaluation, 199 experimental results with all papers, 199–202 experimental settings, 199 results of top 10 papers, 202–203 results on new papers only, 203–204 sensitivity analysis, 204–206 linear regression, 198–199 related work, 190–192 TS-rank, proposed, 192–193 source evaluation, 196–197 trend factor, 197–198 TS-rank algorithm, 193–196 Publication search to Web search, 206–208 PubNum, 399, 402 Q Quadratic complexity, 49 Quality pages, 188 Quasi-Newton optimization methods, 152 R Random dynamic network, 309 Random graphs, 377 models, 115 Randomized summarization, 481, 489, 493 Randomizing technique, 477–478 Random walk probability, 298 Random walk with restart (RWR), 216, 255 RankClus, 442, 447, 464–466, 470 accuracy and efficiency study on synthetic data, 464–465 algorithm summarization, 451–453 Index cluster centers and distance measure, 451 DBLP data set, case study, 465–466 extensions to arbitrary multi-typed information network, 453–454 mixture model of conditional rank distribution, 448–451 overview, 447–448 RankClus framework, 430 Ranking distribution, 443 for attribute objects, 462–463 authority ranking, 462–463 simple ranking, 462 Ranking functions, 440, 442 alternative ranking functions, 446–447 authority ranking, 444–446 simple ranking, 444 Ranking of research papers/web pages content-based factors, 192 reputation-based factors, 192 Reactome project, 567 ReadVar(x), 147 Reality mining, 331 Realizable degree sequence, 370 Real-world data sets, 327 data sets, 327–333 results, 332–333 ReCom, 50, 63, 65 Rec.sports.baseball, Reducible subspace and core space, 516–517 Redundancy, 515–516 REDUS algorithm, 509–510, 520, 522, 527–531 finding overall reducible subspace, 521–522 intrinsic dimensionality estimator, 520–521 maximum reducible subspace, 522–525 Reference reconciliation, 295 Refutation, 138 ReGroup, 85–86 Relational classifiers, 110 Relational clustering, 13 symmetric convex coding, 13 Relational data, 3–4 based on text data sets, 17 structures of, 24 Relational data based on text, 17 Relational entity resolution, 270–271 collective, 121, 271–272 naive, 120–121 Relational Markov network (RMN), 117 Relationship-based Data Cleaning (RelDC), 120 Relationship resolution, 268 583 Relationships homogeneity, 394 Relation summary network under generalized I-divergence (RSN-GI), 29 Representative pattern, 498 Resolution, 136 Res.sports, 10 Restricted neighborhood search clustering (RNSC), 537–538 Result combination, 482 Rissanen’s minimum description length (MDL) principle, 406 R-MAT model, 402 R5T1000C40S10, 68 R5T4000C40S10, 70 R5T5000C40S10, 68 Rule-based heuristics, dicovery-driven InfoNetOLAP, 428 S Satisfiability, 136, 143, 145 Satisfiable formula, 138 Scalability, 158, 164, 391, 465, 491, 527, 530 ACPost, 228–229 graph summarization method, 402 pattern enumeration, analysis, 259–260 real-world data sets, 327 speed and, 100 synthetic databases, 67–70 test, 497 Scale-free degree distributions, 115 Scale-free graphs, 377 Scaling up inference, 145–152 SCC-ED algorithm, 15, 19 SCC-GI, 15, 17 Schema-level heterogeneity, 266 Schema-level integration, 267 SEARCHKL, 87 Seed growth, 542–543 Segment encoding cost, 83–84 (Semi-)distributiveness, 419 Semi-supervised clustering, 23 Sensitive link and relationship protection, 362 Set resemblance, 298 Sherman–Morrison lemma, 221 ShopZilla.com, 283 Sibling-pair based similarity computation, 60 SIGMOD, 50 SIGMOD 2004, 413, 415 Significance of overlap, 311 Significant pattern, 252 Simattrib(ei , e j ), 270 Simple ranking, 444 SimRank, 46, 49–50, 63, 65, 271, 470 584 SimTree, 48–49, 51–53, 70 building, 53 aggregation-based similarity computation, 57–60 complexity analysis, 61–63 initializing, frequent pattern mining, 53–55 iterative adjustment of SimTrees, 61 refining similarity between nodes, 55–57 restructuring, 62–63 Simweight, 58–59 Slice sampling MCMC algorithm, 145 Sliding window, 218 Small-world graphs, 377 Small-world phenomenon, 341 SNAP (summarization by grouping nodes on attributes and pairwise relationships), operation, 393–399 for DBLP DB coauthorship graph, 400 evaluation, 394–395 Social interaction, 317 Social media, 163–165 community extraction in heterogeneous networks, 168 connections between multi-mode and multi-dimensional networks, 176 multi-dimensional networks, 172–176 multi-mode networks, 168–172 heterogeneous networks in, 165–168 motivations to study network heterogeneity, 168 understanding groups, 176–177 group profiling, 177–179 topic taxonomy adaptation, 179–182 Social networks, 338, 340 giant component, 339, 345 middle region, 339, 346 singletons, 339, 345, 347 See also Online social networks Social network data, 360 content disclosure, 360 identity disclosure, 360 link disclosure, 360 Social networks, identity anonymization in, 359–361 degree anonymization, 366–368 restriction to edge additions, 368–370 experiments, 376 data sets, 377–378 evaluating graph construction algorithms, 378–383 graph construction, 370 Index basics on realizability of degree sequences, 370–372 Greedy_Swap algorithm, 372–375 restriction to edge additions, 375–376 overview, 365–366 problem definition, 363–365 restriction to edge additions, 365 related work, 361–363 Social network subgraph from PVTRNetwork, 277 from TKBNetwork, 276 Social security number (SSN), 270 Software behavior graphs, 242 Southern women, 325–327 Spectral clustering, 5, 7–8, 22, 37, 42, 125, 169, 173–174, 464, 470 Spectral graph partitioning (SGP), 27 Spectral relational clustering (SRC), 5, 11 algorithm, SPIRIT, 78 SSnetViz, 273–278 integration steps, 276 manual node merging module, 279 node merging module, 278 user interface design of, 275 Stars, 339, 351–352, 354 age of, 348 Flickr final graph, 348–349 isolated communities, 339 middle region, 347 network schema, 442, 454–455, 462, 470 non-trival, 348–349 Star-structured data, Static bipartite graph, 216 Static graph mining, 212–213 Static network, community identification, 308 Stationary probability distribution, 194 Statistical analysis, dicovery-driven InfoNetOLAP, 428–429 Statistical graphical model, 32, 42 StatStream, 78 Stochastic transition matrix, 193–194 Stream mining, 77–78 Strong correlation, 515 Strongly linear-correlated feature subset, 512 Structural cost, 364 Structural feature extraction, 174 Structural leap search, 252 Structural proximity, 251 Subgraph isomorphism, 238, 478 Submodular set function, 245 Subsets of points, 518–519 Index distance-based point deletion heuristic, 519–520 successive point deletion heuristic, 517 Summarization, 481–482 Summarized graph, 479 SUMMARIZE-MINE framework, 475, 479–482 discarding false positives, 481 overall algorithm layout, 481–482 recovering false negatives, 480–481 with verified ID-lists, 488, 490 Summary graphs, 390, 392, 405 Superedges, 392, 404 Superfeatures, 149–150 Supergraph, 365, 378 Supernodes, 149–150, 392 graph, 404 Support Vector Machine, 238 Support vector machines (SVM), 295, 299 Swap transformation, 373 Symmetric convex coding (SCC), 13, 15, 42 Symmetry, 136 Syn1, 2, 3, 4, 16 Synthetic databases, 67–70 Synthetic data sets generator description, 495–498 generator mechanism, 432–435 T T-aggregated graph, 418 Talk.politics, 10 Taobao (taobao.cn), 338 Taxonomy adaptation via classification learning, 181 Taxonomy structure of datasets, 10 Terrorism knowledge base (TKB) network, 274–275 Text document clustering, 5, 10, 19–21 Tf.idf, 7, 29 Theseus’ Ship, 325–326 Time-aggregate adjacency matrix, 214 Time evolving aspect, 101 Time-evolving graphs, 73–76 experiment evaluation, 91–92 additional observations, 100–101 compression evaluation, 99 data sets, 92–95 mining case studies, 95–98 speed and scalability, 100 GraphScope, 84 initialization, 91 partition identification, 84–89 time segmentation, 89–90 585 GraphScope compression objective, 81 graph encoding, 81–82 graph segment encoding, 82–84 graph stream encoding, 84 problem definition, 78 notation and definition, 78–80 problem formulation, 80–81 related work mining dynamic graphs, 77 mining static graphs, 76–77 stream mining, 77–78 TimeSegmentation, 80–81, 89–90 Time sensitive ranking and publication search, 187–190 empirical evaluation, 199 experimental results with all papers, 199–202 experimental settings, 199 results of top 10 papers, 202–203 results on new papers only, 203–204 sensitivity analysis, 204–206 linear regression, 198–199 related work, 190–192 TS-rank, proposed, 192–193 algorithm, 193–196 source evaluation, 196–197 trend factor, 197–198 Time-slice matrix, 214 Top-down k-SNAP approach vs bottom-up k-SNAP approach, 399 Topic taxonomy adaptation, 179–182 Top-K LEAP, 254 Topological dimensions of InfoNetOLAP, 416 Topological OLAP, 411, 414, 416, 435, 499 Topology-based approaches, 116 link prediction, 117 TPR, 190–191 Trace maximization, 14 Track-centrality, 212–213, 224–226, 234 Tracking, 352 communities, 311 framework for, 310 proximity, on dynamic bipartite graphs, 211–234 algorithm for, 225, 227 kinds of, 215 Track-proximity, 212–213, 224, 226–227, 234 Traffic trace, 92 TRANSACTION, 97–98 data set, 95 Transitivity, 136 TREC, 16, 27 Tri-partite relational data, 29 586 Trustworthiness of web sites, 286, 289 TRUTHFINDER, 283, 286–287, 291–294, 303 parameters of, 288 TS-Rank(AJEval), 206 Twitter (twitter.com), 163, 338 U Update destination partitions, 85 Update source partitions, 85 User–movie bipartite graph, 211 V Vague Gamma priors, 37 VAP, see Ventilator-associated pneumonia VAP gene ranking, 565 Variable symbols, 137 Ventilator-associated pneumonia, 558 expression profiles, 563–565 gene ranking, 565 variance of, 566 Veracity analysis and object distinction, 283–285 distinguishing objects with identical names, 294–296 authorship on DBLP, case study, 300–303 clustering references, 299–300 similarity between references, 296–298 supervised learning with automatically constructed training set, 298–299 information with multiple conflicting information providers on the web, 285–286 book authors, case study on, 291–294 computational model, 287–291 problem definitions, 286–287 Veracity problem, 285 Index Verification, 482 Viral marketing, 156–157 VLDB, 50, 231 Voting, 291 W WalkSAT, 145 WebACE, 16, 27 Web graph, 308 WebKB, 155 Web mining, 12 “Web of trust,” 157 Web pages, 4, 6, 12, 111, 135, 142, 155–156, 167, 180, 187–190, 192, 206, 208, 265, 285, 390, 404, 557–558, 561 Web search engine, 6, 12 Web users, 4–6 WikiTerrorism, 274 Word-document data., Word–document matrix, 10 World Wide Web, 283, 340–341, 389–390 Wrapper models, 510 WriteVar(x, v), 148 X XML databases, 266 Y Yahoo!, 428 Yahoo! 360., 339, 342, 344–351, 354–355 Yahoo! Answers, 265 Yahoo! Japan auctions (auctions.yahoo.co.jp), 338 Yeast mRNA expression, 563 Yeast RNA expression, 565 YouTube, 163–168, 265 multi-mode network in, 165–166 .. .Link Mining: Models, Algorithms, and Applications Philip S Yu · Jiawei Han · Christos Faloutsos Editors Link Mining: Models, Algorithms, and Applications 123 Editors... types and complex links Hence, the study of link mining will have a high impact on various important applications such as Web and text mining, social network analysis, collaborative filtering, and. .. different researchers, the topics and content are organized in such a way as to present the most important models, algorithms, and applications on link mining in a structured and concise way Given the