04 network construction, inference, and deconvolution

CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu Feature matrices, relationship tables, time series, document corpora, image datasets, etc 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks Network construction and inference Today: How to construct and infer networks from raw data? 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks Jonas Richiardi et al., Correlated gene expression supports synchronous activity in brain networks Science 348:6240, 2015 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 1) Multimode Network Transformations: § K-partite and bipartite graphs § One-mode network projections/folding § Graph contractions 2) K-Nearest Neighbor Graph Construction 3) Network Deconvolution: § Direct and and indirect effects in a network § Inferring networks by network deconvolution 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks ¡ Most of the time, when we create a network, all nodes represent objects of the same type: § People in social nets, bus stops in route nets, genes in gene nets ¡ Multi-partite networks have multiple types of nodes, where edges exclusively go from one type to the other: § 2-partite student net: Students Research projects § 3-partite movie net: Actors Movies Movie Companies Network on the left is a social bipartite network Blue squares stand for people and red circles represent organizations 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks ¡ Example: Bipartite student-project network: § Edge: Student ! works on research project " Students Research projects ¡ ! " Two network projections of student-project network: § Student network: Students are linked if they work together in one or more projects § Project network: Research projects are linked if one or more students work on both projects ¡ 10/4/18 In general: K-partite network has K one-mode network projections Jure Leskovec, Stanford CS224W: Analysis of Networks ¡ Example: Projection of bipartite student-project network onto the student mode: Students Research projects ¡ One-mode student projection Consider students 3, 4, and connected in a triangle: § Triangle can be a result of: § Scenario #1: Each pair of students work on a different project § Scenario #2: Three students work on the same project § One-mode network projections discard some information: § Cannot distinguish between #1 and #2 just by looking at the projection 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks ¡ One-mode projection onto student mode: § #(projects) that students ! and " work together on is equivalent to the number of paths of length connecting ! and " in the bipartite network ¡ Let # be incidence matrix of student-project net: if ! works on project #$% = ' otherwise Students ¡ # is an × ; binary non-symmetric matrix: Projects § is #(students), ; is #(projects) 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 10 ¡ Let’s raise adjacency matrix !"#$ to the second power: § The (&, ()-th entry of *.+,- is: *.+,- &, ( = ∑4123 *+,- &, *+,- (5, () *;1< & § This sum is only greater than zero if there exists a node for which *+,- (&, 5) and *+,- (5, () are both nonzero: 53 *=1< ( 54 § There exists a node that is connected to both nodes and § The sum counts the number of neighbors that nodes & and ( share § The sum counts the paths of length between nodes and ¡ This reasoning is valid for higher powers of *+,- : § *9+,- (&, () counts the paths of length between & and ( § *:+,- (&, () counts the paths of length between & and ( 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 46 ¡ Idea: Model indirect flow as power series of direct flow: + , !"#$ = G'() + G'() + G'() + !'() +⋯ Converges with correct scaling Indirect effects Transitive closure of 1234 ¡ Note: Linear scaling of G"#$ so that max absolute eigenvalue of G'() < 1: § Indirect effects decay exponentially with path length § Infinite series converges 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 47 ¡ Transitive closure of !"#$ can be expressed as an infinite sum of: § True direct network, %&'( § All indirect effects along paths of increasing ) * + lengths, %&'( , %&'( , %&'( ,… ¡ Idea: Can be written in a closed form as an infiniteseries sum using Taylor series expansions: ) * + %,- = G&'( + G&'( + G&'( + %&'( +⋯= ) * %&'( I + G&'( + %&'( + G&'( + ⋯ = %&'( I − G&'( 56 Note: Let be any square matrix with max absolute eigenvalue < Then the following series converges: I + + ) + X * + ⋯ : 56 The series converges to: ∑= :;< = − 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 48 ¡ Using Taylor series expansions we get a closedform expression for !"#$ : %&'( = %*+, I − G*+, 01 ¡ In network deconvolution: § Observed network !"#$ is known § True direct network !234 needs to be recovered ¡ Finally, we get a closed-form solution for !234 : %*+, = %&'( I + G&'( 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 01 49 A N A LY S I S Use closed-form expression for !"#$ to recover true direct network %&'( ¡ a True ) Truenetwork network (G ( Gdir dir ) 22 Observed network ) ) Observed network ( G (G obsobs Transitiveeffects closure Transitive 33 (TC) 1 55 44 Network deconvolution deconvolution Network (ND) (ND) 2 3 1 5 4 Direct effects Direct effects Indirect effects Indirect effects b Indirect effects Series closed form Transitive closure: –1 Gobs = Gdir + Gdir + G dir + ··· = Gdir( I – Gdir) Network deconvolution: Gdir = Gobs( I + Gobs)–1 c 10/4/18 True network –1 Jure Leskovec, Stanford CS224W: Analysis of Networks Observed network –1 50 ¡ The solution for !"#$ is: !"#$ = !&'( ) + !&'( ¡ How to efficiently calculate !"#$ : § Without calculating matrix inverse ) + !&'( ¡ +, +, Approach: § Use the eigen-decomposition principle: Express !&'( by decomposition into eigenvectors - and eigenvalues Σ&'( : !&'( = -Σ&'( U +, Express each eigenvalue 0234 as a nonlinear function of a single corresponding eigenvalue 5&'( : 5"#$ 5&'( &'( +, 56 = 1+ Form a diagonal matrix Σ"#$ such that Σ"#$ 8, = 5"#$ Recover true direct network as: :234 = ; indirect interactions (false positives) Direct interactions, correctly recovered (true positives) (false positives) tposttargets in gene regulatory networks postTrue interactions removed as by a ND (falseLength-2 negatives) indirect interactions Length-2 indirect interactions (false positives) 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 52 Length n > positives) indirect interactions (false positives) Length n > indirect interactions (false eshow gene network inference methods, and we show ctivity matrix into its canonical form, can express each removed by ND (false negatives) True interactions removed by True ND interactions (false negatives) ion global local network ework of improves the directboth matrix as aand function of the corresponding ¡ Goal: Distinguish strong and weak collaborations between scientists ¡ Collaboration tie strengths depend on publication details, such as: § #(papers) each pair of scientists has collaborated on § #(co-authors) on each of the papers ¡ Strength of ties are important for: § Recommending friends and colleagues § Recognizing conflicts of interest § Evaluating authors’ contribution to teams 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 53 ¡ Data: Unweighted network of scientists working in the field of network science: § Two authors are linked if they co-authored at least one paper ¡ Setup: Apply ND on the co-authorship network: § ND returns a weighted network whose: § Transitive closure most closely captures the input network § Weights represent the inferred strength of direct interactions § Output: Rank co-authorship edges by the ND-assigned weights ¡ Ground-truth data: § True collaboration strengths are computed by summing the number of co-authored papers and down-weighting each paper by the number of additional co-authors § Compute correlation between ND-assigned weights and true collaboration strengths 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 54 A N A LY S I S a b 1.0 Weak incorrect ND predicted collaboration strength 0.8 Strong correct 0.6 0.4 0.2 Weak correct Strong correct Strong incorrect Weak correct 0 Weak incorrect 0.2 0.4 0.6 True collaboration strength Strong incorrect 0.8 1.0 Agreement between the rank obtained by the true collaboration strength and the rank provided by the ND weight, !" = 0.76 ¡ Conclusion: ND predict collaboration tie strengths solely by using network topology (i.e., not using other publication details) Figure Application to co-authorship social network (a) Use of network deconvolution to distinguishing strong ties from weak ties in the largest connected component of a co-authorship network containing 379 authors True collaboration strengths were computed by summing the number of co-authored papers and down-weighting each paper by the number of additional co-authors Network deconvolution only had access to unweighted co-authorship edges, but exploiting transitive relationships to weigh down weak ties resulting in 77% accurate predictions (solid lines) and only 23% inaccurate predictions (dashed lines), demonstrating that this information lies within the network edges, and that network deconvolution is wellsuited for discovering it (b) Beyond the binary classification of strong and weak ties, we found a strong correlation (R2 = 0.76) across all 2,742 edges connecting 1,589 authors, between the weights assigned by network deconvolution (ND) and the true collaboration strengths obtained using additional publication details ¡ 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks it (Fig 4a) Beyond the binary classification of edges into strong and weak, we found a strong overall agreement between the rank obtained 55 network deconvolution is effective for gene regulatory network inference, protein contact prediction based on protein sequence alignment ¡ Goal: Infer a gene regulatory network from gene feature vectors describing gene activity: § Nodes represent genes § Edges represent regulatory relationships between regulators and their target genes ¡ Well-studied problem in bioinformatics: § A dataset is a gene-by-condition expression matrix § Expression matrix is noisy with many indirect measurements 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 56 ¡ datasets: Gene expression datasets from: bacterium E coli, yeast S cerevisiae, and a simulated env (in silico) ¡ Setup: Use ND to improve network inference methods by eliminating indirect edges in the inferred networks: ¡ Infer a gene regulatory network using a particular network inference method Apply ND to the inferred network to deconvolve the network Evaluate deconvolved network against ground-truth data Ground-truth data: § True positive regulatory relationships (i.e., edges) are defined as a set of interactions experimentally validation in a laboratory 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 57 a b MI & correlation Other MI & correlation Overall score b Other In silico score In silico score E coli score E coli score S cerevisiae score a Relative performance of inference methods for cascades (casc.) and feed-forward loops (FFL) before and after network deconvolution S cerevisiae score Datasets Overall score gulatory MI and correlation methods Other inference methods Community Before ND After ND regulatory on applied to 60 MI and correlation methods Other inference methods Community Before ND Before ND After ND ution applied to 60 Casc.FFL Casc.FFL g methods After ND ND Before 40 Casc.FFL Casc.FFL ring methods After ND CLR improvements 40 CLR nt improvements ARACNE orrelation20 ARACNE correlation20 nce increase, MI mance increase, MI improves 10 Pearson 150 lso improves 10 Pearson n average), Spearman 150 %hod on of average), Spearman 100 GENIE3 method 100 thus of GENIE3 50 TIGRESS 3), thus erformance 50 TIGRESS performance Inferelator obtained by 10 Inferelator rk ANOVerence m obtained individualby 20 10 ANOVerence from individual 20 onvolution 15 Community deconvolution network 15 Community 10 Average improvement ity network ~22% 10 Average improvement ~22% gby the relative Relative performance ing relative for the cascades 10 (AUROC) –5% +5% Relative performance ods for cascades 10 (AUROC) Feed-forward –5% +5% L) before loop B FFL)and before (FFL) contains loop Feed-forward Red B edge (FFL) contains n.decreased Red and A C feed-forward feed-forward edge Feed forward edge A C ndthe decreased of two Feed forward edge Cascade (casc.) lacks B y, of the two performance feed-forward edge lacks Cascade (casc.) B 10 ll performance A C onvolution feed-forward edge I n n S tor ce ty 10 a A C LR NE M S econvolution IE ni so n a r m l C E tary Note, C u I e r N r y n e E or rSe it Ma an E LRA N IE R soea nc m ate Pe CAR EfeS entary Note, un NTIG rm G aSrp elV re om O network RIn AC a r E m e e C N e e G G f P V AR om TI InA O ore network Sp C erent AN sifferent and FFLs, for example, MI–based network inference tends to include feed-forward edges (red arrow), resulting in des and FFLs, for example, MI–based network is inference to includeand feed-forward edges arrow), resulting in accuracy for cascades, whereas the opposite true for tends the Inferelator ANOVerence The(red deconvolved networks wer accuracy for cascades, whereas the opposite is true for the Inferelator and ANOVerence The deconvolved networks side) show significantly higher accuracy (AUROC) for true cascade network motifs for all methods, and moderately ght side) show significantly accuracysuccessfully (AUROC) foreliminates true cascade network motifs for all methods, erage, showing that network higher deconvolution spurious indirect feed-forward edgesand for moderately true average, showing that network deconvolution successfully eliminates spurious indirect feed-forward edges for true g accuracy for true FFLs ing accuracy for true FFLs Network inference methods ND improves the performance of top-performing network inference methods 10/4/18 regulatory relationships etely removing all indirect flow effects to describe Jure Leskovec, Stanford CS224W: Analysisbetween of Networks transcription factors to describe regulatory relationships between transcription factors pletely removing all indirect flow effects tions and weights exactly (Fig 1d) (regulators) and their target genes Regulatory network inference ractions and weights exactly (Fig 1d) (regulators) and their target genes Regulatory 1,6,32 network inference 58 ¡ General approach to identify direct dependencies between objects in a network: § Remove spurious edges that are due to indirect effects § Decrease over-estimated edge weights § Rescale edge weights so that they correspond to direct dependencies between objects ¡ Other published methods (not covered today): § Partial correlations and random matrix theory § Graphical models, e.g., Graphical lasso, Bayesian nets, Markov random fields § Causal inference models 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 59 1) Multimode Network Transformations: § K-partite and bipartite graphs § One-mode network projections/folding § Graph contractions 2) K-Nearest Neighbor Graph Construction 3) Network Deconvolution: § Direct and and indirect effects in a network § Inferring networks by network deconvolution 10/4/18 Jure Leskovec, Stanford CS224W: Analysis of Networks 60

Định dạng
Số trang	60
Dung lượng	41,1 MB