Ebook Computational network science An algorithmic approach Part 2

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	69
Dung lượng	8 MB

Nội dung

(BQ) Until now, studies in network science have been focused on particular relationships that require varied and sometimesincompatible datasets, which has kept it from being a truly universal discipline. This new approach would remove the need for tedious humanbased analysis of different datasets and help researchers spend more time on the qualitative aspects of network science research..Computational network science

CHAPTER Diffusion and Contagion This chapter explores the phenomena of rampant changes in networks exemplified by (a) disseminations of preferences, (b) percolation, (c) epidemic (i.e., contagion) of disease, and (d) community compositions The rest of this chapter reviews these four categories in order 6.1 POPULATION PREFERENCE SPREAD Schelling’s (1971) residential neighborhood segregation model is the earliest formally studied report of rampant changes in networks Schelling pointed out that a small preference for one’s neighbors to be of the same ethnicity leads to a widespread segregation emergent in the network He used coins on a patch of graph paper to demonstrate this theory by placing pennies and nickels in different patterns on the cells Coins were moved one by one if they were in an unsatisfactory composition For every colored cell, if there were greater than 33% of the adjacent cells that were of a different color, the cell would move to another randomly selected cell You can try the model out using Chris Cook’s online demonstration program or the NetLogo model Further details and interesting emergent patterns that arise are available from Hatna and Benenson (2012) Let N be the population size and p be the uniform, random probability for the existence of a tie for each individual with another person in the network Let us ignore the repeated connections With the small value of p and the large value of N, PN represents the probability of an individual’s indirect connection with others in the network, that is, the second-hand tie through the individual’s primary tie The expected (i.e., average) number of individuals that can be reached is shown by Equation 6.1 With the values of PN ≥ 0, contagion begins, that is, epidemic occurs Rapid spread of diseases through connections in populations (e.g., obesity) is an epidemic and pandemic (i.e., an epidemic across borders) event (Hays, 2005) Topologies of networks may hinder or promote these events Whereas high-distance networks inhibit epidemics, low-distance networks (e.g., a scale-free network) accelerate them 46 Computational Network Science: An Algorithmic Approach If PN = 0.5, the ratio in Equation 6.1 yields the value of and no epidemic results This value is below the threshold needed for the epidemic onset With p = 0.0006 and N = 2500, the ratio is larger than 1.0 (i.e., above the epidemic threshold), and the epidemic is a sigmoidal logistic curve shown in Figure 6.1 + PN + PN + PN + = 1 − PN (6.1) A similar model that generates a logistic curve is Bass’s diffusion model (Bass, 1969) Bass’s model is based on a differential equation that describes the process of how new products are adopted in a population The rate of diffusion is computed with Equation 6.2 F(t) is a fraction of agents in a society who have adopted a new product or behavior by time t p is the rate of innovation and q is the rate of imitation χ(t) is a function of percentage change in price and other variables Fig 6.1. The logistic curve ( p + q × F (t )) × χ (t ) = F (t ) − F (t ) (6.2) Diffusion and Contagion 47 6.2 PERCOLATION MODEL In physics and mathematics, percolation theory describes the behavior of clustered components in random networks (Grimmett, 1999) The common intuition is movement and filtering of fluids through porous materials, for example, filtration of water through soil and permeable rocks In a network, let each node be a cell through which a fluid-like substance may transit to other cells A network (i.e., a grid) then is a sponge-like substance and percolation is the determination of whether a substance introduced at one cell will reach the other side of the network (or grid) There have been many applications of percolation such as analysis of forest fire, bank failures, and rumor spread You may explore an online implementation of NetLogo percolation model Commonly, a cell’s transmission rate is modeled as a probability value p As shown in Figure 6.2, at a certain value of p, percolation is achieved 6.3 DISEASE EPIDEMIC MODELS Although the original inspiration is disease epidemics, we treat the topic generically applicable to epidemic spread of any phenomenon The earliest model is the susceptible, infected, susceptible (SIS), documented in Kermack and McKendrick (1927) As shown in Equation 6.3, consider a fixed population N at time t that is divided into three camps: susceptible, infected, and removed The susceptible camps, denoted by S(t), are individuals who are/are not infected at time t The infected camps, denoted by I(t), are individuals who have already been infected and are capable of transmitting epidemics to the individuals in the susceptible Fig 6.2. Percolation model 48 Computational Network Science: An Algorithmic Approach category The recovered camps, denoted by R(t), are individuals with previous infection who have been removed due to either immunization or termination (i.e., death) The removed individuals are not able to get infected again or to transmit epidemics to others N (t ) = S (t ) + I (t ) + R (t ) (6.3) All individuals are considered to have an equal probability of contact (i.e., infection), denoted by b At each time step t, each person may infect b × N others with equal probability The fraction of contacts between an infected individual and a susceptible one is S (t ) / N (t ) Therefore, the new infection rate is computed by β × N × [S (t ) / N (t )] × I (t ) = β × S (t ) × I (t ) The population leaving the susceptible camp is equal to the number entering the infected camp Meanwhile, a number of individuals equal to the fraction of infected individuals are leaving this class per unit time and entering the removed class (i.e., the law of mass action) Let g denote the combined mean recovery and death rates It is assumed that the rate of infection and recovery is much faster than the timescale of births and deaths and overlooked in the model Therefore, the rate of infection change is dI (t ) / d(t ) = β × S (t ) × I (t ) − γ × I (t ) and the rate of recovery is dR(t ) / d(t ) = γ × I (t ) There have been many variations to the SIS model, including SIR and models that account for births and deaths in mathematical epidemiology (Brauer et al., 2008) with models available from Vynnycky and White (2010) Another good source of discussion of these models is Jackson (2008) Whether epidemically grown or not, a community is a group of closely related and connected entities with similar attributes; it can also be a part of a large network with groups of communities Entities in one community can interact with other communities A good community will interact less with the entities of an outside community and more with those of the inside community Parts of networks form groups called clusters These clusters can be considered to be communities Next, we outline processes for community detection 6.4 COMMUNITY DETECTION In order to motivate a community, let us consider a project team in a firm where the team is a community and the team members have frequent interactions There might be interfaces from one project to another For Diffusion and Contagion 49 Fig 6.3. IT organization of a factory example, finance and control project teams interact with material handling, sales, and distribution in a factory Although projects are interconnected, they are clearly identifiable communities as shown in Figure 6.3 We begin a set of global strategies and then turn to local communities We introduce graph partitioning where we divide the graphs into parts with minimal number of links between them The number of links running between two clusters is called cut size If we divide the network in one group, that is, it is undivided, the cut size is In partitioning, it might be desirable to specify the least number of groups and the target group size, even though communities cannot be divided in an optimal way For example, consider a sports club with 12 players Six of them play basketball and six of them play football We would like to divide the entire sports network into communities (i.e., clusters) based on how closely players interact with one another All players who play similar sports interact with one another closely and they are friends There are also a few footballers who interact with basketballers Figure 6.4 illustrates the interactions among footballers and basketballers The 50 Computational Network Science: An Algorithmic Approach Fig 6.4. Two communities of players in sporting club best graph partitioning method is bipartition, which divides the graph into two clusters of equal size and minimal cut size However, it is not possible to have a partition with a smaller cut size than Therefore, this method is optimal for dividing clusters in our graph 6.4.1 Spectral Clustering Spectral “graph” clustering technique is used to determine the number of clusters in large networks Spectral partitioning is based on Laplacian matrix (L) and finding eigenvalues and eigenvectors Adjacency matrix A = [Wij], where Wij is the edge weight between vertices xi and xj If there is a link between nodes i and j, Wij = 1; otherwise, Wij = 0 The Laplacian matrix L = D − A, where D is the diagonal matrix of node degrees We illustrate a simple example shown in Figure 6.5 Fig 6.5. The graph G(9, 15) to be analyzed for spectral partitioning Diffusion and Contagion 51 For each node, the value of D is computed based on how many edges are linked to that node For example, for node 1, there are three edges connected from nodes 2, 3, and Therefore, the degree of node is    −1  −1  −1 L = D−A=      −1 −1 0 0 0 −1 −1 −1 0 0 −1 −1 −1 −1 0 0 0 −1 −1 −1 −1 0 0 −1 −1 −1 −1 0 0 −1 −1 −1 −1 0 0 −1 −1 −1 0         −1    (6.4) From the Laplacian matrix shown in Equation 6.4, we compute Fielder vector (S) based on eigenvalues and eigenvectors Fielder vector has both positive and negative components and their sum must be  0.33   0.33  0.33  0.33 S =  0.33  0.33   0.33  0.33  0.33 −0.38   −0.48  −0.38  −0.12  0.16  0.16   0.30  0.24  0.51  (6.5) From the matrix S shown in Equation 6.5, we identify two communities, where all the positive values form one cluster and negative values form another cluster (Figure 6.6) In this example, two communities (i.e., clusters) are {1, 2, 3, 4} and {5, 6, 7, 8, 9} Fig 6.6. Two communities derived from spectral partitioning technique 52 Computational Network Science: An Algorithmic Approach Fig 6.7. Algorithm for agglomerative hierarchical clustering 6.4.2 Hierarchical Clustering Hierarchical clustering is the most popular and widely used method to analyze social network data In this method, nodes are compared with one another based on their similarity Larger groups are built by joining groups of nodes based on their similarity A criterion is introduced to compare nodes based on their relationship There are two types of hierarchical clustering approaches: Agglomerative approach: This method is also called a bottomup approach shown in Figure 6.7 In this method, each node represents a single cluster at the beginning; eventually, nodes start merging based on their similarities and all nodes belong to the same cluster Divisive approach: This method is also called a top-down approach Initially, all nodes belong to the same cluster; eventually, each node forms its own cluster Divisive approach is less widely used due to its complexity compared with agglomerative approach The final result for both approaches is represented as a dendrogram shown in Figure 6.8 Consider the distances between four Illinois towns, including Carbondale, Peoria, Springfield, and Bloomington, shown in Figure 6.9 We can observe that Bloomington and Peoria are the two closest cities and we join them using hierarchical clustering algorithm Figure 6.10 shows the four towns on the map The distances between Peoria, Bloomington, and Springfield are closer and identical within the distances of 73 and 71 miles The final dendrogram is shown in Figure 6.11 There is a problem with graph partitioning We need to specify the number and the size of the desired clusters If a network is new and large, we not have any idea about the number of clusters and how Diffusion and Contagion 53 Fig 6.8. A dendrogram example for hierarchical clustering approach big they must be Hierarchical clustering has a shortcoming If we cut the hierarchical tree at any level, we produce a good partition but we end up with n − 1 partitions If the network has million nodes (n), we get million minus partitions Many partitions are recovered from which we need to identify the best one We cannot use million partitions We must find additional criteria to find which partition is the best one In Girvan and Newman (2002), an algorithm is offered to solve the problems with spectral methods It is based on the divisive method and hierarchical clustering The divisive method repeatedly identifies and removes edges connecting densely connected regions It uses edge betweenness that is the number of the shortest paths passing through the edge to identify edges to remove them It also removes the links that connect clusters The algorithm shown in Figure 6.12 Fig 6.9. Distances between four Illinois towns 54 Computational Network Science: An Algorithmic Approach Fig 6.10. Hierarchical clustering of example towns shown on a map Fig 6.11. The final dendrogram for the towns’ example 104 Appendix Appendix 105 106 Appendix Appendix 107 108 Appendix Appendix 109 110 Appendix A.1 WHAT IS IT? In a network, a “component” is a group of nodes that are all connected to each other, directly or indirectly So if a network has a “giant component,” that means almost every node is reachable from almost every other This model shows how quickly a giant component arises if you grow a random network A.2 HOW IT WORKS Initially we have nodes but no connections (edges) between them At each step, we pick two nodes at random that were not directly connected before and add an edge between them All possible connections between them have exactly the same probability of occurring As the model runs, small chain-like “components” are formed, where the members in each component are either directly or indirectly connected to each other If an edge is created between nodes from two different components, then those two components merge into one The component with the most members at any given point in time is the “giant” component and it is colored red (If there is a tie for largest, we pick a random component to color.) A.3 HOW TO USE IT The NUM-NODES slider controls the size of the network Choose a size and press SETUP Pressing the GO ONCE button adds one new edge to the network To repeatedly add edges, press GO As the model runs, the nodes and edges try to position themselves in a layout that makes the structure of the network easy to see Layout makes the model run slower, though To get results faster, turn off the LAYOUT? switch The REDO LAYOUT button runs the layout-step procedure continuously to improve the layout of the network A monitor shows the current size of the giant component, and the plot shows how the giant component’s size changes over time Appendix 111 A.4 THINGS TO NOTICE The y-axis of the plot shows the fraction of all nodes that are included in the giant component The x-axis shows the average number of connections per node The vertical line on the plot shows where the average number of connections per node equals What happens to the rate of growth of the giant component at this point? The model demonstrates one of the early proofs of random graph theory by the mathematicians Paul Erdos and Alfred Renyi (1959) They showed that the largest connected component of a network formed by randomly connecting two existing nodes per time step rapidly grows after the average number of connections per node equals In other words, the average number of connections has a “critical point” where the network undergoes a “phase transition” from a rather unconnected world of a bunch of small, fragmented components to a world where most nodes belong to the same connected component A.5 THINGS TO TRY Let the model run until the end Does the “giant component” live up to its name? Run the model again, this time slowly, a step at a time Watch how the components grow What is happening when the plot is steepest? Run it with a small number of nodes (like 10) and watch the plot How does it differ from the plot you get when you run it with a large number of nodes (like 300)? If you multiple runs with the same number of nodes, how much does the shape of the plot vary from run to run? You can turn off the LAYOUT? switch to get results faster A.6 EXTENDING THE MODEL Right now the probability of any two nodes getting connected to each other is the same Can you think of ways to make some nodes more attractive to connect to than others? How would that impact the formation of the giant component? 112 Appendix A.7 NETWORK CONCEPTS Identification of the connected components is done using a standard search algorithm called “depth first search.” “Depth first” means that the algorithm first goes deep into a branch of connections, tracing them out all the way to the end For a given node, it explores its neighbor’s neighbors (and then their neighbors, etc.) before moving on to its own next neighbor The algorithm is recursive, so eventually all reachable nodes from a particular starting node will be explored Since we need to find every reachable node, and since it does not matter what order we find them in, another algorithm such as “breadth first search” would have worked equally well We chose depth first search because it is the simplest to code The position of the nodes is determined by the “spring” method, which is further described in the Preferential Attachment model A.8 NetLogo FEATURES Both nodes and edges are turtles Edge turtles have the “line” shape The edge turtle’s “size” variable is used to make the edge be the right length Lists are used heavily in this model Each node maintains a list of its neighboring nodes Lists are also used in the procedure that identifies the components A.9 RELATED MODELS See other models in the Networks section of the Models Library, such as Preferential Attachment See also Network Example, in the Code Examples section A.10 CREDITS AND REFERENCES Written By: Md Shapon Talukder Venkata Saaketh Gummalla Gautham Bhavaraju Appendix 113 114 Appendix Appendix 115 116 Appendix Appendix 117 118 Appendix ... that is computationally intractable for large Fig 6 .20 . Clique search algorithm Adapted from Han and Liu (20 10) 62 Computational Network Science: An Algorithmic Approach networks Many algorithmic. .. Press 64 Computational Network Science: An Algorithmic Approach Girvan, M., Newman, M., 20 02 Community structure in social and biological networks Proc Natl Acad Sci U S A 99 ( 12) , 7 821 –7 826 Golbeck,... 6.13. An example network illustrating Girvan and Newman’s algorithm 56 Computational Network Science: An Algorithmic Approach neighborhood rivalry and substance abuse (Dodge et al., 20 10;

Ngày đăng: 16/05/2017, 16:43

Xem thêm