Một số phương pháp nâng cao hiệu quả dự báo lan truyền thông tin trên mạng xã hội TT TIENG ANH

MINISTRY OF EDUCATION AND TRAINING VIETNAM ACADEMY OF SCIENCE ANH TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY - DUONG NGOC SON SOME METHODS TO IMPROVE THE EFFICIENCY OF INFORMATION DIFFUSION FORECASTING ON SOCIAL NETWORKS Major: Information System Code: 48 01 04 SUMMARY OF COMPUTER DOCTORAL THESIS Hanoi - 2022 The thesis was completed at Graduate University of Science and Technology - Vietnam Academy of Science and Technology Science instructor: Dr Nguyen Nhu Son Dr Nguyen Ngoc Cuong Reviewer 1: Reviewer 2: Reviewer 3: The thesis will be defended before the Academy-level PhD Thesis Evaluation Council, meeting at Graduate University of Science and Technology - Vietnam Academy of Science and Technology at o’clock, ., 20 The thesis can be found at: - Graduate University of Science and Technology’s Library - National Library of Vietnam INTRODUCTION In the current information technology era, the use of the Internet has become widespread Statistics from Hootsuite and We Are Social show that, as of January 2020, the total number of Internet users worldwide reached 4.54 billion More and more people use the Internet means an increase in the need to use social networks Statistics show that the number of social network users in general worldwide has reached approximately 3.8 billion, accounting for 49% of the population People spend a lot of time using the Internet in general and social networks in particular, and with such a large need to use social networks, the amount of information on social networks is huge In recent years, there have been many scientists as well as many studies on analyzing information on social networks to be able to exploit this big data source Some of the main directions in analytical research on information on social networks can include social network data mining (behavior analysis, hotspot detection, social counseling ); analyzing graph data models (studying graph theory, measurements, calculating on graphs, ); community detection (analysis of community structure in social networks, interaction relationships in the community); information security (information security, false information detection ) and analysis and prediction of information transmission In addition to researching and analyzing information on social networks, there are practical implications Currently, many agencies and businesses have the need to use information analysis systems on social networks to serve different purposes For example, business enterprises need to analyze the trend of consumer choice of goods, user's preferences for products on the market The press agencies are interested in the hot topics that are currently being interested to focus on exploiting Service providers are interested in the attitudes and satisfaction levels of users towards specific services The Ministry of Information and Communications has a demand for ensuring information security and managing information flows spread on social networks The Ministry of Public Security also has needs for information analysis, detection of false information, finding sources of information dissemination, forecasting information spread to have a plan to fight and handle Derived from the actual working process, the PhD student was exposed to and directly used a number of information analysis systems on social networks of domestic and foreign units, organizations and enterprises Each system has its own features and characteristics, however, they basically perform some main tasks such as: collecting public information on social networks such as Facebook, Youtube, Twitter, including includes personal information (name, date of birth, interests), friends list, posts, shares, interest in topics, attitudes, feelings about the content, information of the groups in which the user is a member From there, synthesize information, provide analysis on hot topics, content of interest with positive/negative attitudes, identify influential users, build relationship diagrams between users and, most importantly, forecasting the spread of information Through the process of using, there are still many requirements set to develop and improve the system, in which two interesting contents are highlighted, which is how to increase the speed of analysis, calculation and prediction of propagation information because analyzing information on social networks takes a long time to execute even though it is installed on high-performance computers and increases the accuracy in predicting the spread of information on social networks Through the time of research and research, the PhD student has identified the key issues to solve the above two contents Firstly, social networks are often modeled as graphs with vertices and edges, the analysis and prediction of information spread on social networks depends heavily on the calculation of measurements on the graph With the characteristics of social networks are the increasing number of users, complex relationships, corresponding to the number of vertices and edges in the large graph, the computation on the graph takes a lot of time Increasing the speed of analysis and prediction of information propagation can be done in many processes, including increasing the speed of calculating parameters for propagation prediction This is an issue of interest to many scientists and in fact, there have been many studies to improve this calculation, in which some prominent methods such as: graph reduction, approximation of measures, parallelize measurements or use analysis tools on supercomputers During the research process, the PhD student has the opportunity to work with the research team of the University of Technology - Vietnam National University, Hanoi and the plan proposed by the PhD student in the scope of the thesis is to combine two Among the above methods include graph reduction and parallelization of the computation of a single measurement, which is Intermediate Centrality Median centrality is an important measure in determining the importance of a person in a social network, Saraswathi (2020) used Intermediate centrality to identify objects that need to be localized early to master action to prevent the spread of the SARS-CoV-2 virus All testing is done on high-performance computer of the University of Technology and the research results including the algorithm source code are publicly posted on Github Second, to increase the accuracy in predicting the spread of information on social networks is a problem that is difficult to quantify In essence, prediction of information propagation is the calculation of the probabilities of information propagation, thereby approximating the propagation sizes (orders) To increase the accuracy of propagation prediction, one must reduce the error when compared with the estimated spread size from the “basic truth” Thus, from the outset, the calculation of the probabilities of information propagation should be as precise as possible Most of the models in the world today are built based on two input data sources, the influence between users (information about the interaction history) and the influence from user preferences However, based on these two sources alone, it is still not possible to guarantee the accuracy of the forecasting model This stems from the nature of the predictive model, which is built on the theory of functional graphs, but unlike normal function graphs, the vertices and edges in the social network are all random values natural and personal, such as a person choosing to read or not read this article, share or not share that information will be affected by many factors, including influence from outside social networks movement accounts for no small part Researching this issue, the PhD student joined with the research team of the Institute of Information Technology Vietnam Academy of Science and Technology, as a result, a model for predicting the spread of media has been built information that combines factors from outside the society The research results are also applied in the Academy-level project "Building a monitoring and forecasting system for information spread on social networks in Vietnam" The objective of the thesis The goal of the thesis is to research, develop and improve a number of methods to improve efficiency in forecasting information spread on social networks, within the big topic of improving the efficiency of social network analysis serving research work and applicability in practice The results of the thesis will have to solve 02 problems that are improving the speed (or reducing time) of computation, analyzing information for forecasting information transmission and increasing accuracy (or minimizing errors) number) in forecasting information spread on social networks With the set objectives, the thesis has obtained the following main results: 1) Proposing methods to improve the speed of calculation and analysis for forecasting information spread on social networks This contribution is presented in Chapter of the thesis 2) Proposing methods to improve the accuracy of information propagation prediction This contribution is presented in Chapter of the thesis CHAPTER OVERVIEW OF INFORMATION DIFFUSION ON SOCIAL NETWORKS 1.1 Social network concept Social networks play an important role in spreading information on a large scale Up to now, many studies have been done to understand this process, from data mining to detect topics of interest, detect hot spots, identify influential users in social networks to social networks the analysis and study of information dissemination models Social networking allows billions of Internet users worldwide to connect, post and transmit content The user is exposed and is a huge source of information Disinformation creates powerful effects, for example in creating a revolutionary wave on Facebook during the 2010 Arab Spring or causing impacts on Twitter during the US presidential election Period 2008, Due to the impact of social networks on real life, recent research focus has been on community discovery and the extraction of valuable information from this huge amount of data Events take place and develop very quickly in social networks, so capturing, understanding and predicting events is a matter of concern for many different audiences from organizations and businesses to businesses Researchers It also shows that understanding the relationships in social networking communities and the development of social networks can help to regulate behavior and better predict future events such as optimal analysis Maximize business performance by creating social marketing campaigns; regulate the behavior of the user community through influential individuals in society; or the analysis of protests thereby solving security issues such as preventing terrorist attacks, forecasting information sources that have adverse effects on society Therefore, develop skills Techniques and models for community detection, capturing the development of social networks and spreading information in social networks are also topics of interest to researchers in recent years 1.2 Research areas in social network analysis - Social network data mining: applications in areas such as: behavioral analysis, hotspot detection and social counseling - Graph data model analysis: application in managing large-scale data such as social network data - Community detection: from given social networks, detect community structures and learn the relationships between individuals, thereby solving the problem that individuals/or certain relationships affect the world to the structure of the entire social network - Prediction of information spread: is a process by which an information innovation is communicated through certain channels over time among members of a social network - Information security: solve the problem of information leakage, limit or decontaminate false information 1.3 The problem of improving the efficiency of information difusion forecasting on social networks Viralization is a process by which an information innovation is communicated through certain channels over time among the members of a social network There are three important factors in this process: membership in the social network, mutual interaction and communication channels The study of the transmission processes in each specific situation is the foundation for people to be able to solve problems related to actual transmission such as: the spread of diseases (in medicine, epidemics) epidemiology), the spread of ideas and ideas among individuals in a society, the spread of viruses on a computer network, the spread of information on social networks Information propagation is a typical social network analysis problem with many potential real-world applications For example, it can be used to predict major social events such as the Arab Spring; to increase the feedback performance of products and services; to maximize the effectiveness of advertising to users However, it is a difficult and time-consuming problem to produce prediction results One of the reasons why it takes a long time is because social networks are getting bigger and bigger, the number of users is increasing, the relationship is complicated, when modeled as a graph to calculate, the number of vertices (users) and a very large number of edges (relationships) make the computation complicated Since then, many studies have been launched to increase the speed of analysis and prediction of information transmission In addition, information transmission is calculated based on probabilities, on social networks, these probabilities depend on many influencing factors such as the relationship between the sender and the receiver of the information, the relationship between the messenger and the receiver the recipient's attention to information and influences from the external social environment Therefore, increasing the accuracy in predicting the spread of information on social networks is of interest to many researchers 1.4 Related research directions 1.4.1 Improve the speed of information difusion forecasting Graph reduction is a basic and effective method among studies on reducing analysis time, calculating parameters for information propagation prediction The essence of graph reduction is to remove/replace unnecessary/not-so-important vertices and edges to obtain a more compact graph while keeping the important vertices and necessary properties of the graph town Feder (1995) and Adler (2001) mentioned reduction by means of graph compression In it, Feder offers a compression method using a partitioning algorithm for binary graphs; Adler reduces the compression problem to the problem of finding the minimum spanning tree Basically, the graph compression method results in a more compact graph, however, it often serves the problem of storing graphs (data or structures) In addition, for the social network analysis problem with continuously changing data, the graph compression method is not suitable because the continuous conversion between the original graph and the compressed graph is not feasible Gilbert (2004) has introduced a number of methods of graph reduction with algorithms KeepOne, KeepAll and the method of removing redundant vertices RVE (Redundant Vertex Elimination) In which, the KeepOne algorithm is similar to Adler's method, ie, to find a minimum spanning tree for the graph This algorithm allows to keep the maximum number of significant vertices and the vertices between those important vertices, however, the biggest disadvantage of the algorithm is that it does not preserve the shortest path between two vertices In contrast, KeepAll is an algorithm that allows to keep the shortest path between important vertices but can delete relatively many vertices between them It can be said that the two algorithms have their own strengths and are suitable in certain cases (for example, KeepOne will be suitable for the network planning problem, KeepAll is suitable for the problem of finding the shortest path in traffic) ), however, they are not suitable for social network analysis problems RVE is the closest method to analysis and calculation on social networks, this method allows to remove vertices that share adjacent neighboring vertices The method is often applied in media network reduction with the removal of unnecessary redundant nodes, if applied in social network analysis, it can also remove unimportant vertices, but it should be noted that Besides, if the important vertices are adjacent, one of them will also be removed Dung's thesis (2019) offers a reduction method based on replacing equivalent vertices (hanging vertices and ridge vertices), then calculating the intermediate center on the reduced graph The advantage of the method is that the graph is more compact However, with too much change in the original graph, the method can only be applied to small graphs of size 100-1000 vertices, 500-5000 edges as in the experiment Parallelization of measurements is a method of interest to many researchers due to its effectiveness in reducing computation time on graphs Hanh (2018) is the research closest to the PhD student's idea with parallelizing the process of calculating the Near Centrality and going into the data restructuring The method is clearly effective, but compared to the direction of the PhD student as mentioned above, which is combined with the reduction of graphs and application in increasing the speed of information transmission prediction, the method of Hanh is not applicable Bernaschi (2015), Fan (2017) and McLaughlin (2018) have the same idea as PhD student when offering solutions to improve the speed of calculating Intermediate Centrality by parallelizing the calculation process using different methods GPU graphics processor The use of the GPU is highly effective in increasing the speed of computation, however the GPU is suitable for use with computations on static, unchanged graphs In addition, the parallelization solutions with GPU all use the Brandes Intermediate Center calculation algorithm similar to the PhD student's idea, so in the future, the PhD student will also calculate with combine using GPU to apply his method For the approximation method, Mahmoody (2016) and Riondato (2016) give the idea of quickly calculating the approximation of the Intermediate Center based on the sampling technique According to this method, a number of shortest paths will be randomly sampled, from which an algorithm is applied to estimate the distance between the vertices and approximate the Intermediate Degrees However, in social network graph analysis, random sampling is not always accurate and cannot be used as a basis for evaluation Wei (2016) suggests using GraphLab and Apache Giraph toolkits on complex computing infrastructures such as cluster computing systems or high-performance supercomputers As can be seen, these toolkits are designed to be able to analyze and compute networks at a very large scale to trillions of edges However, they are not really effective for computational problems with real networks of not too large size such as Facebook, Youtube with the number of vertices less than 232 In addition to the above methods, to improve computational performance on graphs, we often use the parallelization method in the NetworKit toolkit or the TeexGraph toolkit to serve the analysis of large-scale social networks These tools all use a parallel model of shared memory and use the OpenMP library to parallelize the calculation of measures including Intermediate Centrality In the thesis, the PhD student uses these two sets of tools as a measure to evaluate the effectiveness of his method 1.4.2 Improve the accuracy of information diffusion forecasting In order to come up with a solution to the problem of increasing accuracy in information propagation forecasting, it is necessary to consider different influences in information propagation Kwak (2010) examines the influence of users on the spread of information by looking at the network structure and using a variation of the PageRank algorithm to rank influential users on the social network Twitter and their followers follow) back and forth of the user Haveliwala in (2002) uses a series of PageRank vectors to calculate query importance scores to pages, thereby determining the influence of a particular content on the spread of information Weng (2010) examines user influence and content matching using the TwitterRank algorithm, an extension of PageRank, to measure user influence on Twitter, including accounting for the similarity agree on the topic of content that users post However, the problem of these methods is that the network structure is relatively static compared to the user activities in the social network, the application of the methods is mainly tested on the social network Twitter with an important structure One-way tracking system Myers (2012) and Wu (2015) examine the effects of external trends, where Myers studies the user's reception of external communication and compares it with the information transmitted through the user association to assess the impact on the dissemination of information; Wu combines the influence of external trends with topic-based social descriptions to build a topic-based information dissemination model from which to apply in propagation forecasting The consideration of external influences has proven to affect users in social networks, however, this research direction considers users' reception of external information, which is different from the PhD student's direction External influences affect the transmission of information between users Predicting the size of the propagation or the size of the order in the information propagation is a topic that many researchers approach in different directions Cohen (2014), Kempe (2003) and Lucier (2015) estimate the spread size through sampling i.e using a fixed number of samples to estimate the expected size of the order, where Cohen and Kempe use use a greedy algorithm to approximate the size of the order; Lucier uses sampling with the MapReduce distributed computing infrastructure This research direction has specific applications such as estimating the total number of votes for an article, estimating the number of news phrases per hour, estimating the total number of hash-tag usage per day However, for social networks, there is usually no single sample size that works for all networks Bakshy (2009), Szabo (2010), Jenders (2013) and Kupavskii (2012) predict the spread size by defining some function that correlates with the dependent variable be it the spread size or the activation probability Then, Bakshy and Szabo used linear regression and Jenders and Kupavskii used a classification algorithm to estimate the value of that dependent variable Basically, regression and classification are two methods in Supervised Learning, therefore, this method is a different research direction than the PhD thesis when wanting to use the results of the determination The activation rate provides a method to estimate the propagation size to increase the prediction accuracy 1.5 Proposed method From the above analysis, in the thesis, the PhD student proposes two methods to solve the two problems posed Firstly, use the technique of graph reduction and parallelize the calculation process Intermediate centrality to reduce analysis and calculation time for information propagation forecasting Specifically, based on the idea of the method of removing redundant vertices RVE, the PhD student proposes a more suitable reduction method in analyzing the social network graph that is to reduce the graph based on the replacement of equivalent vertices degree That is, in the reduction process, we only consider the hanging vertices deg(v) = 1, then, during the graph traversal, if we can identify the hanging vertices with the same adjacent set, We will choose vertex to represent the remaining vertices As a result, a new, more compact graph will reduce the time for computation on the graph because it does not have to consider unimportant vertices Based on the research of Zhang (2020) and Hinz (2011) that have demonstrated the influence of the Intermediate Centrality of the vertices in the graph on the reception and propagation of information, the PhD student parallelizes the Calculations in the Brandes Intermediate Center calculation algorithm with parallel thread programming model on CPU, using the CilkPlus library There are three caveats, one is that reducing the size of the graph will change the calculation results of the measurements, but still ensure to keep the "important" nature of the central vertices; the second is to calculate on the graph with a large number of vertices and edges, in which the calculation of the intermediate centrality is a difficult problem in parallelism, by some processing techniques, the thesis has solved the problem this; Third, within the scope of the thesis, only unweighted - connected - directed - unweighted graphs are considered, which will be described more clearly in Chapter II of the thesis Secondly, propose a method to improve the accuracy of information transmission prediction by providing a method to calculate the user's probability of accepting information (or the probability of spreading) on social networks according to the Unique Degree model based on 03 parameters: user relationships, interests with content and external influences From the calculated probabilities, construct a “most likely” propagation tree for a particular content to estimate the propagation size during information propagation Note that quantifying the probability of users accepting a content on an online social network is a difficult task because it is a highly subjective network problem, the decision depends on many factors and cannot be determined precisely define the mechanism that motivates users to take actions Furthermore, user influencers have an effect only for a certain period of time, necessitating constant assessment of user behavior and viral content Therefore, the method of the thesis has considered two factors, one is to examine the user's interaction history when determining the influence of user relationships on the decision to accept content; the second is to quantify external influences on user acceptance of content CHAPTER IMPROVE THE SPEED OF INFORMATION DIFUSION FORECASTING 2.1 Preamble As described in Chapter 1, the calculation to give results on information propagation is a complicated and time-consuming problem To speed up the prediction of information transmission, it can be done in many different stages and processes, including increasing the speed or reducing the calculation time of the parameters serving the propagation prediction Based on the application of graph theory, graph reduction is a simple and effective method in reducing the calculation time of parameters for analysis in general and propagation prediction in particular Removing unimportant vertices and edges from the graph will make the computation simpler and cleaner It is important to show that the reduction method will not affect the overall model of the graph In addition, the acceleration can be done at the stage of calculating parameters for analysis and forecasting In the problem of graph analysis, one of the essential parameters to be determined is the centrality for the purpose of determining the most important (centre) vertices in a graph Determining the centrality will help us identify the center of an epidemic area, the main node in the Internet network or an influencer in a social network Kumar (2021) has demonstrated the influence and importance of calculating the centrality in the problem of predicting disease spread as well as information transmission In the centrality measurements, besides the Degree of Centrality, Proximity of Centrality, Specific Vector Centrality, then the Intermediate Degree (Betweenness Centrality - BC) is an important and valuable measure It is valuable in determining that a vertex is an intermediate bridge when establishing shortest relationships (paths) between other vertices Intermediate Centrality was conceptualized by Freeman in 1977, where he showed that vertices with a high probability of being on the shortest path between two randomly selected vertices will have a high Intermediate Centrality Best Zhang (2020) and Hinz (2011) have demonstrated the influence of Intermediary Centrality in receiving and spreading information of users To calculate the Intermediate Center for all vertices in the graph G, we have to solve the problem of finding the shortest path on all pairs of vertices in G, that is, solving the All-Pairs Shortest Path (APSP) problem Some methods to solve APSP problems include FloydWarshall algorithm, Johnson algorithm and Brandes algorithm Compared to calculating APSP using Floyd-Warshall algorithm (computation complexity is O(|V|3)) and Johnson algorithm (with complexity O(|V|2log(|V|) + |V||E|)) then Brandes' algorithm with time complexity is O(|V||E|) on unweighted graph and O(|V||E| + |V| log(|E|)) on the weighted graph is still the most effective solution today The algorithm of Brandes algorithm with unweighted graph is illustrated in Algorithm 2.1 below: Algorithm 2.1 Brandes’s Algorithm to Compute the Betweenness Centrality Input: G = (V,E) is organized as a two-dimensional vector Edges[][] Data: an empty queue Q, stack S able to contain |V| vertices dist[v]: to save the distance from the source vertex to v; Pred[v]: to store the list all the vertices on the shortest path from the source vertex to v; σ[v]: the number of shortest paths from the source vertex to v; δ[v]: the dependence of the source via/through v; Output: BC[.] for any v ∈ V 1: foreach s = to Edges.size() 2: foreach v ∈ V Pred[v] ← empty list; dist[v] ← ∞; σ[v] ← 0; 3: dist[s] ← 0; σ[s] ← 1; Q.push(s); 4: while Q not empty 5: v ← Q.pop(); S.push(v); 6: foreach w ∈ Edges[v] 7: if dist[w] == ∞ then dist[w] ← dist[v] + 1; Q.push(w); 8: if dist[w] == dist[v] + then σ[w] ← σ[w] + σ[v]; Pred[w].push_back(v); 9: end 10: end 11: foreach v ∈ V δ[v] ← 0; 12: while S not empty 13: w ← S.pop(); 𝜎[𝑣] 14: for v ∈ Pred[w] δ[v] ← δ[v] + 𝜎[𝑤].(1+ δ[w]); 15: if w ≠ s then BC[w] ← BC[w] + δ[w]; 16: end 17: end 18: return BC[.]; Although compared with the other two algorithms, the Brandes algorithm has the lowest time complexity in calculating the intermediate centrality, however, with the feature of the social network, the graph with the number of membership (number of vertices) and number of relations (number of edges) is large, the calculation takes a lot of time In Chapter II of the thesis, the PhD student presents a method to help improve the speed of information propagation prediction by integrating two ideas: Graph reduction technique based on replacing equivalent vertices of order The technique of parallelizing the calculation process Intermediate centrality in Brandes' algorithm with the parallel thread programming model on the CPU and the use of the CilkPlus library 2.2 Improve the speed of information difusion forecasting 2.2.1 Graph reduction The first thing in the process of graph reduction is to determine the equivalent vertices of degree Degree means that the vertices hanging deg(v) = and equivalent means that they must have a set of adjacent vertices Г(v) duplicate And since they are vertices of degree (with a single adjacent vertex), it can be said that what we need to is to find hanging 11 Time 10 500 vertices graph Not reduced Reduced 1.30 1.25 1.27 1.23 1.28 1.25 1.32 1.24 1.28 1.22 1.29 1.24 1.29 1.23 1.32 1.25 1.31 1.25 1.30 1.24 400 vertices graph Not reduced Reduced 1.28 1.20 1.26 1.20 1.25 1.19 1.26 1.20 1.25 1.18 1.28 1.21 1.24 1.18 1.24 1.19 1.25 1.18 1.26 1.20 Table Compare diffusion time (s) 2.2.2 Parallelize the calculation of Betweeness Centrality First of all, to represent graphs, there are three main methods, namely: edge list, adjacency matrix and adjacency list In a relatively large-scale graph, the edge list method is quite simple, but the computation on the graph such as insertion and deletion of vertices is difficult; The contiguous matrix method is also unusable due to memory size limitations Therefore, the most suitable is the contiguous list method Thus, for vertex data, in the graph G = (V, E), each vertex will be assigned a value from to |V| - For edge data, the vertex vectors are arranged to represent the edges of the graph, or the edge data will be represented in the vector array Second, the calculation of the Intermediate Center BC according to Brandes' method depends largely on the BFS graph traversal process To reduce the queue size when traversing, every time we traverse a vertex u, we will use the Maps array where the vth bit position represents the traversal state or not of the vertex v The Queue is also organized to store the shortest distance from u to the traversed vertex in the Queue Due to the large size of the Queue and the Maps traversal marker array (with a number of elements equal to V), the memory allocation will take a long time Therefore, we will pre-allocate the memory containing these arrays corresponding to the number of threads that can execute in parallel Thirdly, to be able to exploit the performance of multi-core CPUs, the solution to parallelize the calculation process of BC intermediate centrality of the thesis is to execute BC calculations on different vertices in parallel, not must parallelize the process of traversing and calculating the shortest path from one vertex to all other vertices (SSSP) This approach allows SSSP browsing to be performed in specialized threads, thereby improving cache access speed Fourth, about libraries for parallelization such as CilkPlus, OpenMP and Pthread, A Leist and A Gilman (2010) have experimented and proven that the CilkPlus library gives better acceleration coefficient than with OpenMP and Pthread Accordingly, we will use the Cilkplus library to install parallel computation Finally, it can be said that the calculation on the graph with a large number of vertices and edges, including the calculation of the Intermediate Center BC, is a relatively difficult problem in parallelism The reason is that in Phase of the algorithm, the accumulation process requires a concurrency control technique to handle the accumulated data from parallel streams During the research and testing process, in the thesis, the PhD student added a reducerBC[v] vector in the Cilkplus library Technically, reducer allows creating a separate accumulator for each thread, and combining the stream private accumulator will result in the results in the correct order when the streams end That is, the reducerBC[v] vector allows concurrent updating of the BC value of vertex v when executed in parallel with the Cilkplus library 12 From there, the Intermediate Center BC will be executed in parallel as illustrated in the following algorithm combined with graph reduction: Algorithm 2.3 Combined algorithm Input: G = (V,E) is organized as a two-dimensional vector Edges[][] Data: an empty queue Q, stack S able to contain |V| vertices dist[v]: to save the distance from the source vertex to v; Pred[v]: to store the list all the vertices on the shortest path from the source vertex to v; σ[v]: the number of shortest paths from the source vertex to v; δ[v]: the dependence of the source via/through v; reducerBC[v]: vector contains BC values of all vertices v and allows the concurrency update in parallel with CilkPlus library; Output: BC[.] for any v ∈ V /* Execute in parallel using CilkPlus library */ 1: for s = to Edges.size() /* Phase Graph researching */ 2: dist[s] ← 0; σ[s] ← 1; Q.push(s); 3: foreach v ∈ V Pred[v] ← empty list; dist[v] ← ∞; σ[v] ← 0; 4: while Q not empty 5: v ← Q.pop(); S.push(v); /* Graph reduction */ 6: foreach w ∈ Edges[v] && Edges[w].Size() = 7: if u ∈ Edges[v] && Edges[u].Size() = then 8: Edges[v] ← Edges[v]\{u} /* delete u from Adjacent list of v */ 9: Edges[u] = {} /* delete u */ 10: end 11: foreach w ∈ Edges[v] 12: if dist[w] == ∞ then dist[w] ← dist[v] + 1; Q.push(w); 13: if dist[w] == dist[v] + then σ[w] ← σ[w] + σ[v]; Pred[w].push_back(v); 14: end /* Phase Accumulation */ 15: foreach v ∈ V δ[v] ← 0; 16: while S not empty 17: w ← S.pop(); 𝜎[𝑣] 18: for v ∈ Pred[w] δ[v] ← δ[v] + 𝜎[𝑤].(1+ δ[w]); 19: if w ≠ s then reducerBC[w] ← reducerBC[w] + δ[w]; 20: end 21: end 22: reducerBC.move_out(BC); 23: return BC[.]; Algorithm 2.3 has allowed to parallelize the process of calculating the Intermediate Center according to Brandes' method, using the Cilkplus library The algorithm has also been combined with graph reduction (conditions for satisfying the combination are presented in |𝑉|∗|𝐸| section 2.2.3 below) As can be seen, the time complexity of this algorithm is 𝑂( ) That 𝑡 is, if the algorithm is implemented with a stream t = 1, it will be equivalent to the complexity of Brandes' basic algorithm of 𝑂(|𝑉 | ∗ |𝐸 |) However, if this algorithm is executed in parallel with t threads, the time complexity of the algorithm will be reduced by t times 2.2.3 Method of combining two techniques The thesis's method combines two techniques of graph reduction and parallelizes the process of calculating the Intermediate Centrality, which integrates the graph reduction 13 process into the graph traversal phase of the Centrality calculation algorithm intermediate However, to this, we have to prove the two contents mentioned above, firstly, the reduced vertices are unimportant vertices, the reduction will not affect the general model of the graph market; Second, the importance of the central vertices is preserved To prove this, we test on a simple graph with 19 vertices and 28 edges as shown in Figure below: Figure Graph before reduction Applying the formula for calculating the intermediate centrality BC and the central degree near CC, we obtain the results as in Table 2: Vertice BC A B 24 C 71 D E CC 0.24 0.24 0.36 0.31 0.29 Vertice BC 49 F 24 G H 176 I J 163 CC 0.37 0.35 0.44 0.31 0.45 Vertice BC K 133 10 L 10 M N O CC 0.42 0.37 0.37 0.31 0.31 Vertice BC P 66 Q R S CC 0.32 0.33 0.25 0.25 Table BC and CC calculation results before reduction Thus, H is the vertex with the highest intermediate centrality BC[H]=176 (then J and K), J is the vertex with the highest centrality CC[J]=0.45 (after then to the peaks H and K) With a simple graph of small size, we can visually see that vertices A and B are equivalent vertices of degree 1, similarly vertices R and S are equivalent vertices of degree Applying the method of reducing the problem graph output, we get the graph after reduction as shown in Figure 4: Figure Graph after reduction Applying the formula for calculating the intermediate centrality BC and the centrality near CC with the graph after reduction, we get the results as in Table 3: Vertice BC CC Vertice BC CC Vertice BC 0.24 37.67 0.38 A F K 98 14.67 0.36 B G L 14.67 0.37 C H 140 0.47 M CC Vertice BC 0.44 P 0.39 30 Q 0.39 R CC 0.33 0.33 0.25 14 D E 35 0.31 0.30 I J 129 0.33 0.48 N O 0.32 0.33 - S - Table BC and CC calculation results before reduction After reduction, H is still the vertex with the highest intermediate centrality BC[H]=140 (then J and K), J is still the vertex with the highest centrality CC[J]= 0.48 (then H and K peaks) In addition, NCS also sets a threshold to allow the difference in the centrality value Δ=30% The results from Table 2.2 and Table 2.3 show that the difference in the center of the peaks after reduction is

Định dạng
Số trang	27
Dung lượng	1,3 MB