13 outbreak detection in networks

CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu ¡ ¡ (1) New problem: Outbreak detection (2) Develop an approximation algorithm § It is a submodular opt problem! ¡ (3) Speed-up greedy hill-climbing § Valid for optimizing general submodular functions (i.e., also works for influence maximization) ¡ (4) Prove a new “data dependent” bound on the solution quality § Valid for optimizing any submodular function (i.e., also works for influence maximization) 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu ¡ Given a real city water distribution network ¡ And data on how contaminants spread in the network ¡ Detect the contaminant as quickly as possible ¡ 11/7/18 S Problem posed by the US Environmental Protection Agency Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu Posts Users/blogs Information cascade Time ordered hyperlinks Which users/news sites should one follow to detect cascades as effectively as possible? 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu Want to read things before others Detect blue & yellow stories soon but miss the red story Detect all stories but late 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu ¡ Both of these two are instances of the same underlying problem! ¡ Given a dynamic process spreading over a network we want to select a set of nodes to detect the process effectively ¡ Many other applications: § Epidemics § Influence propagation § Network security 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu ¡ Utility of placing sensors: § Water flow dynamics, demands of households, … ¡ For each subset S Í V compute utility f(S) High impact outbreak Contamination Low impact outbreak S3 S1S2 S1 S4 Set V of all network junctions High sensing “quality” (e.g., f(S) = 0.9) 11/7/18 Medium impact outbreak S3 Sensor reduces impact through early detection! S2 S4 S1 Low sensing “quality” (e.g f(S)=0.01) Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu Given: ¡ ¡ Graph !(#, %) Data about how outbreaks spread over the ': § For each outbreak ( we know the time )(*, () when outbreak ( contaminates node * Water distribution network (physical pipes and junctions) 11/7/18 Simulator of water consumption & flow (built by Mech Eng people) We simulate the contamination spread for every possible location Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu Given: ¡ ¡ Graph !(#, %) Data about how outbreaks spread over the ': § For each outbreak ( we know the time )(*, () when outbreak ( contaminates node * a c b a c b The network of newsmedia 11/7/18 Traces of the information flow and identify influence sets Collect lots of articles and trace them to obtain data about information flow from a given news site Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu Given: ¡ ¡ Graph !(#, %) Data on how outbreaks spread over the ': ¡ Goal: Select a subset of nodes S that maximizes the expected reward: § For each outbreak ( we know the time )(*, () when outbreak ( contaminates node * max = ( 15 ⊆0 Expected reward for detecting outbreak i subject to: cost(S) < B P(i)… probability of outbreak i occurring f(i)… reward for detecting outbreak i using sensors S 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 10 ¡ Real metropolitan area water network § V = 21,000 nodes § E = 25,000 pipes ¡ ¡ 11/7/18 Use a cluster of 50 machines for a month Simulate 3.6 million epidemic scenarios (random locations, random days, random time of the day) Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 34 1.4 “Offline” the (1-1/e) bound Solution quality F(A) Higher is better 1.2 Data-dependent bound 0.8 0.6 Hill Climbing 0.4 0.2 0 10 15 20 Number of sensors placed Data-dependent bound is much tighter (gives more accurate estimate of alg performance) 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 35 [w/ Ostfeld et al., J of Water Resource Planning] Author ¡ 11/7/18 Placement heuristics perform much worse Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu Score CELF 26 Sandia 21 U Exter 20 Bentley systems 19 Technion (1) 14 Bordeaux 12 U Cyprus 11 U Guelph U Michigan Michigan Tech U Malcolm Proteo Technion (2) Battle of Water Sensor Networks competition 36 ¡ Different objective functions give different sensor placements Population affected 11/7/18 Detection likelihood Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 37 Here CELF is much faster than greedy hill-climbing! § (But there might be datasets/inputs where the CELF will have the same running time as greedy hill-climbing) 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 38 = I have 10 minutes Which news sites should I read to be most up to date? ? = Who are the most influential news sites? 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 39 Want to read things before others Detect blue & yellow soon but miss red Detect all stories but late 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 40 Crawled 45,000 blogs for year Obtained 10 million news posts ¡ And identified 350,000 cascades ¡ Cost of a blog is the number of posts it has ¡ ¡ 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 41 ¡ Online bound turns out to be much tighter! § Based on the plot below: 87% instead of 32.5% Old bound vs 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu Our bound CELF 42 ¡ ¡ 11/7/18 Heuristics perform much worse! One really needs to perform the optimization Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 43 ¡ ¡ CELF has sub-algorithms Which wins? Unit cost: § CELF picks large popular blogs ¡ Cost-benefit: § Cost proportional to the number of posts ¡ 11/7/18 We can much better when considering costs Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 44 ¡ Problem: Then CELF picks lots of small blogs that participate in few cascades ¡ We pick best solution that interpolates between the costs ¡ We can get good solutions with few blogs and few posts 11/7/18 Score f(S)=0.4 f(S)=0.3 f(S)=0.2 Each curve represents a set of solutions S with the same final reward f(S) Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 45 We want to generalize well to future (unknown) cascades ¡ Limiting selection to bigger blogs improves generalization! ¡ 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu Part 2-46 [Leskovec et al., KDD ’07] ¡ 11/7/18 CELF runs 700 times faster than simple hillclimbing algorithm Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 47 ¡ Outbreak detection problem in networks ¡ Different ways to formalize objective functions § All are submodular ¡ Lazy-Greedy algorithm for optimizing submodular functions ¡ CELF algorithm that combines versions of Lazy-Greedy ¡ Data-dependent bound on the solution quality 11/7/18 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 48

Định dạng
Số trang	48
Dung lượng	28,59 MB