Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 85 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
85
Dung lượng
19,03 MB
Nội dung
Introduction to Graph Analytics CS194-16 Introduction to Data Science Joseph E Gonzalez Post-doc, AMPLab jegonzal@cs.berkeley.edu *These slides are best viewed in PowerPoint with anima Outline Graph structured data Common properties of graph data Graph algorithms Systems for large-scale graph computation GraphX: Graph Computation in Spark Graph structured data is everywhere … Social Network Vertices • Users • Posts / Images Edges • Social Relationships • Directed: Twitter • Undirected: Facebook • Likes Actual Social Graph CHAPTER OVERVIE 27 23 15 10 20 16 31 30 13 34 14 11 33 21 29 12 18 19 28 25 17 22 24 32 26 Club Network e 1.7: From the social Karate network of friendships in the karate club from Figure 1.1, nd clues to the latent schism that eventually split the group into two separate clu Web Graphs • Vertices: Web-pages • Edges: Links (Directed) Wikipedia restricted to 1000 climate change pages Generated Content: • Click-streams Web Graphs • Vertices: Web-pages • Edges: Links (Directed) 2004 Political Blogs Generated Content: • Click-streams Semantic Networks Organize Knowledge Vertices: Subject, Object Edges: Predicates Example: Google Knowledge Graph • 570M Vertices • 18B Edges http://wiki.dbpedia.org Transaction Networks Supply Chain: Vertices: Suppliers/Consumers Edges: Exchange of Goods Transaction Networks (e.g., Bitcoin): Vertices: Users Edges: Exchange of Currency http://anonymity-in-bitcoin.blogspot.com/2011/07/bitcoin-is-not- Transaction Networks Supply Chain: Vertices: Suppliers/Consumers Edges: Exchange of Goods Transaction Networks (e.g., Bitcoin): Vertices: Users Edges: Exchange of Currency Incremental Updates for Iterative mrTriplets Vertex Table (RDD) Change A B C D Edge Table (RDD) Mirror Cache A B C D Mirror Cache Change E D E F F Scan A A B A C B C C D A E A F E D E F Aggregation for Iterative mrTriplets Vertex Table (RDD) Change Change Change Change Change Mirror Cache A B Local Aggregate C Mirror Cache D F B D C E A Local Aggregate A D E F Scan Change Edge Table (RDD) A B A C B C C D A E A F E D E F Performance Comparisons Live-Journal: 69 Million Edges Mahout/Hadoop 1340 Naïve Spark 354 Giraph 207 GraphX 68 GraphLab 22 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 3x slower than GraphLab GraphX scales to larger graphs Twitter Graph: 1.5 Billion Edges Giraph 749 GraphX 451 GraphLab 203 200 400 600 800 Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 2x slower than GraphLab » Scala + Java overhead: Lambdas, GC time, … » No shared memory parallelism: 2x increase in comm PageRank is just one stage… What about a pipeline? A Small Pipeline in GraphX Raw Wikipedia Hyperlinks PageRank Top 20 Pages HDFS XML Spark Preprocess HDFS Compute Spark Post Spark Giraph + Spark 605 GraphX GraphLab + Spark 1492 342 275 375 Total Runtime (in Seconds) Timed end-to-end GraphX is faster than Open Source Project Alpha release since Spark 0.9 Contributors? Python Bindings? Graph Processing Systems • Apache Giraph: java Pregel implementation • GraphLab.org: C++ GraphLab implementation • NetworkX: python API for small gaphs • GraphLab Create: commercial GraphLab python framework for large graphs and ML Graph Database Technologies Property graph data-model for storing and retrieving graph structured data • Neo4j: popular commercial graph database • Titan: open-source distributed graph database Break! http://tinyurl.com/ampgraphx jegonzal@eecs.berkeley.edu About Scala High-level language for the Java VM » Object-oriented + functional programming Statically typed » Comparable in speed to Java » But often no need to write types due to type inference Interoperates with Java » Can use any Java class, inherit from it, etc; can also call Scala code from Java Quick Tour Declaring variables: Java equivalent: var x: Int = var x = // type inferred in t x = 7; val y = “hi” // read-only fi n al String y = “hi”; Functions: Java equivalent: d ef square(x: Int): Int = x*x in t square(int x) { retu rn x*x; } d ef m in(a:Int, b:Int): Int = { if (a < b) a else b } d ef announce(text: String) { println(text) } void announce(String text) { System out.println(text); } Quick Tour Generic types: Java equivalent: var arr = new Array[Int](8) in t[] arr = n ew in t[8]; var lst = List(1, 2, 3) // type of lst is List[Int] List< Integer> lst = n ew ArrayList< Integer> (); lst.add( ) Indexing: Java equivalent: arr(5) = arr[5] = 7; println(lst(5)) System out.println(lst.get(5)); Quick Tour Processing collections with functional Function expression programming: (closure) val list = List(1,2,3) list.foreach(x = > println(x)) // prints 1,2,3 list.foreach(println) // sam e list.m ap(x = > x + 2) // = > List(3,4,5) list.m ap(_ + 2) // sam e,w ith placeholder notation list.fi lter(x = > x % = = 1) // = > List(1,3) list.fi lter(_ % = = 1) // = > List(1, 3) list.reduce((x, y) = > x + y) // = > list All reduce( of these _ + _) leave // =the > list unchanged (List is immutable) Other Collection Methods Scala collections provide many other functional methods; for example, Google for “Scala Seq” Method on Seq[T] Explanation m ap(f: T = > U ): Seq[U ] Pass each element through f fl atM ap(f: T = > Seq[U ]): Seq[U ] One-to-many map fi lter(f: T = > Boolean): Seq[T] Keep elements passing f exists(f: T = > Boolean): Boolean True if one element passes forall(f: T = > Boolean): Boolean True if all elements pass reduce(f: (T, T) = > T): T Merge elements using f groupBy(f: T = > K): M ap[K,List[T]] Group elements by f(element) sortBy(f: T = > K): Seq[T] Sort elements by f(element)