spark english lesson, slide

Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang wzhg0508@163.com 2012.11.20 Project Goals Extend the MapReduce model to better support tw o common classes of analytics apps: >> Iterative algorithms (machine learning, graph) >> Interactive data mining Enhance programmability: >> Integrate into Scala programming language >> Allow interactive use from Scala interpreter Background Most current cluster programming models are b ased on directed acyclic data flow from stable st orage to stable storage Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures Background Acyclic data flow is inefficient for applications t hat repeatedly reuse a working set of data: >> Iterative algorithms (machine learning, graphs) >> Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data fro m stable storage on each query Solution: Resilient Distributed Datasets (RDDs) Allow apps to keep working sets in memory for efficie nt reuse Retain the attractive properties of MapReduce >> Fault tolerance, data locality, scalability Support a wide range of application Outline • Introduction to Scala & functional programming • What is Spark • Resilient Distributed Datasets (RDDs) • Implementation • Demo • Conclusion About Scala High-level language for JVM >> Object-oriented + Functional programming (FP) Statically typed >> Comparable in speed to Java >> no need to write types due to type inference Interoperates with Java >> Can use any Java class, inherit from it, etc; >> Can also call Scala code from Java Quick Tour Quick Tour All of these leave the list unchanged (List is Immutable) [...]... be used in parallel operations Spark framework Spark + Pregel Spark + Hive Run Spark Spark runs as a library in your program (1 instance per app) Runs tasks locally or on Mesos >> new SparkContext ( masterUrl, jobname, [sparkhome], [jars] ) >> MASTER=local[n] /spark- shell >> MASTER=HOST:PORT /spark- shell Outline • Introduction to Scala & functional programming • What is Spark • Resilient Distributed... Outline • Introduction to Scala & functional programming • What is Spark • Resilient Distributed Datasets (RDDs) • Implementation • Demo • Conclusion Spark Overview Goal: work with distributed collections as you would with local ones Concept: resilient distributed datasets (RDDs) >> Immutable collections of... Persistence (in RAM, reuse) 2) Partitioning (hash, range, []) RDD Types: parallelized collections By calling SparkContext’s parallelize method on an e xisting Scala collection (a Seq obj) Once created, the distributed dataset can be operat ed on in parallel RDD Types: Hadoop Datasets Spark supports text files, SequenceFiles, and any other Hadoop inputFormat Local path or hdfs://, s3n://, kfs://... RDD Dependencies Each box is an RDD, with partitions shown as shaded rectangles Outline • Introduction to Scala & functional programming • What is Spark • Resilient Distributed Datasets (RDDs) • Implementation • Demo • Conclusion Implementation Implement Spark in about 14,000 lines of Scala Sketch three of the technically parts of the syste m: >> Job Scheduler >> Fault Tolerance >> Memory Management... transparency) Behavior if not enough RAM Similar to existing data flow systems Poor performance(swapping ?) Outline • Introduction to Scala & functional programming • What is Spark • Resilient Distributed Datasets (RDDs) • Main technically parts of Spark • Demo • Conclusion ... partitioned collection of reco rds Can only be created by : (1) Data in stable storage (2) Other RDDs An RDD has enough information about how it was de rived from other datasets(its lineage) Memory Management Spark provides three options for persist RDDs: (1) in-memory storage as deserialized Java Objs >> fastest, JVM can access RDD natively (2) in-memory storage as serialized data >> space limited, choose . with local ones Spark framework Spark + Hive Spark + Pregel Run Spark Spark runs as a library in your program (1 instance per app) Runs tasks locally or on Mesos >> new SparkContext ( masterUrl,. jobname, [sparkhome], [jars] ) >> MASTER=local[n] . /spark- shell >> MASTER=HOST:PORT . /spark- shell Outline • Introduction to Scala & functional programming • What is Spark • Resilient. Outline • Introduction to Scala & functional programming • What is Spark • Resilient Distributed Datasets (RDDs) • Implementation • Demo • Conclusion Spark Overview Concept: resilient distributed datasets

Định dạng
Số trang	42
Dung lượng	1,13 MB