1. Trang chủ
  2. » Công Nghệ Thông Tin

SPLASH2013 indrajitroy rforbigdata Presto Big Data Analysis Beyond Hadoop

39 2 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Presto Big Data Analysis Beyond Hadoop © Copyright 2012 Hewlett Packard Development Company, L P The information contained herein is subject to change without notice R for Big Data Indrajit Roy, HP La.

R for Big Data Indrajit Roy, HP Labs October 2013 Team: Shivaram Erik Kyungyong Alvin © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Rob Vanish A tale of three researchers (Systems + PL) talk about data mining problems! Systems Data science Programming languages © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice A Big Data story Once upon a time, a customer in distress had… … 2+ billion rows of financial data (TBs of data) … wanted to model defaults on mortgage and credit cards … by running regression analysis … Alas! … traditional databases don’t support regression analysis … custom code can take from hours to days Moral of the story: Customers need platform+programming model for complex analysis © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Big Data has many facets Volume >7 TB/day Variety >1B user graph >40B photos © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Velocity >1M customer transactions/hr Just “Big” is not the issue Storage is not a problem Petabytes can be handled by DBs Volume Volume Variety Complex analytics at scale Volume Velocity Event processing at scale Not today’s talk © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Big data, complex algorithms PageRank (Dominant eigenvector) Recommendations Machine learning (Matrix + Graph algorithms factorization) Anomaly detection Iterative Linear Algebra Operations (Top-K eigenvalues) User Importance (Vertex Centrality) © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Example: PageRank using matrices Simplified algorithm repeat { p = M*p } Linear Algebra Matrices p Operations M on Sparse p Power method Dominant eigenvector M = Web graph matrix p = PageRank vector © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice (Efficiency) Variety Towards Distributed R Machine learning, images, graphs SQL, database Scale+ Complex Analytics R/Matlab RDBMS (col store) search, sort Hadoop Volume (Scalability) © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice * very simplified view Large scale analytics frameworks Data-parallel frameworks – MapReduce/Dryad (2004) Process each record in parallel Use case: Computing sufficient statistics, analytics queries Graph-centric frameworks – Pregel/GraphLab (2010) Process each vertex in parallel Use case: Graphical models Array-based frameworks Process blocks of array in parallel Use case: Linear Algebra Operations Our approach* *Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices S Venkataraman, E Bodzsar, I Roy, A AuYoung, R Schreiber Eurosys 2013 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Enter the world of R © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Distributed PageRank N s P1 P1 P2 P2 s … PN/s PN/s P M P1 P1 P2 P2 … … PN/s PN/s P_old M  darray(dim=c(N,N),blocks=(s,N)) P  darray(dim=c(N,1),blocks=(s,1)) while( ){ foreach(i,1:len, function(p=splits(P,i),m=splits(M,i) x=splits(P_old),z=splits(Z,i)) { p  (m*x)+z update(p) }) P_old  P 25 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice } N Z Execute function in a cluster Pass array partitions Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 26 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Architecture • Scheduler: performs I/O and task scheduling • Worker: executes tasks and I/O operations Master Scheduler Worker I/O Engine Executor pool R instance 27 R instance © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice HDDs R instance SSDs HDDs SSDs R instance R instance DRAM R instance I/O Engine DRAM Executor pool Worker Locality based computation: Part foreach(i,1:4, function(p=splits(P,i)) {…} Ship functions to data 28 Task Task P1 P2 M1 Task P3 M2 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice M3 Task P4 M4 Locality based computation: Part foreach(i,1:1, function(p=splits(P)) {…} Run Task Re-assemble data P1 29 M1 P2 P3 M2 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice M3 P4 M4 Efficiently sharing data Goal: Zero-copy sharing across cores Immutable partitions  Safe sharing Versioned distributed arrays 30 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Data sharing challenges Problems with data sharing • Garbage collection • Header conflicts R instance RCorrupt object header 31 R instance R object data part © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Data sharing challenges Problems with data sharing • Garbage collection • Header conflicts Solution • Override R’s allocator • Process local header, mmap data part page Local R object header page boundary 32 Local R object data part page boundary © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 33 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Applications in Distributed R Application Algorithm LOC PageRank Eigenvector calculation Triangle counting Top-K eigenvalues 121 Netflix recommendation Matrix factorization 130 Centrality measure Graph algorithm 132 SSSP Graph algorithm 62 k-means Clustering 71 Logistic regression Data mining Fewer than 140 lines of code *LOC for core of the applications 34 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice 41 120 Distribtued R has good performance Algorithm: PageRank (power method) Dataset: ClueWeb graph, 100M vertices, 1.2B edges, 20GB Setup: SL 390 servers, cores/server, 96GB RAM *Shorter is better 35 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Back to the Big Data story Scalable and high performance • Regression on multi-billion rows in minutes • Graph algorithms on billions of vertices and edges in minutes Ease of programming? • Familiar model as R, easy for data scientists • Distributed algorithms in hundreds of lines: clustering, classification, regression, graph algorithms, … 36 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Related work Non-R systems • MapReduce, Spark (UC Berkeley), Piccolo (NYU), MadLINQ (Microsoft) • Pregel (Google), GraphLab (CMU) • Star-P (MIT), Julia R extensions • Multi-core packages such as doMC, snow • Rmr: Interface to Hadoop ã Bigmemory: mmap arrays 37 â Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Summary Big data requires complex analysis : machine learning, graph processing, etc Matrices and arrays are surprisingly handy data structures Distributed R: simplifies distributed analytics Still, much remains: formalism, single node performance, compiler improvements, package contributions, … 38 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Thank you HPL Team: Alvin AuYoung, Rob Schreiber, Vanish Talwar Interns: Shivaram Venkataraman (UC Berkeley), Erik Bodzsar (U Chicago), Kyungyong Lee (UFL) HP Vertica Developers Collaborators: Prof Andrew Chien (U Chicago) Prof Renato Figueiredo (UFL) http://www.hpl.hp.com/research/distributedr.htm © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Ngày đăng: 29/08/2022, 22:37

Xem thêm: