1. Trang chủ
  2. » Công Nghệ Thông Tin

SPLASH2013 indrajitroy rforbigdata Presto Big Data Analysis Beyond Hadoop

39 2 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 39
Dung lượng 1,24 MB

Nội dung

Presto Big Data Analysis Beyond Hadoop © Copyright 2012 Hewlett Packard Development Company, L P The information contained herein is subject to change without notice R for Big Data Indrajit Roy, HP La.

R for Big Data Indrajit Roy, HP Labs October 2013 Team: Shivaram Erik Kyungyong Alvin © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Rob Vanish A tale of three researchers (Systems + PL) talk about data mining problems! Systems Data science Programming languages © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice A Big Data story Once upon a time, a customer in distress had… … 2+ billion rows of financial data (TBs of data) … wanted to model defaults on mortgage and credit cards … by running regression analysis … Alas! … traditional databases don’t support regression analysis … custom code can take from hours to days Moral of the story: Customers need platform+programming model for complex analysis © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Big Data has many facets Volume >7 TB/day Variety >1B user graph >40B photos © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Velocity >1M customer transactions/hr Just “Big” is not the issue Storage is not a problem Petabytes can be handled by DBs Volume Volume Variety Complex analytics at scale Volume Velocity Event processing at scale Not today’s talk © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Big data, complex algorithms PageRank (Dominant eigenvector) Recommendations Machine learning (Matrix + Graph algorithms factorization) Anomaly detection Iterative Linear Algebra Operations (Top-K eigenvalues) User Importance (Vertex Centrality) © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Example: PageRank using matrices Simplified algorithm repeat { p = M*p } Linear Algebra Matrices p Operations M on Sparse p Power method Dominant eigenvector M = Web graph matrix p = PageRank vector © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice (Efficiency) Variety Towards Distributed R Machine learning, images, graphs SQL, database Scale+ Complex Analytics R/Matlab RDBMS (col store) search, sort Hadoop Volume (Scalability) © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice * very simplified view Large scale analytics frameworks Data-parallel frameworks – MapReduce/Dryad (2004) Process each record in parallel Use case: Computing sufficient statistics, analytics queries Graph-centric frameworks – Pregel/GraphLab (2010) Process each vertex in parallel Use case: Graphical models Array-based frameworks Process blocks of array in parallel Use case: Linear Algebra Operations Our approach* *Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices S Venkataraman, E Bodzsar, I Roy, A AuYoung, R Schreiber Eurosys 2013 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Enter the world of R © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Distributed PageRank N s P1 P1 P2 P2 s … PN/s PN/s P M P1 P1 P2 P2 … … PN/s PN/s P_old M  darray(dim=c(N,N),blocks=(s,N)) P  darray(dim=c(N,1),blocks=(s,1)) while( ){ foreach(i,1:len, function(p=splits(P,i),m=splits(M,i) x=splits(P_old),z=splits(Z,i)) { p  (m*x)+z update(p) }) P_old  P 25 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice } N Z Execute function in a cluster Pass array partitions Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 26 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Architecture • Scheduler: performs I/O and task scheduling • Worker: executes tasks and I/O operations Master Scheduler Worker I/O Engine Executor pool R instance 27 R instance © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice HDDs R instance SSDs HDDs SSDs R instance R instance DRAM R instance I/O Engine DRAM Executor pool Worker Locality based computation: Part foreach(i,1:4, function(p=splits(P,i)) {…} Ship functions to data 28 Task Task P1 P2 M1 Task P3 M2 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice M3 Task P4 M4 Locality based computation: Part foreach(i,1:1, function(p=splits(P)) {…} Run Task Re-assemble data P1 29 M1 P2 P3 M2 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice M3 P4 M4 Efficiently sharing data Goal: Zero-copy sharing across cores Immutable partitions  Safe sharing Versioned distributed arrays 30 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Data sharing challenges Problems with data sharing • Garbage collection • Header conflicts R instance RCorrupt object header 31 R instance R object data part © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Data sharing challenges Problems with data sharing • Garbage collection • Header conflicts Solution • Override R’s allocator • Process local header, mmap data part page Local R object header page boundary 32 Local R object data part page boundary © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 33 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Applications in Distributed R Application Algorithm LOC PageRank Eigenvector calculation Triangle counting Top-K eigenvalues 121 Netflix recommendation Matrix factorization 130 Centrality measure Graph algorithm 132 SSSP Graph algorithm 62 k-means Clustering 71 Logistic regression Data mining Fewer than 140 lines of code *LOC for core of the applications 34 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice 41 120 Distribtued R has good performance Algorithm: PageRank (power method) Dataset: ClueWeb graph, 100M vertices, 1.2B edges, 20GB Setup: SL 390 servers, cores/server, 96GB RAM *Shorter is better 35 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Back to the Big Data story Scalable and high performance • Regression on multi-billion rows in minutes • Graph algorithms on billions of vertices and edges in minutes Ease of programming? • Familiar model as R, easy for data scientists • Distributed algorithms in hundreds of lines: clustering, classification, regression, graph algorithms, … 36 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Related work Non-R systems • MapReduce, Spark (UC Berkeley), Piccolo (NYU), MadLINQ (Microsoft) • Pregel (Google), GraphLab (CMU) • Star-P (MIT), Julia R extensions • Multi-core packages such as doMC, snow • Rmr: Interface to Hadoop ã Bigmemory: mmap arrays 37 â Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Summary Big data requires complex analysis : machine learning, graph processing, etc Matrices and arrays are surprisingly handy data structures Distributed R: simplifies distributed analytics Still, much remains: formalism, single node performance, compiler improvements, package contributions, … 38 © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice Thank you HPL Team: Alvin AuYoung, Rob Schreiber, Vanish Talwar Interns: Shivaram Venkataraman (UC Berkeley), Erik Bodzsar (U Chicago), Kyungyong Lee (UFL) HP Vertica Developers Collaborators: Prof Andrew Chien (U Chicago) Prof Renato Figueiredo (UFL) http://www.hpl.hp.com/research/distributedr.htm © Copyright 2012 Hewlett-Packard Development Company, L.P The information contained herein is subject to change without notice

Ngày đăng: 29/08/2022, 22:37