Apache Spark 2.x for Java Developers Explore data at scale using the Java APIs of Apache Spark 2.x Sourav Gulati Sumit Kumar BIRMINGHAM - MUMBAI < html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> Apache Spark 2.x for Java Developers Copyright © 2017 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: July 2017 Production reference: 1250717 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78712-649-7 www.packtpub.com Credits Authors Copy Editor Sourav Gulati Safis Editing Sumit Kumar Reviewer Project Coordinator Prashant Verma Nidhi Joshi Commissioning Editor Proofreader Amey Varangaonkar Safis Editing Acquisition Editor Indexer Shweta Pant Tejal Daruwale Soni Content Development Editor Graphics Mayur Pawanikar Tania Dutta Technical Editor Production Coordinator Karan Thakkar Arvindkumar Gupta Foreword Sumit Kumar and Sourav Gulati are technology evangelists with deep experience in envisioning and implementing solutions, as well as complex problems dealing with large and high-velocity data Every time I talk to them about any complex problem statement, they have provided an innovative and scalable solution I have over 17 years of experience in the IT industry, specializing in envisioning, architecting and implementing various enterprise solutions revolving around a variety of business domains, such as hospitality, healthcare, risk management, and insurance I have known Sumit and Sourav for years as developers/architects who have worked closely with me implementing various complex big data solutions From their college days, they were inclined toward exploring/implementing distributed systems As if implementing solutions around big data systems were not enough, they also started sharing their knowledge and experience with the big data community They have actively contributed to various blogs and tech talks, and in no circumstances they pass up on any opportunity to help their fellow technologists Knowing Sumit and Sourav, I am not surprised that they have started authoring a book on Spark and I am writing foreword for their book - Apache Spark 2.x for Java Developers Their passion for technology has again resulted in the terrific book you now have in your hands This book is the product of Sumit's and Sourav's deep knowledge and extensive implementation experience in Spark for solving real problems that deal with large, fast and diverse data Several books on distributed systems exist, but Sumit's and Sourav's book reverse transformation is used to reverse the direction of edges in the graph This operation just reverses the source and destination indices of the edges without updating any vertex and edge properties It can be executed as follows: reverse() Graph reversedGraph = graph.reverse(); To verify, print the edgetriplet: reversedGraph.triplets().toJavaRDD().collect().forEach(System.out::println); subgraph As its name suggests, the subgraph operations return a subpart of the graph This operation returns the graph with vertices and edges that satisfy the userdefined criteria In this example, we will fetch the subpart of the graph where the edge property is Friend The following is the signature of the subgraph method: subgraph(scala.Function1 epred, scala.Function2 vp The first parameter is used to filter the edges and the second parameter is used to filter vertices based on the user-defined criteria As we want to filter the graph where the edge property is Friend, we not need to specify any filter condition in the second parameter The following is the implementation of both parameters: public class AbsFunc1 extends AbstractFunction1 implements Se @Override public Object apply(EdgeTriplet arg0) { return arg0.attr().equals("Friend"); } } public class AbsFunc2 extends AbstractFunction2 implements Serializable { @Override public Object apply(Object arg0, string arg1) { return true; } } Hence, the subgraph operation can be executed as follows: Graph subgraph = graph.subgraph(new AbsFunc1(), new AbsFunc2()); subgraph.triplets().toJavaRDD().collect().forEach(System.out::println); The logical output of the preceding operation on the graph is as follows: Logical representation of output of the subgraph operation aggregateMessages Aggregating data about vertices is very common in graph-based computations For example, finding the total number of friends on Facebook for a user or finding the total number of followers on Twitter aggregateMessages transformation is the primary aggregate function of GraphX The following is the signature of the method: aggregateMessages (scala.Function1< EdgeContext < VD , ED , A>,scala.runtime.BoxedUnit> sendMsg, works as a map reduce function The sendMessage function can be considered as a mapFunction, which helps to send a message from the source to the destination or vice versa It takes the EdgeContext object as parameter, which exposes methods to send messages between source and destination vertices mergeMsg function can be considered as a reduce function, which helps to aggregate the messages sent using the sendMessage function aggregateMessages The Triplets object can be used to specify which part of EdgeContext, such as source attributes or destination attributes, is used in the sendMsg function to optimize the behavior The default value of this parameter is TripletFields.All In this example, we will count that a vertex is the destination of how many directed edges It is similar to counting how many followers a user has on Twitter The following is the definition of the sendMsg function: public class AbsFunc1 extends AbstractFunction1 im @Override public BoxedUnit apply(EdgeContext arg0) { arg0.sendToDst(1); return BoxedUnit.UNIT; } } Here we are sending a message with the value from the source to the destination vertex The mergeMsg function can be defined as follows: public class AbsFunc1 extends scala.runtime.AbstractFunction2 implemen @Override public Integer apply(Integer i1, Integer i2) { return i1+i2; } } In this function, we are performing a sum on all the messages received at a vertex This can be visualized as a word count program in map reduce Using the preceding user-defined functions, the aggregateMessages operation can be executed as follows: VertexRDD aggregateMessages = graph.aggregateMessages(new AbsFunc4(), new AbsFunc5(), Tr aggregateMessages.toJavaRDD().collect().forEach(System.out::println); As shown in the preceding example, it returns a type of VertexRDD, which contains the vertex ID and aggregated result The vertices that are not the destination for any directed edge are not included in the resultant RDD outerJoinVertices Joining datasets is a very popular operation in analytics systems The outerJoinVertices operation is used to join graphs with external RDDs Joining graphs to external datasets can be really useful at times when some external properties need to be merged in the graph The following is the signature of the outerJoinVertices transformation: outerJoinVertices ( RDD other, scala.Function3