Python Social Media Analytics Analyze and visualize data from Twitter, YouTube, GitHub, and more Siddhartha Chatterjee Michal Krystyanczuk BIRMINGHAM - MUMBAI < html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> Python Social Media Analytics Copyright © 2017 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: July 2017 Production reference: 1260717 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78712-148-5 www.packtpub.com Credits Authors Copy Editor Siddhartha Chatterjee Safis Editing Michal Krystyanczuk Reviewer Project Coordinator Ruben Oliva Ramos Nidhi Joshi Commissioning Editor Proofreader Amey Varangaonkar Safis Editing Acquisition Editor Indexer Divya Poojari Tejal Daruwale Soni Content Development Editor Graphics Cheryl Dsa Tania Dutta Technical Editor Production Coordinator Vivek Arora Arvindkumar Gupta Spark on the Cloud – Amazon Elastic MapReduce Finally, now that you have learnt about Spark, let's finally look at potentially limitless scaling! We will learn how to use cloud services to deploy Spark clusters There are many big data and data analytic service providers, such as Google or IBM Bluemix, but we will concentrate on Amazon for this chapter We will provide screenshots of the process because sometimes such platforms can get a little overwhelming The following are the steps for the process: First, we need to create an Amazon Cloud account if you don't already have one Go to https://aws.amazon.com and click on create a free account: Provide your credentials and click on Create account Next, we have to create a Key Pair Key Pairs are the basic authentication method on Amazon First, we need to go the EC2 services dashboard: Then, click on Key Pairs in the side menu Click on Create Key Pair and name it test-spark Next, we need to give our user some special permissions, so on the Header Menu hover on your name, from the drop-down menu click on Security Credentials, and from the side menu click on Users Next, click on your user, then click on Add permissions Choose the option Attach existing policies directly and search for AutoScalingFullAccess Finally, click on Next: Review and click on Add permissions Your user permissions should look like this: AutoScalingFullAccess will give your user the right to use services like Amazon Elastic MapReduce to automatically commission servers to form clusters 10 Next, go back to the AWS Console home screen and we will choose the service EMR (Elastic MapReduce): 11 Click on Create cluster, which should land you on the following screen: 12 We will name the cluster Test Spark Cluster, and choose the number of nodes we desire in the cluster For testing purposes, we will choose only two (one master and one slave) Finally, select the EC2 key pair that we created previously and click on Create cluster The cluster will take about 10 minutes to be ready, but once it is ready you should see the following on your screen: 13 The cluster services are only accessible from the master node, so we will SSH into the master node to get access to the cluster To so, we need to add our IP address in the Inbound Rules 14 To this, return to the AWS Console home screen and choose the service EC2 When open click on Security Groups from the side menu, and you should see the following: 15 In the table under Group Name you will find ElasticMapReduce-master Right-click on it and click on and select Edit inbound rules: 16 Add a new rule and choose the choose SSH as the type of rule, and My IP for the address, and save the list of inbound rules: 17 Next, return to the cluster in the EMR services dashboard and, next to the Master public DNS, click on SSH This will open a pop-up window with instructions on how to connect to the Master Node: 18 Copy and paste that in your terminal, indicating the right location for the test-spark.pem file The command should look like the following: >> ssh -i test-spark.pem hadoop@ec2-34-210-177-135.us-west-2.compute.amazonaws.com 19 Next, when logged in, simply open the PySpark shell >> pyspark 20 Just like that you are connected to your Spark Cluster Next, a simple test to make sure everything by running the following: >> sc.parallelize(range(10)).map(lambda _: _ * _).collect() That should return a list of integers as result Here we've created a cluster with just two nodes, but Amazon EMR allows to to scale up to as many nodes as you need With a simple click, you could scale up to hundreds or even thousands of nodes Summary Ten or twenty years ago, we did not need to scale up, except in very specific domains Today, with the boom of digital, data volume is increasing exponentially In today's world we need to be able to scale Scaling brings about more new challenges than simple sequential programming, but its benefits largely outweigh the challenges Social media analytics also require the processing and analysis of massive amounts of unstructured data, so the ability to scale our algorithms and analysis is indispensable In this chapter, we looked at the basic methods of speeding up programs, like multi-threading and multi-processing These methods are great when we have a powerful machine and a moderate sized data If we are working on a small machine with, for example, four to eight cores then we will be limited on the extent to which we can parallelize our code However, of course, if we only have a single machine with such resources, installing Spark on it is pointless At the same time, let's say we have a single very powerful machine with say 40-80 cores and our program is not very complicated, then using Celery might be more beneficial than Spark, because with Celery we would have less major code adaptations and, we could launch multiple Celery workers on the machine Big data analysis platforms like Spark are not optimal on small datasets because the overheads of master to slave communication and data distribution might themselves consume more resources than a slower sequential program The power of such platforms is seen when processing large datasets, which a sequential program might take days to compute, whereas Spark with adequate resources can in minutes! The biggest decision when scaling up is to choose the right approach for your problem In some situations, multi-processing can be a better choice than Celery or Spark; it all depends on the problem in hand The complexity of our problem viz-a-viz the availability of resources, such as the budget, the available processing power, and the number of machines available, are to be carefully considered before coming to a decision This chapter is meant to be a beginner's guide to scaling up There are of course many things left to learn, but we hope that this chapter has demonstrated the potential of what can be achieved when we master cluster computing for social media analytics This book was downloaded from AvaxHome! Visit my blog for more new books: www.avxhm.se/blogs/AlenMiler ... the Latest Social Media Landscape and Importance, covers the updated social media landscape and key figures We also cover the technical environment around Python, algorithms, and social networks,... scientists around the world Python Social Media Analytics has been written to show the most practical means of capturing this data, cleaning it, and making it relevant for advanced analytics and insight... Questions Introduction to the Latest Social Media Landscape and Importance Introducing social graph Notion of influence Social impacts Platforms on platform Delving into social data Understanding semantics