Go to https://cloud.google.com/dataproc/ , and click Try It Free.
Log in using your Google account credentials.
Sign up for a free Google Cloud Platform Account. 3 You need to provide debit/credit card details.
Install the gcloud command-line tool:
$ curl https://sdk.cloud.google.com | bash
$ exec -l $SHELL
$ gcloud init
Go to https://console.cloud.google.com/ , and click the Products & Services tab (represented by three horizontal lines) as shown in Figure 10-1 .
3 At the time of writing, Google is offering $300 of free credit for 60 days for new accounts.
Clicking the button displays a scrollable menu, as shown in Figure 10-2 . Figure 10-1. Accessing the Products & Services menu for Google Cloud Platform
Scroll down to the Big Data part of the menu, and select Dataproc (Figure 10-3 ).
Selecting Dataproc takes you to the screen in Figure 10-4 , which tells you to enable billing before you can use Dataproc. If you are using a preconfigured GCP project, you can skip the next two steps.
Click Enable Billing to go to the Billing Account setup screen (Figure 10-5 ).
Figure 10-3. Selecting Dataproc from the Big Data services provided by Google Cloud Platform
Figure 10-4. Billing needs to be enabled before using Dataproc
Go back to the Dataproc dashboard, and you should be able to create a cluster, as shown in Figure 10-6 .
Clicking Create Cluster takes you to the screen in Figure 10-7 , because the Compute Engine API needs to be enabled first. Go to the Compute Engine dashboard to enable it.
Figure 10-5. Enabling a billing account for your GCP project
Figure 10-6. Dataproc cluster-creation screen
Figure 10-7. The Compute Engine API needs to be enabled before you can use Dataproc
First Spark on Dataproc Application
The Yelp dataset also contains a number of features about businesses. 4 For each business, these features include its location, full address, and the reviews it has received. For restaurants, the dataset also contains qualitative information such as whether the location is child friendly and whether it accepts credit cards.
Listing 10-1 contains the JSON template for this dataset.
Listing 10-1. JSON Blueprint of the Business Dataset 1. {
2. "business_id":"<anonymized_id>", 3. "full_address":"<street_address>", 4. "hours":{
5. "<day_of_week>":{
6. "close":"<HH:MM>", 7. "open":"<HH:MM>"
8. } 9. },
10. "open":< true / false >, 11. "categories":[
12. <list_of_categories_such_as_restaurant>
13. ],
14. "city":"<self_explanatory>", 15. "review_count":<number_of_reviews>, 16. "name":"<self_explanatory>", 17. "neighborhoods":[
18. <list_of_neigborhood_names>
19. ],
20. "longitude":<self_explanatory>, 21. "state":"<self_explanatory>", 22. "stars":<star_count>,
23. "latitude":<self_explanatory>, 24. "attributes":{
25. <key_value_pairs_of_attributes>
26. },
27. "type":"business"
28. }
One of the attributes is Wi-Fi , with three fairly obvious values: no , free , and paid . An interesting application would be to figure out whether there is any correlation between having WiFi access and the rating of an establishment. The code for this application is in Listing 10-2 . It reads data from a socket (line 32) and converts each record to a JSON object (line 35). The application requires only two features, WiFi presence and star rating, so you confine yourself to records that contain the latter (lines 37–39) and convert them into key-value pairs where the key is the WiFi type and the value is the star rating (lines 40–43).
To calculate the average rating per WiFi type, you exercise a custom combineByKey transform to get an overall sum and count and a subsequent map operation to perform the actual average (recall Listing 3-18 in Chapter 3 ).
4 Contained in yelp_academic_dataset_business.json .
Listing 10-2. First Spark on Dataproc Application to Compare the Ratings of Restaurants With and Without Free WiFi
1. package org.apress.prospark 2.
3. import org.apache.spark.HashPartitioner 4. import org.apache.spark.SparkConf 5. import org.apache.spark.SparkContext 6. import org.apache.spark.streaming.Seconds
7. import org.apache.spark.streaming.StreamingContext
8. import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions 9. import org.json4s.DefaultFormats
10. import org.json4s.JsonAST.JNothing 11. import org.json4s.jvalue2extractable 12. import org.json4s.jvalue2monadic
13. import org.json4s. native .JsonMethods.parse 14. import org.json4s.string2JsonInput
15.
16. object DataProcApp { 17.
18. def main(args: Array[String]) { 19. if (args.length != 4) { 20. System.err.println(
21. "Usage: DataProcApp <appname> <batchInterval> <hostname> <port>") 22. System.exit(1)
23. }
24. val Seq(appName, batchInterval, hostname, port) = args.toSeq 25.
26. val conf = new SparkConf() 27. .setAppName(appName)
28. .setJars(SparkContext.jarOfClass( this .getClass).toSeq) 29.
30. val ssc = new StreamingContext(conf, Seconds(batchInterval.toInt)) 31.
32. ssc.socketTextStream(hostname, port.toInt) 33. .map(r => {
34. implicit val formats = DefaultFormats 35. parse(r)
36. })
37. .filter(jvalue => {
38. jvalue \ "attributes" \ "Wi-Fi" != JNothing 39. })
40. .map(jvalue => {
41. implicit val formats = DefaultFormats
42. ((jvalue \ "attributes" \ "Wi-Fi").extract[String], (jvalue \ "stars").
extract[Int]) 43. })
44. .combineByKey(
45. (v) => (v, 1),
46. (accValue: (Int, Int), v) => (accValue._1 + v, accValue._2 + 1), 47. (accCombine1: (Int, Int), accCombine2: (Int, Int)) => (accCombine1._1 +
accCombine2._1, accCombine1._2 + accCombine2._2),
48. new HashPartitioner(ssc.sparkContext.defaultParallelism)) 49. .map({ case (k, v) => (k, v._1 / v._2.toFloat) })
50. .print() 51.
52. ssc.start()
53. ssc.awaitTermination() 54. }
55.
56. }
This application consumes the Yelp businesses JSON data via the SocketDriver . The only minor change you need to make is to replace line 1 in Listing 9-2 ( AbstractDriver ) with this:
else if (ext.equals("dat") || ext.equals("json"))
To run the application on Dataproc, create a JAR from it using sbt assembly . To complete the story, let’s create a Dataproc cluster.
Figure 10-8. Creating a bare minimum cluster on Dataproc