Google Cloud Platform for Architects Design and manage powerful cloud solutions Vitthal Srinivasan Janani Ravi Judy Raj BIRMINGHAM - MUMBAI Google Cloud Platform for Architects Copyright © 2018 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information Commissioning Editor: Vijin Boricha Acquisition Editor: Rohit Rajkumar Content Development Editor: Abhishek Jadhav Technical Editor: Mohd Riyan Khan Copy Editors: Safis Editing, Dipti Mankame Project Coordinator: Judie Jose Proofreader: Safis Editing Indexer: Priyanka Dhadke Graphics: Tom Scaria Production Coordinator: Shantanu Zagade First published: June 2018 Production reference: 1220618 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78883-430-8 www.packtpub.com mapt.io Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career For more information, please visit our website Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Mapt is fully searchable Copy and paste, print, and bookmark content PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks Contributors About the authors Vitthal Srinivasan is a Google Cloud Platform Authorized Trainer and certified Google Cloud Architect and Data Engineer Vitthal holds master's degrees in math and electrical engineering from Stanford and an MBA from INSEAD He has worked at Google as well as at other large firms, such as Credit Suisse and Flipkart He is currently in Loonycorn, a technical video content studio, of which he is a cofounder Janani Ravi is a certified Google Cloud Architect and Data Engineer She has earned her master's degree in electrical engineering from Stanford She is currently in Loonycorn, a technical video content studio, of which she is a cofounder Prior to co-founding Loonycorn, she worked at various leading companies, such as Google and Microsoft, for several years as a software engineer I would like to thank my family, dogs, colleagues at Loonycorn, and friends for making life so much fun! Judy Raj is a Google Certified Professional Cloud Architect, and she has great experience with the three leading cloud platforms, namely AWS, Azure, and the GCP She has also worked with a wide range of technologies in machine learning, data science, IoT, robotics, and mobile and web app development She is currently a technical content engineer in Loonycorn She holds a degree in computer science and engineering from Cochin University of Science and Technology Being a driven engineer fascinated with technology, she is a passionate coder, an AI enthusiast, and a cloud aficionado I'd like to thank my coauthors and colleagues for all the support and encouragement I've received I'd also like to thank God and my parents for everything that I am and everything I aspire to be Try to find reasons to use network peering Remember that VPCs in the GCP world are quite different from networks in the physical world or even on other cloud providers such as AWS VPCs are more like Autonomous Systems (AS) because each VPC can include multiple disjoint IP address ranges Resources that are in the same VPC can communicate using internal IP addresses as well as using a project-internal DNS facility This is true even if the resources are in different regions For instance, consider two VMs, one in the US and the other in UK Provided these are in the same VPC they will be able to communicate using internal IP addresses despite their physical distance By contrast if two resources are in different VPCs even if they happen to be in the same region or even on the same underlying bare metal box (remember that GCP VMs are multi-tenanted), they will still have to communicate using external IP addresses, which implies that the network traffic between them will have to pass over the internet Communication on internal IP addresses has several advantages: Cost: Remember that network egress traffic incurs charges, and communication over internal IP addresses avoids this Security: Google's internal networks are relatively invulnerable to intrusion and security attacks After all, Google has been under siege from hackers for over a decade now However, once traffic leaves Google's internal networks and touches the internet, all bets are off Latency: Google internal networks are blazingly fast; this is partially a legacy of Google's investments in YouTube and in trying to get video served at acceptable latencies in all or most regions of the world Internal traffic on the GCP is able to hitch a ride on these really fast internal links This presents us with a trade-off: if we have lots of small, modular VPCs, organization of resources and firewall rules gets cleaner, but network traffic gets slower, costlier, and less secure A great way to square this circle is to make use of the feature named VPC peering This allows a 1:1 link between VPCs so that resources on the peered VPCs can communicate using internal IP addresses Unlike AWS, GCP is cheaper in this aspect since it only applies standard network charges So, look for every possible opportunity to use VPC peering Understand how sustained use discounts work Kubernetes notwithstanding, GCE VM instances are likely to remain an integral part of your organization's cloud strategy for the foreseeable future Because you are certain to be using a lot of VM instances, you should invest the time to understand how exactly discounts on their usage work There are basically two major types of discounts currently available for VMs: sustained use and committed use discounts Committed use discounts require you to make upfront commitments about how much you will use your VM instances, so if you really know, you will need some specific compute power; by all means, go ahead and make that commitment Beware, always, of making such commitments The cloud providers offer such discounts in the hope that users will overestimate their need and end up overpaying Sustained use discounts, on the other hand, are a lot more worthwhile The basic idea here is: the GCP will combine all of your usage of VM instances of the same standard machine type, in the same project, and in the same zone, and apply a discount based on the total usage as if this were one, giant VM instance There is a lot of fine print in there that is worth understanding The sustained use discount does not require any upfront commitment; it is automatically applied based on your actual usage The discount applies for all VMs of the same standard type, in the same project, and in the same zone This makes sense because VMs with these shared properties are in a sense fungible, they help GCP with capacity planning Discount calculation for custom machine types is a bit different than for the standard machine types So, there is a sustained use discount for custom machine types as well, but it is less likely to save you a ton of money Read the fine print on GCS pricing Google Cloud Storage buckets are elastic, that is, you definitely pay for what you use, not what you allocate That's great However, there are a couple bits around their pricing that you should be certain to keep in mind: Access charges on nearline and coldline buckets: Recall that the hot bucket types (regional and multiregional) have relatively high storage charges, but no access charges On the other hand, the cool and cold bucket types (nearline and coldline) have access charges, and these can become quite substantial Say you back up a laptop to a coldline bucket and want to retrieve all of that data because the laptop crashes, you might find yourself paying access charges not that different from the cost of an old laptop So think through your use cases for the different bucket types very carefully Again, I can't emphasise it enough to remember that use Nearline when access is once a month or once a few months; use coldline when access is once a year or even less frequent Class A and Class B operations: GCS operations are categorized as Class A, Class B, and free Class A and Class B operations are cheap on a per-operation basis but can become quite expensive if performed repeatedly Take, for instance, something like Object Lifecycle management This is the GCS feature that allows you to change the storage type of a bucket based on age, freshness, and so on You might want to keep in mind that such operations are Class A operations, and they can end up costing you more than having the data item in a slightly suboptimal bucket type Use BigQuery unless you have a specific reason not to There is an absolute candy store of data storage services available on the GCP You probably have seen a taxonomy like the one in the following diagram, which tells you which service to use when Now you certainly should take the time to go through this diagram and understand each part of it, but the bottom line can be summed up as: unless you really have a strong pressing reason, just use BigQuery: One asterisk about this preceding taxonomy is we intentionally list Datastore in the OLAP part of the tree because even though it does have a transactional mode, in the real world, if you need transactional support, you are far more likely to go with RDBMS than with a document-oriented NoSQL database such as Datastore What might some of those strong pressing reasons to not use BigQuery be? Well, you might need really ironclad transactional support in which case Cloud SQL or Cloud Spanner are your best options Perhaps you need to support a large volume of writes and at a very high throughput, then BigTable is your best bet However, in general, unless you have a strong explicit reason, just pick BigQuery Use pre-emptible instances in your Dataproc clusters Hadoop (and Spark) jobs constitute perhaps the single-most important use case for organizations moving to the cloud So, getting your Hadoop strategy right is really important, and here, Dataproc is a fairly obvious way to get started One important bit of cost optimization that you ought to perform is using as many pre-emptible instances as possible Recall that pre-emptible instances are those that can be taken back by the platform at very short notice So, if you are using a pre-emptible VM instance, you could have it snatched away at any point with just about 30 seconds to execute a shutdown script and clean up your state The flip side of this inconvenience is that pre-emptible VM instances are very cheap On an applesto-apples basis, you can expect a pre-emptible instance to cost about 60-80% less than a non-preemptible instance with the same specs And here's the kicker: Hadoop has fault-tolerance built-in, and it is the perfect setting in which to exploit the affordability of pre-emptible instances Recall that Hadoop is the big daddy of distributed computing apps, it practically invented the idea of horizontal scaling in which large clusters of generic hardware are assembled and managed by some central orchestration software This use of generic hardware implies that Hadoop always expects bad things to happen to nodes in clusters: it has elaborate mechanisms for sharding, replication, and making sure that node failures are managed gracefully In fact, within a Dataproc cluster, all of the pre-emptible VMs are collectively placed inside a Managed Instance Group, and the platform takes responsibility for clearing away the old pre-empted VMs so that they don't clog up your cluster There are some guidelines to keep in mind while allocating pre-emptible VMs to your Dataproc clusters If your Hadoop jobs are skewed toward Map-only jobs and not rely on HDFS a whole lot, you can probably push the envelope and use even 80-90% pre-emptible VMs without seeing performance degradation On the other hand, if your Hadoop jobs tend to have a lot of shuffling, then using more than 50% preemptible VMs might be a bad idea: the pre-emption of a lot of VMs can significantly slow down your job, and the additional processing for fault-tolerance might end up even increasing the total cost Keep your Dataproc clusters stateless Remember that Hadoop in its pure, non-cloud form maintains state in a distributed file system named HDFS HDFS is on the same set of nodes where the Hadoop jobs actually run; for this reason, Hadoop is said to not separate compute and storage The compute (Hadoop Jars) and storage (HDFS data) are on the same machines, and the Jars are actually shipped to where the data is This was a fine pattern for the old days, but in the cloud world, if you kept your data in HDFS, you would run up an enormous bill Why? Because in the world of elastic Hadoop clusters, such as Dataproc on the GCP or Elastic MapReduce on AWS, HDFS is going to exist on the persistent disks of the cloud VMs in the cluster If you keep data in HDFS, you will need those disks to always exist; therefore, the cluster will always be up You will pay a lot, use only a little, and basically negate the whole point of moving to the cloud So, what you really ought to is move your data from on-premise HDFS to cloud GCS Do not move from on-premise HDFS to cloud HDFS That way, you can spin up clusters whenever you like, point them to data on the GCS buckets, run your job, and kill the cluster Such clusters are named stateless because they only reference state data from an external source (GCS buckets) rather than maintaining it internally in HDFS Understand the unified architecture for batch and stream More and more big data applications rely on streaming data There are many reasons for this: notably the increasing need for real-time insights where a system must output analytics as new data comes in on-the-fly We will not spend a lot of time discussing the difference between batch and streaming data, but intuitively, batch data is at rest in a database or a file, whereas streaming data is, well, streaming from a source to a sink There is a specific architecture that Google mentions a lot, which combines batch and stream processing into a single pipeline, and it is worth our understanding this architecture, as follows: In the GCP word, the most common batch data source is GCS (that is, buckets) and the reliable messaging layer is Pub/Sub Pub/Sub virtually always feeds into Dataflow, which is based on the Apache Beam APIs and combines batch and streaming logic into pipelines The classic source for summary analytics is BigQuery, and the best place to store granular, tick-bytick processed data is BigTable (why BigTable? Because it supports fast writes and very large datasets, order of PB) So, penciling in all of those, we get the GCP version of the architecture Now, this is an important architectural set piece, and you really should commit it to memory However, this does not mean that you should actually adopt it, at least not as of the time of this writing, in early 2018 The weak link, right now, in this is Dataflow In theory, Dataflow is an awesome technology It unifies batch and streaming layers, de-dupes, and orders the streaming data coming in from Pub/Sub and can perform complex event-time and processing-time operations, such as windowing and watermarking The downside? You have to write code to this; there is no UI currently available What's more? The code that can be in either Python or Java is not all that easy to write In time, and probably very soon, Dataflow will be an attractive proposition, though, so we should not be quick to dismiss it Understand the main choices for ML applications We have not spent a lot of time discussing machine learning on the GCP in this book, but at a very high level, you have two choices: TensorFlow and the Cloud ML Engine SparkML and Dataproc Both options are good The Cloud ML Engine has support for distributed training and prediction and is tightly coupled with TensorFlow, which is a great technology for deep learning So, this option is probably a better one, on balance SparkML is a great option too, though Spark is possibly the hottest big data technology today; therefore, there are a lot of existing Spark applications and a lot of talented Spark developers out there today If your organization uses a lot of Spark right now, you might find the SparkML on Dataproc option to be a better one, at least until TensorFlow and the ML Engine catch on in popularity in your firm Understand the differences between snapshots and images Persistent disks can be backed up using either images or snapshots; the two services seem similar but differ in some subtle ways, so let's be sure we understand the differences: Snapshots are best for data backups, they are cheaper, and incremental snapshots are possible too Images are best for infrastructure re-use, such as exporting a VM image for use in a different project or as the basis for a Managed Instance Group As we just mentioned previously, only images can be used as the basis for an instance template, which in turn is used to create Managed Instance Groups Snapshots can't be used for this purpose Images can be shared across projects and assigned versions and organized into families, marked with metadata, such as deprecated and obsolete A fresh VM can be spun up using either a snapshot or an image, but with one difference: an image can directly be imported into a VM, whereas a snapshot will need to first instantiated into a persistent disk, and that persistent disk can then be used to spin up the VM Snapshots are global, whereas persistent disks are zonal; so, if you'd like to move a persistent disk from one region to another, snapshots are the way to go Neither images nor snapshots will work with local SSDs; they both only work with local or persistent disks Don't be Milton! There's a line that goes, "To a man with a hammer, everything looks like a nail" That's a serious risk in the world of technology these days because of how fast new tools and technologies come and go We all have a favorite hammer, a technology or a language that we mastered a decade or two ago, and that got us our first job or a big promotion Well, that's great, but we have got to learn to move on There are new hammers being devised every day, and in today's world of technology, the surest way to fall behind is to keep clinging to your old favorite hammer Think of Milton, the stapler guy from the cult classic Office Space Never mind if you have not seen the movie or even heard of it The idea is simple to flourish; don't just accept new technologies grudgingly; run out, and embrace them Summary In this chapter, you learned that containers and Kubernetes are a great compute option for the future Dataproc can be a serious game-changer, particularly if you use it right Pre-emptible VMs, images, snapshots, and buckets all have fine features to love and fine print to be aware of Other Books You May Enjoy If you enjoyed this book, you may be interested in these other books by Packt: Cloud Analytics with Google Cloud Platform Sanket Thodge ISBN: 978-1-78883-968-6 Explore the basics of cloud analytics and the major cloud solutions Learn how organizations are using cloud analytics to improve the ROI Explore the design considerations while adopting cloud services Work with the ingestion and storage tools of GCP such as Cloud Pub/Sub Process your data with tools such as Cloud Dataproc, BigQuery, etc Over 70 GCP tools to build an analytics engine for cloud analytics Implement machine learning and other AI techniques on GCP Google Cloud Platform Cookbook Legorie Rajan PS ISBN: 978-1-78829-199-6 Host a Python application on Google Compute Engine Host an application using Google Cloud Functions Migrate a MySQL DB to Cloud Spanner Configure a network for a highly available application on GCP Learn simple image processing using Storage and Cloud Functions Automate security checks using Policy Scanner Understand tools for monitoring a production environment in GCP Learn to manage multiple projects using service accounts Leave a review - let other readers know what you think Please share your thoughts on this book with others by leaving a review on the site that you bought it from If you purchased the book from Amazon, please leave us an honest review on this book's Amazon page This is vital so that other potential readers can see and use your unbiased opinion to make purchasing decisions, we can understand what our customers think about our products, and our authors can see your feedback on the title that they have worked with Packt to create It will only take a few minutes of your time, but is valuable to other potential customers, our authors, and Packt Thank you! .. .Google Cloud Platform for Architects Design and manage powerful cloud solutions Vitthal Srinivasan Janani Ravi Judy Raj BIRMINGHAM - MUMBAI Google Cloud Platform for Architects Copyright... mobile platforms He currently leads a team of SREs building customer solutions on Google Cloud Platform for a managed services provider in the UK Tim is a Google Certified Professional Cloud Architect... Google Cloud Platform Global, regional, and zonal resources Accessing the Google Cloud Platform Projects and billing Setting up a GCP account Using the Cloud Shell Summary Compute Choices 013 ;