Performance optimisations in a cloud centric world

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	62
Dung lượng	1,25 MB

Nội dung

Performance Optimizations in a Cloud-Centric World Andy Still Performance Optimizations in a Cloud-Centric World by Andy Still Copyright © 2015 O’Reilly Media, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles ( http://safaribooksonline.com ) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Brian Anderson Copyeditor: Holly Bauer Proofreader: Nicole Shelby Cover Designer: Randy Comer July 2015: First Edition Revision History for the First Edition 2015-07-19: First Release 2015-09-02: Second Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Performance Optimizations in a Cloud-Centric World, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights Cover image courtesy of Vera & Jean-Christophe, from flickr The original image (“Heavy Traffic”) was in color 978-1-491-93137-0 [LSI] For Candance For Candance, who insists that all poor performance on the Internet is my fault Introduction Back in the day, it was simple Content was served from your server, over your network, and then to client machines that you controlled Even when that moved out from a LAN to a WAN, the connectivity came from a single provider—it was all under your control Then came the Internet… Now content was being served across the public Internet to end-user machines—you lost control of the location, type of machine, and type of connectivity Then came the cloud… The cloud brought with it a new way of thinking about web system hosting Hosting shifted from being a hand-crafted service to a commodity service providing throwaway systems You moved from being a hardware owner to being a service consumer With this change came the increasing loss of control over your system Nowadays your application is often the only element that you control directly, and even that can be dependent on consuming third-party services This is not a bad thing, but you need to be aware of the issues that can arise as a result of this shift to the cloud Whether you’ve already moved systems to the cloud or are thinking of doing so, this book will point out some of the risks to your site’s performance created by this loss of control and puts forth some methods to identify and then mitigate those risks In no way, though, does this book set out to deter you from moving into the cloud This author has long been a cloud advocate and works almost exclusively on cloud-based systems Terminology For simplicity, I’ve used the term “website” throughout to refer to any system that distributes data across the Internet, including browser-based applications, mobile apps, etc Chapter Losing Control So, here we are in the world of the cloud, with ever-expanding elements of our websites being placed in the hands of others Try Before You Buy Before using any service, you need to put it through its paces and ensure that it is behaving as expected and performing as advertised The nature of the cloud makes these kinds of proof-of-concept tests much more viable than non-cloud offerings They can be undertaken with minimal upfront costs and long-term commitment and can be thrown away if they fail While performing this testing, it’s good to get as many monitoring systems as possible going to ensure that you’re not just focusing on functional correctness; other metrics such as availability, reachability, and performance should be considered For example, the IPM data should be used to determine the network impact of using this service from different locations All tests should include a reasonable amount of load to understand the endto-end performance of the system under normal and high traffic Optimize Your Systems for the Cloud It’s easy to use cloud services in a sub-optimal way, because they’re relatively new systems, have a high velocity of change, and because developers are usually self-taught Furthermore, developers often apply onpremise thinking and practices to the cloud, not realizing that cloud systems are built with a slightly different paradigm in mind For example, the cloud-based database as a service offerings are better suited for a few larger queries than many small queries, meaning that any systems that are very “chatty” with the database will likely perform considerably worse in the cloud than on premise with a direct database connection Monitoring data should be used to confirm that the performance of these services is as expected and required Understand the Configuration Options Cloud services are usually aimed at delivering complex pieces of functionality in a simple way through a GUI or API Therefore, you can usually get up and running with them fairly quickly However, the out-of-the-box configuration options may not be the most resilient or performant You should be proactive in understanding which options are available as well as being reactive to issues identified by monitoring and testing Understand the SLAs Most cloud providers will provide SLAs; however, it’s is important to understand the terms of the SLA that they provide and ensure that you have implemented your service correctly to take advantage of it For example, Microsoft Azure provides an uptime SLA for cloud services, but only if you’re running two or more instances Apply the Same Good Practice to the Cloud as You Would to Any Other System The same good practices that you would apply to on-premise solutions should be applied to cloud-based solutions A standard risk assessment process should be followed For example, the cloud-based database as a service systems provide multiple levels of resilience around data (multiple copies in multiple places) but still involve a SPOF if there’s a system failure that causes data corruption Good practice in this case would dictate that a separate backup be taken and stored remotely—in traditional terms, an “offsite backup.” This backup should ideally be stored with another cloud provider (or elsewhere) Ensure You Can Handle Any Failure When you’re dependent on services that are out of your control, you have to be conscious of two things: They may stop working at any point You will have no control whatsoever over when they will start working again Therefore, you have to architect your systems to handle this failure gracefully NOTE Failure is not just failure—it’s also poor performance You should be monitoring thirdparty services to ensure they’re responding in a timely manner Avoid “Death by Retry” Once a failure state is known, share that knowledge across any elements of your system that depend on that service and put in place a measured policy for attempting retries Do not create a death by retry situation where your system is brought down by constant attempts to connect to an unavailable system A good architectural practice is to route all requests through a central point of connection Have a Backup Plan If the functionality provided by the third-party system is key, then consider having a replacement system in place and automatically failover to it Another option is to capture all the details of the request for processing offline when the system returns This is valid for systems such those for payment processing or appointment bookings Provide Ability to Turn Functionality Off Your system should be built to provide the ability to remove elements of functionality by a simple configuration or application change—often referred to as feature toggles This allows you much more granular control over the impact of elements of your system If they’re starting to cause issues, then remove then FEATURE TOGGLES Feature toggles are a development methodology where software features are built into systems with the ability to turn them on and off without redeploying the application This approach is often used as a way of pushing new features into production ahead of the time that they need to be made active, allowing the wider business to activate the feature at an appropriate time with minimal assistance needed from the IT team More intelligent feature toggle systems will allow gradual roll-out of new features to subgroups of users This allows the company to validate aspects such as functional correctness, performance, and popularity of features before rolling them out completely Feature toggles are designed to be short term, then to be removed after the feature is fully rolled out into production, as there is overhead in running and maintaining them Longer-term feature toggles should only be considered for specific pieces of functionality that are part of a set plan to remove on demand Examples of this may be predictive search that is triggered on every key press If this uses a search system that’s out of your control, then when you start to see problems with that service, you can change the toggle to remove predictive search If the service starts to struggle even more, then you can change the toggle to remove search functionality completely Fail Gracefully If there’s no way to proactively handle the failure and prevent any impact on the user, then you need to ensure that your system will fail gracefully The user should see a properly designed and presented page with a helpful error message that explains what has happened If the failure happens within a data transaction, the user should be notified of the current state of the transaction (e.g., has their order been placed successfully?) Create a “Flight Manual” A “flight manual” should be created with mitigation plans associated with each type of failure This should include the nature of the change that can be made and the circumstances under which it is acceptable to make that change Having this sort of manual allows people on the ground to be empowered to make decisions and changes without having to go through a complex decision-making process with management Chapter Takeaways There are six important lessons to take from this book: Don’t fear loss of control—embrace the cloud Introducing cloud systems will lead to further loss of control over your website, but the advantages of using these systems outweigh the disadvantages For most people, the services offered by cloud providers will be faster and easier to implement and manage, as well as more resilient, technologically advanced, and cost effective to run than anything they could implement themselves Ensure you have sufficient monitoring in place You can’t control what’s going on, so make sure you’re gathering data and can determine what users are seeing (across the full range of your audience) and some root-cause analysis on any issues raised This should include the following types of monitors: RUM/EUM IPM APM Stay in control—maintain an independent DNS provider Keeping your DNS independent and flexible allows you to implement a “right tool for the right job” strategy, combining multiple cloud providers/CDNs for different sections of your audience based on the data returned from your monitoring Offload the load—use caching and a CDN Make sure you’re caching data as close to the user as possible Implement a CDN to optimize responses and minimize latency Use your monitoring to determine the best CDN or combination of CDNs to use Understand the difference between cloud and on-premise Cloud providers offer many advantages over on-premise systems, and it’s important to understand the differences between them Research, investigate, and try new systems to ensure that you’re taking advantage of their features and understanding their weaknesses Failure will happen—build systems and processes to handle it As good as they are, when it comes down to it, you have no control over the systems and services you’re using, so your website must be able to handle the failure or poor performance, and you must have a process to be able to handle it About the Author Andy Still has worked in the web industry since 1998, leading development on some of the highest-traffic sites in the UK He co-founded Intechnica, a vendor-independent IT performance consultancy, to focus on helping companies improve performance on their IT systems, particularly websites He is also the creator of TrafficDefender, a cloud-based traffic-management tool Andy is one of the organizers of the Web Performance Group North UK and Amazon Web Services NW UK User Group Acknowledgments As usual, I have to pay tribute to all my fellow Performance Architects at Intechnica for sharing their knowledge across the spectrum of performance issues Books like this wouldn’t be possible without them Thanks also to Samir Jafferali for taking the time out to review the content and provide feedback and invaluable comments ... functionality, including database (Amazon RDS or DynamoDB), file storage (Amazon S3), message queuing (Amazon SQS), data analysis (Amazon EMR), email sending (Amazon SES), authentication (AWS Directory... networking, and a reasonable amount of time and ongoing maintenance Services such as Amazon RDS make this achievable within an hour, and at a reasonable hourly rate Flexibility and the ability... from your ISP Latency Latency is based on the distance that the data has to travel to get from end to end and any other associated delays involved in establishing and maintaining a connection Which

Ngày đăng: 05/03/2019, 08:37