Chaos Engineering Observability Bringing Chaos Experiments into System Observability Russ Miles Beijing Boston Farnham Sebastopol Tokyo Chaos Engineering Observability by Russ Miles Copyright © 2019 O’Reilly Media, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/) For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or cor‐ porate@oreilly.com Development Editors: Virginia Wilson and Nikki McDonald Production Editor: Katherine Tozer Copyeditor: Amanda Kersey February 2019: Proofreader: Zachary Corleissen Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 2019-02-19: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Chaos Engineering Observability, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc The views expressed in this work are those of the author, and not represent the publisher’s views While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the autho disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of oth‐ ers, it is your responsibility to ensure that your use thereof complies with such licen‐ ses and/or rights This work is part of a collaboration between O’Reilly and Humio Please see our statement of editorial independence 978-1-492-05101-5 [LSI] Table of Contents Preface vii Observability and Chaos The Value of Observability The Value of Chaos Engineering Chaos Engineering Encourages and Contributes to Observability Summary Chaos Experiment Signals Coarse-Grained Signals Through Notifications Fine-Grained Signals Through Chaos Controls Summary 10 12 Logging Chaos Experiments 13 From Signals to Centralized Logging Centralized Chaos Logging in Action Summary 13 15 16 Tracing Chaos Experiments 17 Open Tracing The Open Tracing Control Summary 17 18 19 Conclusion 21 iii For Mali, Mum, Dad, Sylvain, Aurore! For Geeta and everyone at Humio! Finally, for the free and open source Chaos Toolkit and Open Chaos community! You’re all awesome! Preface If you’re considering running chaos experiments to find system weaknesses, especially in production, then observability will be on your mind This book is for everyone adopting automated chaos engineering in their teams and to ensure that they can execute those experiments as safely as possible by bringing those chaos experi‐ ments into the overall system observability picture This book introduces the concept of chaos observability: how to run chaos experiments and bring that work into your overall system observability picture You will see how chaos engineering experi‐ ments leverage a system’s observability and contribute to it This all begins by introducing the key concept of Chaos Experiment Observability Signals, covered in Chapter Throughout this book, high-level samples are shown using the free and open source Chaos Toolkit Although only the Chaos Toolkit’s observability capabilities are shown, the hope is that this book will prompt the need for observability across other chaos engineering implementations, possibly resulting in a set of open standard con‐ cepts and guidelines for chaos observability This book provides code samples of integrations with OpenTracing, visualized using Jaeger, for tracing chaos experiments alongside dis‐ tributed system traces, and centralized logging using Humio These samples show concrete implementations of how any system could be integrated with available chaos experiment observability signals For more on chaos engineering, see Chaos Engineering by Ali Basiri, Nora Jones, Aaron Blohowiak, Lorin Hochstein, and Casey Rosen‐ thal (O’Reilly) For an introduction to Observability, see Distributed Systems Observability by Cindy Sridharan (O’Reilly) vii Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, data‐ bases, data types, environment variables, statements, and key‐ words Constant width bold Shows commands or other text that should be typed literally by the user This element signifies a general note With those caveats and limitations, and ideally a copy of the afore‐ mentioned books on hand, I hope you enjoy this book and I wish you: “Happy (Observable) Chaos Engineering!” viii | Preface Figure 2-1 The notifications available from a chaos experiment’s exe‐ cution Chaos experiment notifications are coarse-grained because they are only triggered at the highest level of an experiment’s execution The Slack extension for the Chaos Toolkit uses chaos notifications to then surface those signals in a specific Slack channel Once the Slack extension is installed, add this block to the Chaos Toolkit’s ~/.chaostoolkit/settings.yml file to turn on the notifications: notifications: type: plugin module: chaosslack.notification token: xop-1235 channel: general The token specified here is a Slack API token The channel is the Slack channel that you’d like to surface your chaos experiment notifications into With the Slack extension installed, you can surface notifications to your own Slack channels, as depicted in Figure 2-2 | Chapter 2: Chaos Experiment Signals Figure 2-2 Chaos notifications surfacing in Slack The Slack extension is implemented in Python, but if you’d rather not write some Python code to hook into chaos notifications, it’s also possible to send notifications to an HTTP endpoint using an entry in your ~/.chaostoolkit/settings.yml: notifications: type: http url: https://mystuff.com/api verify_tls: false headers: Authorization: "Bearer 1234" Here you’re specifying that you’d like to make an HTTPS call to the specified API, passing the indicated headers when the notifications API is triggered Coarse-Grained Signals Through Notifications | Fine-Grained Signals Through Chaos Controls As well as high-level chaos notifications, the Chaos Toolkit also sup‐ ports a second, more fine-grained set of signals that can be hooked into and send valuable data to your observability systems This approach is referred to as a Chaos Toolkit Control A Chaos Control can much more than simply send data to your observability systems A control can also interrupt or manipulate a running experiment A Chaos Toolkit Control is implemented in Python and provides the functions needed for observability The functions available to imple‐ ment are the following: def configure_control(config: Configuration, secrets: Secrets): # Triggered before an experiment's execution # Useful for initialization code for the control def cleanup_control(): # Triggered at the end of an experiment's run # Useful for cleanup code for the control def before_experiment_control(context: Experiment, configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered before an experiment's execution def after_experiment_control(context: Experiment, state: Journal, configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered after an experiment's execution def before_hypothesis_control(context: Hypothesis, configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered before a hypothesis is analyzed def after_hypothesis_control(context: Hypothesis, state: Dict[str, Any], configuration: 10 | Chapter 2: Chaos Experiment Signals Configuration = None, secrets: Secrets = None, **kwargs): # Triggered after a hypothesis is analyzed def before_method_control(context: Experiment, configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered before an experiment's method is executed def after_method_control(context: Experiment, state: List[Run], configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered after an experiment's method is executed def before_rollback_control(context: Experiment, configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered before an experiment's rollback's block # is executed def after_rollback_control(context: Experiment, state: List[Run], configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered after an experiment's rollback's block # is executed def before_activity_control(context: Activity, configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered before any experiment's activity # (probes, actions) is executed def after_activity_control(context: Activity, state: Run, configuration: Configuration = None, secrets: Secrets = None, **kwargs): # Triggered after any experiment's activity # (probes, actions) is executed Compared to the chaos notifications, a chaos control provides a lot more signals that can be tapped into Each of those chaos control signals is then triggered during the execution of a chaos experiment, as Figure 2-3 shows Fine-Grained Signals Through Chaos Controls | 11 Figure 2-3 Chaos Control signals Summary In this chapter you saw, using the Chaos Toolkit as a reference implementation, some of the potential observability signals that a chaos experiment can produce In the next two chapters we’ll dig deeper into how these extension points, in particular the Chaos Toolkit’s Control API, can be built upon to push these signals to des‐ tinations useful for observability 12 | Chapter 2: Chaos Experiment Signals CHAPTER Logging Chaos Experiments “Come, Watson, come!"” he cried “The game is afoot Not a word! Into your clothes and come!” —Sherlock Holmes, from “The Return of Sherlock Holmes” by Sir Arthur Conan Doyle In Chapter you were introduced to the sorts of chaos experiment observability signals that a chaos experiment’s execution may pro‐ vide Now it’s time to look at how those signals can be turned into something useful to your own observability picture Centralized logging systems are widely recognized as a foundational part of any system’s observability toolkit By bringing all of the log events of a system together in one place, you are able to interrogate, inspect, correlate, and begin to comprehend what happened and when across your system Now you’re going to see how you can con‐ vert raw signals from your running chaos experiments to send them as valuable log events to a centralized logging system From Signals to Centralized Logging The Chaos Toolkit open source community has created an imple‐ mentation of a Control (see “Fine-Grained Signals Through Chaos Controls” on page 10) that bridges from a running chaos experiment to a centralized logging system The following code sample, taken from a full Logging Control shows how you can implement a Chaos Toolkit Control function to hook into the lifecycle of a running chaos experiment: 13 def before_experiment_control(context: Experiment, secrets: Secrets): # Send the experiment if not with_logging.enabled: return event = { "name": "before-experiment", "context": context, } push_to_humio(event=event, secrets=secrets) With the Humio extension installed, you can now add a Control configuration block to each experiment that, when you execute it, will send logging events to your logging system: { "secrets": { "humio": { "token": { "type": "env", "key": "HUMIO_INGEST_TOKEN" }, "dataspace": { "type": "env", "key": "HUMIO_DATASPACE" } } }, "controls": [ { "name": "humio-logger", "provider": { "type": "python", "module": "chaoshumio.control", "secrets": ["humio"] } } ] } 14 | Chapter 3: Logging Chaos Experiments Centralized Chaos Logging in Action Once configured, and the logging extension is installed, you will now see logging events from your experiment’s arriving in your Humio dashboard, as shown in Figure 3-1 Figure 3-1 Chaos experiment execution log messages Your chaos experiment executions are now a part of your overall observable system logging Those events are now ready for manipu‐ lation through querying and exploring (see Figure 3-2) just as you would conduct normally with other logging events Centralized Chaos Logging in Action | 15 Figure 3-2 Querying chaos experiment executions Summary In this chapter you’ve taken the raw signals from an executing chaos experiment and pushed them into your centralized logging to add chaos experiment execution to your observability! In the next chapter you’re going to go a step further and get those same signals to be used as the basis for distributed tracing visualiza‐ tion as well 16 | Chapter 3: Logging Chaos Experiments CHAPTER Tracing Chaos Experiments “Nothing clears up a case so much as stating it to another person.” —Sherlock Holmes, from “Silver Blaze” by Sir Arthur Conan Doyle Distributed tracing is critical to comprehending how an interaction with a running system propagates across the system By enriching your logging messages with trace information, you can piece together the crucial answers to questions such as what happened, in what order, and who instigated the whole thing When it comes to understanding how chaos experiments affect a whole system, add your chaos experiments to the tracing observability picture In this chapter you’re going to see how you can use the raw observa‐ bility signals from the Chaos Toolkit (Chapter 2) to enable a new type of Control that will be able to push trace information into dis‐ tributed tracing dashboards so that you can view your chaos experi‐ ment traces alongside your regular system interaction traces Open Tracing Open Tracing is a helpful open standard for adding and communi‐ cating distributed tracing about a system The Chaos Toolkit comes with an Open Tracing extension that pro‐ vides an Open Tracing Control, and it’s this control that you are going to use and see in action in this chapter 17 The Open Tracing Control After you have installed the Open Tracing Chaos Toolkit extension, your experiments can be configured to use the open tracing control by specifying a configuration block: { "configuration": { "tracing_provider": "jaeger", "tracing_host": "127.0.0.1", "tracing_port": 6831 }, "controls": [ { "name": "opentracing", "provider": { "type": "python", "module": "chaostracing.control" } } ] } This configuration turns on the control and points the open tracing feed at a destination The destination in this case is a Jaeger tracing visualisation dashboard, but it can be to any tool that supports receiving an open tracing feed The preceding configuration tells the Chaos Toolkit to send an experiment execution’s traces to the Jaeger dashboard where those traces can be displayed alongside all the other traces in your runtime system, as shown in Figure 4-1 18 | Chapter 4: Tracing Chaos Experiments Figure 4-1 Application and chaos traces in the Jaeger dashboard Summary Chaos experiment traces give you a way of correlating your chaos experiment execution to the potential effects and traces occuring elsewhere in your systems You can observe when your chaos was executing and even begin to dive into observable impacts on other system traces at the same time The combination of incorporating chaos experiments into your cen‐ tralized logging and then adding their execution traces to your dis‐ tributed tracing picture are two foundation steps to making chaos engineering observable In the next chapter we’ll conclude by look‐ ing at how this foundation can be extended into new observability areas and systems to fit your specific needs Summary | 19 CHAPTER Conclusion Observability and chaos engineering go hand in hand As you explore chaos engineering, the observability of your system will have to improve as you ask important questions about how you make sense of your running system Any chaos introduced into a system also needs to participate in your system’s observability picture by contributing a set of chaos signals The lifecycle of a chaos experiment offers a number of signals that can be channeled into your various observability systems In this book you’ve seen how logging systems, such as Humio, and Open Tracing can be integrated with from chaos experiments in the free and open source Chaos Toolkit to build the foundations of this chaos observability, but this is just the starting point! The Notification and Control extension APIs in the Chaos Toolkit exist so that you can integrate your own chaos experiments into your own observability toolsets More implementations are already planned for the Chaos Toolkit Incubator, and of course you can also create your own With observability added to your own chaos engineering experi‐ ments, they can contribute, like good system citizens, to your observability picture Chaos experiments themselves should never be a surprise, although their findings can sometimes be Making sure you have good system and chaos observability should make sure they never are 21 About the Author Russ Miles has been working as a chaos engineer at various compa‐ nies (both startups and enterprises) for the past three years He is CEO of ChaosIQ, a company dedicated to helping their customers build and run reliable and resilient systems through the ChaosIQ toolset Russ has been teaching technical topics, as well as consult‐ ing, worldwide for the past 15 years His current courses include a popular three-day course open to the public on chaos engineering that has most recently been run in London He also speaks interna‐ tionally He founded and continues to build a community around the free and open source Chaos Toolkit and Platform projects ... and open source Chaos Toolkit Although only the Chaos Toolkit’s observability capabilities are shown, the hope is that this book will prompt the need for observability across other chaos engineering... relies on observability but also, as a good citizen in your systems, needs to participate in your overall system observability picture The Value of Observability Observability is a key characteristic... debugging and using the feedback to iterate on and improve the product,” Cindy Sridharan writes Observability helps you effectively debug a running system without having to modify the system in