Putting Services First A New Scorecard for Observability Ben Sigelman, LightStep Co-Founder and CEO Observing software used to be straightforward So much so, that we didn’t even need a word for it Of course today, where distributed systems are commonplace, we have observability: a long mouthful of a word to describe how we understand a system’s inner workings by its outputs In the previous post in this series (Three Pillars with Zero Answers), we presented a critique of the conventional wisdom about “the three pillars of observability.” We argued that “metrics, logs, and traces” are really just three pipes that carry three partially overlapping streams of telemetry data None of the three is an actual value proposition or solution to any particular problem, and as such, taken on their own, they not constitute a coherent approach to observability We now present part two: an alternative, value-driven approach to independently measuring and grading an observability practice — a new scorecard for observability Microservices Broke APM It’s worth remembering why we got ourselves into this mess with microservices: We needed our per-service teams to build and release software with higher velocity and independence In a sense, we were too successful in that effort Service teams are now so thoroughly decoupled that they often have no idea how their own services depend upon — or affect — the others And how could they? The previous generation of tools wasn’t built to understand or represent these deeply layered architectures We transitioned to microservices to reduce the coupling between service teams, but when things break — and they often — we can’t expect to find a solution by guessing-and-checking across the entire system Yes, a service team’s narrow scope of understanding enables faster, more independent development, but during a slowdown, it can be quite problematic WAN In the example above, Service E is having an issue, and Services A, B, C, and D all depend on E to function properly If you depend on Service E but only have tools to observe Service B and its neighbors, then you have no way to discover what went wrong It’s virtually impossible to re-instrument on the fly: You can’t redeploy Service C or Service D to understand what they are doing (Organizations designed to “ship their org chart” are at an even greater disadvantage; as reduced communications across teams leads to a reduced understanding of services for which a given team is not directly responsible.) Conventional APM is unable to provide an answer to the problem in the example above Yet, if this were a monolith and not a microservices architecture, all it would take is a simple stack trace from Service E, and it would be fairly obvious how you depended on it Tracing Is More Than Traces Distributed traces are the only way to understand the relationships between distant actors in a multilayered microservices architecture, as only distributed traces tell stories that cross microservices boundaries A single trace shows the activity for a single distributed trace transaction within the entire application — all the way from the browser or mobile device down through to the database and back Individual distributed traces are fascinating and incredibly rich, recursive data structures That said, they are absolutely enormous, especially when compared with individual Microservices structured logging statements or metric increments The sheer size and complexity of distributed traces lead to two problems: Our brains are not powerful enough to effectively process them without help from machines, real statistics, or ML When we consider the firehose of all of the individual distributed traces, we are unable to justify the ROI of centralizing and storing them Hence the proliferation of sampling strategies (related: why your ELK and/or Splunk bills are so unreasonable after a move to microservices) It’s simply too costly to store all of this data for the long term without some form of — hopefully intelligent — summarization That’s where distributed tracing comes into play It’s the science and the art of making distributed traces valuable By aggregating, analyzing, correlating, and visualizing traces, we can understand these patterns: service dependencies; areas of high-latency, error rate, and throughput; and even the critical path A New Scorecard for Observability Before we can improve the performance and reliability for any given service — and take advantage of the insights offered by distributed tracing — we must first define performance and reliability for that service As such, the single most important concept in a service-centric observability practice is the Service Level Indicator, or SLI Service Level Indicator (SLI): a measurement of a service’s health that the service’s consumers or customers would care about Most services only have a small number of SLIs that really matter and are worth measuring This often ends up taking the form of latency, throughput, or error rate For example, an SLI could be the length of time it takes a message to get in and out of a Kafka queue for a particular topic If the queue crosses a key latency threshold, then the services that depend on it would be significantly impacted Whereas average CPU usage across the microservice instances would not be an appropriate SLI, as it’s an implementation detail; nor would the health of any particular downstream dependency (for the same reason) Observability: Two Fundamental Goals Service-centric observability is structured around two fundamental goals: Gradually improving an SLI (i.e., “optimization”) Rapidly restoring an SLI (i.e., “firefighting”) For a mature system, improving the baseline for an SLI often involves heavy lifting: adding new caches, batching requests, splitting services, merging services, and so on The list of possible techniques is very long “Gradually improving an SLI” can take days, weeks, or months It requires focus over a period of time (If you are working on an optimization but don’t know which SLI it is supposed to improve, you can be fairly confident that you are working on the wrong thing.) By contrast, “rapidly restoring an SLI” is invariably a high-stakes, high-stress scramble where seconds count Most of the time our goal is to figure out what changed — often far away in the system and the organization — and un-change it ASAP If we’re unlucky, it’s not that simple For instance, organic traffic may have taken a queue past its breaking point, leading to pushback and the associated catastrophes up and down the microservice stack Regardless, time is of the essence, and we are in a bad place if we suddenly realize that we need to recompile and deploy new code as part of the restoration process Observability: Two Fundamental Activities In pursuit of our two fundamental goals, practicing observability is comprised of two fundamental activities: Detection: measuring SLIs precisely Refinement: reducing the search space for plausible explanations So, how we model and assess these two activities? We prefer a rubric that presupposes nothing about our “observability implementation.” This reduces our risk of over-fitting to our current tech stack, especially during a larger replatforming effort to move to microservices or serverless We also need to measure outcomes and benefits, not features — and certainly not the bits, bytes, and UI conventions of “the three pillars” per se Modeling and Assessing SLI Detection Great SLI detection boils down to our ability to capture arbitrarily specific signals with high fidelity, all in real-time As such, we can assess SLI detection by its level of specificity, fidelity, and freshness Specificity, at its core, is a function of stack coverage and cardinality Stack coverage is an assessment of how far up and down the stack you can make measurements: Can you measure mobile and web clients in the same way you measure microservices? (If your goal is, let’s say, lower end-user latency, then this would be nearly mandatory.) Can you look below app code and into open source dependencies and managed services to understand how failures at that level are propagating up into the application layers? Can you understand off-the-shelf OSS infrastructure like Kafka or Cassandra? In effect, stack support is your ability to observe any layer of your system and to understand the connections between them Cardinality refers to the number of values for a particular metric tag It’s an expression of the granularity with which you can view your data Since there is often a literal dollar cost associated with cardinality (a single trace can have hundreds of millions of tags), it’s important to understand your cardinality needs when structuring your metrics strategy How fine-grained should the criteria be for reviewing the performance of a host, user, geography, release version, specific customer, etc? Fidelity represents access to accurate, high frequency statistics Accurate statistics may seem like a given, but unfortunately, that is not often the case Many solutions, even quite expensive ones, measure something as fundamental as p99 incorrectly For example, it’s commonplace for p99 to be averaged across different hosts or shards of a monitoring service, rendering the data effectively worthless (If you haven’t been storing histograms or meaningful summaries, there is no way to compute the p99 globally.) p99 latency, 1s granularity p99 latency, 5s granularity p99 latency, 10s granularity p99 latency, 25s granularity 15000 15000 15000 15000 000 10000 10000 10000 10000 5000 5000 5000 5000 But fidelity isn’t simply an assessment of a calculation’s accuracy For detection to work well, you need to be able to detect the difference between intermittent and steady state failures, and that’s only made possible through high frequency data The images above are from the exact same p99 data from an internal system The only difference is the smoothing interval that we used to compute these percentiles As the smoothing interval lengthens, your ability to detect outliers diminishes Conversely, as the smoothing interval shortens, you’re able to detect patterns of failure you wouldn’t otherwise have seen For example, at 10-second granularity, it’s difficult to tell whether failure is steady state or intermittent, but at 1-second granularity, it becomes clear that this is indeed intermittent Freshness is an expression of how long you have to wait to access your SLIs An SLI is only useful during an emergency if it can be accessed immediately This is especially true for our “SLI restoration” goal (and firefighting use cases), though we should never have to wait more than a few seconds to see if a change made a difference The less “fresh” our data, the less relevant and helpful it becomes, regardless of its accuracy In Our Next Installment Now that we have a framework for measuring SLIs precisely, we can better understand the severity of an issue — but that doesn’t necessarily help us understand where it may be In our next and final post, we’ll cover SLI refinement: the process of reducing the search space for a plausible explanation to resolve an issue Root cause analysis can be particularly difficult to automate in a microservices architecture, simply because there are so many possible root causes But in a refined search space, it becomes much easier to identify the root cause — and to automate systems to continuously and systematically reduce MTTR About LightStep LightStep’s mission is to deliver confidence at scale for those who develop, operate and rely on today’s powerful software applications Its products leverage distributed tracing technology – initially developed by a LightStep co-founder at Google – to offer best-of-breed observability to organizations adopting microservices or serverless at scale LightStep is backed by Redpoint, Sequoia, Altimeter Capital, Cowboy Ventures and Harrison Metal and is headquartered in San Francisco, CA For more information, visit https://lightstep.com or follow @LightStepHQ Try It Now Start a free trial of LightStep Tracing today © 2019 LightStep, Inc LightStep is a registered trademark and the LightStep logo is a trademark of LightStep, Inc All other product names, logos, and brands are property of their respective owners LS-2019-04 lightstep.com LightStepHQ LightStep ... capture arbitrarily specific signals with high fidelity, all in real-time As such, we can assess SLI detection by its level of specificity, fidelity, and freshness Specificity, at its core, is... of the granularity with which you can view your data Since there is often a literal dollar cost associated with cardinality (a single trace can have hundreds of millions of tags), it s important... at 10-second granularity, it s difficult to tell whether failure is steady state or intermittent, but at 1-second granularity, it becomes clear that this is indeed intermittent Freshness is an