The TwinSLO Proposal

Comments/Insights/Contributions from

  • Niall Murphy
  • Toby Burress
  • Štěpán Davidovič
  • Sal Furino
  • (Note that when I say "we" below, I don't specifically intend to speak for these fine people, I'm just using the academic "we". -Niall)

Introduction

If you don’t already know about SLOs, we can recommend Alex Hidalgo’s book, the original SRE book chapter about them, the SRE Workbook elaborations, or the antecedents those borrow from, particularly in the telecommunications industry.

The rest of this document assumes a familiarity with SLOs.

Problem Statement

One of the weaknesses of SLO availability reporting is that one number isn’t enough to capture detail that matters for high cardinalities.

Once your user population goes above a certain threshold, a perfect adherence to a 3 or 4 9s SLO doesn’t tell you anything meaningful about who or what specifically is being affected by outages. Trivially, once you have a large enough user population or underlying service volume, a potentially significant number of customers or queries can be affected by an outage and this won’t trigger an SLO violation. Indeed, whether something is a “wide-spread” violation or not depends on the size of the user population, and not on the SLO.

More importantly, one-number SLO reporting doesn’t allow for understanding the distribution of the failures. If some particular narrow subset of your user population is affected by an outage, perhaps even repeatedly, one-number SLO reporting doesn't allow you to discover this. Of course, any particular cohort being systematically punished is typically much worse for churn in that cohort.

On the other side of the coin, you could argue that successful SLO reporting is conditional on an environment where failures are evenly distributed - or to put it another way, if one-number SLO reporting is not working for you, SLOs are not the right tool for you precisely because your failure distributions are non-evenly distributed. Yet if in some sense, part of the idea of SLOs is help you understand what you can successfully ignore, it is clear that the larger and more complex your production environment gets, the problems you ignore because the SLO tells you you can, also get larger and more complex.

Proposal

Summary: we need a measure of the concentration of the failures that can be displayed in a simple numerical format alongside a standard one-number SLO compliance figure, that will allow the viewer to immediately understand how concentrated the failures are.

By way of analogy, consider standard deviation, a statistical metric designed to capture the degree of variance from a mean in a normally-distributed population. Without retreading well-understood material in excessive detail, an example might help: the average height of men in Europe is 178cm, with a standard deviation of 7cm. This means that 68% of men’s heights range between 171cm and 185cm, with 95% falling within 164cm and 192cm. The ability to say precisely this flows from the specific properties of the normal distribution. In this example, if the standard deviation was a much larger number, you would expect to see much more variation in heights; conversely, if it was smaller, you would expect to see less. In some respects, it is therefore a measure of concentration.

Now we turn back to our website SLO example. To really understand what’s going on, we need some measure of concentration accompanying a one-number SLO report. This measure should reflect the level of concentration of the SLO failures, so we can understand whether or not intervention is required.

How could we achieve this? Of course, we need a way of calculating a summary metric, but it’s actually more important to have labelled source data. A variety of metric calculation methods exist (and we talk about one of them later), but it’s really the fact that labelled data allows the creation of cohorts that matters the most. At root, the SLO model is very simple and just counts successes and failures. To properly compare distributions we need to measure more than just a boolean, so we’ll need access to an underlying dataset with full details - we can’t compute it from the SLO success/failure counts alone.

In other words, when we’re measuring concentration in this context, we really mean concentration across some categories. Using a Prometheus-based example, let’s say we have a metric that has labels associated with it - e.g. browser-requested language, source IP address, AWS region, and so on. Naively, we could calculate concentration across such labels by looking at how evenly distributed across each label they are, but this will fail for low-cardinality data (in the example above, if you only support one or two languages, or all your traffic appears to originate from a restricted set of load-balancers). Instead, we need labels with high cardinality, and ideally an underlying metric with (potentially) high variation - for example, latency. This allows you to detect concentration more effectively because you have a higher number of potential partitions to detect it in.

Measuring concentration

We mentioned above there are multiple ways to establish concentration metrics. Variance and standard deviation are examples of same, and are still applicable though most Internet-associated distributions are not normal, but long-tailed.

Our proposed method is to use entropy loss. This is illustrated in detail in the twinslo repository, but in summary we use the DecisionTreeClassifier method of the sklearn module to compare the expected distribution of the failures (technically, it measures log loss, but the overall effect is the same). This can also be used to stack-rank the various contributors to loss, as well as to surface one that stands out. The repository shows an example of randomly generated data with subsets that are specifically "punished", which the method can extract automatically while the rest of the data varies normally.

To summarise, assume we have a collection of labelled data with categories, and the classifier takes new, unlabelled data and tries to predict the category while the regressor tries to predict the value. As per classic ML/data-science approaches, the training data is partitioned into two sets: set A where f1<=x, and set B where f1>x.  It chooses f1 and x in a couple of ways, usually either by minimizing the entropy or a squared loss function. Ultimately, no partition of your data set should have a large decrease in entropy.  If it does, that means that one of the features is acting very differently from the others and is therefore appropriate for surfacing.

Future Work

As-is, the scikit-learn/DecisionTreeClassifier is not very inspectable and is not lift-and-shift into your existing setup. But it seems a promising avenue to investigate.

We've suggested that reporting two numbers would provide a significant benefit over reporting just one. Obviously this is most immediately relevant to the debugging case, though it's possible it's also useful in less immediate reporting contexts, so that remains to be investigated.

It's possible that exposing other things simultaneously in some UI, say distribution shapes via sparklines or equivalent, might give you more information at approximately the same cost.

This technique relies on having labelled data in order to discriminate between subsets. In classic "SLO-only" deployments, such labels are not available, and we currently believe it is not possible to discriminate successfully as a result, since you are throwing away too much data. Perhaps we're wrong.

Related Work

Going from 30 to 30 million SLOs

Honeycomb Bubbleup