Implicit SLOs and their dangers

This is a topic of intermediate complexity in SLOs. If you are coming to this cold, we recommend you read a few other pieces about SLOs first, then this will make a fair bit more sense to you.

SLOs, as you may know, have a dual nature: they have both social components and technical components. We spend a lot of time talking about the technical aspects, of course, but just as much as you use failure counts and worry about error budgets and exhaustion rates and so on, you are also using SLOs to signal to various audiences about your service.

Like approximately all human communication, these signals can be confused or misunderstood. I suspect a number of you will be familiar with a certain kind of signalling that happens in SLO program implementation meetings, where teams (mis)understand that a demanding SLO means a service is important, and because their service is important (of course), their service must therefore have a demanding SLO.

But there’s one kind of signalling and one kind of audience where the communication is actually very hard to misinterpret. Of course, that communication is the service level you’ve been offering already, and the audience is the existing users of the service.

It’s a subtle point in some ways, but a very straightforward reality in others.

When you use something - an API, a webservice, an app would all qualify - you build in your mind an idea of how useful and reliable this service is, and you modify your behaviour accordingly. Never fails? Use it every day. Really janky? Only when you have to. (Sometimes “have to” means you use it quite often. Sometimes, you implement a feature by relying on someone else’s implementation of it, and it turns out that implementation is only accidentally in production, or a side-effect of something else.)

The historical performance of your service is what we call the implicit SLO. People, over time, come to expect what you’ve been doing previously will continue into the future. The danger is when you change your behaviour and violate people’s expectations.

A key point to make is that if you implement SLOs for something previously unblighted by same, and your implicit and explicit SLOs end up at the same level, no-one gets upset. Your users are receiving the same level of service; maybe the actual performance turns up in some org-wide report whereas it didn’t previously, but otherwise the universe is mostly unperturbed.

It’s when things change, in either direction, that life can get tricky.

Take the example where you’ve been delivering two nines and all of a sudden it’s decided we’ll now do five nines. That’s an inherently implausible volte-face in how the service is managed, but let’s say you accomplish it (at least in the short-term, probably involving some burn-out).

Certainly some people will be happy as a result, but you also stand a decent chance of annoying others. Why? Well, let’s say you have some service consumers who made the ‘two nines’ work by building a set of infrastructure around that service, in order to make it work for their use case - think caching, fancy error handling, lots of retries, a close relationship with customer service, and so on. Now all that gets used about 1000x less because of the new standard - what does that do to the cost/usage profile of maintaining that infrastructure? Even worse if, as sometimes happens, this transition happens without every user of your service being notified. Well then the cost profile either gradually or sharply changes, and in either case the user can legitimately feel hard done by. Conversely, if you did literally nothing to accommodate that weird two nines service which nonetheless, you are compelled to use, there might still be a cost implication to getting better - the downstream users of you may well decide that your increased reliability means they’ll consume way more of your service than they were previously. It absolutely could have a destabilizing effect.

However, it’s the converse case that really annoys people, which is a significant relaxation of the observed performance. You were at five nines, we had some cost-cutting or reprioritisation, now we’re (say) three nines. Even if we smoothly manage the transition - unlikely - if you fail 100x more often than you previously did, people are going to notice, quite a lot. They might feel they have to build the infrastructure to make using your new SLO levels tolerable, and that you’re effectively passing the consequences of your cost-saving into their necessity to spend. (They might be right.)

Either way, this question of violated expectations is key to understanding the shock, and in some cases outrage, that accompanied the publication of the Chubby story in the SRE book. Chubby was a very important service and had a high SLO, which was generally exceeded. But SRE deliberately took actions to implement a controlled outage in order to keep the reliability "at SLO" rather than exceeding it, because, as we’ve discussed, if you give people an implicit SLO of X, it doesn’t really matter if you say it’s not X, it’s behaving like it’s X, and overwhelmingly that’s what matters in determining consumer's behaviour.

Another way to think of implicit SLOs relates to an ex-colleague, Hyrum Wright, and his eponymous law.

Here’s the direct quote, to save you a click-through:

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

Hyrum has good reason for saying this, and folks with long-standing experience in large-scale systems would be well-advised to take account of it, but I’d like to discuss a corollary for a moment.

Hidalgo's Corollary to Hyrum’s Law: it turns out that the reliability level is also an observable behaviour of the system.

So over time, people come to depend on your implicit SLO and the more implicit you are about it, the worse it is when you have to change it. Be careful about this. Over-communicate, particularly during transitions, and pay regular attention to how actual performance and the SLO are being perceived, and you’re likely to get through without too much difficulty.(1) Do the opposite, and life might be interesting for a while.

Good luck!

(1) John Looney said that he regularly checked in with his service’s consumers every 1-2 quarters to ask points like "Are you steadily getting annoyed with the SLO", and also make the occasional remark to the effect of "You do realise that we are frequently better than the stated SLO - don't get too comfortable".

Implicit SLOs and their dangers

Niall Murphy

Topics