[This post originally published in USENIX's ;login: magazine, and is based on my keynote at SRECon EMEA 2021. For those already familiar with those, nothing is meaningfully different here. Readers not interested in minutiae about SRE, and the overall project of making production systems better, may wish to skip this one.]
It’s natural, in the swirling chaos of the past few years, to take a step back and wonder just where everything is going. Though I’ve certainly been doing my fair share of that, I’ve been also thinking about defining precisely where things are. Answering that question for Site Reliability Engineering - SRE - is not easy.
On the one hand, SRE is inarguably incredibly successful: it’s a discipline, a job role, and a set of practices that have revolutionised a good portion of the tech industry. Its practitioners are strongly in demand in every sector from financial services to pizza delivery and beyond. The whole community has played a part in changing how the industry thinks about managing online services, developing dependable software, and helping businesses give their customers what they want more quickly, more cheaply, and more reliably.
On the other hand, I’ve grown increasingly unsettled as I’ve moved from being enmeshed in the day-to-day minutiae of executing our work, to thinking about what that work is actually based on: and why — or if — it works. Unfortunately, I’ve come to the conclusion we have much less underpinning our ideals than we had assumed. In the words of Benoit Blanc in the 2019 movie Knives Out, SRE is a doughnut hole in a doughnut hole, and the hole is not at the periphery, but at the centre.
Today, I believe we cannot successfully answer several key questions about SRE. Let's start with the most important one.
How can we thoroughly understand what kind of reliability customers want and need?
Can we provide a model for customer behaviour during and after an outage, give upper and lower bounds on return or abandonment rates, or otherwise provide a pseudo-mechanistic model for loss? Is there any way we can compare reliability, or lack thereof, between services? Can we say, for example, that Roblox going offline for a long weekend is better or worse than AWS going partially offline for two hours?
How do we value reliability work, especially with respect to other conflicting priorities?
Specifically, are there any useful guidelines to help us understand when reliability work should be preferred over other work, or a basis to understand what proportion of continual effort should be spent on it? What do we get for this work, as opposed to other work?
Do we fundamentally misunderstand outages and incident response? Is there any prospect of improving our models for them?
I don’t just mean “can a team or individual get better at incident response” — to which the answer is of course yes - but the larger question of whether or not we are understanding incident response the right way.
For example, incident responders and leadership often have a very different understanding of what’s going on. One of the main differences is the question of whether or not incidents represent exceptional behaviour. As best we know, right now, for a sufficiently complicated distributed system undergoing change, incident occurrence is unbounded. This is not widely understood at leadership level, and reconciling these views is critical. Yet a larger question beckons: though modelling unknown-unknowns is well understood to be impossible in principle, is there any intermediate result of use? For example, even if we can’t model exceptional behaviour, there are approaches (such as Charles Perrow’s Normal Accidents theory [1]), which can help us understand why it occurs. Is this area entirely resistant to analysis in general, or modelling with numbers?
Are SLOs really the right model for every system management challenge?
SLOs (Service Level Objectives) — clearly articulated numeric targets for reliability of production systems — are described in many articles and books, but most definitively in Alex Hidalgo’s book Service Level Objectives [2]. SLOs have quickly moved to the heart of how the profession does day-to-day work, since it is hard to make something reliable to degree X until you decide what degree X is, and how to represent it. However, we find the underlying conceptual model hard for coping with situations where success cannot be adequately represented as a boolean, and where “slow burn” alert construction has no business reason for one target over another. Yet SLOs remain critical as the basis for many SRE interventions. How can SLOs be improved? Or is there a more fundamental problem underneath them?
How can we best move beyond the 'SRE book'?
Site Reliability Engineering: How Google Runs Production Systems was published in 2016 (edited by this author along with Betsy Beyer, Chris Jones and Jennifer Petoff) [3]. The 'SRE book' has provided many helpful conceptual frameworks for practitioners and leaders to aspire towards. But more than five years on from the publication of the original volume, it’s time to re-evaluate what it says and recommends in the light of experience with those models in other environments and other contexts. How can we best do that, in a profession which is notoriously practical and focused on pragmatic action, not research?
If you did nothing other than keep your mind open about the necessity of tackling the above, I will have achieved my goal. But if you need more persuasion, or want to know where best to help, read on.
The Character and Value of Reliability
Though reliability is in the very name of the profession, we have an unforgivable lack of analytic rigour about what it is, why it matters, and specifically how much it matters.
At the very foundations, we confuse availability with reliability — or more accurately, we simply don’t define what they are. The narrowest commonest definition, that of a correctly responding HTTP request/response service, has some uncertainty itself — for example, whether or not 40x series response codes represent errors or not. In one view, the service is correctly responding with a 404 for, say, a particular image file not being present, since a user was just guessing a URL. In another view, the URL may have been generated incorrectly by another part of the system, and the correct one should have been used, therefore the response is an error. This is related to how we can know whether or not a server we’re dealing with is working correctly — and somewhat unbelievably, this is not a solved problem, even for protocols invented three decades ago.
Wider definitions of reliability often reference latency, cost of computation/performance, freshness, seasonal performance comparisons, and so on, but no complete list of such attributes exists. Most teams treat this as an exercise in assessing the behaviour of the business logic from first principles instead. Since we don’t have a solid framework that relates those attributes, everyone’s SLO implementation projects (and, therefore, what their SRE teams do) are going to vary, with consequent cost, and little benefit for that cost.
As well as lacking good definitions, we also lack a good understanding of, and ability to communicate, the value of reliability. Why do I believe this? Well, next time you audit talks, read articles, or just go looking around for answers to questions about how much reliability matters, count the number of bald assertions that you see, as opposed to discussions of models. I see a lot of statements like “reliability is the most important feature”. I do not see anything that looks like a model — either quantitative, mechanistic, or qualitative. I ask the reader to consider the following question: if reliability were in fact the single most important thing, would it not win every prioritisation discussion? You, as I do, presumably operate in an environment where this does not happen.
The problem is that such assertions are both wrong and excessively defensive. Declaring that “X is the most important feature”, by its nature prevents us from developing our understanding of what actual trade-offs exist, no matter what X is. It turns us away from developing a more sophisticated understanding. It is anti-model.
I claim that because the value of reliability is so difficult to calculate formally, its value has therefore become primarily socially constructed. We need to move the question of the value of reliability out of the realm of the incalculable and into something which isn't entirely constructed that way. If we can’t explain the fundamentals behind the rationale for the profession outside of solely the social context, we are in serious trouble.
Sawtooth Reliability
The default model of the importance of reliability today is what I’ll call a “sawtooth” or “boolean” model: it is the most important thing there is, except when we have it — then it doesn’t matter at all. Those coming from the ops side of the house know precisely what I’m talking about. No outage ongoing, therefore resources are hard to come by, since nothing bad is happening, and who wants to fund a Department of Bad for nothing bad happening? Big outage, and the floodgates are opened — everyone will help, until the outage is over, or the outage is not being resolved quickly enough. Then back to baseline zero we go. We never converge on an appropriate steady-state level of investment in reliability, a stable balance between prevention and cure.
As frustrating as this refusal to consistently fund reliability is, there are at least two interpretations in which this is rational. The first interpretation is that execs are using outages as a signal to tell them what to spend on (though the spending is generally with people’s time, as distinct from money). The second one is that there is no clear competing model to tell us that reliability is valued outside of the context of a total outage, and even then its value is ephemeral. Execs can point to a long series of outages by world-famous companies who do not seem to particularly suffer as a result, and say that the market does not appear to value reliability. Without a competing model, it’s hard to oppose that view. In that narrow sense, companies with extremely large and public outages do us all a particular disservice when they neglect or refuse to publish their postmortems, and in particular, their impact assessments, since we are left with nothing to support comparisons between reliability work and feature work. Or to put it another way, as is generally true, a lack of transparency costs the industry, and benefits the end company.
The sawtooth model of reliability’s value leads to a sawtooth model of investment. In general, such wild gyrations are not associated with stability.
The Parable of the Sticky Users
Part of the reason I’ve been thinking about this question over the past while was a moment, about four years ago, when I spoke to an engineering leader for a world-wide online brand. He was in town to investigate the possibility of growing an engineering organisation locally, so I asked him about the reliability story for the company in question. In my memory, he looked at me calmly and said:
“We have no reliability story. We don’t focus on it. We believe that in the business we’re in, our users are so sticky that they have no choice but to come back.”
My initial reaction was ... sceptical. However, as I considered the conversation further, it occurred to me that actually, maybe he was correct! As far as I know, the company suffered no business-disabling outages, is still around — indeed, is still a household name — and, generally speaking, seems to have gotten its engineering tradeoffs right. Does this mean reliability has no value? Probably not. But does it mean it is effective and convincing, in that environment, to assert it is the most important feature? Also probably not.
To muddy the waters further, I was later told by another source that that reliability work was taking place even if the engineering leader I spoke to wasn’t aware of it — it’s just that work wasn’t strongly prioritised from the top. Given that I keep coming back to this conversation and its many conflicting implications, I suspect those of you who have had similar ones also do so. Is the null model of investment actually tractable? Is the shadow sawtooth model of investment the actual, real default, even if leadership think it isn’t? Could this be better, and at what cost?
The Default Model of Reliability
What we do today, when we attempt to characterise and value reliability, is based on a kind of heuristic that is strongly related to experience with request/response systems like HTTP or RPC servers. There is a broad acceptance that there are upper and lower bounds of reliability that matter; few would find a service available less than 90% of the time useful. A corresponding observation applies for 99.999% of the time — few users of consumer-grade technology would notice. So most folks cluster around some definition of “good enough” between 2 and 4.5 9s, and even within that the bulk of teams probably converge somewhere between 3 and 4 9s. When it comes to valuing reliability, if your business context is, say, an e-commerce shop, you’ll care about dropped queries because of the direct costs of loss, with similar arguments being made for payment flows, etc. However, if you’re running something else, where the direct connection is weaker, it’s hard to motivate such detailed approaches.
Though it is very wide-spread, this intuitive model has a number of problems. Probably the most important one is that the model itself is applied to all sorts of circumstances where it doesn’t match. Non-request/response services, for a start — e.g. data pipelines, today increasing greatly in importance because of their involvement in machine learning. Services for organizations with varying business models than e-commerce. Services which only produce one report a day, but it’s for the CFO and they really care about it. Services where 5 9s might well not be enough for one customer, but it’s a multi-tenant infrastructure and everyone else is happy with it — really, all of those don’t match well to this default model.
We don’t have a good way of modifying the default model of reliability in line with business context, and we need to.
Incident Management
There remain fundamental problems with how we conduct incidents, understand them, and model them. These problems are not just in the conduct of incident resolution, where we might reasonably expect any given team would vary, but in understanding the very nature of incidents, and how to think about them.
This is most obvious in the domain of numeric approaches to incident management, where a strong dichotomy exists between executives and practitioners. In brief, though executives understand that incidents will occur, being generally business-facing, they are mostly interested in how long it takes to restore service. The main metric is Mean Time To Restore (or repair; also known as MTTR). Execs naturally expect to be able to treat the incident situation like they treat anything else in business – pay attention to a small number of top-line metrics, delegate to the people who seem to know what they’re doing, and reallocate resources when excrement hits the fan.
Unfortunately, Štěpán Davidovič has presented strong evidence to suggest that MTTR not only doesn’t work in the way execs expect, it also cannot [4]. Practitioners have internalised that every metric execs care about is either inaccurate, irrelevant, actively harmful, or relying on the wrong model of the world, and generally believe that the “everything is a metric” methodology is less applicable than execs think it is, and specifically less applicable in the case of incidents.
This dichotomy is in and of itself a serious issue, since either the exec contention that incident management can be performed numerically is wrong, and we the practitioners have a phenomenal job of persuasion to do, or the execs are right after all, and we just haven’t found the right metrics or model. Neither of those outcomes present much opportunity for progress: it has been long acknowledged that the market can stay irrational longer than an individual can stay solvent, and I expect that execs as a class will remain committed to the numerical management model I outlined above in the absence of something clearly and verifiably better. Qualitative analysis, which is often used in the social sciences to study complex phenomena, holds out some promise of identifying mechanisms and fruitfully exploring some of the socio- component of socio-technical systems, but this is precisely the kind of long-term, non-teleological work that almost every institution hiring SREs is allergic to.
I spend a lot of time talking about models, but this is not academic. Incidents are real with real effects, of course, and understanding them better is likely to yield significant benefits. The perception that “it’s just software and doesn’t really matter”, widespread in the industry in general, is both a) not true, and b) a significant cause of friction to progress. Case a) is trivially not true for cloud providers — for example, Microsoft set up a program [5] to look after key health and physical security industries — but it’s also not true, to varying degrees, for other service providers, and it will get worse over time as more and more things become more closely coupled to computing, the cloud, and the Internet in general. Though I accept a number of readers will disagree with me here, I believe the industry being unwilling to directly connect software failures with life-ending or life-jeopardising events has prevented investment in industry-wide progress, and we are more exposed every day.
Ultimately, though, how corporate hierarchies assign and reward value affects how SRE is perceived and how many members get on in their job; in the industry at large, many SRE teams are understood to have value primarily as a function of their incident response capabilities. How that value is understood is crucial to the future of the profession.
Service Level Objectives and Cloud 9
If you asked the proverbial person-on-the-street about SRE, and they don’t say “no idea”, they’re probably saying something about SLOs. In my opinion, the major benefit we get from SLOs is the idea that there are things we can successfully ignore — or, to put it more formally, that an organization as a whole should be intentional about what reliability / performance it wants from its systems, other than the completely reactive (and traditional) 100%. It provides a framework for introducing this idea to the business, and reflecting it in a structured way for engineering. It also defangs (or can defang) the relationship between development and operations teams if they’re in a hard split model. I can’t emphasize enough how much I like it compared to the previous default.
However, there are some important weaknesses about how it works, and what it assumes:
- How the modelling is done.
The underlying architecture for SLOs presumes that successes and failures in a service can be successfully mapped to a boolean, and that these booleans can be put in a ratio (Narayan Desai and Brent Bryan explore some of this in more detail in their SRECon Americas 2022 talk 'Principled Performance Analysis' [6]). Services outside of the classic request/response paradigm are not handled well in this paradigm, though of course there are compensating techniques.
- Error budgets.
Another rephrasing of SLO-based service management is that you pick a level and stick to it; well, what happens if you don’t stick to that level? In the original idea, exceeding the “budget” leads to an agreement by the product development and SRE team to work primarily on stabilisation work, until the appropriate service level is restored. But what happens if — as I've seen happen — you blow your error budget for the next twenty years? How to correctly respond in these circumstances is not well understood. As a result, error budget implementation should arguably be decoupled from SLO implementation, at least until this is better understood.
- Alert construction and threshold selection.
One upside of the SLO-based alerting approach is the ability to treat certain kinds of error as if they don’t matter, which is central to being able to avoid just reacting all the time (which in itself is central to being able to afford project work in the production context at all). But there are two issues here: the first and subtlest is that errors which the SLO says you can ignore might, of course, play a bigger role in a future outage, and prioritising those is not within the scope of the framework.
More immediately, though, if you are writing alerts in an SLO framework, you typically divide your alerting into fast-burn and slow-burn buckets: one of them designed to capture total (or very close to total) outages, and the other designed to detect slow erosion of the customer experience. The problem here is that we have nothing that allows us to confidently say that the particular line we have chosen for slow-burn alerting is correct. Furthermore, and this is perhaps the worst problem, we don’t have a good way of distinguishing between 1x100m and 100x1m outages. To spread this threshold selection problem across two alerts rather than one is not an improvement.
- Alert automation.
Only comparatively few “settings” in the conventional 9s setup allows for human response (basically, <= 99.95%). If SLO selection were in fact evenly distributed around that boundary, we would see a lot more automation of alert response, and it would be much more widespread and important than it is. So, why don’t we see it more often? Is it because we are systematically biased to human response? And if that is true, is this because automating responses is hard, or designing systems to avoid the alert conditions in the first place is hard — or both? Either way, there is a gap that warrants investigation.
- Off Cloud 9.
Speaking of 9s, it almost seems ridiculous to say this, considering how wide-spread this particular model is within the community - but is counting reliability in 9s an effective method of understanding the user experience? Charity Majors famously said “9s don’t matter if the users aren’t happy”, but I’m talking less about the question of effectively mapping user happiness to metrics or SLOs - the complexity of which anyone might struggle with - and more about the model as a whole. Why did we settle on a powers-of-ten relationship for capturing and modelling this behaviour? (It clearly wasn’t sufficient on its own, because if it was, why did we introduce 5s?) Do we have any confidence this structure maps to a natural behaviour or pattern that matters? Or is it all entirely arbitrary? Facetiously-but-not-really, why not use 8s?
I’m speaking for myself, of course, but every time I look at those charts of allowed unavailability that divide periods of time by fractions of 9s and 5s, I wonder to myself if a human is ever in a position where 4.37 minutes of unreachability won’t matter to them, but 4.39 will. There are other concepts that we could work with - for example, suppose that natively, without particular effort, a cloud provider’s infrastructure will deliver 98.62% availability - in such a world, having one behaviour on one side of that line and another on the other would make sense because it marks the difference between what the provider will give you without effort, and what you have to work to achieve. I have seen no data supporting the contention that 4.38 minutes of unavailability is a critical boundary.
Is a better model available here? One that was easier to understand, mapped better to real-world experiences, and required less division by 9 would be a good start.
Getting to SRE 2.0
Turning the conversation back to models for a moment, the obvious question is what might a better model for reliability look like? I hope I’m wrong, but I’m not aware of any serious, holistic, work in this area. Instead, what I see is partial hints: edges of the elephant we are blindly and sporadically exploring.
One hint, for example, is the series of studies from many online services (such as Google [7] and Zalando [8]) which have shown a relationship between experienced latency and user satisfaction, purchases, and similar. The numbers might vary a little from study to study, but the key result has been reproduced a bunch of times: the higher the latency, the less user satisfaction, the fewer purchases or user journeys, and so on. There should be public work that explores similar relationships for reliability, but across many different services, user populations, job-to-be-done contexts, and so on. Assembling such data will help with filling out the picture more widely. Yes, it is empirical; in my opinion, we can’t afford to turn down inspiration from anywhere, and it can all help to prime our intuition.
Another hint is Nicole Forsgren, Jez Humble and Gene Kim's Accelerate [9], which presents research which shows that we have a reason, based on data, to believe that stability and rapidity of release to production can in fact go hand in hand. This is counter-intuitive to some, but is very much a real effect, related to small batch sizes stabilising change. Of course, some of the risk from changing systems flows not just from the content of the change itself, but the very act of making a change, and how long it has been since your last one. This result isn’t necessarily detailed about all potential mutations of production (migrations, network topology, schema changes, etc) but it’s suggestive of a way forward.
When I see work like those two examples above, I see work that explores a relationship between things. A relationship is an equation, an area for exploration, and an invitation to examine exciting edge-cases. A number of these relationships, integrated with a strong theoretical vision — underpinned by empiricism — could provide a lot of insight into how we should do systems and software management. It would be the beginning of a kind of biology of systems development and management, or a special case of systems science in the production domain.
I initially wrote physics, rather than biology, but changed my mind; I want to reassure the reader I have no great plan for a fully determined framework. Even the impulse for one is often misguided: those familiar with the history of mathematics may recall the story of David Hilbert, extremely influential 19th and 20th century mathematician, who proposed a research program to show, once and for all, that the foundations of logic were on solid ground, and that everything can and must be known. Godel and his famous incompleteness theorem blew an unfillable hole in that shortly afterwards, and limits to knowledge are an important part of how we perceive the world today.
I’ve seen too much of the world to propose such a program - I’m just looking for better models. But even if I was, one of the bigger problems is that SRE itself is a very practically oriented profession. Neither the profession nor its practitioners tend to have much patience with, or time for, indefinitely long cross-company projects — never mind the complicated and hugely necessary questions of ethics that come with being the stewards of the machinery of society in an era seemingly dominated by polarisation, deceit, and unsustainability in every sense of the word. Digging ourselves out of this hole is therefore only likely to make progress when doing so aligns with one company’s need to solve particular problems relevant to them, or if motivated individuals take it upon themselves to do the work anyway.
This is a pity, because from where I stand, I see across the profession way more questions than answers, and the answers we have right now are insufficient. The SRE books provided a look behind the curtain at a different way of thinking about service and software management - one which proceeded from different assumptions than most other organisations, yet was still worthwhile. The next revolution - SRE 2.0, as some have started calling it - is just as urgently necessary as 1.0 was, but if anything further away.
SRE could be - should be - much more than it is today. Please help.
Acknowledgements
At the end of SRECon EMEA 2019, a number of folks got together to discuss their impressions of the conference, the future of the profession, discussions about reliability, software development, and systems thinking generally. Unhappily, the pandemic happened shortly thereafter, but the SREfarers group (as it became known) was a source of great comfort during those times, and the input from the members was incredibly influential in my thinking for this piece and others. I would like to thank Narayan Desai, Nicole Forsgren, Jez Humble, Laura Nolan, John Looney, Murali Suriar, Emil Stolarsky, and Lorin Hochstein for their many and varied contributions to my insights over the past few years. It has given me an abiding respect for intimate and respectful cross-industry, cross-role discussions that might be an interesting model for development in the future. I also want to specifically thank Laura Nolan, Cian Synnott, and Tiarnán de Burca for comments on an earlier version of this article.
References:
[1] Charles Perrow, Normal Accidents (Basic Books, 1984).
[2] Alex Hidalgo (ed.), Implementing Service Level Objectives (O'Reilly, 2020).
[3] Betsy Beyer, Chris Jones, Niall Richard Murphy, and Jennifer Petoff (eds), Site Reliability Engineering: How Google Runs Production Systems (O'Reilly, 2016).
[4] Štěpán Davidovič, Incident Metrics in SRE: Critically Evaluating MTTR and Friends (O'Reilly, 2021).
[5] 'Life and Safety: Scaling Up Azure Resources to Safeguard Society in a Pandemic', Microsoft. https://www.microsoft.com/en-ie/engineering/lifeandsafety
[6] Narayan Desai and Brent Bryan, 'Principled Performance Analytics', USENIX SREcon (2022). https://www.usenix.org/conference/srecon22americas/presentation/desai
[7] Jake Brutlag, 'Speed Matters for Google Web Search', Google (2009). https://services.google.com/fh/files/blogs/google_delayexp.pdf
[8] Christoph Luetke Schuelhowe, 'Loading Time Matters', Zalando Engineering Blog (2018). https://engineering.zalando.com/posts/2018/06/loading-time-matters.html
[9] Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and Devops (IT Revolution Press, 2018).