We used to have a difficulty in our community - thankfully less prevalent now - with rootless questions of identity. Of course, it's not wrong to ask who we are, what we're here for, and what should we be doing: every profession benefits from regular reflection. But too much of it, and you never converge, and moving forward becomes impossible.
Though, as I say, I think things are settling and existential questions are much less urgent than they were, the profession continues to grow. Many new folks are still joining with similar questions about our purpose, how we achieve it, and so on. How, then, should we best address this?
Rather than pointing these new joiners at the fixed list of responsibilities present in the original SRE book, I thought it might be better to try a new approach: defining what SRE was by looking at what it's not. Or to put it another way, what can you remove from SRE and have it still be SRE?
Here are my suggestions.
|If you don't have this...||... are you doing SRE?|
|Access to internal source code||No|
|Ability to change internal source code/system design||No|
|Ability to cap operational work||No|
|Organizational cross-cutting ability||No|
|Ability to write code in the first place||No, but temporarily is okay|
|Operational responsibilities||No, but temporarily is okay|
|SLOs||Yes, but it's better if you have them|
|A large-scale system to manage||Yes, but it's better if you have some|
|Ability to avoid vendor kit||Yes|
|SRE job title||Irrelevant|
Access to internal source code. An SRE team without access to source code, either for their products/services, or infrastructure, can still do some useful things. They can make distributed systems designs, trace issues back to an endpoint or contributing factor of some kind, and write useful tools.
But this lack of access affects MTTR, destroys parity of esteem, undermines building closer relationships between the two teams, prevents deep engineering contributions to the supported systems, and sharply circumscribes SRE team possibilities. Unlike the ability to change the code, if the SRE team can't even be trusted to see the code, that indicates a deeper relationship pathology it would be hard to recover from.
Ability to change internal source code/system design. The good news is that an SRE team with read-only access to source code can perform more accurate problem resolution, and can understand the systems more deeply. As a result, they can come up with ideas for deep engineering contributions - but they aren't allowed to do them. That’s almost worse than the previous situation!
However, there's a reasonable situation where this makes sense, and that's where individuals in the SRE team have to satisfy the product engineering team of their competence with the code. Though this might come across as condescending in your individual situation, I actually don't judge here - folks are going to be cautious about source code, and legitimately so. But to my mind, this could only ever be a temporary (albeit perhaps somewhat long-lived) situation for an individual; a permanent barrier would mean an SRE relationship was impossible.
A similar discussion applies to system design as well. If an SRE team can't influence the design phase of the SDLC for the systems they're minding, that's not engineering. Of course, it might take quite a while to demonstrate competence at doing so: that's fine.
Ability to cap operational work. (We might also call this "SRE team having autonomy over how their time is spent", for reasons which will become clear.)
The key behaviour enabled by having a cap on operational work is that operational work almost by definition requires some element of immediate attention or task-focus. If you can’t do anything other than respond to an issue, by definition you can’t put together the project time required to solve a general class of problems with software. If in turn you can’t solve classes of problems with software, in a general environment of growing services (which ~all cloud services are) you either have more work over the same amount of people, which is bad, or linearly growing numbers of people, which is bad. So I feel as a function of autonomy, a function of organizational back pressure, and a function of just getting the job done, an SRE team needs to be able to do this.
Whether it's 50% or some other number, I doubt matters in the short-term, but if you don't give a team at least as much time to reflect, clean up, and engineer as they spend in purely reacting, they are not SREs.
Organizational cross-cutting ability. One of the cultural aspects of SRE that is often misunderstood is the implications of being the guardians of the user experience. Reliability is a holistic thing; it's not an attribute or property in the gift of any one silo, so guarding that experience necessarily requires moving outside your own team, to assemble the end-to-end picture. This leads to a situation where SRE often acts as "horizontal glue between vertical silos", to coin a phrase.
Well, what happens when you can't do that? In very hierarchical environments, where you are literally not allowed to talk to other teams without going up and down a chain of command; in environments where work items are processed primarily via context-free agents in ticket queues; or in environments where information about the customer experience is hidden, protected, or otherwise gate-kept, SRE struggles to work.
To be clear, there's (potentially) a huge difference between declared policy and actual practice here; if "leadership says" you have to go through the chain of command, but in practice people just talk to each other and help out as they would anyway, that's SRE-compatible for sure.
But if the organization is structured so as to prevent this kind of work, then it's not SRE compatible. (Indeed, it might be incompatible with many other things too.)
Ability to author code. Some idealised SRE team whose members can't write software, or can’t be in a position to within some agreed timeframe, is an operations team with distributed systems expertise. While this provides value in and of itself, not having the ability to write code loses the SRE team one of the key ways it can contribute meaningfully to scaling, reliability, monitoring, and so on. To my mind, this is not an SRE team. There may well be a path to being an SRE team, particularly if the individuals are competent in a related domain, and are willing to acquire knowledge in the other. An SRE team should, of course, have a spectrum of experience and inclination within it, including both systems expertise and software expertise.
Without operational responsibilities. An SRE team without operational responsibilities can still do useful work, particularly if the products/services are in the process of launching (design & pre-launch can be an incredibly valuable period for SRE contributions) but a permanent removal of operational responsibilities breaks one of the main feedback loops utilised by SRE to improve the product. As a result, I think a complete withdrawal of operational responsibilities must be temporary (perhaps extended, but temporary), or I don't think this is an SRE team.
Do please be aware that despite various assumptions, on-call is not the primary value SRE provides, and neither do operational responsibilities have to be provided primarily as on-call.
SLOs. There are many great practices that flow from having SLOs for your services, and much of value that is gained by being able to trade off priorities in services, but the author has been in a number of teams either without SLOs, that took a long time (~year) to settle on SLOs, and even a team where a relatively relaxed SLO was chosen by fiat and it was explicitly forbidden to spend more time finding a better one. So I am forced to conclude you don't need them to be doing SRE, though whether or not you can continue to do that indefinitely is very much another question.
FWIW, having SLOs unlocks disciplined tradeoffs between services, deciding on appropriate work, whether issues are important enough to care about, and a lot of organizational goodness - and also is a great way to be objective about what the user experience should be.
A large-scale system. An SRE team without a large-scale system to look after is still a perfectly valid thing, providing either that the system will grow significantly at some point in the future (in which case preparation in advance is useful), or that the time cannot be spent more usefully on something else. “Large” is also a subjective definition, geared not just on the sizes of the systems in question but also the scoped competence of the individuals in question. “Business importance” can stand in for “large” too, of course, it is just that much of SRE expertise is most fruitfully applied across a large number of systems, amount of data, or high number of users.
In short, you don't need it, though having it helps apply expertise in a high-leverage way.
Supporting vendor kit. In a way, this is a special case of source code not being available. Note that source code not being directly available does not necessarily prevent SRE improving manageability or scalability of a piece of kit; devices often offer some kind of management API even if they don’t expose their full range of capabilities, or their source generally. Sometimes they can be automatically managed even if they don’t provide an explicit API: for example, Traffic team in Google supported Netscalers with a collection of perl scripts that SSH’d into the machines and redefined VIPs on the fly. (Not joking, sadly.) Traffic team was a completely legitimate SRE team despite having to work with this for some years, before a self-developed system called Maglev replaced them.
So supporting vendor kit, even kit for which you don’t have code, does not necessarily mean you can’t do SRE: if the majority is entirely proprietary or not automatically manageable, then yes, it is a major problem, but as a subcomponent, it’s not.
Mono-repos. A mono-repo, while convenient in the general case, is not required for SRE to be SRE. The main benefit of it in the general case is the ability to track down a software path, derive what log messages actually mean, figure out appropriate people to talk to, and so on. If there is delayed access to segments of the code, that may have an effect on MTTR, but does not represent a conclusive blocker. (Access to source code as a general point is covered above.)
Job title. SRE work does not require the SRE job title to perform. Conversely, having the SRE job title but not doing SRE work creates confusion and dismay.
Though I've given you a lot of separate headings above, the summary is probably this: SRE is an engineering role. (The clue is in the name, I suppose!) The headings above talk about ability to write code, influence design, and so on - fundamentally, these are all proxies for the ability to do engineering. There are some practices that are perhaps more central or more peripheral than others, and there are some situations which might be temporarily bearable, particularly in startup mode - but fundamentally, if you can't do engineering, it's not SRE.
It is occasionally useful to spell out the specifics though, so if you find yourself in a position where you feel you're not doing SRE, have a look at the above list, see what's missing, and maybe you can start good conversations about fixing that. Maybe even point your leadership at this page, which could start opening the doors for actual engineering. Or perhaps it means those doors are more thoroughly locked, in which case many other companies await the arrival of motivated SREs desiring to improve their engineering abilities with pleasure.
Review from David Blank-Edelman, Liz Fong-Jones, and the SREfarers crew: Narayan Desai, Laura Nolan, Emil Stolarsky, Nicole Forsgren, Jez Humble, Murali Suriar, and John Looney.