SRE in the Real World

SRE in the Real World

(This is a repost of a document living here, but I am putting it here for backup's sake. Originally a joint effort with Murali Suriar, with input from Matt Brown, Liz Fong-Jones, and many others. The intended audience of this doc is the recently laid-off, or those who suspect they are shortly to be, though a number of others have found it useful outside of that context.)

Intro

We’re very sorry to hear you’ve been affected, or think you might be, by this regrettable action on the behalf of your previous/current employer. Despite this, there is good news, which we’ll talk about shortly. There’s also bad news. (It wouldn’t be the 2020s without it.) There’s also some unexpected news! We’ll provide some ways to think about further opportunities, and some resources to help orient you. Of course, all of these are our opinions, but your mileage may vary, yadda yadda yadda.

The main takeaways, if you read nothing else, are:

  • Working outside of Google SRE is more of a culture shift than a tooling shift, though it is both.
  • The cultural shift is harder.
  • The tooling shift is hard, but tractable, and you can use some approaches you’re already familiar with to make progress.
  • What SRE is called and what it means in the outside world varies a lot. You can’t just read the text of a job advertisement and expect to understand what’s required. In some ways, the closest analogue in the real world might be staff engineer.

The Good

Ok, so. Some good news. There are things out there in the world that you would recognise as being SRE. Niall has a story of seeing folks in a Microsoft SRE team arguing over the precise semantics of a HTTP return code in a PR (==CL), and knowing that this was SRE-nature. The team had operational struggles for sure, but worked steadily towards solving them with software approaches, and made progress in doing so.

It is also true that SRE has a massively positive reputation in many circles, with commensurate pay. It may or may not be FAANG-level pay, but it will be a net increase on many jobs otherwise classed as operational. The SRE book continues to sell, and it is part of an industry-wide conversation. As a result, it’s a healthy profession, in terms of people engaging with it, trying to push it forward, and it has people entering it and leaving it (both of which are necessary for a healthy profession).

Net-net: there are jobs available and you have a reasonable prospect of exchanging labour for money.

The Bad

You cannot take all of your habits of work and expect to successfully transplant them to another company unchanged. Google is still a cultural outlier in many respects, even within valley companies (though that is changing as we write): it cares about reliability more than most places, and measures it. Both of those are not default behaviours elsewhere.

Another large difference is the attitude towards shared infrastructure. We’ve not seen anything like Google’s commitment to shared infra and company-wide engineering solutions in other companies. Googlers mutter darkly about a long tail of narrowly scoped solutions to particular problems (particularly prevalent in the Ads space, for example) and the consequent cognitive load, but ignore the fact that Borg, Colossus, Chubby, etc etc all have either almost completely uniform adoption for the general use case, or are “market leaders” (70%+) within their segment. But this isn’t an observation about technical challenges, design problems, etc, all of which are relevant - it is an observation that in other companies, internal infrastructure stops at the business unit boundary. Niall has a vivid memory of something one VP is supposed to have said about another when the prospect of adopting a system across their orgs arose: “Why would I take a dependency on him?”

But don’t view that exclusively through the lens of “it is not sufficient I succeed - others must fail”. You will find that the rest of the world very much more strongly values autonomy on the team, org, business unit level. In Google, the conviction of success strongly related to using Google software to achieve answers to the complicated, large problems Google suffers from: elsewhere, they don’t care about how it gets solved, it only matters that it gets solved. To that end, it is viewed as a positive thing that teams get the autonomy to have whatever implementation they want behind the scenes.

“Googlers pick up after themselves”. That was a saying back in the day. Whether or not it remains true, the interpretation we remember was that Googlers would take the time to try to fix something they touched, even if it wasn’t theirs. Whether it was a shared commitment to quality, or the unspoken idea that everyone had to perform being slightly better in front of others, it’s hard to be sure. But most other business cultures are hurry cultures, and don’t have time for this kind of thing.

“Toil is the job” – as Narayan says, Googlers often have difficulty in understanding that in other companies, toil is the job - and there are a wide variety of situations in which that’s actually okay. In essence, be careful before you instantly dismiss the current situation.

And finally, tools and tooling. But more on that shortly.

The unexpected

It is true that there are obvious things to be missed, and you will miss them. Perhaps the largest will be the entire devinfra ecosystem (Piper, Blaze, Forge, TAP, Rapid, MPM, …). There are partial replacements for most if not all of these, but the huge value brought by coherent integration is lost, and that will be continual friction - the PR process on GH is not an effective substitute, though there can also be a huge benefit to not living within a giant monorepo with changes that affect you, but you have nothing to do with flying past.

Conversely, there will be things you miss that will surprise you. We found the largest one of these was the entire AAA suite: LOAS/Ganpati/LsAclRep/RpcSecurityPolicy et cetera. It is unsurprising they are missing in the outside world, since the combination of homogeneity of environment and NIH-spirit doesn’t really apply anywhere else. But we strongly miss the ability to look at what access the team-mate beside you has by looking at a small set of tools, duplicating that, and getting on with your day. Or even providing patches to a tool to compare what you have versus what you should have. There’s just no equivalent that we’re aware of, and the cloud provider IAM systems are all gigantic tire fires no-one is coming to put out. Prepare for a future where finding out what you have access to, why, and why not, is an exercise in effort and determination.

A crucial nuance here, which also feeds into the cultural discussion, is that many tools (and hence related or adjacent processes) do not support delegation. The system is inherently designed to centralise power, or be administered by some nominated subset. Whether this is about power, trust, or process inertia, it is a real and observable effect. One of us has a tale related to the MS “stack” - a particular office which used this stack had over a thousand people in it, but there was no way to mail everyone in it easily, since they all had different off-site reporting chains, and the way Active Directory does mailing lists meant that only the root of the tree could be emailed. Both a self-selected email list which you could just subscribe to, or a set of internal tools which would allow you to deterministically select the right people to email, were unknown science. In a very real way, not only does Conway’s law mean your software reflects your organisation, but how the software does communication affects how the org works too.  You will miss Buganizer’s component hierarchies.

A related point is that enterprises typically arrange workstreams around ticket queues, and those queues often have inexplicable configuration or behaviour. (Damon Edwards has many great talks on this, with recommendations on what to do, of which the main is: work horizontally and in classic SRE fashion tip-toe around the impediment until you get the work done - then surface it in the systems of record nonetheless, because that’s the right thing to do.)

Things you’ll have to wrestle with - identity

What “SRE” is used for varies throughout the industry. Hell, it even varied within Google, but it particularly varies outside. So if your identity is tightly coupled with the term, and what you believe it means, there will be a conflict at some point. (There are many well-documented reasons to work on decoupling your identity and your professional existence, though the totalizing environment of Google would never have made that easy.)

Here are a few things things that the term SRE means in the outside world in our personal experience:

  • Expensive and good at on-call
  • Distributed systems consultant
  • Platform engineer
  • Rebranded ops group member

As we’ve stated elsewhere, you’re not necessarily going to be able to determine what is involved in a particular role from the text, or even from the interview process (although you may get some clues). A key component is obviously what they’re actually hiring you to do, but that might not even be clear to them, never mind you. (We’ll explore this, and the question of providing value both generally, and in the context of an SRE effort, in the next section).

But it’s important to make this next point, since it is so related to questions of identity; Google culture presumes value from engineering activities first, and customer or product oriented ones afterwards. SRE in particular was extremely engineering-led. That’s not necessarily bad (though it does create some interesting dynamics), but it does turn out that other companies have different cultures: sales-led, product-led, customer-led, and so on. Part of handling the change in identity for yourself, is realising that the company you’re in has a different identity too, particularly around how they understand value.

A special note about on-call

Though many places equate SRE with ops and/or on-call, do not expect on-call compensation without negotiation, and possibly not even then. (As always, your point of maximum leverage is before you sign a contract.) We have not come across anywhere of a significant size in the industry that has as consistent and as fair on-call compensation policies as Google does, and the default is very much that it is perceived as part of the role.

Things you’ll have to wrestle with - providing value

To develop that theme of understanding and providing value, we note that SRE in the outside world is more usefully treated as a toolbox, rather than a doctrine. Which is to say that there are various bits of it which are tools that can be applied to solve problems, and some of the tools are appropriate to the situation and some of them are not. The major question is, which of those tools and practices will work in your new environment? What are the most urgent problems? What are the most perceived-to-be-valuable problems? (Note the careful distinction between those two; they may of course be different, but it is particularly necessary in non-engineering cultures to find out what people think about their problems, since solving a problem that no-one believes is a problem is a clear social signal you think you’re different, and can ignore everyone else. Much better to engage in persuasion first: not being doctrinaire about things is precisely the behaviour you yourself would want to see in any new team members.)

Furthermore, how you provide value is of course partially dependent on what you’ve been hired as. There are a fair number of companies still experimenting with this SRE thing, and still interested in hiring someone with deep background to help drive change in the company. Experience suggests that it is a two-year journey to implement SRE. You might be hired as the vanguard or the hindmost, but the key point is that the benefits don’t tend to be felt until 18-24 months in. Many organisations struggle with things which are hard to do and don’t provide any value for a long while, so you need to prepare for that. Even if you notionally have support from senior management and executive leadership, you will need to convince people at the coal face of the benefits of change, one by one, so getting acquainted with how people do that at your new company is key. Some cultures are deck-driven, some doc-driven, some person-driven. Figure out which (possibly by trying all approaches) and then work with that until it’s safe to do otherwise.

A suggestion for where to start

If you have been hired as SRE, and maybe there’s a complex landscape in which you’re not sure how to provide value, our suggestion is to start by learning how to help people beyond the traditional “SRE” approach to things. In particular, devinfra / releng is not a bad place to start: no one ever complains about faster builds/less flaky tests, and improving time-to-prod-from-change is generally a metric which is understood and valued, even if it’s not maintained. It has the further advantage of keeping you firmly on the software side of the house rather than the pure operational side, and can make you a lot of friends amongst the product development community. The one downside is that it can be seen as helping others out a little too much, but it’s usually possible to excuse that early on (if you happen to be in a culture which cares about that).

You may be struggling with more fundamental questions than that, though.

A note about economic tradeoffs and the time to do something right

This is particularly true for startups but also true for non-FAANG companies: in any company without billions of dollars in the bank, the economic circumstances constrain the acceptable solution space. Such constraints are actually valid. Your Google time has taught you that you should do the high-quality thing, even if it takes longer, because the complexity of the internal environment means 80/20 solutions are hard to find, and everyone’s pretty used to OKRs getting bumped quarter after quarter. This is different elsewhere. Often there are 80/20 solutions available, and often they seem pretty janky to your engineering “taste”, but they’re perfectly functional and get it done. Finding where that balance is, and where it can be for your new environment, is a necessary step. Do not use your accustomed pace of development/deployment in Google as a guideline for your new company.

The above goes double for a startup. A commenter says: “I was asked to build something that’d let us run long-running jobs in a better way than having someone SSH into a machine in prod, and running something from the command line. We had people doing database migrations that failed, because after an hour, they closed their laptop.

“I initially specced a month to get a K8s cluster up & running, with a simple deployment interface, that’d let people submit a slug to be deployed & run for a bounded time. People laughed at me. Two more design iterations, and I’d a Ruby on Rails URL that’d take a command to run, and ran it under TMUX with logging to stdout, so if you hit that URL later, you got the logs. Took three hours to implement, 3 days to design.”

(From the SRE perspective, the difference the company gets between you implementing something and a dev doing it would be to leave monitoring & inspectability in whatever you’re doing. There are very many solutions that don’t, and the amount of time that’s wasted as a result is tremendous.)

How to provide value?

One common question we see is, “I’ve only done GCL, gmon & GSLB for five years… what do I do?”. The good news is, there’s actually a fairly large set of tools which have at least some mapping into the Google world, and there’s more coming for sure. The number one analogue today would be Prometheus/Borgmon, but there are others. Whether you have a direct analogue or not, expect to do a lot of learning, quickly. (External tools generally feel less sophisticated, but this is usually because of the integrated nature of the environment (as discussed previously). They are also less resilient on average.)

“I’ve only ever done X in the Google way!” That’s fine. Outcome equivalents often exist. Where they don’t, again you are in that special place where you are trying to figure out what people care about and how to make it happen. There’s nothing about that work which admits of shortcuts.

How to get value for yourself?

You will appreciate that we spend a lot of time talking about the downsides of working in the external world, and it’s true that the historical eng practices & supportive culture of Google are, or were, fantastic, but you do yourself a great disservice (and are also being somewhat patronising) if you think of this as only being a downgrade. There were many things that G was terrible at. Outside observers would point at product execution generally, for which the Chat app debacle is a constant and searing reminder of just how bad it has been for years, and the general point of understanding users. Insiders with a little experience of other places would also say that G had many troubles innovating at great speed in situations where there weren’t clear numbers. There are plenty of things which are done better outside Google, and one fact of providing value to a company is to understand what they already do well, and learn that, so you can be at least as good as the others.

How to help people

Erase from your vocabulary the phrase: “At Google we …”. Nothing more effectively signals a separate tribe, and nothing makes it easier to ignore you. Instead, listen to people, understand their problems and challenges, and use first principles explanations to demonstrate you’ve listened and understood. Then and only then suggest solutions, and use those solutions to teach different approaches, where appropriate. If you must use a phrase to directly reference a Google approach, try “In the past, when trying to achieve … I’ve found … to work well.”

Be humble. Humility is endless, relatively cheap, and makes friends.

Building a new normal elsewhere

Many things you just assume to be universally true aren’t. You will have to spend time and effort making them so by modelling good behaviour.

Some key things to look at:

  • Everything should be in version control, code reviewed, have tests, tests should always be passing. This seems basic, but isn’t (in fairness, most places have some of this, but not all of it.)
  • Here’s how to respond to a page, how to manage an incident, how to react afterwards. There are a number of free frameworks around (for example, the Jeli post-incident guide) or you can reinvent what you know.
  • Depending on what stack you use and how things get done: try to move to intent-based “stuff” as soon as possible, since that aligns so well with techniques we know are good for reliability. Don’t make people click buttons to provision new things.

The above are likely the most important things you can do, long term, in terms of enabling other people and supporting reliability, but… such work may not be directly rewarded by management, who might be more concerned by surviving till Tuesday than fixing things next quarter. (They also might be right.) So we recommend you have that chat first, before you invest too much time and effort.

Other things to note

Staff Engineers

When we look at the array of responsibilities and work and skills that a moderately experienced Google SRE does, we find ourselves thinking that the kind of cross-cutting, whole-system vision work done has a much closer analogue in the outside world that’s a world away from what the term SRE means - and that is, Staff Engineer. If you find yourself interviewing in a bunch of places and getting work that doesn’t sound like it makes full use of your skills, try going for that kind of role instead and see if it changes anything.

Interviewing in the outside world generally

Some people have asked us if interviewing is different externally. Yes, though the general tech model of “N” interview slots in different focus areas, often concentrating on demonstrating a set of technical skills, is very widespread. There are two major ways it differs: the rest of the world is a lot more about “tell me about a time when” rather than “do this on the whiteboard right now”. So preparing the answers to those kinds of questions (“what was your most complex problem”, “tell me about a time you failed to achieve something that mattered”, yadda yadda yadda) is important, otherwise you’re improvising in the interview, which can go awry. Secondly, G SRE folks are often at a loss to describe how tooling in the outside world works, and so you might find yourself saying that you’re sorry, but you don’t know that particular tool a lot. If that happens, pivot the conversation to a tool or approach you do know, or start talking about the underlying, more abstract problems (and potential solutions), in order to give the interviewer a signal that you do understand the general space, even if you happen to have been using mpm for the past while, rather than npm.

In terms of things to learn, if you’re literally starting from a homogenous Google environment, consider starting with the following:

  • Kubernetes
  • Auth0 or Okta
  • GitHub ( & GitHub Actions)
  • Prometheus

It is not that you are likely to get internals questions about how any of the above work. Instead, learning those will give you a plausible real-world answer you can supply to any of the typical “design a system that” questions, and help to kickstart your own development process.

Good luck!

Resources