Reflections on SREcon EMEA 2022

[Note to the reader: chairs have an obvious obligation to remain as neutral as possible, which I take very seriously. If I mention specific speakers or talks below, it is definitely not to the exclusion of others.]

Though I’ve been involved with SREcon EMEA many times before, this time was unique for two reasons: firstly and most importantly, we hadn't been physically together since late 2019, and secondly, I was program co-chair (alongside Daria Barteneva) - a new experience for me. With the freshness of the experience still in mind, I thought it was the perfect moment to write up my reflections.

The role of SREcon (EMEA) in the community

SREcon comes in three varieties: EMEA, which I know best, Americas, which I’ve been to a few times but certainly not every year, and APAC, which I’ve never been to. Most of my remarks are scoped to the EMEA instance, but if I specifically mean the others, I’ll say so.

It’s known that SREcon as a whole is the conference specifically intended for the community or profession to talk to itself. Another origin story is that it’s also a historic out-growth of LISA, the Large Installation Systems Administration conference, about which you can read more in Tom Limoncelli’s excellent ;login: piece. For some (not all), SREcon has effectively replaced it, though it is not a 1:1 replacement.

For what it’s worth, my personal feeling is that SREcon EMEA is highly distinct from mass-market cons for mass-market audiences. It’s not a KubeCon or a re:Invent. It isn’t oriented towards a specific technology, and despite having a vibrant sponsorship/commercial component, it avoids some of the downsides of having an implicit and continual commercial underpinning. It’s got more organizational and human awareness than (say) a language-focused conference, and my impression is it’s a little more intimate than some of the really large DevOps conferences you get - mostly, as I say, because the community is smaller. (There’s good and bad to that last point, of course.)

My own highly personal, biased view is that SREcon EMEA is the center of an SRE community of some kind. It’s by no means completely confined to EMEA, but mostly located there. It definitely has a European sensibility, though I won’t work too hard at defining that other than by saying that it’s in a position peripheral enough to be interesting, but important enough to have just the right amount going on. This time around, something like 40% of attendees were first-timers, so you know the community is growing, and folks are excited to talk to each other - particularly in person.

Being a chair

Being a chair is certainly a different experience from being an attendee, but it’s also different from being on the program committee (PC), a speaker, or a vendor. You don’t decide the content of any individual talk, unlike a speaker, and you participate in talk reviews like other PC members. But the theme is your choice, the text of the Call For Papers/Participation (CFP) is of course written by you, and the ultimate composition of the program is your decision. Most folks focus on the last bit, which is of course a genuine power, but actually there’s something a bit subtler which is also very powerful: you have the authority to ask people to submit talks.

That might not sound like much - after all, you could easily just ask people even if you weren’t chair - but it turns out to be surprisingly useful. In particular, if you do have an idea, and you know people who can speak about that in some way, you can create a program by not just relying on the presentations that come in as a function of individual actions, but by explicitly building it from invited talks. That gives coherence to what might otherwise be a rather scattered experience, and affords a chair the opportunity to give things a public airing that the chair feels are worthy of it. (As it turns out, we did in fact do that.)

So in a way, your influence as chair is very broad, but arguably a little shallow: there are some puzzle pieces that you assemble that are entirely other people’s, some that you make yourself, and all that combined with the logistics of the conference venue are the frame in which you have to fit the pieces that assemble the beautiful object. You have far from complete control, but you have a kind of Spotify playlist style of artistic direction, which, to stretch the metaphor a little awkwardly, includes the ability to commission songs from your friends in cool bands.

It’s also a lot of work. Leaving aside on-site fire-fighting, my co-chair and I started a half-hour meeting every week from January, for a conference scheduled in October. While most meetings didn’t have urgent deadlines to respond to, some of them did, and all of those meetings were vital for keeping us aligned. The main milestones were the CFP, due in March, the talk proposals, ostensibly due in June, their review and processing in July, various confirmations of attendance in August, and the full program in ready for registration in late August/early September. While there’s definitely a whole village involved in keeping everything going, and support from the eminence grises of Kurt Andersen and Laura Nolan on the USENIX board was at every moment superb and necessary, there’s still a lot to keep track of and try and guide - especially given the three-year gap since the last in-person meeting.

Warming to the conference theme

Those of you who’ve followed SREcon, and maybe even SRE topics in general, will know I believe there are some gaps requiring closing in how we engage with production management and engineering today. Setting the theme of the CFP to “what could SRE be” and explicitly mentioning questions asked in those previous pieces was the perfect opportunity to see if anyone else out there was thinking about those questions. To mention a few: how do we value reliability? How can we better understand and model outages and incident response? What are the strengths and weaknesses of SLOs? And many others. But the CFP was both broader and more specific than the questions I asked in my piece, and also asked for input on new ways of displaying monitoring, whether or not there was anything from cognitive science that might be relevant to incident response, and specifically asked about whether or not there were overlaps with SRE in other domains that might be illuminating.

Of course, there’s a difference between intention and what ends up happening. In our particular case, we did quite well on a combination of voluntary and invited talks on some of the broader topics, but talks that responded directly to the specific questions answered were less numerous. That’s fine; in my opinion, we still ended up with a sufficiently diverse set of talks for a great conference. One of the things Daria and I did internally was to group the talks into a number of tracks - which alas could not be reflected in the program groupings directly - we called ‘future’, ‘outage’, ‘SRE in practice’, ‘Systems engineering’, and ‘overlap’. This ensured that we couldn’t entirely starve any of those focus areas, and made it easier to balance them. In fact, looking back, those tracks do a reasonable job of summarising key areas of concern for SRE, and are possibly the beginnings of an answer to some of the earlier questions.

Conversely, we did poorly on attempting to get a set of overlap talks from more diverse fields than usual, such as historical analysis of military history, and even on some fields considerably closer to what we usually do, such as air traffic control. Succeeding here is usually a function of strength of personal connection and compelling-ness of the audience for the speaker, but this time around we mostly didn’t manage to pull it off, with perhaps the most notable exception being Mario Platt’s talk on GRC and SRE - though very definitely in the tech domain, of course.

Detailed reflections

One of the things which is simultaneously very interesting and also mind-numbingly frustrating about SRE and the community’s parsing of it is how very quickly questions of identity can come to dominate the discussion. This was perhaps a bit more pressing in the early days of what we were doing, but is still somewhat extant today. To my mind there are a number of contributing factors to this tendency to self-reflect, at least one of which will never entirely go away: the lower status that all non-revenue generating roles suffer from.

We aimed to separate the question of identity, which might never be satisfactorily answered since so much of that depends on highly local attributes of culture, economic situation, etc, and turn that energy into thinking about the future, and how to structure and understand our work. This was driven by our belief that we get a better conversation if we talk about “what can we do?” rather than “who are we anyway?”

My view is that we were partially but not completely successful. Opening with “Knowledge & Power: a sociotechnical systems discussion on the future of SRE” was an implicit declaration that the emerging STS/safety science/resilience engineering/etc “subflavour” of SRE is going to be more important in the future. For me, the questions it raised about team level dynamics and questions about knowledge management - and how to value that knowledge - are incredibly relevant to where where we are, and where we’re going. The tone was surprisingly hard-nosed too, but given Laura Maguire and Lorin Hochstein’s long years of experience, perhaps I was wrong to be surprised. Andrew Clay Shafer’s “SRE as she is spoke” was more in the way of a “traditional” uplifting/inspirational keynote - to my way of thinking, he effectively addressed some of the nervousness that animates our discussions of identity by drawing parallels between DevOps and SRE, and the act of learning a new language. Laura Nolan - and by the way, what a wonderful experience it was to attend a conference where the number of Lauras on-stage outnumbered the Daves - also addressed the “what could we be” point head-on by staking a claim to “SRE is Systems Reliability Engineering”. To my mind this is a highly plausible future direction for development, and is indeed a current reality too. Alex Hidalgo also contributed to this discussion with his “Diamonds with Flaws”, arguing passionately that whatever you wanted SRE to be, that’s what it should be for you. Bonus points for crowd-sourcing (a non-complete) set of responsibilities for SREs through Twitter.

Though for obvious reasons I’m extremely biased, I suspect that Todd Underwood’s “SRE for ML: Do you need to care?” was a great close-out talk, and perhaps the most on-topic talk in terms of what SREs will need to care about in the future that they currently don’t. Todd gave a great demonstration on stage of actually asking a large language model to write his talk, which went poorly, but not quite as poorly as you’d expect. I would advise anyone in the profession to adopt his recommendation to learn more about ML as soon as practical.

The outage track was very well received by the community, a number of whom I heard walking past saying how pleased and grateful they were that companies were willing to be vulnerable about their failings on stage. (We are all of us in the gutter, but some of us are looking at the stars, as another Irishman once put it.) As our first go, I was happy with what we managed to get on stage: the larger companies have in many ways a naturally more compelling story as an introduction, but I think the distribution of large versus small would need to be looked at for the next time, if there is one. Emily Ruppe’s closer of “The Repeat Incident Fallacy: What Jurassic Park Can Teach Us About Incidents” was highly enjoyable, though it can’t be avoided that her post-talk feedback experience was not, and we should reflect on what this says about us.

Finally, I’d like to call out Peter Sperl’s “Caching Entire Systems without Invalidation”, which was essentially how you do systems engineering as if you could use C++’s const keyword on a subsystem level. There’s a lot of promise in this approach which I think is currently underexploited, and I suspect that Bloomberg’s highly customised environment might make certain approaches like this both more intuitive to try, and more practical to enact, than other environments. However, it’s a hugely promising direction and I suspect will reward your viewing.

What might be coming

It seems clear that human factors and cognitive systems engineering are going to grow in importance for what we do. That’s been true for quite a few years, but I sense momentum building, though what form it might ultimately take is unclear. In a world where teams are decimated with virtually no notice, it’s ever more true that the humans, providing adaptive capacity, need to be supported. (We certainly believe that at Stanza, but there are many other places for which this is an emerging theme.)

I think we’re also going to see increasing pressure for regulation in the broadest sense. Whether it takes the form of cloud-based reliability concerns from financial regulators, as per Andrew Ellam’s talk, or likely increasing constraints on the environmental impact of compute (as per Namrata and Bill Johnson’s talks), we’ve been past the threshold where at-scale cloud work affects very serious things in the real world for many years now. I expect governments to move faster than they have been here, though it is as yet unclear whether we will just be involved in meeting requirements, or potentially writing them.

Another trend which I have sadly not been able to follow as much as I want, but I think is quite promising, is the explosion of tools for modelling, constructing, and reasoning about distributed systems. Formal systems generally, TLA+, and so on, are being joined by halfway-house style systems where reasoning about states can be done at some level that is not as low as, say, assembly language, but can still provide useful insights from a reliability point of view.

This is closely connected to the final point - the arrival of machine learning. The past year has seen an explosion in the public consumption of AI, in terms of Stable Diffusion, large language models, and so on. The boundaries of what is possible need to be redefined on an almost daily basis, and part of why this is so exciting is that the accessibility of tools doing this is slowly beginning to improve. It would be hubris indeed to think our particular area will remain untouched. Our best approach is to get ahead of it.

In our opening, we alluded to what I think most people would agree has been a general deterioration in circumstances over the past few years. The traditional four horse-people of the apocalypse have been joined by promising new entrant, >2% annual inflation, as well as a wave of layoffs in the tech industry, not to mention the environment, general societal cohesion, and so on. If you were to say there were more important things to care about than our narrow corner of the world, then I wouldn’t disagree with you; yet, it is also possible to read the infinity of the world out there in our particular grain of sand. The outstanding question remains how to understand and communicate the value of reliability, often to an audience mostly disinterested in it, and happy to assume continuity by default. Yet the lesson of our profession and perhaps the past couple of years is that continuity does not emerge by default. It requires active work to sustain and support across both social and technical spheres. More ways of explaining that value, and justifying it, are urgently required, and this this need will not end soon.

[I am CEO of a company called Stanza developing tools in the SRE & ML space. Please drop me a line if you would like to discuss anything related to these!]

Reflections on SREcon EMEA 2022

Niall Murphy

Topics