LLMs won't save us

LLMs won't save us

The AI wave is passing over us: what of genuine value will be left behind? asks Niall Murphy

As a long-time observer of the SRE/DevOps tooling market, I look at the tsunami of AI-powered and LLM-enabled currently engulfing our industry like most great wave observers would: half in genuine wonder, and half in fear.

Every AI cliché you can think of is true. It has amazing abilities coupled with astonishing errors. Transformational effects on more or less everything; incredible - literally unbelievable - costs. Phenomenal cosmic powers; itty bitty living space.

There are a lot of companies trying to take this cosmic power and apply it everywhere. For a ton of consumer applications and a lot of business ones too, this combination of brilliant and broken works well, or certainly well enough. But the  infrastructure engineering and tooling market has some attributes that present a lot of problems for the successful adoption of AI. At this stage it’s not clear whether these attributes represent a fundamental obstacle to success, or merely a temporary speedbump, but there’s enough money, effort, and drive involved here to make for a fascinating attempt to find out.

Regardless, production engineering folks look at this push to apply LLMs to their domain - particularly incident management - and are, in the main, skeptical. (To be fair, we have to note the evident self-interest accompanying this skepticism, of which more later.) In essence, AI people are betting that the production engineering people are wrong about their skepticism. Given their superhuman performance in other domains I’m not so foolish as to predict failure indefinitely on the part of AI. There’s so many resources flooding the zone that success of some kind is almost inevitable.

But I thought it was appropriate, as someone who straddles the two domains of AI and production engineering/SRE/DevOps/whatever we’re calling it today, to go into why I think this technology, as important, as useful, and as cosmically powerful as it is, today has fundamental limits to success, that (particularly in the domain of incident management) LLMs are dangerous to use, and that - as the title implies - that LLMs won’t save us. In other words, when we need them the most is when they are most likely to abandon us.

The rest of this article outlines the difference between these two world views, applies a critical lens to their positions, and tries to present a synthesis.

AI and Infrastructure Engineering

Let’s start off with some background. I’ll draw a distinction - weak in theory, stronger in practice I think - between “classical” AI/ML modelling (i.e. statistics) and Generative AI/LLMs. Leaving aside the mechanics of gigantic training runs, transformers, etc, the main practical distinction we observe today is the ability of LLMs to synthesize plausible responses in natural language to human questions, as opposed to special-case prediction of (to pick arbitrary examples) return rates on toys, decline rates on insurance, and so on.

Mostly these days the excitement is around LLMs rather than classical models. (That’s a pity for the infra market in some sense, since there’s a lot in the metrics world that works quite well with classical models.) I won’t spend a lot of time on characterising LLMs here - there’s a ton of relevant material out there and you’re probably sick of them, quite honestly - but I will lay out my propositions about them so you understand where I’m coming from when the oven of my hot take reaches 220C:

  1. LLMs are not (yet?) AGI. Reasoning ability is still developing.
  2. They’re also not just autocomplete. That paradigm fails to capture important components of their behaviour.
  3. We don’t understand a significant amount about how they work.
  4. They are (incidentally or necessarily) excellent search engines, since exact syntactic/token matches are no longer necessary to retrieve.
  5. They are most reliable when they retrieve tokens which are already in the training data, or are clearly derivable from an existing distribution. Unknown datasets give less reliable performance.
  6. Hallucination is not an accidental or incidental property of how they work.

You can agree or disagree or have a nuanced view about all those propositions, but they’re the background informing what I say.

Features of the infrastructure tooling market

It’s a fact that almost everything is infrastructure for someone, especially given “infrastructure” often means “the stuff I need but don’t understand”. But the tooling market for production engineering, infrastructure, and the like is pretty specialised. For the purposes of this article, it’s software that isn’t directly consumer-facing, but is usually trying to help developers/operators take care of a piece of code, a system, or even occasionally a human more efficiently, effectively, or reliably than was happening otherwise. Kubernetes would be a classic example; a giant system for managing systems more reliably. There are many others. Some of them are all-encompassing frameworks (Heroku), some are narrower solutions (OpenTelemetry). But they are generally united in the task of helping to manage other systems, rather than (e.g.) fulfill the functional requirements of an e-commerce shopping cart.

Let’s look at the attributes I have in mind.

Build-vs-buy. The classic buy-versus-build distinction matters a lot in infra tooling. Almost everyone believes their situation is unique or is strongly characterized by some special feature that no-one else has written reusable software for. Maybe they are even correct in this belief, but regardless, there are few engineers in this space who believe they can buy something off the shelf and have it work. Also, software engineers like to write software (indeed, sometimes we even hire them to do this...). As a result, infra software folks are generally inclined to try to make a go of fixing their problem themselves, except where there’s either some specific domain knowledge which turns out to be problematic, or the tool provides insane levels of convenience. Or in other words, production-minded people are more inclined to build.

Conversely, AI folks believe that their approaches have been successful, indeed, Nobel-prize-worthy, across domains as varied as radiography, protein folding, Go (the game), making music, images, and writing poetry. Is there any real reason to regard production management as sufficiently different from those that you couldn’t do something good? AI folks reckon they can, so they are more oriented with buy (really in this context, sell).

Predictability mindset. Folks in the infrastructure world are often emotionally and practically aligned with lower rates of change, and care deeply about accuracy and, in particular, predictability. This is not to say that they are uncomfortable with probability, or the idea of things failing - both the framework of risk management and probabilistic approaches to distributed system management are huge components of how production engineering is actually done. But it does mean that people care deeply about the behaviour of their tools, and reliability/predictability of those tools has effects on the perceived value of solutions, and the difficulty of selling them.

It’s important to note this is particularly true for “large” production environments: a program that applies some change correctly to 93% of backends, but leaves 7% of them persistently misconfigured/unaddressed is in many cases worse than no automation at all, especially if that 7% changes arbitrarily, and if the 7% left behind is tricky to handle in any way.

Folks in the AI world are much less upset by a lack of predictability, particularly given their jobs are often contingent on increasing the lack of predictability in a software system. At worst, this means the systems provided by AI folks fundamentally don’t - and potentially never will - meet the customer expectations for how they should work. At the very least this may make it harder for them to put themselves in the customer’s position, and model what’s important to them. 

Focus on Incident Management

Now let’s talk about incidents, and the incident tooling market. I’ve picked this market because I think the issue is particularly well illustrated if you use it as a lens to understand what’s going on. I think my argument still holds for other use cases too, but I accept the applicability might vary.

Starting with the basics, incident management is one of the domains in the industry where practitioners often have a very different view of reality than leadership. (I wrote about this in both Against On-Call and Rice’s Theorem and Software Failures - many others have written more and better elsewhere.) So from a sales perspective one challenge here is that the people who are paying for the software don’t necessarily have the same opinion about the problems for which the software is ostensibly a solution, as the people who are obliged to use that software.

The tooling market for incidents is, in general, not for preventing them (I often wish that approach was better served). It’s a “how can we react better” market. From my perspective that’s entirely fair enough - incidents are hard enough to cope with without having terrible or in many cases non-existent tools, and it’s not like incidents are a non-renewable resource, so the problem can never be “solved”. Software to help people work together better in a more structured way has a long and successful history and, mostly, both producer and consumer benefit.

But there’s another feature of incidents which is absolutely key here: not only are they effectively infinite/unsolvable in the general case, but how they manifest, how they’re understood, and how they are “fixed” are potentially novel every time. In some sense, a sufficiently complex distributed system is akin to a generative grammar for incidents - novel sentences issue from the grammar of the system, and it is only by extreme effort of parsing that we make sense of them each time. If you think back to the famous known/unknown quadrant popularised during the 2003-2011 Iraq war, you’ll get a handle on what I mean: any team on-call for a sufficiently complex system will encounter an infinite series of incidents over time, and these fall broadly into the four buckets of known-known, unknown-known, known-unknown, and the dreaded end-of-level boss, unknown-unknown. It’s precisely in this categorisation where I think there are some serious questions to ask about the application of AI to this domain.

In summary, the job of incident management is supported well by coordination tooling, but has the fundamental issue that - pending AGI - a necessary subset of the incidents encountered are not addressable by anything other than helping the humans cooperate better. The measure of your tool is how well you accommodate that fact.

LLMs won’t save us

Note my careful wording. LLMs definitely have a role to play. But they won’t save us. I’ll talk more about what that means shortly, but for now, let’s go back to those general features of the market to explain some of the friction here.

Build-versus-buy. AI-flavoured build-versus-buy discussions are even worse than the normal build-versus-buy, because not only do you have the question of surpassing the convenience of internal tooling, but any LLM implementation that’s three wrappers on OpenAI/Anthropic in a trenchcoat is seriously affected by behaviour and training drift, which is bad for the customer, and vulnerable to pricing shock and data nontransparency, which is bad for the provider.

Local model implementation is also hard for a provider to build a business on, for similar reasons, as well as being vulnerable to the customer deciding they can do it just as well - and since those models are freely (caveats apply) downloadable, they might well be able to do that.

Predictability. The fundamental difficulty. A lot of the excitement happening around LLMs is precisely their ability to produce conversations which are compellingly like a human conversation, able to accommodate previous context and reference occasionally quite sophisticated nuance across a wide range of information. In return for that flexibility, we have (necessarily?) surrendered the idea that we can get a predictable output. But there are many use-cases in the infrastructure domain for which we precisely and utterly want a predictable output. This is true in, say, the rollout case, where we would want some kind of reasonable model to safely oversee the operation of something which we expect will generally work most of the time, and clean up simple problems if they happen. An LLM is currently not well positioned to fill that role.

Slightly counter-intuitively, it’s the incident domain where creativity and randomness would be most welcome - part of the diagnosis and resolution process is figuring out where your model of the system and the actual system diverge, and that usually requires creative thinking. But entirely novel unknown-unknown failures have no precedent in training data and, as currently constituted, LLMs would struggle to suggest something useful. This leads us to one key question: do you want your AI to perform superhumanly accurate remediations 70% of the time, and then delay the resolution of, or increase the impact of your worst outages by doing things that make it worse and could never work?

This leads into my next point.

Jobs-to-be-done. What is the job that an LLM is going to do for you in the infrastructure/incident domain? Is it, as many folks are working on, training over post-mortems or incident reports in order to produce recommendations for how to handle things? Some kind of real-time assistant? But such jobs stand over the corpus of human activity created and/or written by humans doing these jobs - if we delete the requirement for them to do that work, the corpus goes away, or less dramatically, loses relevance. Does the role of assistant not cannibalise itself, or even worse, kick away the ladder that humans use to climb up to the level of being useful in these domains? If the response is that LLMs will write the reports that they will then train on, as I currently understand it this causes the LLM to drift from reality, never mind that for a sufficiently large or quickly updated system, the reality itself changes continually. If you did somehow have an Oracular IM LLM to which humans gleefully outsourced resolution, would that in itself not remove any incentive to write operable code, and as a result lead to a code quality decrease? Or to put it less dramatically, if the known-known problem is going to be papered over by computers successfully almost every time, why bother doing anything about it?

A large body of work in the resilience engineering community talks about the problems of automation, and how (in aviation for example) the problem of transitioning between an autopilot and human control, called “the bumpy transfer of control” (Dr. Woods, et al), often leads to worse outcomes than if the automation had never been there at all. (Of course automation might still be beneficial overall, but certain specific scenarios can definitely be worsened.) In that sense, an assistant model as the heart of the product is a valuable first step, but the provider might think twice before enabling completely unattended operation.

The value of operations. There is also another contributing factor to the decision-making currently motivating AI approaches not often said out loud, but is very much as real as everything else, sometimes more so: to wit, the belief that operations work should not be staffed, and that any half-tolerable AI solution will allow large or complete destaffing of the effort, if not the teams themselves. If you genuinely believe operations work has no value, then this is an obvious goal, but if you’re one of those who believe that adaptive capacity is, in all its messy unmeasurability, currently the only thing that stands between us and cascading failure of our complex and fragile systems, then you might rightly be critical of that goal. 

In a strange way, the fact that we have happened upon this pseudo-alien technology, which can reproduce a certain amount of the range of human behaviour in language, but only on foot of a gigantic corpus of precisely that behaviour, is a demonstration of the centrality of humans to this process. By trying to get rid of them, it's entirely plausible that we’ll discover we only need them more.

Conclusion

I wrote above that LLMs won’t save us, meaning that in the narrow role of incident management, we don’t get to surrender our responsibilities and expect them to pick up the slack. But I do want to be clear that it’s entirely plausible they have useful ways to contribute. We could even manage some of this risk, if we wanted to adopt these technologies intelligently; for example, by analogy with the SRE/SWE case, perhaps instead of surrendering on-call responsibilities to the machine completely, we make them shared, so humans keep doing some proportion of that work, and retain those skills and capabilities for when they are most needed.

Indeed, to my mind the risks of not doing so seem quite high, because I’ve spent most of my career on the right hand side of the probability distribution where million-to-one chances happen several times a day, and nothing about the direction the world is going in at the time of writing seems to suggest increasing stability.

However, I fully acknowledge that there are other points of view. It might still be a perfectly fine product to have a thing that sits there and makes reasonable suggestions some percentage of the time, and is particularly good about known-knowns. That would represent a meaningful step up for a large number of organisations. (Indeed, under the heading of conserving attention, it’s arguably an improvement for everyone. The product would, however, need to somehow escape that character of LLMs that causes them to confidently assert untruths.) Many would also welcome automated assistance with writing post-mortems, though as I’ve said elsewhere, if such documents are going to be written by machines and only consumed by machines, pending AGI we are removing a vital quality gate both in their generation and their consumption. Ultimately if we rely on it too much, the autopilot will disengage when we are too close to the mountain to course correct.

If the risks are manageable, the product is thoughtful, and the adoption is measured, there’s a lot of value to be potentially gained. (Though, in my experience, given how notoriously difficult it is to get people to care about infrastructure, expecting those things is a high bar.) As a result, the problem from the point of view of the supplier probably won’t be getting people to try something like this. Instead, the issue will be keeping it safe. Safety and predictability often go hand-in-hand, and I fear that in the rush to destaff unfashionable things, we will sacrifice predictability in the expectation of safety, and receive neither.

Acknowledgements

Thanks to Todd Underwood, Yvonne Lam, Steve McGhee, Serhiy Masyutin, Willem Pienaar, and others for providing feedback on earlier drafts of this essay.