Waste versus slack in production engineering

Waste versus slack in production engineering
“The waste remains, The waste remains and kills” – Missing Dates, William Empson

Economics teaches us that one person’s income is another person’s expenditure. But we don’t think like that when we think of waste. Instead, our usual intuition for waste is it's something expended that we didn’t have to, or something left behind after we did what we needed to.

In today’s distributed systems and cloud-first environment, however, that perfectly fine concept is a little trickier. Many apparently obvious examples of it become a bit less obvious when you look at them more closely. It’s also hard to decide what to do about that “waste”, even when you’ve decided what it is.

But fundamentally, there is one kind of waste which is ongoing, pernicious, and which nearly every organisation underestimates, and should be the real target of our efforts. Of which, more later.

Assumptions

For the purposes of this piece, we are not going to talk about the waste of physical hardware (sometimes known as e-waste), but  less immediately tangible quantities like CPU time, RAM and disk usage, efficiency of algorithmic operation, tradeoffs between time and space, and so on. Also crucial, but harder to measure the utilisation of on a second-by-second basis is human time. Finally, a closely related concept to waste is slack. This is unused capacity - in some sense waste, in some sense not.

All of those are in the mix when it comes to understanding what waste means.

Computing resource wastage

Let’s start with the simplest case. You bought a machine with 8 cores to do X. It uses at most 1 core to do X. Therefore, we waste 7+ cores. Or it has 128Gb RAM, and habitually uses 12Gb. Wasted: 116Gb. 1 Tb SSD, of which you are currently using 24Gb? The rest is wasted!

While this is broadly true, the simplicity of the story disintegrates a little when you poke at it. For example, in the era of long-running applications requiring logging, ML applications getting better the more data you have, etc, it’s not very useful to regard the excess of free space at the current moment as “waste” - in general, of course, the need for persistent storage grows over time. Instead of waste, the resources you’ve bought but aren't yet using are therefore your envelope. If you consume 50Mb a day in (say) logs, an excess of storage by default trades space for time; you have bought yourself an extra couple months of operation.

I say "by default". You can do other things than the default. For example, you can use an envelope sized substantially above what you need to do other things. In an important sense then, the actual waste is not to use that capacity for other things.

Now we see waste as a different thing entirely. It's not that we have more than we should. (The definition of "should" can be very fluid in fast-moving systems.) Instead, it's us getting our estimates wrong! So in the universe where you both track your usage accurately and buy capacity in advance to prevent outages - which of course you should - waste can be thought of, not as capital expenditure you spent unnecessarily, but as further insurance against poor estimates, or the ability to defer future expenditure for a bit longer.

A further nuance is that economically speaking, if money is likely to be more expensive in the future (for example, if inflation is high) or commodity prices are likely to increase, it’s economically rational to apparently “overspend” now rather than later, if you’ve reasonable confidence you’re going to grow, or you'll need more envelope.

So even in this simple context, waste is hard to define.

Algorithmic and design efficiency

Two key ways in which systems can be wasteful - which we will here define as doing more work to achieve a particular aim than is strictly necessary - are algorithmic efficiency and system design.

Algorithmic efficiency gets a lot of attention. In an era with several large tech companies with billions of customers each, there is a real and meaningful difference in time, money, resource usage, etc, between an O(1) and O(n) algorithm in those contexts. Even for much smaller companies, a misplaced O(n^2) algorithm can profoundly affect customer experience. So it is right and proper for us to spend time on getting those details right.

However, the flip-side is that it’s possible to spend too much time on it, and I think most experienced software engineers will have a memory of doing so. Who amongst us has not leaned against a whiteboard, pen in hand, thinking about a complex way to handle some tricky case, when talking to some customers instead would have saved the month’s worth of engineering time you spent implementing that diagram?

A related point is the question of efficiency of the design of the overall system. It is entirely possible, and indeed happens quite often, that the execution speed of each individual component of the system might well be "racetrack-optimised", but the overall system is just not a very efficient way of accomplishing the higher-level goal. It’s perhaps a facetious example, but some decades back a friend and I used to argue about the comparative merits of Linux versus Windows. At some point or other in a discussion about filesystems, my friend triumphantly declared “See? Windows has a filesystem defragmenter, and Linux doesn’t have that!” (My counter-argument that the Linux filesystem of the time, ext2, didn’t need it, because better fragmentation decisions were made at file creation/layout time did not alas carry the day.) Regardless of those opinions, though, the effect is real. Software is indeed one of those weird fields where you find out that actually, you’ve been doing the thing 10x stupidly wrong for years, and even more weirdly, sometimes it’s even possible to fix it.

Waste in this context then, we can see as not doing the right thing when you “should have” - though in some sense, there are very real limits to realising what the right thing is, and doing so may require your engineers, your teams, or your org gets a lot better at understanding what it is they’re doing.

Waste in cloud environments

Today of course, many organisations effectively rent capacity rather than purchase hardware outright, so some of the above cases don’t apply quite as directly. Rental of capacity changes the economics a little - now, if you use 1 CPU and only pay for 1 CPU, you have saved some money in opex, sure, but the real saving is not having to have the capex for the fleet you require.

Yet this is not quite the overwhelming win it’s sometimes portrayed as. One way in which the hyper-scalers resemble the mobile telcos is their billing models, which in the cloud provider case is sometimes closer to an actual business model than a billing model.

What do I mean? Well, in the mobile telco world, you’re undoubtedly aware that as a consumer you are generally presented with a choice of plans when you buy service from a company. These plans are sold to you as a chance for you to optimise your expenditure - picking the "text-a-lot" plan if you text a lot, and so on.

But just as much as they do that, they are also an opportunity for the telco in question to control user behaviour. The plans offered will incentivise behaviour in keeping with the capacity of the telco network, and even more importantly, they change a lot over time. So it’s not uncommon to find if you haven’t renewed your plan in a few years, that others are paying quite a lot less for the same thing as you are. In this way, both telcos and cloud providers materially benefit from having as complicated a consumption model as they possibly can. Whether this results from inadvertent behaviour or not is unlikely to ever be fully clarified, but rest assured that organisations like the Duckbill Group and projects like Jez Humble’s cloud resource harvester wouldn’t exist if there wasn’t a real and genuine need for them.

In some sense, then, the cloud providers capitalise not just on indefinite rent for compute resources continuing into the future, but also a rapidly changing environment which just makes it organisationally easier to keep spending that money to service current needs, rather than try to spend the time to understand what’s going on and maybe reduce that spend.

Waste then in this context becomes a lot subtler. You, the customer, might not have had the resources for capex expenditure for your production fleet, but you can afford the opex. This is hardly waste - indeed, arguably it’s the opposite, since being able to trade having to have a large amount of money up front for having continual smaller amounts of money arriving on a regular schedule gives you the chance to run a business you previously couldn’t. Exactly like mortgages! But now it’s much harder for you to figure out whether or not a better deal exists elsewhere, precisely because of the complexity of the billing model and how your production now works. Is that waste? Well, it is, if you could pay for the same thing somewhere else for less, but for good reasons and for bad, it’s not trivial to compare those possibilities.

Probably, then, the most accurate thing we can say is that there may or may not be waste in your current cloud expenditure (though for any sufficiently large fleet, you almost certainly have waste somewhere) but it absolutely has reduced your optionality: i.e., increased opportunity cost. The waste is therefore the other things you could have done with that money.

Slack
Now we come to the question of slack - un-used capacity in your system, which could be used, but on average isn’t.

There is a key difference between waste and slack. Both of them represent states of the world in which resources are unused, but we define slack as when the “waste” is there deliberately - to cover a situation where those resources will be used, or have a high probability of such. The difference is in some sense intentional: waste which gets accidentally used to compensate for a situation is still waste, if you didn't do so deliberately. But when that happens - when you use waste for slack - you’ve discovered something about your system, congratulations! (Though you'd better build capacity into your plans from now on.)

Ultimately though, slack results from a deliberate act taken to compensate, or partially so, for events that would otherwise overwhelm your system. It is capacity, but to smooth out variations, not for immediately assigned work.

I once worked on a system that had a failover procedure between two datacentres. It was business critical; there was no possibility for it to be hard down for any extended period of time. The resources that were used in both datacentres were large and (obviously) the same size, since they had to fit the system in both cases. We were continually questioned by folks asking why we kept this huge set of resources unused - surely this was waste? It wasn’t, of course, but we had that conversation continually. That questioning, in fact, is really the issue when it comes to slack in the modern workplace: our contemporary culture of efficiency, regardless of what tradeoffs exist, eliminates the benefits of slack for short-term gain, and therefore invites long-term destruction.

And this is the difficulty that we face having a conversation about slack in a modern production system, hosted on the cloud. The cost-consumption model means that idle resources cost money; why would we ever have a situation where we are paying for something that might not be used, even if we get a discounted rate? Or in other words, why would you ever want slack?

There are a couple of different ways of answering that question.

  • It’s been well understood since at least the time Operations Research started in WW2 that slack is hugely important in ensuring stable system behaviour, especially when you don’t control the demand side. One compelling example is John Cook’s blog, where he shows that according to queueing theory, an additional server in a simple bank teller system can, under certain intuitive conditions, reduce the average queue wait time by 93x. That kind of number is often persuasive in and of itself.
  • Slack can also be thought of as insurance - money you spend in anticipation of the eventual occurrence of bad outcomes. Business leaders understand insurance quite well, and even though many of them chafe at having to pay it, the argument is not unfamiliar to them. If you can frame and price it like that - and even better, show multiple occasions on which the pseudo-insurance did its job - that should help.

Here, then, we see that once again waste is a subtler concept than we might have thought. Folks in the resilience engineering community will often reference how a resilient team has some slack in it, and this slack enables you to respond more effectively during exceptional circumstances, bad outages, or macro surprises of all kinds.

This also reminds me of one of the arguments for the 50/50 operations versus project work split we see in SRE: if you’re just responding to today’s needs all the time, and those needs grow over time, eventually we reach a point where we don’t have the time to make tomorrow better than today, and that is generally when things are primed for catastrophe or collapse. Or both.

Human Time

There’s one key angle we’ve left till last, and arguably it’s the most important, though given how some organisations run things, you’d be hard pressed to realise it. That angle is the question of human time, and its use (and misuse) in production systems.

Many years ago I went to the premiere of a certain Star Wars film with some colleagues, as part of a company event. I didn’t particularly enjoy it, and as I left I said brightly “Well, there’s two hours of my life I won’t get back!” to my manager. In my memory, he raised his eyebrow perhaps 2 cm and dryly quipped “Just like all the other ones.” (He was a great manager.)

This is the central realisation at the heart of the human condition. You don’t get the time back, no matter what. You can, however, spend it on different things - better things.

How we habitually arrange staff time in production engineering today is not, in general, respectful. A list of examples: we expend time on answering pages, or doing toil (again, an SRE reference) that keeps things going but otherwise delivers no enduring business value. We often deliberately do so, because actually fixing the real problem (the source of the page, or the un-automated procedure) we judge as too hard. The cost is opportunity cost, but it cuts hard - feeding the machines with human blood does indeed feed the machines, but the humans rarely enjoy it. More importantly, it's even rarer that they get better at it.

Waste, then, can be most properly understood as expending time that you won’t get back - just like all the other time - on things you could fix, but don’t. It’s a waste of human opportunity, development potential, and a refusal to increase the capabilities of your organisation. The ultimate result of this kind of waste is the waste of human potential. Unfortunately precisely because this is hugely difficult to value, it is, as a result, sometimes valued at zero.

Conclusion

One person’s income is another’s expenditure, as they say: waste in your production environment is a cloud provider’s margin, and spending money today might well give you more margin tomorrow. But in contemporary engineering, giving the maximum capabilities to your staff - bulking up your resilience and your slack - seems likely to produce the maximum optionality for your organisation, or in other words, the largest field of choice for change.

In an era of huge and rapid change, those who can adapt the fastest will survive the best, and that adaption happens best when your waste is deliberately chosen and primarily about money, not human time.

References

(PDF) Removing Waste While Preserving Slack: The Lean and Complexity Perspectives (researchgate.net)

Antecedents of Organizational Slack on JSTOR

Model of Operational Slack: The Short-Run, Medium-Run, and Long-Run Consequences of Limited Attention Capacity | The Journal of Law, Economics, and Organization | Oxford Academic (oup.com)

Acknowledgements

This paper benefitted from contributions and review by Cian Synnott of Datadog and Anurag Gupta of Shoreline.