Asynchronous and Unreliable Episode 2

Episode 2

Guest: Jon Berger

In the second episode of Asynchronous and Unreliable, amongst many other subjects, Anne & networking expert (and Anne's husband) Jon Berger discuss the high end of code efficiency

Watch on YouTube

Listen on Spotify

Listen on Apple Podcasts

Read shownotes & transcript below

Shownotes

The High Stakes of Code Efficiency in Networking and Beyond

In this episode of Asynchronous and Unreliable, host Anne Currie is joined by long-time tech veteran Jon Berger to explore the critical importance of code efficiency, especially in the networking software that underpins the internet’s performance and resilience. They delve into practical comparisons between operational and code efficiency, the impact of AI on high-performance software, and how scale influences software optimization strategies.

Key Topics:

The necessity of ultra-efficient code in networking software managing billions of packets per second
Differences between operational efficiency, systems design, and code efficiency
The exponential scale of code efficiency impacts versus the linear scale of operational efficiency
How high-frequency trading and networking code both push the limits of performance, often in assembler and machine language
The influence of AI and automation on future high-performance software development
Trade-offs between hardware reliance for speed versus software optimization
The impact of scale: Large companies like Google vs small startups in operational and code efficiency
Challenges of maintaining operational efficiency at scale in different sized businesses
The role of human expertise versus AI in optimizing software for resilience, energy, and security
The importance of aligning software performance strategies with business goals like growth or cost reduction

Resources & Links:

Building Green Software by Anne Currie, Sara Bergman and Sarah Hsu (Book reference for operational efficiency strategies)

Flash Boys by Michael Lewis – about high-frequency trading and performance

Linux Kernel in Rust – AI-created Rust code for Linux kernel

Connect with Jon Berger on LinkedIn

Transcript

Anne Currie (00:01)

Hello and welcome to Asynchronous and Unreliable, a new weekly podcast where we discuss the most interesting ideas and concepts in tech. I'm your host, Anne Currie, co-author of Building Green Software, The Cloud Native Attitude and author of the Science Fiction Panopticon series. And today we're going to be talking about the endlessly interesting topic of code efficiency. And for my guest, who could be better than long-term tech veteran and expert on code efficiency, Jon Berger, who as my husband of many decades happens to be usefully cached locally. ⁓ So, Jon you've been in tech for well over 30 years. You've been involved in creating, growing, leading organizations that are all about building mission critical software. That's high performance and high resilience software. And a lot of your perspective that we're going to be talking about today comes from delivering that code, which has to be at the highest possible end of efficiency, specifically networking code. So that means code that's running on routers and switches and things. So tell us a little bit about maybe about yourself, but also about why that code has to be so efficient.

Jon Berger (01:15)

So that's a good question in the area of both voice and data networking, the scale of the modern Internet is just absolutely incredible, and it's enabled by the network. And that means the network has to do an awful lot of very hard work in the data plane. That is the bit of the network that is processing, moving your data from one place to another. We're talking about processing billions of packets per second. That is very little time to process each packet. And if you couldn't operate at that level, then the amount of equipment and the size of the network you'd need would be absolutely ludicrous and prohibitive. And in the control plane, which is the bit of the network you don't really see, but is the bit of the network that controls the data plane that allows the different pieces of networking equipment to negotiate between each other about how to get from A to B. Then you've got to build up a picture, you know, logically of the entire network or some decent proportion of it and manipulating that in real time as it changes to be able to do your job. So this is very high performance because it's billions of things per second. And that's not billions of things per second across some distributed system in a cloud. That's billions of things per second on one single piece of hardware.

I think that was the answer to the question.

Anne Currie (03:03)

I will say it's interesting.

To give our listeners or viewers a bit of background here, you were very involved in the writing of Building Green Software. You were our primary consultant on the code efficiency chapter and the networking chapter ⁓ because you and all of your many friends in the code efficiency field ⁓ were invaluable when it came to helping Sara and Sarah and I write that. ⁓ Because really, I think you could argue effectively networking code is the fastest code on the planet, isn't it?

Jon Berger (03:47)

I hard to know for sure. ⁓ It's certainly some of the highest performing code on the planet. ⁓ I'm not an expert on high frequency trading, but I understand that they that's an area where every every femto second matters. But networking is an area where historically people have looked into writing code in assembler because just shaving off one instruction from processing a packet was meaningfully useful. So, I'm not going to necessarily say it's the number one, but it's you know, it's a competitor.

Anne Currie (04:40)

Yeah it is, as you say, incredibly important. And not only does it have to be incredibly fast and performant it has to be very resilient as well? Because the internet, oddly enough, you're running this ridiculously high performance code over often quite flaky underpinnings. Historically very flaky underpinnings. And it still has to keep going and so performance and resilience are really built into the fabric, literally the fabric of the internet.

Jon Berger (05:17)

Yeah, and there's there's always, I guess, two ways of looking at that resilience. One is ensuring that your code is of reasonable quality such that it doesn't fall over very often. And the other one is that if something goes wrong, there's some way of recovering, ⁓ failing over, switching over to something else without losing data or without losing time performance. And so, you know, one of those is more, one of those is more about design. The other one is more about the execution.

Anne Currie (05:56)

which is really interesting. So, ⁓ Sara Bergman and I recorded ⁓ an episode, a couple of episodes back, we talked a bit about code efficiency versus operational efficiency, and how much, when we were writing Building and Green Software, ⁓ a lot of people were asking us to write mostly about code efficiency, because it really is where everybody knows that that's where the big improvements in overall efficiency are hidden. ⁓ We had to fight very hard to try and say, well, it might be amazingly good what you can achieve with code efficiency, but it's very hard to achieve it. And so for almost everybody in the world, it's not where you start. And you start with operational efficiency, because no matter how good your code efficiency is, if your operational efficiency isn't ⁓ fantastic, all that effort you're putting into code efficiency will be wasted. it's an interesting one that code efficiency is amazing in networking, but operational efficiency is just as important, is more important, isn't it really?

Jon Berger (07:08)

Yeah, I mean, operational efficiency is always going to be the low hanging fruit and brings many, many, other benefits ⁓ in terms of being able to to move fast, to deploy more often, to upgrade to new features more easily and more safely. ⁓ They both require expertise, but operational efficiency, probably you can get away with increasing operational efficiency with a relatively small number of people that have good ideas and know what they're doing. Whereas code efficiency, at least before AI wrote all the code for you, was something that required expertise in breadth and depth. You needed a lot of people who could do that. And that's really expensive, as in there aren't that many of them and therefore they command higher salaries, but it's just really difficult to do and to deploy those engineers at scale.

Anne Currie (08:13)

Yeah, it's incredibly expensive. It's incredibly difficult. Even the big players don't do anywhere near as much of it as you would hope ⁓ because it's so expensive and difficult to do. I think the other thing that I say, this is something we discuss, being a married couple - over the breakfast table, we discuss code efficiency versus operational efficiency all the time. That's the exciting kind of married life we lead. Because we met on a graduate training program 30 odd years ago for the kind of software that you went on continue to write, so really high performance software, distributed systems, that kind of thing. And I left after 10 years, but you stayed there your entire career until finally the company sold to Microsoft a few years back.

But yes, it's one of the difficult things about operational efficiency versus code efficiency is that it's meaningful to say that one system is more operationally efficient than another. [An inefficient] versus a really high end operationally efficient system. This is maybe a 10x difference.

So there's a 10x difference between an enterprise who's just kind of doing your bare minimum, maybe not even following, not following best practice, but still running a successful business, which is quite common. And Google, who are the operational efficiency kings and queens of the world is only kind of 10x style differences, kind of like it's a straight line difference between operational efficiency.

But code efficiency, it's more of an exponential scale, you know, it's 100,000, it's a million times difference, even more than that between somebody writing the kind of ordinary stuff that's running every day and somebody writing really, really hyper efficient software. And so better or worse becomes somewhat meaningless. it's a logarithmic scale as opposed to a non logarithmic scale. So operational efficiency is an ordinary scale. Code efficiency is on a logarithmic scale. What do you do you think that's the right way to think about it? Or is it a useful way to think about it?

Jon Berger (11:03)

I don't know. think you're saying that for operational efficiency, because obviously people can write or design infinitely inefficient software, ⁓ tends to be maybe 10 times worse than Google. Just in normal usage, that's the sort of range that you might see. ⁓ I expect it's bigger than that. ⁓ But you're right, it's probably not a million times. In code efficiency terms, then, yeah, a million X wouldn't be unusual. And. you know, there are number of reasons for that. We have spent as an industry the last three decades basically creating platforms which allow software engineers to be more efficient. We write code more quickly, almost all of which come at the expense of performance. We've traded off wins on software engineer performance against losses on the actual software performance. All high level languages are less efficient than just writing machine code. But writing machine code is really, really, really slow and probably it would be quite hard to employ a lot of people who really wanted to do it.

Writing in Python, Java, is much, much, much quicker, but the code is going to execute way slower. So that's one of the reasons [that the question] often comes up, just which language is more efficient? But actually even within those languages, there are several orders of magnitude differences in other choices you make. There might be design choices about how you glue your components together and how microservices might talk to each other or whether you're even using microservices or how you deploy your software and which algorithms to use, how you use them. And most engineers don't really have a really good understanding and don't need a really good understanding of those choices or how to make them, because actually it doesn't matter for most software. It could run 10 times slower, 100 times slower, a thousand times slower. And doesn't matter. It doesn't make any difference because if you're not doing billions of things a second or you're below the limit of where your user could notice that you've done something or not. If your website responsiveness goes from being responsive in ato seconds to femto seconds, it doesn't matter.

It needs to get into decent numbers of milliseconds before anyone even cares. So there's a huge range of just the system's ability to absorb inefficient software because the hardware now is so incredibly good. That didn't used to be true.

I think it was ⁓ What was the book about the high frequency traders? Flash Boys, where [the author] talks about the engineers. This is 30 years ago now, I guess. The engineers that were in really high demand to build these systems were Russian because the Russian software engineers had grown up without the superfast modern, as it was then, hardware that existed in the West. And so they'd had to work really, really hard to wring every percent of performance out of what relatively weak hardware they had. And so they were used to being able to write really, really, really high performance code, whereas Western engineers were already, this is 30 years ago, were already lazier. And that's good laziness. That's not a criticism. Sometimes lazy is good. Don't do things you don't need to do. if you can relatively inexpensively buy very high performance hardware, and that means you need to spend less of your constrained resource, which is your expensive software engineers, that's a great trade off. That's exactly what you should be doing.

That is what most of the world has been doing for some time now.

Anne Currie (16:22)

Which kind of gets me onto ia subject, one of the things that is a constant breakfast table discussion for us, something that you comment on, which is that the reason why code is kind of 10,000, 100,000, a million times less efficient than it could be is because it's written by and maintained by humans. It is human.

Jon Berger (17:07)

At the time of recording, that is mostly true.

Anne Currie (17:10)

Indeed. And in fact, if this doesn't go out for a couple of weeks, maybe it'll be different by then. because that is the area where there is an astonishing amount of change potentially coming from AI. One of the examples of this is the paper published recently by OpenAI saying that they had written a new C compiler written not in C, but in Rust, that could compile the Linux kernel. And they'd written it using AI in a totally hands-off fashion. No human wrote a single line of that Rust. And that is a sign that's High-Performance code is going to change.

Jon Berger (18:13)

enormously. I mean, all code is going to change and how quickly this will happen. We don't know, but it will happen. The expertise that is required to build high performance code is a really tough discipline to learn as a software engineer.

And it's not just a discipline where once you have that skill, then all code you write is high performance. An awful lot of the skill in writing high performance code is a healthy attitude to trial and error. It requires having a go at something and seeing what the effect is, because there's so many different ⁓ variables that can affect the performance of the code.

You can look at an individual line of code and go, what does that look like? High performance or not? But that's really very much at the very easy end of the scale.

When you're looking at the amount of code you're writing and therefore how much code is in the cache at any one time, because if you have to load more code into the cache, that's a massive hit. If you have to load more data, that's a massive hit. So if your data isn't handled helpfully or you don't have the ability to fetch the data in advance. Things like that. There are so many different dimensions. You can have a good guess and more experienced engineers will guess better than less experienced engineers about what might help or what is worth trying. But until you've really tried it, you won't really know. And that's really expensive on human time.

But when human time is no longer a factor, then you can do those things much more efficiently. The other thing that you can do when your cost to write code is free, is then you can have lots of different variants of a particular codebase or a particular algorithm or a particular product, which are optimized for specific things, which would be totally mad if you've got to write all that code by hand because the maintainability nightmare of that would mean your costs become instantly prohibitive but if you've just got your code written for free then sure have as many different variants of that as as you need and and suddenly the performance opportunities available from AI written code are really quite incredible

Anne Currie (21:07)

Yeah, astonishing.

Jon Berger (21:07)

I mean, that's further off. That's further off, obviously, because it requires a whole load of other more complex reasoning about the way, about how the code is deployed. I would guess that once you get beyond just AI writing any code, the next easiest thing to do would be to make it more efficient, because you're saying, take this thing that's already got well-defined inputs and outputs, and just rewrite it.

Anne Currie (21:10)

Yeah.

Jon Berger (21:37)

to go quicker. That's a very, very well defined job. very testable, relatively testable ⁓ compared to, yeah, redeploy this whole system and make good guesses as to when you need ⁓ to rewrite a specialized bit of code to go faster.

Anne Currie (21:59)

So the interesting thing about this discussion is you might view it as the opposite of the discussion that I very recently had in the first episode of this podcast with Sara Bergman, where we're talking about code efficiency versus operational efficiency. And from her work at Microsoft, she was talking about the benefits of running code that wasn't that tuned for a CPU, which has enormous operational efficiency improvements because it means that you actually you can keep hardware for longer. You can move things around from place to place. You can move ⁓ things that you need to run in a different data center or a different machine or a different location. You get all of those operational efficiency improvements.

This is an amazing example of the trade off between operational efficiency, which is often you don't want to be too tuned for particular chip set, and code efficiency where to get to the really high end, you want to be incredibly tuned for a particular chip and set. We are in a moment where everybody's having to balance both those things and it causes quite a lot of confusion actually.

We talk about efficiency, but code efficiency and operational efficiency are quite different, you use quite different techniques for them.

Jon Berger (23:31)

Yeah. Yeah, very different. And I guess so much of operational efficiency is related to the system design, that that's probably less something that AI is doing for folks in the short term. But yes, building software in such a way that makes it easier to operate efficiently is something that up until now, humans have had to worry about.

Anne Currie (24:10)

Yeah, yeah, it is interesting. I do think that there's this huge opportunity for AI to really build green software far better. Well, code wise. So interestingly, in building green software we really concentrated on operational efficiency as being the more human skill and it's more useful and transferable ⁓ it's a skill that every enterprise should have because it has all kinds of knock on effects on your cost and your resilience and your security and your performance. And you have to get it right first, because if you've got that wrong, You can do as much code efficiency as you like, but if you don't have ⁓ operational efficiency nailed it's wasted and it was incredibly expensive. You cannot afford to waste it.

So you've got to have operational efficiency nailed first And that is innately a more human skill. At this point, I'm guessing that operational efficiency will last as a human job longer than code efficiency. But who knows? It's very hard to say.

Jon Berger (25:26)

Yeah, I go along with that and that sounds likely. Although more and more of the actual operating will be done by AI, operational efficiency will be a human job because it will just be too utterly terrifying for people to hand that over to AI for a little while.

Anne Currie (25:54)

But then in some ways that's similar to, you know, Kubernetes. That's just automating without the need for AI ⁓ it was terrifying at first. And in fact, most businesses have not yet moved over to a totally ⁓ automated data center yet.

I wrote a book about it. The Cloud Data Attitude. 10 years ago now and at that point I was thinking this is going to be here in two minutes but it really hasn't spread much more in the past in the past 10 years than it had done in the in first two or three years. People are still a long way from basic automation of operations

Jon Berger (26:38)

Yeah, it's true.

We started talking about the million time difference between the most efficient and least efficient code. Obviously, it's much more than that. But even in in normal deployments, many times. the same is true in how efficiently people can can can operate software and build it. Because the same is true in terms of the scale at which people operate.

The very largest companies operate at a scale that is like really, really difficult to imagine if you're an individual shop or an individual, just a business that's coping with a local area. The fact is that if you're Google, Microsoft, Amazon, everything you do is billions, tens of billions, at a minimum. That's the minimum scale of thing that you care about.

But for the vast majority, for 99 % of businesses, that's beyond anything they could ever imagine. And that's fine. And a lot of people will correctly say, well, let's worry about getting a product that works and that anyone cares about, know, zero to one is hard. And then scaling that up is a different job. And it is a different job that requires different expertise.

And if you're one of those, if you're if you're a shop or if you're a a small business somewhere that has. Half a dozen employees and a hundred customers. Well, you're just not someone who's ever going to care about billions. You're not somebody who should ever spend time worrying about massive levels of automation or, you know shaving 1 % off this or the other, because that will never be the thing that is actually useful to your business. It might be interesting for you to do personally, but it won't be what your CEO wants you to be doing. And if you are at the other end of the scale and you are operating at a hyperscaler, then it's not the job of everyone in the business to worry about that scale, but it's the job of a lot more people to make sure that things operate operate safely at scale, and and you can afford to spend a lot more on efficiency. And not just afford to if you don't you don't have a business.

Anne Currie (29:21)

Yeah, in the Green Software Maturity Matrix, which you were also involved in, we do suggest that for most businesses there's a kind of basic operational competence, which is kind of like making sure you know what's running in your data centre And that will halve your costs and halve your carbon emissions. And you've got to have. It is non-negotiable because if you don't, then those machines that you're not watching are the ones that get hacked.

it's kind of basic operational competence required for all businesses - don't have machines sitting there running, but you don't know what they're doing anymore. But beyond that for most businesses, it doesn't really pay off that much. Even kind of basic auto scaling and things like that. They're lovely to have, they, know, it's not the end of world if you don't.

Jon Berger (30:23)

Based on that definition of "do you know what is running in your data center?" Most businesses do not have basic operational competence. In fact, probably not many have it. But is that a problem?

Yes, anyone who's doing something that they don't know that they're doing or don't or wouldn't choose to be doing, you could argue that's a problem. Is the problem worth addressing? Probably for most of those businesses, it isn't. If your entire business is run off one PC in the back room and 50 % of what's running on that PC is stuff that you don't need to be running, that's a problem, but it might not be one that you ever decide to bother to do anything about if you actually only need 5%. So you don't care about the inefficiency. Yes, it's a security hole. Yes, you're wasting energy that you don't need to be wasting. ⁓ Yes, when you make changes, they might be harder than they need to be. But still, none of those might be in your top 10 worries.

Anne Currie (31:35)

Yeah, it's true. It really depends on whether or not you really need that data that you're holding to be secure, which is not necessarily the case for all businesses. Sometimes it just doesn't matter.

Something you talk a lot about, was a big part of your job, was strategy for performance, strategy for security, strategy for resilience.

You need to know your business and know what matters and what doesn't matter. ⁓ So that you know where it's really important that you put your attention and where actually doesn't matter so much.

Jon Berger (32:17)

Yeah. Yeah. I mean just at a basic level are you more obsessed with growing your top line or is it the bottom line that matters? Are you trying to just add more customers? Cause at the end of the day, you're in a growth phase, you're in a growth business and the only important way that success is being measured is more customers or customers spending more money or more top line revenue, in which case you're probably not going to be worrying about efficiency unless you're in an area where that efficiency is what drives customers. Or are you in a ⁓ relatively stable environment where you've got limited ability to drive that top line and becoming more efficient can significantly increase your margins or lower your risks, and in which case that might be a very sensible thing for you to be doing.

Anne Currie (33:28)

Now, on that point, we have now been talking for 40 minutes and it's been quite a dense discussion. So I feel that at this point we should, for the time being, draw this conversation to a close, let people have a think about what we said. We'll be, as I say, Jon is very conveniently cached for this podcast and therefore will be on it quite often. So we can continue this conversation.

So I hope you've enjoyed the episode today. Thank you very much, Jon, for being a guest on asynchronous and unreliable. And I will be seeing you and all you listeners again, hopefully very soon on a future episode. Thank you very much and goodbye.

Page updated

Google Sites

Report abuse