Asynchronous and Unreliable Episode 3

Episode 3

Guest: Niki Manoledaki

In the third episode of Asynchronous and Unreliable, Anne and Niki Manoledaki of Grafana Labs discuss the highest end of operational efficiency, GreenOps, and Finops using purely open source solutions including Kubernetes, Kepler and the Vertical Pod Autoscaler. What can you do with them and what is the vital technology you need in place before you attempt it? Real life stories!

Watch on YouTube

Listen on Spotify

Listen on Apple Podcasts

Read shownotes & transcript below

Shownotes

Join us for a deep dive into the cutting-edge practices of sustainable software and operational efficiency with Niki Manoledaki, senior software engineer at Grafana Labs. Discover how sophisticated data center automation, Kubernetes best practices, and innovative tools can dramatically cut costs and reduce carbon footprints. This episode offers valuable insights for tech leaders, DevOps engineers, and sustainability advocates aiming to align technology with green principles.

Main Topics:

The concept of Green Ops: integrating FinOps with GitOps for automated resource management
Practical techniques for improving machine utilization and auto-scaling in cloud-native environments
How Kubernetes features like Vertical Pod Autoscaler and in-place restart enhance resource efficiency
Challenges and solutions for large-scale cost attribution and monitoring with open source tools
Building a culture of openness, experimentation, and ownership to foster sustainable practices

Transcript

Anne Currie (00:00) Hello and welcome to Asynchronous and Unreliable, a new weekly podcast where we discuss the most interesting ideas in tech. I'm your host, Anne Currie, co-author of Building Green Software, The Cloud Native Attitude, and author of the science fiction Panopticon series. I also offer consultancy, training, and workshops at strategically.green. And as for my guest today, to celebrate Earth Day and—arguably, it's still Earth Day. We will discuss whether it is really still Earth Day because we're recording this on Thursday, not Wednesday, but it's Earth Day in Hawaii. So arguably, it's still Earth Day. We have sustainability and green software expert Niki, who I first met as one of my many technical contributors to Building Green Software. So, Niki, do you want to introduce yourself?

Niki (00:51) Sure. Hi everyone. And thank you, Anne, for bringing me on the podcast. I'm so excited to be here, along with the really great cohort that you have put together. I'm Niki Manoledaki. I'm a senior software engineer at Grafana Labs on the platform engineering team. And I'm a long-time open-source contributor. I lead a few different projects in the open-source ecosystem, particularly in the Cloud Native Computing Foundation. For example, Kepler, an energy monitoring tool, and also previously the Technical Advisory Group for environmental sustainability has been a big focus of mine. I'm so excited to be here to talk about cloud sustainability. And it is still Earth Day in Hawaii, so it's Earth Day somewhere. And I'm mentally in Hawaii right now.

Anne Currie (01:46) Yeah, we all are, I think. I've never been to Hawaii actually. It sounds fantastic.

Niki (01:58) I mean, I was researching the meaning of Earth Day and definitely Hawaii is a beautiful place to think of Earth Day happening. It's beautiful, beautiful nature for sure.

Anne Currie (02:14) One day we'll all go to Hawaii. So, yeah, it's—I'd say it's very appropriate that it's a whole Hawaii-themed. We didn't have anything culturally insensitive or whatever, fortunately; we couldn't kind of grab our leis or whatever. You've all been spared that probably culturally inappropriate experience of seeing us attempt to pretend to be in Hawaii.

So today, being Earth Day, we will be talking about green things. And one of the interesting things—you were a major contributor to the operational chapter of Building Green Software, because ops are your area. So I'm very interested in—and you do a lot of interesting work around effectively operational efficiency and aligning the operation of your systems with renewable power availability. You work on that from multiple angles, multiple really interesting angles. One is your involvement in Kepler. The other is your involvement in monitoring and tracking sustainability for systems through your job. So it would be great to talk about both those things today. I don't know which one you want to talk about first.

Niki (03:50) My day-to-day is with cost monitoring, and that's something maybe people don't know about me. My day-to-day is monitoring the cost of resources with a lot of granularity. So what I generally care about is what we talked about in Building Green Software: the trifecta of cloud sustainability, which I describe as energy, carbon, and cost. On the cost side of things, I know it's a complicated topic when it comes to time, cost, and carbon efficiency in general. But what I do in my day-to-day is creating cost metrics that can be used by engineers by tying pricing and utilization metrics.

That's where it really connects with operational efficiency because we multiply pricing with utilization. So for example, CPU utilization, or load balancer utilization, or virtual machine utilization for EC2 instances—different kinds of utilization metrics that we usually get depending on whether it's running on Kubernetes or if it's not running on Kubernetes. And we do all this through Prometheus recording rules. So we have a set of cost metrics through Prometheus queries that we build and maintain and extend, and we add granularity to and we aggregate them in different ways. We add a team-level attribution to them as well. We can calculate unit costs. We can use this unit cost during the release process to show that the unit cost has stayed the same release by release.

Unit economics is—I was reading in the Cloud FinOps book by O'Reilly—unit economics is the "Nirvana" of FinOps. And that's where you want to be. And also then you can use this unit cost to tie production costs to SKUs, which are stock keeping units. And that's where my team is right now. It's a very exciting place to be. Energy comes into it if you look at cost per watt, which is not something we've achieved yet, at least in my team, but I'm very interested in going in that direction. Especially when it comes to AI, I've been listening to a lot of discussions around the limit of AI scaling. Obviously one of the limiting factors of AI is energy consumption. So a lot of companies are trying to tie operational efficiency and cost per watt, which I think is where we're heading.

Anne Currie (07:18) Yeah, absolutely. There's loads in there and now we're going to unpack that. I'm going to roll all the way back for our listeners and viewers to the very beginning of this: why FinOps? One of the things that you suggested that we put in Building Green Software was the very simple algorithm, which I quite like a lot, which is Green Ops. This was when you were still very heavily involved with the CNCF at the time. The CNCF definition of Green Ops is FinOps plus GitOps, which is effectively automated FinOps.

And the reason for saying Green Ops is all about cost is because they're both really about the same thing. Operational efficiency and being green is about cutting the amount of resources you're using to do the job. Not cutting the amount of job you're doing, but cutting the amount of resources you are using per unit of delivery—whether that's SKU or order or visits. So you want to cut the amount of resources, and the resources that you're trying to cut are generally the machines, the number of machines you're using, the amount of hardware that you're using, and the amount of electricity. Those are the two key resources, and they both have a cost associated with them. Obviously, the cost of buying, maintaining, and housing the hardware, and then the cost of electricity.

One of the things it sounds like you are trying to achieve—and I wholeheartedly applaud this—is making it quite "like for like". The only way you can move forward with improving things is if you can say, "this is better this week than last week". You have to isolate the moment-by-moment changes that would just hide the underlying issue from you. That's why often money is a better metric for early days in improving your operational efficiency than, for example, carbon, because carbon intensity of the grid varies moment by moment. So it's very hard for you to tell whether a change that you made made your system use twice as much electricity and hardware—which is often a decision under your control as the ops or coding team—or if it was just less sunny than expected this week. Operational monitoring is really about giving you actionable information, prioritizing actionability over moment-by-moment correctness. Is that right?

Niki (10:57) Absolutely. As a platform engineering team, we've decentralized the optimization of resources so that each engineering team can perform and implement these optimizations because they are the domain experts. At Grafana Labs, there's logs, metrics, and traces. As a platform engineer and specialist in Kubernetes, I might not have much in my toolkit beyond deploying autoscalers like Karpenter and the Vertical Pod Autoscaler. And we do that very well. We've achieved really good utilization efficiency metrics as a platform team using Karpenter, and we maintain the Vertical Pod Autoscaler so that there's right-sizing available for engineering teams.

Anne Currie (12:09) So I'm going to interrupt you there because you've introduced concepts that are utterly key to operational efficiency—using fewer machines to produce the same results. Probably the most fundamental one is utilization. A couple of years back, Gartner said the average enterprise has a machine utilization of 10 to 15% in their data centers. If you look at somebody who's doing things really well, like Google, they have more like 80, maybe even 90% utilization. That means for doing a similar job, Google might require 10x fewer resources—an order of magnitude fewer machines—because they're really good at machine utilization. It's a no-brainer to reduce the number of machines and increase the utilization to decrease resource usage. Techniques to do that include autoscaling and clever schedulers like Karpenter. Why don't you tell us how autoscalers do it?

Niki (14:42) There's a couple of different concepts we use in Kubernetes: bin packing and right-sizing. Bin packing is orchestrating as many pods to land on the same node so that we can reach a higher utilization, ideally around 80%. Anything above 90% starts to be a regression in terms of performance and increases latency, and research shows energy efficiency can degrade at those extremely high levels. So 80% is the target.

Anne Currie (15:59) Bin packing is the subject of another book I wrote called The Cloud Native Attitude. It's about the modern techniques that allow the cloud to be as profitable as it is. If you have one application per server, that leads to 10% utilization or less. If you wrap each application in an encapsulating layer—like a container—it's more lightweight and quicker to move and start. Then you can bin pack your applications onto physical servers.

Kelsey Hightower always used to do an excellent visual representation of this when he would play Tetris live on screen. He said it's like playing Tetris live with all the applications you want to run. But his point was you can't do it manually. If it required people to be playing Tetris live, you wouldn't be doing it. Automation through schedulers like Karpenter allows you to pack applications onto servers in an automated way that humans could never possibly do. These efficiency levels are only achievable because of programmatic control of data centers.

Niki (18:56) Yes, that should be a must. This kind of smart bin packing using an orchestrator like Kubernetes is a lifesaver for computing at scale. But the maturity model continues; after you've bin packed, the next thing is right-sizing. Right-sizing looks at the pod's anatomy.

Anne Currie (19:40) Define a pod for us.

Niki (19:54) It's the smallest unit of components in Kubernetes. You wrap up your application in a container and you might have three replicas running. You set the resource in terms of CPU and memory the application requests when it's scheduled. You might not always get it right and over-provision them. The Vertical Pod Autoscaler (VPA) helps with right-sizing by adapting the memory and CPU the pod is requesting based on an algorithm. It's another smart tool to optimize resources at that small unit.

Anne Currie (21:50) That's interesting because right-sizing used to just mean shrinking an oversized VM in the cloud. A good ops team will be doing that all the time, but manual right-sizing reviews rarely happen because ops teams are busy. You're talking about automatic right-sizing. Most people are familiar with horizontal autoscaling—firing up more VMs—but this is a cleverer "zoom in" version. Instead of just adding more VMs, it's looking at the container level. It's like playing live Tetris where the size of the shapes as they're falling are changing. That could never be done by an operational team manually. That is what Karpenter and VPA are about—making things smaller and replaying the Tetris screen to fit more stuff.

Niki (27:19) Yes, and a lot of this can be automated. VPA fits most workloads, but there are exceptions, like workloads with a big spike in utilization at startup. VPA used to restart the pod to increase resources, which could lead to eternal out-of-memory loops. But Kubernetes released a new "in-place restart" feature recently. VPA no longer requires a restart to adjust CPU or memory, which helps with those harder edge cases. It took a long time to build in the open-source community, but it's out there now.

Anne Currie (30:22) Doing this complicated automation is difficult and involves a lot of trial and error because things change all the time. You have to be really comfortable with that. VPA is a quite sophisticated scheduler that has been around for a long time, but I wouldn't start there if you're new to Kubernetes.

Niki (32:04) That's the beauty of a platform engineering team; we can hide that complexity. I've contributed to the VPA code; it's a very complicated algorithm. But the user interface of just adding a YAML definition isn't that hard—deploying and understanding it is the challenge. My goal was to make it as simple as possible for product engineers so they don't have to learn the internals.

Anne Currie (33:53) You're an expert, and you're signaling that while VPA has been hard to use in the past, maybe it's time for people to take another look.

Niki (35:26) I think so, especially with the in-place restart feature. We still have some workloads where we don't use VPA, like our large Prometheus instances. For those, we use a cron job with Argo CD to right-size periodically, rather than the quick cadence of VPA.

Anne Currie (36:51) Grafana is clearly at the sophisticated end of programmatic infrastructure, pushing down costs and carbon emissions per unit. Most enterprises are still thinking about maybe running a "thrift-a-thon". It's interesting that you're using VPA in the wild for almost everything. We rely on companies like yours to push through the horrendous edge cases so ordinary enterprises can eventually do it too.

Niki (40:12) We're definitely pushing the limits. We've deployed Karpenter, VPA, and event-driven autoscaling with KEDA. Now we're gamifying cost optimization—efficiency doesn't end with Kubernetes. We used to use OpenCost for cost monitoring, which attributes the cost of each resource. But OpenCost didn't scale for us—it actually took down the Kubernetes API in some large clusters because of the number of queries. So we built our own open-source tool called Cloud Cost Exporter.

Anne Currie (43:55) You are absolutely demonstrating the classic behavior of an early adopter. You mentioned attribution is the name of the game here. This isn't about blame; it's about delegating ownership to the people who have the power to make changes. It requires a mature culture where attribution equals ownership, not blame. When I wrote Cloud Native Attitude, the one thing consistent across everyone doing well wasn't a specific piece of technology—it was an iterative trial-and-error mindset.

Niki (48:07) Our culture is very engineering-driven. Open source is a core attribute, which fosters openness and a willingness to share. Our CTO, Tom Wilkie, was a Prometheus and Kubernetes maintainer himself, which is a huge contributor to how we operate today. We innovate in the open.

Anne Currie (51:06) We've talked for 50 minutes now about some very complicated high-end ops. But the takeaway is that operational efficiency is green. You do not have to be as sophisticated as Grafana to start. Be open, accept attribution as a useful tool, and iterate on units of cost.

Niki (53:04) I love decentralizing this information. My role as a platform engineer is to create "golden paths" for other engineers. I can multiply my environmental impact by sharing these best practices. We could talk more about unit economics and AI another time.

Anne Currie (54:23) Thank you very much indeed for being on the podcast and showing us what the highest end of operational efficiency is. All the tools you use are open source and available to everybody.

Niki (56:21) Everything is achievable. I feel like I'm clearing a path and showing others how to follow along.

Anne Currie (57:00) Token maxing versus token ops was the subject of a podcast I did with Sara Bergman that will come out imminently. We can talk about tokenomics next time you are on. You've given me a hell of a lot of stuff to edit to get this out while it's still Earth Day somewhere in the world.

Niki (57:51) It's Earth Day somewhere in the world! It's Earth Day every day for us. I'm so excited to come back and talk more about tokenomics and everything else.

Anne Currie (58:18) Well, I look forward to it. Thank you to all our watchers and listeners, and hopefully I will see you again on a future episode of Asynchronous and Unreliable. Goodbye.

Page updated

Google Sites

Report abuse