MEarly on, we didn’t have much in the way of choice. We could not hire SREs as fast as the demand for them, so there was always scarcity. I used that to our advantage by simply saying, “We will assign SREs to the places where they’re going to do the most good”.
The solution that we have in SRE — and it’s worked extremely well — is an error budget. An error budget stems from this basic observation: 100% is the wrong reliability target for basically everything. Perhaps a pacemaker is a good exception! But, in general, for any software service or system you can think of, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and, let’s say, 99.999% available. Because typically there are so many other things that sit in between the user and the software service that you’re running that the marginal difference is lost in the noise of everything else that can go wrong.
If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system? I propose that’s a product question. It’s not a technical question at all. It’s a question of what will the users be happy with, given how much they’re paying, whether it’s direct or indirect, and what their alternatives are.
The business or the product must establish what the availability target is for the system. Once you’ve done that, one minus the availability target is what we call the error budget; if it’s 99.99% available, that means that it’s 0.01% unavailable. Now we are allowed to have .01% unavailability and this is a budget. We can spend it on anything we want, as long as we don’t overspend it.
So what do we want to spend the error budget on? The development team wants to launch features and get users. So ideally, we would spend all of our unavailability budget taking risks with things we launch in order to get them launched quickly. This basic premise describes the whole model. As soon as you conceptualize SRE activities in this way, then you say, oh, okay, so having things that do phased rollout or 1% experiments, all these are ways of putting less of our unavailability budget at risk, so that we can take more chances with our launches, so that we can launch more quickly. Sounds good.
Outside Google, we often observe that there isn’t parity of esteem between the SWE and operations teams, which combines poorly with the fact that they often have different incentives. That’s how we end up with the model that exists in the industry today, where SWE teams write something and throw it over a wall to the operations teams, who then try to make it work, and can’t, and throw it back, and so on.
This approach also has another good consequence, which is that if the service natively sits there and throws errors, you know, .01% of the time, you’re blowing your entire unavailability budget on something that gets you nothing. So you have an incentive in both the development world and the SRE team to improve the service’s native stability so that you’ll have budget left to spend on things you do want, like feature launches.
The other crucial advantage of this is that SRE no longer has to apply any judgment about what the development team is doing. SRE measures and enforces, but we do not assess or judge. Our take is “As long as your availability as we measure it is above your Service Level Objective (SLO), you’re clearly doing a good job. You’re making accurate decisions about how risky something is, how many experiments you should run, and so on. So knock yourselves out and launch whatever you want. We’re not going to interfere.” And this continues until you blow the budget.
Once you’ve blown the budget, we don’t know how well you’re testing. There can be a huge information asymmetry between the development team and the SRE team about features, how risky they are, how much testing went into them, who the engineers were, and so on. We don’t generally know, and it’s not going to be fruitful for us to guess. So we’re not even going to try. The only sure way that we can bring the availability level back up is to stop all launches until you have earned back that unavailability. So that’s what we do. Of course, such an occurrence happens very rarely, but it does happen. We simply freeze launches, other than P0 bug fixes — things that by themselves represent improved availability.
This has two nice effects. One, SRE isn’t in this game of second-guessing what the dev team is doing. So there’s no need for an adversarial relationship or information hiding or anything else. That’s what you want.
The other thing, which is one I hadn’t anticipated but turns out to be really important, is, once the development team figures out that this is how the game works, they self-police. Often, for the kind of systems we run at Google, it’s not one development team; it’s a bunch of small development teams working on different features. If you think about it from the perspective of the individual developer, such a person may not want a poorly tested feature to blow the error budget and block the cool launch that’s coming out a week later. The developer is incentivized to find the overall manager of the teams to make sure that more testing is done before the launch. Generally there’s much less information asymmetry inside the development team than there is between the development and SRE teams, so they are best equipped to have that conversation.