Learning—and teaching—the art of service-level objectives — CRE Life Lessons

Avid readers of CRE Life Lessons blog posts (there are dozens of us!) will appreciate the value of well-tuned service-level indicators (SLIs) and service-level objectives (SLOs). These concepts are fundamental building blocks of a site reliability engineering (SRE) practice. After all, how can you have a meaningful discussion about the reliability you want your services to achieve without properly measuring that reliability?

The Customer Reliability Engineering (CRE) team has helped many of Google Cloud’s customers create their first SLOs and better understand the reliability of their services. We want to make sure that teams everywhere can implement these principles. We’re pleased to announce that we’re making all the materials for our Art of SLOs workshop freely available under the Creative Commons CC-BY 4.0 license for anyone to use and re-use—as long as Google is credited as the original author. We’ve been inviting customers from around the world to this workshop for the past year. From now on, anyone can run their own version of this workshop, to teach their coworkers, their customers, or conference attendees why all services need SLOs.

What’s covered in the Art of SLOs

The Art of SLOs teaches the essential elements of developing SLOs to an audience from across the realms of development, operations, product, and business. The workshop slides are accompanied by a 28-page supporting handbook for participants, which is part reference and part background material for the practical problems that workshop participants engage with.

In the workshop, we start by making a business case for the value of SLOs based on two fundamental assertions. First, that reliability is the most important feature of any service, and second, that 100% is the wrong reliability target for basically everything. These assertions underlie the concept of an error budget, a non-zero quantity of allowable errors in a given time window that arises from an SLO target set somewhere just short of 100%. The tension between a fast pace of innovation and service reliability can be resolved by aiming to roll out new features as fast as possible without exhausting this error budget.

Once everyone is (hopefully) convinced that SLOs are a Good Thing, we explain how to choose good SLIs from the wealth of telemetry generated by a service running in production, and introduce the SLI equation, our recommended way of expressing any SLI. We cover two alternate ways of setting your first SLO targets, which arise from making different tradeoffs, and offer advice on how to converge these targets over time. We introduce a hands-on example—the server-side infrastructure supporting a fictional mobile game called Fang Faction—and use it to demonstrate the process of refining an SLI from a simple, generic specification to a concrete implementation that could be measured by a monitoring system.

Art (noun): A skill at doing a specified thing, typically one acquired through practice.

Critically, participants put this newly acquired knowledge to practical use straight away, as they develop more SLIs and SLOs for Fang Faction. Typically, when we run this workshop with customers, we break them up into groups of eight or so and unleash them on the workshop problems for 90 minutes. Each group is paired with an experienced SRE volunteer, who facilitates the discussion, encourages participation, and keeps the group on track.

Run your own SLO workshop!

If this sounds interesting, you’ll want to check out the Facilitators Handbook, which has a lot more information on how to organize an Art of SLOs workshop. If you don’t have a whole team to educate, you might be interested in our Measuring and Managing Reliability course on Coursera, which is a more thorough, self-paced dive into the world of SLIs, SLOs and error budgets.