Introducing a new Coursera course on Site Reliability Engineering

Our Customer Reliability Engineering (CRE) team is on a mission to help every business become more reliable by making it easy to adopt Site Reliability Engineering (SRE). SRE is a discipline founded here at Google that utilizes prescriptive methods and principles for building and running reliable systems. With CRE, we work with customers and partners to reduce the operational burden of your systems, become more agile, and help you run reliable services for your users and customers.

We want to make sure that teams everywhere can adopt SRE and implement these principles. That’s why we’re pleased to introduce a new Coursera course that’s dedicated to helping you get started with SRE. The new course, Site Reliability Engineering: Measuring and Managing Reliability, distills years of collective Google SRE experience with designing and managing complex systems that meet their reliability targets. We’re making it easy for developers to start learning the basics of SRE concepts and help the larger SRE community continue on their journey. You’ll learn at your own pace and find insight, whether you’re a new or experienced SRE.

Some of the terms and concepts you’ll learn include:

  • How to describe and measure the desired reliability of a service
  • What it means to operate reliably
  • What SLOs, SLIs and SLAs are
  • What error budgets are and how to use them
  • How to measure against your metrics and assess whether they’re realistic.

Getting started with Coursera and SRE

In the SRE course, you’ll learn about the basics, including how it came to be part of Google engineering, and what kind of tools SREs use to make decisions. You’ll start by learning about the goals of a reliable system and how that relates to user expectations. You’ll also learn about common monitoring practices, pros and cons of different measurement strategies, and specific recommendations on how to choose your own metrics.

The course also dives into the details you’ll need to build your own set of service-level indicators (SLIs) and service-level objectives (SLOs), using a case study. You’ll see a method for performing risk analysis and see how to incorporate those findings into your long-term reliability goals. Additionally, you’ll cover documenting SLOs and assigning responsibilities to ensure you’re setting up a sustainable SRE practice.

Get started today with SRE on Coursera as the next step in your SRE journey!