The service mesh era: Using Istio and Stackdriver to build an SRE service
Just to recap, so far our ongoing series about the Istio service mesh we’ve talked about the benefits of using a service mesh, using Istio for application deployments and traffic management, and how Istio helps you achieve your security goals. In today’s installment, we’re going to dig further into monitoring, tracing, and service-level objectives. The goal of this post is to demonstrate how you can use Istio to level up your own Site Reliability Engineering (SRE) practices for workloads running in Kubernetes. You can follow along in this post with the step-by-step tutorial here.
The pillars of SRE
At Google, we literally wrote the book on SRE, and it has now become an industry term; but let’s quickly review what the term really means to us at Google. The goal of SRE is to improve service reliability and performance and, in turn, the end-user experience. Conceptually, that means proactively managing and incorporating three main components: service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs). We can summarize these as follows:
- SLOs: targets you set for overall service health
- SLAs: promises you make about your service’s health (so, they often include specific SLOs)
- SLIs: metrics that you use to define the SLO targets
How do we take these ideas from conceptual to practical? To provide guarantees about your service (SLAs), you need to set targets (SLOs) that incorporate several key service metrics (SLIs). That’s where Istio and Stackdriver come in.
Surfacing application metrics with Stackdriver Monitoring
In our second post, we talked about how Google Kubernetes Engine (GKE), Istio, and Stackdriver are integrated right out of the box. This means that Stackdriver Monitoring gives you the ability to monitor a dozen Istio-specific metrics without any special configuration or setup. These include metrics for bytes sent and received, request counts, and roundtrip latencies, for both clients and servers. Once you create a Stackdriver Workspace, you can immediately head to the Metrics Explorer and start visualizing those metrics from Istio. Without any manual instrumentation, Istio provides a significant amount of telemetry information for your workloads—enough to begin thinking about which of those metrics (Istio-provided or GCP-provided) could make for useful SLIs.
Which SLIs make the most sense will depend on your application and deployments, but for Istio-enabled workloads we typically recommend creating Dashboards that include some combination of GKE cluster resource monitoring (node availability, CPU, RAM) along with service request counts and service request/response latency, broken out by Kubernetes Namespaces and/or Pods. The example Dashboard below provides a combined overview of cluster and service health (see the tutorial here for steps to set up your own Dashboard).
After identifying the appropriate SLIs for your deployment, the next step is to create alerting policies that notify you or your team about any problems in your deployment. Alerting policies in Stackdriver are driven by metrics-based conditions that you define as part of the policy. In addition, you can combine multiple metrics-based conditions to trigger alerts when any or all of the conditions are met.
With a working metrics dashboard and alerting policies in place, you’re now at a point where you can keep track of the health of each of your services. But what happens when you see an alert? What if it turns out that one of your services has a server response latency that’s much higher than expected—and that it’s happening on a pretty regular basis? The good news is that now you know there’s a problem; but now the challenge is tracking it down.
Digging into requests using Stackdriver Trace
So far we’ve been talking about monitoring, but Istio’s telemetry support also includes the ability to capture distributed tracing spans directly from individual services. Distributed tracing allows you to track the progression of a single user-driven request, and follow along as it is handled by other services in your deployment.
Once the Stackdriver Trace API is enabled in your GCP project, Istio’s telemetry capture components start sending trace data to Stackdriver, where you can view it in the trace viewer. Without instrumenting any of your services or workloads, Istio captures basic span information, like HTTP requests or RPCs.
This is a good start, but to truly diagnose our example (higher than expected server response latency) we’ll need more than just the time it takes to execute a single service call. To get that next level of information, you need to instrument your individual services so that Istio (and by extension, Stackdriver) can show you the complete code path taken by the service called. Using OpenCensus tracing libraries, you can add tracing statements to your application code. We recommend instrumenting tracing for critical code paths that could affect latency, for example, calls to databases, caches, or internal/external services. The following is a Python example of tracing within a Flask application:
We instrumented our sample microservices demo using OpenCensus libraries. Once you’ve deployed that app and the built-in load generator has had a chance to generate some requests, you can head over to Stackdriver Trace to examine one of the higher latency service calls.
As you can see in the diagram above, Stackdriver Trace lets you examine the complete code path and determine the root of the high latency call.
Examining application output using Stackdriver Logging
The final telemetry component that Istio provides is the ability to direct logs to Stackdriver Logging. By themselves, logs are useful for examining application status or debugging individual functions and processes. And with Istio’s telemetry components sending metrics, trace data, and logging output to Stackdriver, you can tie all of your application’s events together. Istio’s Stackdriver integration allows you to quickly navigate between monitoring dashboards, request traces, and application logs. Taken together, this information gives you a more complete picture of what your app is doing at all times, which is especially useful when an incident or policy violation occurs.
Stackdriver Logging’s integration comes full circle with Stackdriver Monitoring by giving you the ability to create metrics based on structured log messages. That means you can create specific log-based metrics, then add them to your monitoring dashboards right alongside your other application monitoring metrics. And Stackdriver Logging also provides additional integrations with other parts of Google Cloud—specifically, the ability to automatically export logs to Cloud Storage or BigQuery for retention and follow-on ad-hoc analysis, respectively. Stackdriver Logging also supports integration with Cloud Pub/Sub where each output log entry is exported as an individual pub/sub message, which can then be analyzed in real-time using Cloud Dataflow or Cloud Dataproc.
Coming soon: SLOs and service monitoring using Stackdriver
So far we’ve reviewed the various mechanisms Stackdriver provides to assess your application’s SLIs; and now available for early access, Stackdriver will provide native support for setting SLOs against your specific service metrics. That means you will be able to set specific SLO targets for the metrics you care about, and Stackdriver will automatically generate SLI graphs, and track your target compliance over time. If any part of your workload violates your SLOs, you are immediately alerted to take action.
SRE isn’t about tools; it’s a lifestyle
Think of SRE as a set of practices, and not as a specific set of tools or processes. It’s a principled approach to managing software reliability and availability, through the constant awareness of key metrics (SLIs) and how those metrics are measured against your own targets (SLOs)—which you might use to provide guarantees to your customers (via SLAs). When you combine the power of Istio and Stackdriver and apply it to your own Kubernetes-based workloads, you end up with an in-depth view of your services and the ability to diagnose and debug problems before they become outages.
As you can see, Istio provides a number of telemetry features for your deployments. And when combined with deep Stackdriver integration, you can develop and implement your own SRE practices.
We haven’t even begun to scratch the surface on defining SRE and these terms so we’d recommend taking a look at SRE Fundamentals: SLIs, SLAs, and SLOs as well as SLOs, SLIs, SLAs, oh my – CRE life lessons for more background.
To try out the Istio and Stackdriver integration features we discussed here, check out the tutorial here. In our next post in the Service Mesh era series, we’ll take a deep-dive into Istio from an IT perspective and talk about some practical operator scenarios, like maintenance, upgrades, and debugging Istio itself.