All together now: our operations products in one place

Our suite of operations products has come a long way since the acquisition of Stackdriver back in 2014. The suite has constantly evolved with significant new capabilities since then, and today we reach another important milestone with complete integration into the Google Cloud Console. We’re now saying goodbye to the Stackdriver brand, and announcing an operations suite of products, which includes Cloud Logging, Cloud Monitoring, Cloud Trace, Cloud Debugger, and Cloud Profiler. 

The Stackdriver functionality you’ve come to depend on isn’t changing. Over the years, these operations products have seen a strong growth in usage by not just application developers and DevOps teams, but also IT operators and security teams. Complete integration of the products into the Cloud Console, along with in-context presence within the key service pages themselves—like the integrations into Compute Engine, Google Kubernetes Engine, Cloud SQL, and Dataflow management consoles—brings a great experience to all users. Putting operations tasks a quick click away, without users losing context of the activities they had been performing, shows how seamless an operations journey can be. 

In addition to this console integration, we’re very happy to share some of the progress in our products, with lots of exciting features launching today. 

Cloud Logging

Continuing with our goal to build easy-to-use products, we have completely overhauled the Logs Viewer and will be rolling this out to everyone over the next week. This makes it even easier for you to quickly identify and troubleshoot issues. We’re also pleased to announce that the ability to customize how long logs are retained is now available in beta. With both the new Cloud Logging user interface and the new 10-year log retention feature, you can search logs quickly and identify trends and correlations. We also understand that in some cases, it is very useful to export logs from Cloud Logging to other locations like BigQuery, Cloud Storage, or even third-party log management systems. To make this easier, we are making Logs Router generally available. Similarly, data from Cloud Trace can also be exported to BigQuery. Log Router’s support for customer management encryption keys (CMEK) also makes this a good solution for environments needing to meet that security need for compliance and other purposes. 

Cloud Monitoring

Our biggest change of all that you’ll see in the console is Cloud Monitoring, as this was the last Stackdriver product to migrate over to our Google Cloud Console. You’ll now find a better designed, easy-to-navigate site, and important new features targeted to make your life easier. We are increasing our metrics retention to 24 months and writing metrics at up to 10-second granularity. The increased granularity is especially useful when making quick operational decisions like load balancing, scaling, etc. Like Cloud Logging, you can now access what you need more quickly, as well as prepare for future troubleshooting with longer retention. 

An additional key launch is Dashboards API, which lets you develop a dashboard once and share it multiple times in other workspaces and environments. Users might also notice better metrics recommendations by surfacing to the top of the selection list the most popular metrics for a given resource type. This is a great example of understanding the preferred metric by the millions of users on Google Cloud, and surfacing them quickly in other situations.

This release makes it possible to route alerts to independent systems with Pub/Sub support, continuing with the ability to connect a broad variety of operational tools with Cloud Monitoring. To keep up with the needs of some of our largest users, we are also expanding support for hundreds of projects within a Workspace—providing a single point of control and management interface for multiple projects. Stay tuned for more details about all of these new capabilities in a series of blog posts over the next few weeks. 

2020 will continue to see momentum for our operations suite of products, and we’re looking forward to the road ahead as we continue to help developers and operators across the world to manage and troubleshoot issues quickly and keep their systems up and running.

Learn more about the operations suite here.

Making your monolith more reliable

In cloud operations, we often hear about the benefits of microservices over monolithic architecture. Indeed, microservices help manage hardware being abstracted away and push developers towards resilient, distributed designs. However, many enterprises still have monolithic architectures which they need to maintain. 

For this post, we’ll use Wikipedia’s definition of a monolith: “A single-tiered software application in which the user interface and data access code are combined into a single program from a single platform.”

When and why to choose monolithic architecture is usually a matter of what works best for each business. Whatever the reason for using monolithic services, you still have to support them. They do, however, bring their own reliability and scaling challenges, and that’s what we’ll tackle in this post. At Google, we use site reliability engineering (SRE) principles to ensure that systems run smoothly, and these principles apply to monoliths as well as microservices. 

Common problems with monoliths

We’ve noticed some common problems that arise in the course of operating monoliths. Particularly, as monoliths grow (either scaling with increased usage, or growing more complex as they take on more functionality), there are several issues we commonly have to address:

  • Code base complexity: Monoliths contain a broad range of functionality, meaning they often have a large amount of code and dependencies, as well as hard-to-follow code paths, including RPC calls that are not load-balanced. (These RPCs call to themselves or call between different instances of a binary if the data is sharded.)

  • Release process difficulty: Frequently, monoliths consist of code submitted by contributors across many different teams. With more cooks in the kitchen and more code being cooked up every release cycle, the chances of failure increase. A release could fail QA or fail to deploy into production. These services often have difficulty reaching a mature state of automation where we can safely and continuously deploy to production, because the services require human decision-making to promote them into production. This puts additional burden on the monolith owners to detect and resolve bugs, and slows overall velocity.

  • Capacity: Monolithic servers typically serve various types of requests, and that variation means that in order to complete the requests, differences in compute resources—CPU, memory, storage I/O, and so on—are required. For example, an RDBMS-backed server might handle view-only requests that read from the database and are reasonably cacheable, but may also serve RPCs that write to the database, which must be committed before returning to the user. The impact on CPU and memory consumption can vary greatly between these two. Let’s say you load-test and determine your deployment handles 100 queries per second (qps) of your typical traffic. What happens if usage or features change, resulting in a higher number of expensive write queries? It’s easy to introduce these changes—they happen organically when your users decide to do something different, and can threaten to overwhelm your system. If you don’t check your capacity regularly, you can end up being underprovisioned gradually over time.

  • Operational difficulty: With so much functionality in one monolithic system, the ability to respond to operational incidents becomes more consequential. Business-critical code shares a failure domain with low-priority code and features. Our Google SRE guidelines require changes to our services to be safe to roll back. In a monolith with many stakeholders, we need to coordinate more carefully than with microservices, since the rollback may revert changes unrelated to the outage, slow development velocity, and potentially cause other issues.

How does an SRE address the issues commonly found in monoliths? The rest of this post discusses some best practices, but these can be distilled down to a single idea: Treat your monolith as a platform. Doing so helps address the operational challenges inherent in this type of design. We’ll describe this monolith-as-a-platform concept to illustrate how you can build and maintain reliable monoliths in the cloud.

Monolith as a platform

A software platform is essentially a piece of software that provides an environment for other software to run. Taking this platform approach toward how you operate your monolith does a couple of things. First, it establishes responsibility for the service. The platform itself should have clear owners who define policy and ensure that the underlying functionality is available for the various use cases. Second, it helps frame decisions about how to deploy and run code in a way that balances reliability with development velocity. 

Having all the monolith code contributors share operational responsibility sets individuals against each other as they try to launch their particular changes. Instead of sharing operational responsibility, however, the goal should be to have a knowledgeable arbiter who ensures that the health of the monolith is represented when designing changes, and also during production incidents.

Scaling your platform

Monoliths that are run well converge on some common best practices. This is not meant to be a complete list and is in no particular order. We recommend considering these solutions individually to see if they might improve monolith reliability in your organization:

  • Plug-in architecture: One way to manifest the platform mindset is to structure your code to be modular, in a way that supports the service’s functional requirements. Differentiate between core code needed by most/all features and dedicated feature code. The platform owners can be gatekeepers for changes to core code, while feature owners can change their code without owner oversight. Isolate different code paths so you can still build and run a working binary with some chosen features disabled.

  • Policies for new code and backends: Platform owners should be clear with the requirements for adding new functionality to the monolith. For example, to be resilient to outages in downstream dependencies, you may set a latency requirement stating that new back-end calls are required to time out within a reasonable time span (milliseconds or seconds), and are only retried a limited number of times before returning an error. This prevents a serving thread from getting stuck, waiting indefinitely on an RPC call to a backend, and possibly exhausting CPU or memory.

    Similarly, you might require developers to load test their changes before committing or enabling a new feature in production, to ensure there are no performance or resource requirement regressions. You may want to restrict new endpoints from being added without your operation team’s knowledge.

  • Bucket your SLOs: For a monolith serving many different types of requests, there’s a tendency to define a new SLI and SLO for each request. As the number of SLOs increases, however, it gets more confusing to track and harder to assess the impact of error budget burn for one SLO vs. all the others. To overcome this issue, try bucketing requests based on the similarity of the code path and performance characteristics. For example, we can often bucket latency for most “read” requests into one group (usually lower latency), and create a separate SLO bucket for “write” requests (usually higher latency). The idea is to create groupings that indicate when your users are suffering from reliability issues.

    Which team owns a particular SLO or deciding whether an SLO is even needed for each feature are important considerations. While you want your on-call engineer to respond to business-critical outages, it’s fine to decide that some parts of the service are lower-priority or best-effort, as long as they don’t threaten the overall stability of the platform.

  • Set up traffic filtering: Make sure you have the ability to filter traffic by various characteristics, using a web application firewall (WAF) or similar method. If one RPC method experiences a Query of Death (QoD), you can temporarily block similar queries, thereby mitigating the situation and giving you time to fix the issue.

  • Use feature flags: As described in the SRE book, giving specific features a knob to disable all or some percentage of traffic is a powerful tool for incident response. If a particular feature threatens the stability of the whole system, you can throttle it down or turn it off, and continue serving all your other traffic safely.

  • Flavors of monoliths: This last practice is important, but should be carefully considered, depending on your situation. Once you have feature flags, it’s possible to run different pools of the same binary, with each pool configured to handle different types of requests. This helps tremendously when a reliability issue requires you to re-architect your service, which may take some time to develop. Within Google, we once ran different pools of the same web server binary to serve web search and image search traffic separately, because performance profiles were so different. It was challenging to support them in a single deployment but they all shared the same code, and each pool only handled its own type of request.
    There are downsides to this mode of operation, so it’s important to approach this thoughtfully. Separating services this way may tempt engineers to fork services, in spite of the large amount of shared code, and running separate deployments increases operational and cognitive load. Therefore, instead of indefinitely running different pools of the same binary, we suggest setting a limited timeframe for running the different pools, giving you time to fix the underlying reliability issue that caused the split in the first place. Then, once the issue is resolved, merge serving back to one deployment. 

Regardless of where your code sits on the monolith-microservice spectrum, your service’s reliability and users’ experience are what ultimately matters. At Google, we’ve learned—sometimes the hard way—from the challenges that various design patterns bring. In spite of these challenges, we continue to serve our users 24/7 by calling to mind SRE principles, and putting these principles into practice.

Introducing the Cloud Monitoring dashboards API

Using dashboards in Cloud Monitoring makes it easy to track critical metrics across time. Dashboards can, for example, provide visualizations to help debug high latency in your application or track key metrics for your applications. Creating dashboards by hand in the Monitoring UI can be a time-consuming process, which may require many iterations. Once dashboards are created, you can save time by using them in multiple Workspaces within your organization. 

Today, we’re pleased to announce that the Cloud Monitoring dashboards API is generally available from Google Cloud. The dashboards API lets you read the configuration for existing dashboards, create new dashboards, update existing dashboards and delete dashboards that you no longer use. These methods follow the REST and gRPC semantics and are consistent with other Google Cloud APIs. 

A common use case for the dashboards API is to deploy a dashboard developed in one Monitoring Workspace into one or more additional Workspaces. For example, you may have a separate Workspace for your development, QA and production environments (learn more on selecting Workspace structures). In one of the environments, you may have developed a standard operational dashboard that you’d like to use across all your Workspaces. By first reading the dashboard configuration via the projects.dashboards.get method, you can save the dashboard configuration and then use the projects.dashboards.create method to create the same dashboard across the other environments.

How the dashboard API works 

When creating a dashboard, you have to specify the layout and the widgets that go inside that layout. A dashboard must use one of three layout types: GridLayout, RowLayout or ColumnLayout. 

  • GridLayout divides the available space into vertical columns of equal width and arranges a set of widgets using a row-first strategy.

  • RowLayout divides the available space into rows and arranges a set of widgets horizontally in each row.

  • ColumnLayout divides the available space into vertical columns and arranges a set of widgets vertically in each column.

The widgets available to place inside the layouts include an XyChart, Scorecard and Text object.

  • XyChart: displays data using X and Y axes. Charts created through the Google Cloud Console are instances of this widget.

  • Scorecard: displays the latest value of a metric, and how this value relates to one or more thresholds. 

  • Text: displays textual content, either as raw text or a markdown string. 

Here’s an example of the JSON dashboard configuration, which specifies a GridLayout with a single XyChart widget. You can see other examples in our sample dashboards and layouts documentation.

Dashboard configuration as a template

A simple approach to building a dashboard configuration is to first create a dashboard in the Cloud Monitoring console, then use the dashboards API projects.dashboards.get method to export the JSON configuration. Then, you can share that configuration as a template either via source control or however you normally share files with your colleagues.

You can try out the dashboard API in the Try this API section of the API documentation, and learn more about managing dashboards by reading the Managing Dashboards documentation. We’re working on features to make the API even more useful, including through the gcloud command line. Also, contributors are discussing and planning the Terraform module for the Monitoring Dashboard API in github.


A special thanks to our colleagues David Batelu, Technical Lead and Joy Wang, Product Manager, Cloud Monitoring, for their contributions to this post.

Logging + Trace: love at first insight

Meet Stackdriver Logging, a gregarious individual who loves large-scale data and is openly friendly to structured and unstructured data alike. Although they grew up at Google, Stackdriver Logging welcomes data from any cloud or even on-prem. Logging has many close friends, including Monitoring, BigQuery, Pub/Sub, Cloud Storage and all the other Google Cloud services that integrate with them. However, recently, they are looking for a deeper relationship to find insight.

Now meet Stackdriver Trace, a brilliant and organized being. Trace also grew up at Google and is a bit more particular about data, making sense out of the chaos of distributed systems.

Logging and Trace were brought together by mutual friends, such as Alex Van Boxel, cloud architect at Veepee. “Tracking down performance issues is like solving a murder mystery, having tracing and logging linked together is a big help for the forensics team,” he says. With a strong union, Trace and Logging are a match made in heaven: Developers are able to see exactly what is happening with their code, and how it fits within other services in the ecosystem. 

By embedding logs, Trace is able to show detail of what happened during a particular service call. Added to Trace’s ability to show the complete request, the user has full stack observability. By adding Trace IDs to logs, Logging is able to filter for logs within a trace, and link into end-to-end Traces. You can see not only how your code functions, but the context in which it does.  

What Trace loves most about Logging

“Logging, you complete me“  — Trace 

Complete your workflow by showing logs inline for each service call. In the Trace UI, you can understand logs in context by showing logs in line as events for each service call.

Drill into the logs that relate to a particular service in the logs view. You can understand how your code is operating at a deeper level by linking from the Trace UI right into the relevant log entry in Stackdriver Logging.

Search across the entire request. In the Trace UI, you can filter for labels on any service in the trace, showing logs for a downstream service when an upstream condition is true.

gcp console trace logging.gif

What Logging loves most about Trace

“Trace, you help me be a better platform.”  — Logging

See logs from the entire request. In the logs viewer, filtering by Trace ID shows you all logs for that specific request.

Drill into a trace of the complete request. In the logs viewer, you can drill into the trace of the complete request right from the log entry of interest, which helps understand richer and more complete context. 

Diagnose the root cause of errors. In the Trace UI, you can search for error traces, and easily see which downstream service is responsible for the error.

gcp console trace logging 2.gif

Identifying and tracking toil using SRE principles

One of the key measures that Google site reliability engineers (SREs) use to verify our effectiveness is how we spend our time day-to-day. We want ample time available for long-term engineering project work, but we’re also responsible for the continued operation of Google’s services, which sometimes requires doing some manual work. We aim for less than half of our time to be spent on what we call “toil.” So what is toil, and how do we stop it from interfering with our engineering velocity? We’ll look at these questions in this post.

First, let’s define toil, from chapter 5 of the Site Reliability Engineering book:

“Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

Some examples of toil may include:

  • Handling quota requests

  • Applying database schema changes

  • Reviewing non-critical monitoring alerts

  • Copying and pasting commands from a playbook

A common thread in all of these examples is that they do not require an engineer’s human judgment. The work is easy but it’s not very rewarding, and it interrupts us from making progress on engineering work to scale services and launch features.

Here’s how to take your team through the process of identifying, measuring, and eliminating toil.

Identifying toil

The hardest part of tackling toil is identifying it. If you aren’t explicitly tracking it, there’s probably a lot of work happening on your team that you aren’t aware of. Toil often comes as a request texted to you or email sent to an individual who dutifully completes the work without anyone else noticing. We heard a great example of this from CRE Jamie Wilkinson in Sydney, Australia, who shared this story of his experience as an SRE on a team managing one of Google’s datastore services.

Jamie’s SRE team was split between Sydney and Mountain View, CA, and there was a big disconnect between the achievements of the two sites. Sydney was frustrated that the project work they relied upon—and the Mountain View team committed to—never got done. One of the engineers from Sydney visited the team in Mountain View, and discovered they were being interrupted frequently throughout the day, handling walk-ups and IMs from the Mountain View-based developers. 

Despite regular meetings to discuss on-call incidents and project work, and complaints that the Mountain View side felt overworked, the Sydney team couldn’t help because they didn’t know the extent of these requests. So the team decided to require all the requests to be submitted as bugs. The Mountain View team had been trained to leap in and help with every customer’s emergency, so it took three months just to make the cultural change. Once that happened, they could establish a rotation of people across both sites to distribute load, see stats on how much work there was and how long it took, and identify repetitive issues that needed fixing.

“The one takeaway from this was that when you start measuring the right thing, you can show people what is happening, and then they agree with you,” Jamie said. “Showing everyone on the team the incoming vs. outgoing ticket rates was a watershed moment.”

When tracking your work this way, it helps to gather some lightweight metadata in a tracking system of your choice, such as:

  • What type of work was it (quota changes, push release to production, ACL update, etc.)?

  • What was the degree of difficulty: Easy (<1 hour); Medium (hours); Hard (days) (based on human hands-on time, not elapsed time)?

  • Who did the work?

This initial data lets you measure the impact of your toil. Remember, however, that the emphasis is on lightweight in this step. Extreme precision has little value here; it actually places more burden on your team if they need to capture many details, and makes them feel micromanaged.

Another way to successfully identify toil is to survey your team. Another Google CRE, Vivek Rau, would regularly survey Google’s entire SRE organization. Because the size and shape of toil varied between different SRE teams, at a company-wide level ticket metrics were harder to analyze. He surveyed SREs every three months to identify common issues across Google that were eating away at our time for project work. Try this sample toil survey to start:

  • Averaging over the past four weeks, approximately what fraction of your time did you spend on toil?  

    • Scale 0-100%

  • How happy are you with the quantity of time you spend on toil? 

    • Not happy / OK / No problem at all

  • What are your top three sources of toil?

    • On-call Response / Interrupts / Pushes / Capacity / Other / etc.

  • Do you have a long-term engineering project in your quarterly objectives?

    • Yes / No

  • If so, averaging over the past four weeks, approximately what fraction of your time did you spend on your engineering project? (estimate)

    • Scale 0-100%

  • In your team, is there toil you can automate away but you don’t do so, because that very toil takes time away from long-term engineering work? If so, please describe below.

    • Open response

Measuring toil

Once you’ve identified the work being done, how do you determine if it’s too much? It’s pretty simple: Regularly (we find monthly or quarterly to be a good interval), compute an estimate of how much time is being spent on various types of work. Look for patterns or trends in your tickets, surveys, and on-call incident response, and prioritize based on the aggregate human time spent. Within Google SRE, we aim to keep toil below 50% of each SRE’s time, to preserve the other 50% for engineering project work. If the estimates show that we have exceeded the 50% toil threshold, we plan work explicitly with the goal of reducing that number and getting the work balance back into a healthy state. 

Eliminating toil

Now that you’ve identified and measured your toil, it’s time to minimize it. As we’ve hinted at already, the solution here is typically to automate the work. This is not always straightforward, however, and the aim shouldn’t be to eliminate all toil.

Automating tasks that you rarely do (for example, deploying your service at a new location) can be tricky, because the procedure you used or assumptions you made while automating may change by the time you do that same task again. If a large amount of your time is spent on this kind of toil, consider how you might change the underlying architecture to smooth this variability. Do you use an infrastructure as code (IaC) solution for managing your systems? Can the procedure be executed multiple times without negative side effects? Is there a test to verify the procedure?

Treat your automation like any other production system. If you have an SLO practice, use some of your error budget to automate away toil. Complete postmortems when your automation fails, and fix it as you would any user-facing system. You want your automation available to you in any situation, including production incidents, to free humans to do the work they’re good at.

If you’ve gotten your users familiar with opening tickets to request help, use your ticketing system as the API for automation, making the work fully self-service.

Also, because toil isn’t just technical, but also cultural, make sure the only people doing toil work are the people explicitly assigned to it. This might be your oncaller, or a rotation of engineers scheduled to deal with “tickets” or “interrupts.” This preserves the rest of the team’s time to work on projects and reinforces a culture of surfacing and accounting for toil.

A note on complexity vs. toil

Sometimes we see engineers and leadership mistaking technical or organizational complexity as toil. The effects on humans are similar, but the work fails to meet the definition at the start of this post. Where toil is work that is basically of no enduring value, complexity often makes valuable work feel onerous. 

Google SRE Laura Beegle has been investigating this within Google, and suggests a different approach to addressing complexity: While there’s intense satisfaction in designing a simple, robust system, it inevitably becomes somewhat more complex, simply by existing in a distributed environment, used by a diverse range of users, or growing to serve more functionality over time. We want our systems to evolve over time, while also reducing what we call “experienced complexity”—the negative feelings based on mismatched expectations about how long or difficult a task is to complete. Quantifying the subjective experience of your systems is known by another name: user experience. The users in this case are SREs. The observable outcome of well-managed system complexity is a better user experience.

Addressing the user experience of supporting your systems is engineering work of enduring value, and therefore not the same as toil. If you find that complexity is threatening your system’s reliability, take action. By following a blameless postmortem process, or surveying your team, you can identify situations where complexity resulted in unexpected results or a longer-than-expected recovery time.

Some manual care and feeding of the systems we build is inevitably required, but the number of humans needed shouldn’t grow linearly with the number of VMs, users, or requests. As engineers, we know the power of using computers to complete routine tasks, but we often find ourselves doing that work by hand anyway. By identifying, measuring, and reducing toil, we can reduce operating costs and ensure time to focus on the difficult and interesting projects instead.

For more about SRE, learn about the fundamentals or explore the full SRE book.

Introducing Google Cloud’s Secret Manager

Many applications require credentials to connect to a database, API keys to invoke a service, or certificates for authentication. Managing and securing access to these secrets is often complicated by secret sprawl, poor visibility, or lack of integrations.

Secret Manager is a new Google Cloud service that provides a secure and convenient method for storing API keys, passwords, certificates, and other sensitive data. Secret Manager provides a central place and single source of truth to manage, access, and audit secrets across Google Cloud. 

Secret Manager offers many important features:

  • Global names and replication: Secrets are project-global resources. You can choose between automatic and user-managed replication policies, so you control where your secret data is stored.

  • First-class versioning: Secret data is immutable and most operations take place on secret versions. With Secret Manager, you can pin a secret to specific versions like 42 or floating aliases like latest.

  • Principles of least privilege: Only project owners have permissions to access secrets. Other roles must explicitly be granted permissions through Cloud IAM.

  • Audit logging: With Cloud Audit Logging enabled, every interaction with Secret Manager generates an audit entry. You can ingest these logs into anomaly detection systems to spot abnormal access patterns and alert on possible security breaches.  

  • Strong encryption guarantees: Data is encrypted in transit with TLS and at rest with AES-256-bit encryption keys. Support for customer-managed encryption keys (CMEK) is coming soon.

  • VPC Service Controls: Enable context-aware access to Secret Manager from hybrid environments with VPC Service Controls.

The Secret Manager beta is available to all Google Cloud customers today. To get started, check out the Secret Manager Quickstarts. Let’s take a deeper dive into some of Secret Manager’s functionality.

Global names and replication

Early customer feedback identified that regionalization is often a pain point in existing secrets management tools, even though credentials like API keys or certificates rarely differ across cloud regions. For this reason, secret names are global within their project.

While secret names are global, the secret data is regional. Some enterprises want full control over the regions in which their secrets are stored, while others do not have a preference. Secret Manager addresses both of these customer requirements and preferences with replication policies.

  • Automatic replication: The simplest replication policy is to let Google choose the regions where Secret Manager secrets should be replicated.

  • User-managed replication: If given a user-managed replication policy, Secret Manager replicates secret data into all the user-supplied locations. You don’t need to install any additional software or run additional services—Google handles data replication to your specified regions. Customers who want more control over the regions where their secret data is stored should choose this replication strategy.

First-class versioning

Versioning is a core tenet of reliable systems to support gradual rollout, emergency rollback, and auditing. Secret Manager automatically versions secret data using secret versions, and most operations—like access, destroy, disable, and enable—take place on a secret version.

Production deployments should always be pinned to a specific secret version. Updating a secret should be treated in the same way as deploying a new version of the application. Rapid iteration environments like development and staging, on the other hand, can use Secret Manager’s latest alias, which always returns the most recent version of the secret.

Integrations

In addition to the Secret Manager API and client libraries, you can also use the Cloud SDK to create secrets:

and to access secret versions:

Discovering secrets

As mentioned above, Secret Manager can store a variety of secrets. You can use Cloud DLP to help find secrets using infoType detectors for credentials and secrets. The following command will search all files in a source directory and produce a report of possible secrets to migrate to Secret Manager:

If you currently store secrets in a Cloud Storage bucket, you can configure a DLP job to scan your bucket in the Cloud Console. 

Over time, native Secret Manager integrations will become available in other Google Cloud products and services.

What about Berglas?

Berglas is an open source project for managing secrets on Google Cloud. You can continue to use Berglas as-is and, beginning with v0.5.0, you can use it to create and access secrets directly from Secret Manager using the sm:// prefix.

If you want to move your secrets from Berglas into Secret Manager, the berglas migrate command provides a one-time automated migration.

Accelerating security

Security is central to modern software development, and we’re excited to help you make your environment more secure by adding secrets management to our existing Google Cloud security product portfolio. With Secret Manager, you can easily manage, audit, and access secrets like API keys and credentials across Google Cloud. 

To learn more, check out the Secret Manager documentation and Secret Manager pricing pages.

Using deemed SLIs to measure customer reliability

Do you own and operate a software service? If so, is your service a ”platform”? In other words, does it run and manage applications of a wide range of users and/or companies? There are both simple and complex types of platforms, all of which serve customers. One example could be Google Cloud, which provides, among other things, relatively low-level infrastructure for starting and running VM images. A higher-level example of a platform might be a blogging service that allows any customer to create and contribute to a blog, design and sell merchandise featuring pithy blog quotes, and allow readers to send tips to the blog author.

If you do run a platform, it’s going to break sooner or later. Some breakages are large and easy to understand, such as no one being able to reach websites hosted on your platform while your company’s failure is frequently mentioned on social media. However, other kinds of breakage may be less obvious to you—but not to your customers. What if you’ve accidentally dropped all inbound network traffic from Kansas, for example?

At Google Cloud, we follow SRE principles to ensure reliability for our systems and also customers partnered with the Customer Reliability Engineering (CRE) team. A core SRE operating principle is the use of service-level indicators (SLIs) to detect when your users start having a bad time. In this blog post, we’ll look at how to measure your platform customers’ approximate reliability using approximate SLIs, which we term “deemed SLIs.” We use these to detect low-level outages and drive the operational response.

Why use deemed SLIs?

CRE founder Dave Rensin noted in his SRECon 2017 talk, Reliability When Everything Is A Platform, that as a platform operator, your monitoring doesn’t decide your reliability—your customers do! The best way to get direct visibility into your customers’ reliability experience is to get them to define their own SLIs, and share those signals directly with you. That level of transparency is wonderful, but it requires active and ongoing participation from your customers. What if your customers can’t currently prioritize the time to do this?

As a platform provider, you might use any number of internal monitoring metrics related to what’s happening with customer traffic. For instance, say you’re providing an API to a storage service:

  • You may be measuring the total number of queries and number of successful responses as cumulative numeric metrics, grouped by each API function.

  • You may also be recording the 95th percentile response latency with the same grouping, and get a good idea of how your service is doing overall by looking at the ratio of successful queries and the response latency values. If your success ratio suddenly drops from its normal value of 99% to 75%, you likely have many customers experiencing errors. Similarly, if the 95th percentile latency rises from 600ms to 1400ms, your customers are waiting much longer than normal for responses.

The key insight to motivate the use of “deemed SLIs” is that metrics aggregated across all customers will miss edge cases—and your top customers are very likely to depend on those edge cases. Your top customers need to know about outages as soon as, or even before, they happen. Therefore, you most likely want to know when any of your top customers is likely to experience a problem, even if most of your customers are fine. 

Suppose FooCorp, one of your biggest customers, uses your storage service API to store virtual machine images:

  • They build and write three different images every 15 minutes.
  • The VM images are much larger than most blobs in your service. 
  • Every time one of their 10,000 virtual machines is restarted, it reads an image from the API. 
  • Therefore, their traffic rate is one write per five minutes and assuming a daily VM restart, one read per 8.6 seconds. 
  • Your overall API traffic rate is one write per second and 100 reads per second.

Let’s say you roll out to your service a change that has a bug, causing very large image reads and writes, which are likely to time out and not complete. You initially don’t see any noticeable effect on your API’s overall success rate and think your platform is running just fine. FooCorp, however, is furious. Wouldn’t you like to know what just happened?

Implementation of deemed SLIs

The first and foremost step is to see key metrics at the granularity of a single customer. This requires careful assessment and trade-offs. 

For our storage API, assuming we were originally storing two cumulative measures (success, total) and one gauge (latency) at one-minute intervals, we can measure and store three data points per minute with no problem at all. However, if we have 20,000 customers, then storing 60,000 points per minute is a very different problem. Therefore, we need to be careful in the selection of metrics for which we provide the per-customer breakdown. In some cases, it may be sensible to have per-customer breakdowns only for a subset of customers, such as those contracting for a certain level of paid support.

Next, identify your top customers. “Top” could mean:

  • invests the most money on your platform;

  • is expected to invest the most money on your platform in the next two years;

  • is strategic from the point of view of partnerships or publicity; or even

  • raises the most support cases and hence causes the greatest operational load on your team.

As we mentioned, customers use your platform in different ways and as a result, have different expectations of it. To find out what your customer might regard as an outage, you need to understand in some depth what their workload really does. In some cases, the customer’s clients might automatically read data from your API every 30 minutes, and update their state if new information is available. However, even if the API is completely broken for an hour, very few customers might actually notice. 

To determine your deemed SLIs, consider applying your understanding of the customer’s workload from the limited selection of metrics per customer.  Think about your observation of the volatility of the metrics over time, and if possible, observation of the metrics during a known customer outage. From this, pick the subset of metrics which you think best represent customer happiness. Identify the normal ranges of those metrics, and aggregate them into a dashboard view for that customer.

This is why we call these metrics “deemed SLIs”—you deem them to be representative of your particular customer’s happiness, in the absence of better information. 

Some of the metrics you look at for your deemed SLIs of the storage service might include:

  • Overall API success rate and latency

  • Read and write success rate for large objects (i.e., FooCorp’s main use case)

  • Read latency for objects below a certain size (i.e., excluding large image read bursts so there’s a clear view of API performance for its more common read use case).

The main challenges are:

  • Lack of technical transparency into the customer’s key considerations. For instance, if you only provide TCP load balancing to your customer, you can’t observe HTTP response codes. 

  • Lack of organizational transparency—you don’t have enough understanding of the customer’s workload to be able to identify what SLIs are meaningful to them.

  • Missing per-customer metrics. You might find that you need to know whether an API call is made internally or externally because the latter is the key representative of availability. However, this distinction isn’t captured in the existing metrics.

It’s important to remember that we don’t expect these metrics to be perfect at first— these metrics are often quite inconsistent with the customer’s experience in the beginning. So how do we fix this? Simple—we iterate.

Iteration when choosing deemed SLIs

Now sit back and wait for a significant outage of your platform. There’s a good chance that you won’t have to wait too long, particularly if you deploy configuration changes or binary releases often.

When your outage happens:

  • Do an initial impact analysis. Look at each of your deemed SLIs, see if they indicate an outage for that customer, and feed that information to your platform leadership.
  • Feed quantitative data into the postmortem being written for the incident. For example, “Top customer X first showed impact at 10:30 EST, reached a maximum of 30% outage at 10:50 EST, and had effectively recovered by 11:10 EST.”
  • Reach out to those customers via your account management teams, to discover what their actual impact was.

Here’s a quick reference table for what you need to do for each customer:

reference table.png

As you gain confidence in some of the deemed SLIs, you may start to set alerts for your platform’s on-call engineers based on those SLIs going out of bounds. For each such alert, see whether it represents a material customer outage, and adjust the bounds accordingly. 

It’s important to note that customers can also shoot themselves in the foot and cause SLIs to go out of bounds. For example, they might cause themselves a high error rate in the API by providing an out-of-date decryption key for the blob. In this case, it’s a real outage, and your on-caller might want to know about it. There’s nothing for the on-caller to do, however—the customer has to fix it themselves. At a higher level, your product team may also be interested in these signals because there may be opportunities to design the product to guard against customers making such mistakes—or at least advise the customer when they are about to do so.

If a top customer has too many “it was us, not the platform” alerts, that’s a signal to turn off the alerts until things improve. This may also indicate that your engineers should collaborate with the customer to improve their reliability on your platform.

When your on-call engineer gets deemed SLI alerts from multiple customers, on the other hand, they can have a high confidence that the proximate cause is likely on the platform side.

Getting started with your own deemed SLIs

In Google Cloud, some of these metrics are exposed to customers directly through project-related, Transparent SLIs.

If you run a platform, you need to know what your customers are experiencing.

  • Knowing that a top customer has started having a problem before they phone your support hotline shrinks incident detection by many minutes, reduces the overall impact of the outage, and improves relationships with that customer. 

  • Knowing that several top customers have started to have problems can even be used to signal that a recent deployment should presumptively be rolled back, just in case.

  • Knowing roughly how many customers are affected by an outage is a very helpful signal for incident triage—is this outage minor, significant, or huge? 

Whatever your business, you know who your most important customers are. This week, go and look at the monitoring of your top three customers. Identify a “deemed SLI” for each of them, measure it in your monitoring system, and set up an automated alert for when those SLIs go squirrelly. You can tune your SLI selection and alert thresholds over the next few weeks, but right now, you are in tune with your top three customers’ experience on your platform. Isn’t that great? 

Learn more about SLIs and other SRE practices from previous blog posts and the online Site Reliability Workbook.


Thanks to additional contributions from Anna Emmerson, Matt Brown, Christine Cignoli and Jessie Yang.

Performance art: Making cloud network performance benchmarking faster and easier

Before you migrate workloads to the cloud, you need to benchmark network performance in order to understand how that performance affects your business applications. Unfortunately, the cloud hasn’t offered the standards, tools, and methods to do the benchmark testing you need. As a result, you’re forced to make deployment decisions without comprehensively understanding the implications of network performance for your use case.

Today, we’re excited to make a few announcements that will help you understand cloud network performance more quickly and easily:

  • We are investing in performance benchmarking tools. To begin with, we merged new contributions to PerfKit Benchmarker, an open-source tool created inside Google that makes network performance benchmarking faster and easier by automating network setup, provisioning of VMs, and test runs. With the updates, PerfKit Benchmarker now supports a broader range of network performance tests for VM-to-VM latency, throughput, packets-per-second for multiple clouds (inter-region, inter-zone, intra-zone, and on-prem to cloud), and lets you view the results in Google Data Studio (free to use). With this information, you can more accurately predict the performance impact of moving workloads to/across different clouds.

  • The publication of a new benchmarking methodology for using PerfKit Benchmarker continuously and consistently. This methodology, co-developed with performance engineering researchers at Southern Methodist University’s AT&T Center for Virtualization, is based on Google’s own internal best practices. “Continuous performance measurement and benchmarking are essential for understanding trends and patterns in large-scale cloud deployments,” said Suku Nair, director of SMU AT&T Center for Virtualization. “PerfKit Benchmarker, which wraps over 100 industry standard benchmark testing tools in an easy-to-use and extensible manner, is a key enabler in automating this process.”

Read on for an overview of how to use PerfKit Benchmarker to take advantage of its new features, such as support for additional performance metrics (e.g., packets per second) and deployment use cases  (e.g., VPN). 

Using PerfKit Benchmarker

PerfKit Benchmarker automates the setup and teardown of all the resources you need to run tests on (or between) most major public cloud providers, as well as on-premises deployments like Docker and OpenStack. Specifically, it automates the setup and provisioning of networks, subnets, firewalls and firewall rules, virtual machines, and drives required to run a large variety of benchmarks, as well as running the benchmarks themselves and tearing down the infrastructure afterwards. 

Along with installing and running the actual benchmark tests, PerfKit Benchmarker packages the test results in an easy-to-consume JSON format and offers hooks into backend storage providers like Google BigQuery, Elasticsearch, and InfluxDB to automate publishing results, making reporting and analytics a breeze.

When performing network tests, the critical metrics you need to understand include throughput, latency, jitter, and packets per second. To find the values of these metrics across various configurations, you can use PerfKit Benchmarker to draw upon a number of testing tools, including iperf2, iperf3, ping, netperf, nuttcp, nttcp and NTttcp, just to name a few.

Once PerfKit Benchmarker has been installed, running a single benchmark is simple: Specify the test you want to run and where you want to run it. As a basic example, here is a ping benchmark between two VMs that are located in zone us-east1-b of Google’s cloud:

./pkb.py --benchmarks=ping --zones=us-east1-b --cloud=GCP

This command creates a new Virtual Private Cloud (VPC) and two new VMs in zone us-east1-b of Google Cloud, configures them for a ping test (including setting the appropriate firewall rules), runs the test, and then deletes the VMs and the VPC. Finally, it outputs the results to the console and stores them in a file in the /tmp directory. You can also store results in BigQuery or Elasticsearch when appropriate flags have been set. 

Measuring Google Cloud inter-region latency with PerfKit Benchmarker

When designing your environment, it’s important to understand the latency between components in different Google Cloud regions. As an example, here are the results of our own all-region to all-region round trip latency tests using n1-standard-2 machine types and internal IP addresses. The daily benchmark tests ran over the course of the last month. The statistics were all collected using PerfKit Benchmarker to run ping benchmarks between VMs in each pair of regions.

PerfKit Benchmarker.png

To reproduce this chart, you can run the following command with the following config file. To run a smaller subset of regions, just remove the regions you don’t want included from the zones and extra_zones lists.

You can also add the --run_processes=<# of processes> flag to tell it to run multiple benchmarks in parallel. Furthermore, you can add the --gce_network_name=<network name> flag to have each benchmark use a Cloud VPC you have already created so each benchmark doesn’t make its own VPC.

./pkb.py --benchmarks=ping  --benchmark_config_file=/pat /to/all_region_latency.yaml

More benchmarks using PerfKit Benchmarker

Other examples of network performance benchmark tests you can run using PerfKit Benchmarker include:

  • Inter-region, inter-zone, and intra-zone network performance tests

  • On-premises to cloud and cross-cloud performance benchmarks between a VM in one cloud, and a VM on-premises or in another cloud

  • Performance benchmarks using various network tiers

  • Benchmarking across various guest OSes (e.g., Linux vs Windows) and machine types (e.g., general purpose, compute-optimized)

For complete details about the methodology for running more of these benchmarks, read the “Measuring cloud network performance with PerfKit Benchmarker” methodology white paper.

More good stuff on the way

By using PerfKit Benchmarker, you can make better decisions about where to put workloads, and improve the experience of your end-users. As time goes on, we’ll continue to add coverage for new performance benchmarking use cases, publish additional guidelines for cloud performance benchmarking, and report on the experiences of cloud adopters. In the meantime, we welcome and encourage new contributions to the PerfKit Benchmarker codebase, and look forward to seeing the community grow!

Learning—and teaching—the art of service-level objectives — CRE Life Lessons

Avid readers of CRE Life Lessons blog posts (there are dozens of us!) will appreciate the value of well-tuned service-level indicators (SLIs) and service-level objectives (SLOs). These concepts are fundamental building blocks of a site reliability engineering (SRE) practice. After all, how can you have a meaningful discussion about the reliability you want your services to achieve without properly measuring that reliability?

The Customer Reliability Engineering (CRE) team has helped many of Google Cloud’s customers create their first SLOs and better understand the reliability of their services. We want to make sure that teams everywhere can implement these principles. We’re pleased to announce that we’re making all the materials for our Art of SLOs workshop freely available under the Creative Commons CC-BY 4.0 license for anyone to use and re-use—as long as Google is credited as the original author. We’ve been inviting customers from around the world to this workshop for the past year. From now on, anyone can run their own version of this workshop, to teach their coworkers, their customers, or conference attendees why all services need SLOs.

What’s covered in the Art of SLOs

The Art of SLOs teaches the essential elements of developing SLOs to an audience from across the realms of development, operations, product, and business. The workshop slides are accompanied by a 28-page supporting handbook for participants, which is part reference and part background material for the practical problems that workshop participants engage with.

In the workshop, we start by making a business case for the value of SLOs based on two fundamental assertions. First, that reliability is the most important feature of any service, and second, that 100% is the wrong reliability target for basically everything. These assertions underlie the concept of an error budget, a non-zero quantity of allowable errors in a given time window that arises from an SLO target set somewhere just short of 100%. The tension between a fast pace of innovation and service reliability can be resolved by aiming to roll out new features as fast as possible without exhausting this error budget.

Once everyone is (hopefully) convinced that SLOs are a Good Thing, we explain how to choose good SLIs from the wealth of telemetry generated by a service running in production, and introduce the SLI equation, our recommended way of expressing any SLI. We cover two alternate ways of setting your first SLO targets, which arise from making different tradeoffs, and offer advice on how to converge these targets over time. We introduce a hands-on example—the server-side infrastructure supporting a fictional mobile game called Fang Faction—and use it to demonstrate the process of refining an SLI from a simple, generic specification to a concrete implementation that could be measured by a monitoring system.


Art (noun): A skill at doing a specified thing, typically one acquired through practice.


Critically, participants put this newly acquired knowledge to practical use straight away, as they develop more SLIs and SLOs for Fang Faction. Typically, when we run this workshop with customers, we break them up into groups of eight or so and unleash them on the workshop problems for 90 minutes. Each group is paired with an experienced SRE volunteer, who facilitates the discussion, encourages participation, and keeps the group on track.

Run your own SLO workshop!

If this sounds interesting, you’ll want to check out the Facilitators Handbook, which has a lot more information on how to organize an Art of SLOs workshop. If you don’t have a whole team to educate, you might be interested in our Measuring and Managing Reliability course on Coursera, which is a more thorough, self-paced dive into the world of SLIs, SLOs and error budgets.

5 favorite tools for improved log analytics

Stackdriver Logging, part of our set of operations management tools at Google Cloud, is designed to manage and analyze logs at scale to help you troubleshoot your hybrid cloud environment and gain insight from your applications. But the sheer volume of machine-generated data can pose a challenge when searching through logs. 

Through our years of working with Stackdriver Logging users, we’ve identified the easiest ways and best practices to get the value you need from your logs. We’ve collected our favorite tips for more effective log analysis and fast troubleshooting, including a few new features to help you quickly and easily get value from your logs: saved searches, a query library, support for partition tables when exporting logs to BigQuery, and more.

1. Take advantage of the advanced query language

The default basic mode for searching Stackdriver logs is using the drop-down menus to select the resource, log, or severity level. Though this makes it incredibly easy to get started with your logs, most users gravitate toward the advanced filter to ask more complex queries, as shown here:

1 advanced query language.png

Some powerful tools in this advanced query mode include:

  • Comparison operators:
    =           # equal
    !=          # not equal
    > < >= <=   # numeric ordering
    :           # "has" matches any substring in the log entry field

  • Boolean operators: By default, multiple clauses are combined with AND, though you can also use OR and NOT (be sure to use upper case!)

  • Functions: ip_in_net() is a favorite for analyzing network logs, like this:
        ip_in_net(jsonPayload.realClientIP, "10.1.2.0/24")

Pro tip: Include the full log name, time range, and other indexed fields to speed up your search results. See these and other tips on speeding up performance.

New queries library: We’ve polled experts from around Google Cloud to collect some of our most common advanced queries by use case including Kubernetes, security, and networking logs, which you can find in a new sample queries library in our documentation. Is there something different you’d like to see? Click the “Send Feedback” button at the top of the Sample Queries page and let us know.

2. Customize your search results

Often there is a specific field buried in your log entries that is of particular interest when you’re analyzing logs. You can customize the search results to include this field by clicking on a field and selecting “Add field to summary line.” You can also manually add, remove, or reorganize fields, or toggle the control limiting their width under View Options. This configuration can dramatically speed up troubleshooting, since you get the necessary context in summary. See an example here:

2 Customize your search results.png

3. Save your favorite searches and custom search results in your personal search library

We often hear that you use the same searches over and over again, or that you wish you could save custom field configurations for performing future searches. So, we recently launched a new feature that lets you save your searches, including the custom fields in your own library.

3 Save your favorite searches.png

You can share your saved searches with users who have permissions on your project by clicking on the selector next to Submit and then Preview. Click the Copy link to filter and share it with your team. This feature is currently in beta, and we’ll continue working on the query library functionality to help you quickly analyze your logs.

4. Use logs-based metrics for dashboarding and alerting

Now that you’ve mastered advanced queries, you can take your analysis to the next level with real-time monitoring using logs-based metrics. For example, suppose you want to get an alert any time someone grants access to an email address from outside your organization. You can create a metric to match audit logs from Cloud Resource Manager SetIamPolicy calls, where a member not under “my-org.com” domain is granted access, as shown here:

With the filter set, simply click Create Metric and give it a name.

4 Create Metric.png

To alert if a matching log arrives, select Create Alert From Metric from the three-dot menu next to your newly created user-defined metric. This will open a new alerting policy in Stackdriver Monitoring. Change the aggregator to “sum” and the threshold to 0 for “Most recent value” so you’ll be alerted any time a matching log occurs. Don’t worry if there’s no data yet, as your metric will only count log entries since it was created.

5 Create Alert From Metric.png

Additionally, you can add an email address, Slack channel, SMS, or PagerDuty account and name, and save your alerting policy. You can also add these metrics to dashboards along with custom and system metrics.

5. Perform faster SQL queries on logs in BigQuery using partitioned tables

Stackdriver Logging supports sending logs to BigQuery using log sinks for performing advanced analytics using SQL or joining with other data sources, such as Cloud Billing. We’ve heard from you that it would be easier to analyze logs across multiple days in BigQuery if we supported partitioned tables. So we recently added this partitioned tables option that simplifies SQL queries on logs in BigQuery.

When creating a sink to export your logs to BigQuery, you can either use date-sharded tables or partitioned tables. The default selection is a date-sharded table, in which a _YYYYMMDD suffix is added to the table name to create daily tables based on the timestamp in the log entry. Date-sharded tables have a few disadvantages that can add to query overhead:

  • Querying multiple days is harder, as you need to use the UNION operator to simulate partitioning.

  • BigQuery needs to maintain a copy of the schema and metadata for each date-named table.

  • BigQuery might be required to verify permissions for each queried table.

When creating a Log Sink, you can now select the Use Partitioned Tables option to make use of partitioned tables in BigQuery to overcome any issues with date-sharded tables.

6 partitioned tables in BigQuery.png

Logs streamed to a partitioned table use the log entry’s timestamp field to write to the correct partition. Queries on such ingestion-time partitioned tables can specify predicate filters on the _PARTITIONTIME or _PARTITIONDATE pseudo column to limit the amount of logs scanned. You can specify a range of dates using a WHERE filter, like this: 

WHERE _PARTITIONTIME BETWEEN TIMESTAMP("20191101") AND TIMESTAMP("20191105")

Learn more about querying partitioned tables.

Find out more about Stackdriver Logging, and join the conversation directly with our engineers and product management team.

Shrinking the time to mitigate production incidents – CRE life lessons

Your pager is going off. Your service is down and your automated recovery processes have failed. You need to get people involved in order to get things fixed. But people are slow to react, have limited expertise, and tend to panic. However, they are your last line of defense, so you’re glad you prepared them for handling this situation.

At Google, we follow SRE practices to ensure the reliability of our services, and here on the Customer Reliability Engineering (CRE) team, we share tips and tricks we’ve learned from our experiences helping customers get up and running. If you read our previous post on shrinking the impact of production incidents, you might remember that the time to mitigate an issue (TTM) is the time from when a first responder acknowledges the reception of a page to the time users stop feeling pain from the incident. Today’s post dives deeper into the mitigation phase, focusing on how to train your first responders so they can react efficiently under pressure. You’ll also find templates so you can get started testing these methods in your own organization.

Understanding unmanaged vs. untrained responses

Effective incident response and mitigation requires effective technical people and proper incident management. Without it, teams can end up working on fixing technical problems in parallel instead of working together to mitigate the outage. Under these circumstances, actions performed by engineers can potentially worsen the state of the outage, since different groups of people may be undoing each other’s progress. This total lack of incident response management is what we referred to as “unmanaged.”

Check out the Site Reliability Workbook for a real example of the consequences of the lack of proper incident management, along with a structure to introduce that incident management to your organization.

Solving the problem of the untrained response

What we’ll focus on here is the problem that arises when the personnel responding to the outage are managed under a properly established incident response structure, but lack the training to effectively work through the response. In this “untrained” response, the response is coordinated and those responding know and understand their roles, but they lack the technical preparedness to troubleshoot the problem and identify the mitigation path to restore the service. Even if the engineers were once prepared, they can lose their edge if the service has a very low number of pages or if the on-call shifts for an individual are widely spaced in time.

Other causes could be fast-paced software development or new service dependencies. Those can lead to the on-call engineers being unfamiliar with the tools and procedures needed to work through an outage. They know what they are supposed to be doing, but they just don’t know how to do it.

How can we fix the untrained response to minimize the mean time to mitigation (MTTM)?

Teaching response teams with hands-on activities

The way humans can cope with sudden changes in the environment, such as those introduced by an emergency, and have a measured response is by establishing mental models that help with pattern recognition. Psychologists call this “expert intuition,” and it helps when identifying underlying commonalities in situations that we have never faced before: “Hmm, I don’t recognize this specifically, but the symptoms we’re seeing make me think of X.”

The best way to gain knowledge and, in turn, establish long-term memory and expert intuition, isn’t through one-time viewings of documents or videos. Instead, it’s through a series of exercises that include (but are not limited to) low-stakes struggles. These are situations with never-before-seen (or at least rarely seen) problems, in which failure to solve them will not have a severe impact on your service. These brain challenges help the learning process by practicing memory retrieval and increasing the neuro pathways that access memory, thus improving analytical capacity.

At Google, we use two types of exercises to help our learning process: Disaster Recovery Testing (DiRT) and Wheel of Misfortune.

DiRT, or how to get dirty 

The disaster recovery testing we perform internally at Google is a coordinated set of events organized across the company, in which a group of engineers plan and execute real and fictitious outages for a defined period of time to test the effective response of the involved teams. These complex, non-routine outages are performed in a controlled manner, so that they can be rolled back as quickly as possible by the proctors should the tests get out of hand.

To ensure consistent behavior across the company, there are some rules of engagement that the coordinating team publishes, and every participating team has to adhere to. These rules include:

  • Prioritizations, i.e., real emergencies take precedence over DiRT exercises

  • Communication protocols for the different announcements and global coordination

  • Impact expectations: “Are services in production expected to be affected?”

  • Test design requirements: all tests must include a revert/rollback plan in case something goes wrong

All tests are reviewed and approved by a cross-functional technical team, different from the coordinating team. One dimension of special interest during this review process is the overall impact of the test. It not only has to be clearly defined, but if there’s a high risk of affecting production services, the test has to be approved by a group of VP-level representatives. It is paramount to understand if a service outage is happening as a direct result of the test being run, or if something is out of control and the test needs to be stopped to fix the unrelated problem.

Some examples of practical exercises include disconnecting complete data centers, disruptively diverting the traffic headed to a specific application to a different target, modifying live service configurations, or bringing up services with known bugs. The resilience of the services is also tested by “disabling” people who might have knowledge or experience that isn’t documented, or removing documentation, process elements, or communication channels.

Back in the day, Google performed DiRT exercises in a different way, which may be more practical for companies without a dedicated disaster testing team. Initially, DiRT comprised a small set of theoretical tests done by engineers working on user-facing services, and the tests were isolated and very narrow in scope: “What would happen if access to a specific DNS server is down?” or “Is this engineer a single point of failure when trying to bring this service up?”

How to start: the basics

Once you embrace the idea that testing your infrastructure and procedures is a way to learn what works and what does not, and use the failures as a way of growing, it is very tempting to go nuts with your tests. But doing so can easily create too many complications in an already complex system.

To avoid the initial unnecessary overhead of interdependencies, start small with service-specific tests, and evolve your exercises, analyzing which ones provide value and which ones don’t. Clearly defining your tests is also important, as it helps to verify if there are hidden dependencies: “Bring down DNS” is not the same as “Shut down all primary DNS servers running in North America data centers, but not the forwarding servers.” Forwarding rules may mask the fact that all the DNS servers are down but the clients are sending DNS queries to external providers.

Over the years, your DiRT tests will naturally evolve and increase in size and scope, with the goal of identifying weaknesses in the interfaces between services and teams. This can be achieved, for instance, by failing services in parallel, or by bringing down entire clusters, buildings, geographical domains, cloud zones, network layers, or similar physical or logical groupings.

What to test: human learning

As we described earlier, technical knowledge is not everything. Processes and communications are also fundamental in reducing the MTTM. Therefore, DiRT exercises should also test how people organize themselves and interact with each other, and how they use the processes that have previously been established for the resolution of emergencies. It’s not helpful to have a process to purchase fuel for a long-running generator working during an extensive power outage if nobody knows the process exists, or where it is documented.

Once you identify failures in your processes, you can put in place a remediation plan. Once the remediation plan has been implemented and a fix is in place, you should make sure the fix is effective by testing it. After that, expand your tests and restart the cycle. 

If you plan to introduce a DiRT-style exercise in your company, you can use this Test Plan Scenario template to define your tests.

Of course, you should note that these exercises can produce accidental user-facing outages, or even revenue loss. During a DiRT exercise, as we are operating on production services, an unknown bug can potentially bring an entire service to a point in which recovery is not automatic, easy, or even documented.

We think the learning value of DiRT exercises justifies the cost in the long term, but it’s important to consider whether these exercises might be too disruptive. There are, fortunately, other practices that can be used without creating a major business disruption. Let’s describe the other one we use at Google, and how you can try it.

Spinning the Wheel of Misfortune

A Wheel of Misfortune is a role-playing scenario to test techniques for responding to an emergency. The purpose of this exercise is to learn through a purely simulated emergency, using a traditional role-playing setup, where engineers walk through the steps of debugging and troubleshooting. It provides a risk-free environment, where the actions of the engineers will have no effects in production, so that the learning process can be reinforced through low-stakes struggles.

The use of scenarios portraying both real and fictitious events also allow the creation of complete operational environments. These scenarios require the use of skills and bits of knowledge that might not be used otherwise, helping the learning process by exposing the engineers to real—but rarely occurring—patterns to help build a complete mental model.

If you have played any role-playing game, you probably already know how it works: a leader such as the Dungeon Master, or DM, runs a scenario where some non-player characters get into a situation (in our case, a production emergency) and interact with the players, who are the people playing the role of the emergency responders.

Running the scenario

The DM is generally an experienced engineer who knows how the services work and interact in order to respond to the operations requested by the player(s). It is important that the DM knows what the underlying problem is, and the main path to mitigate its effects. Understanding the information the consoles and dashboards would present, the way the debugging tools work, and the details of their outputs all add realism to the scenario, and will avoid derailing the exercise by providing information and details that are not relevant to the resolution.

The exercise usually starts with the DM describing how the player(s) becomes aware of the service breakage: an alert received through a paging device, a call from a call-center support person, an IM from a manager, etc. The information should be complete, and the DM should avoid holding back information that otherwise would be known during the real scenario. Information should also be relayed as it is without any commentary on what it might mean.

From there, the player should be driving the scenario: They should give clear explanations of what they want to do, the dashboards they want to visualize, the diagnostic commands they want to run, the config files they want to inspect, and more. The DM in turn should provide answers to those operations, such as the shape of the graphs, the outputs of the different commands, or the content of the files. Screenshots of the different elements (graphs, command outputs, etc.) projected on a screen for everybody to see should be favored over verbal descriptions.

It is important for the DM to ask questions like “How would you do that?” or “Where would you expect to find that information?” Exact file system paths or URLs are not required, but it should be evident that the player could find the relevant resource in a real emergency. One option is for the player to do the investigation for real by projecting their laptop screen to the room and looking at the real graphs and logs for the service.

In these exercises, it’s important to test not only the players’ knowledge of the systems and their troubleshooting capacity, but also the understanding of incident command procedures. In the case of a large disaster, declaring a major outage and proceeding to identify the incident commander and the rest of the required roles is as important as digging to the bottom of the root cause.

The rest of the team should be spectators, unless specifically called in by the DM or the player. However, the DM should exercise veto power for the sake of the learning process. For example, if the player declares the operations lead is another very experienced engineer and calls them in, with the goal of unloading all the troubleshooting operations, the DM could indicate that the experienced engineer is trapped inside a subway car without cell phone reception, and is unable to respond to the call.

The DM should be literal in the details: If a page has a three-minute escalation timeout and has not been acknowledged after the timeout, escalate to the secondary. The secondary can be a non-player who then calls the player on the phone to inform them about the page. The DM should also be flexible in the structure. If the scenario is taking too long, or the player is stuck on one part, allow suggestions from the audience, or provide hints through non-player observations.

Finally, once the scenario has concluded, the DM should state clearly and affirmatively (if so) that the situation is fixed. Allow some time at the end for debriefing and discussion, explaining the background story that led to the emergency and indicating the contributors to the situation. If the scenario was based on a real outage, the DM can provide some factual details of the context, as they usually help understand the different steps that led to the outage.

To make the process of bootstrapping the exercises easier, check out the Wheel of Misfortune template we’ve created that can help you with your Wheel of Misfortune preparation.

Putting it all together

The people involved in incident response directly affect the time needed to recover from an outage, so it’s important to prepare teams as well as systems. Now that you’ve seen how some testing and learning methods work, try them out for yourself. In the next few weeks, try running a simple Wheel of Misfortune with your team. Choose (or write!) a playbook for an important alert, and walk through it as if you were solving a real incident. You might be amazed at steps that seem obvious that need documenting.

Check out these resources to learn more:

Keep a better eye on your Google Cloud environment

Monitoring, managing and understanding your cloud environment can be a challenging task for large-scale organizations. We built Google Cloud Asset Inventory so IT, security, and ops admins can get easy visibility into their Google Cloud Platform (GCP) environment. Cloud Asset Inventory is a fully managed metadata inventory service that offers various services to access GCP assets and see asset history. Two new features can make it even easier for you to do continuous asset monitoring and deep asset analysis across your GCP assets. 

Real-time notification feature for continuous monitoring
Cloud Asset Inventory now brings the real-time notification feature to beta, letting you do real-time config monitoring. For example, you can get notifications as soon as a firewall rule is changed for your web front end, or if an IAM policy binding in your production project has changed. The notifications are sent through Cloud Pub/Sub, from where you can then trigger actions. 

The example diagram below shows you how to monitor an IAM policy and trigger actions using Cloud Asset Inventory. In this scenario, a Gmail account was added to an IAM policy, which is generally against organizational security policy. If real-time notifications are set up on that IAM policy, Cloud Asset Inventory will send a Cloud Pub/Sub message containing the new change as soon as the change occurs. You can then write Cloud Functions to trigger an email notification, as well as directly revert the change back. You can see the IAM policy’s previous state by getting the change history of the IAM policy through the existing Cloud Asset Inventory export history feature.

IAM policy.png

Native BigQuery export feature for in-depth asset analysis
Given high demand from customers, and the popularity of the related open source tool, we’ve launched native BigQuery export support in Cloud Asset Inventory. You can directly export your asset snapshots and write to a BigQuery table using the same API or CLI. This enables lots of in-depth asset analysis, asset validation, and rule-based scannings. 

One of our customers from Paypal has been a longtime Cloud Asset Inventory customer, and recently got a chance to adopt the BigQuery export feature. Here’s how they’ve been using it:

“With the adoption of GCP and all of the associated services, Paypal was drowning in unorganized data. With multiple organizations and thousands of projects, we needed a method to gain insight and control of our cloud usage,” says Micah Norman, cloud engineer at Paypal. He initially created a Python application that queried all of the relevant APIs individually and stored the results in CloudSQL and BigQuery. This application worked well, but since Paypal has such a large number of assets, the entire job took about three hours per run. 

“The release of the Asset Export API allowed me to cut out nearly half of the code,” says Norman. “No longer did I have to query multiple APIs for each project. Now, with a simple bash script of around 60 lines, I was able to collect all of the relevant data in seconds. The remaining code primarily dealt with reading the resulting data and storing it correctly in CloudSQL and BigQuery.”

With the most recent release of the Asset Export API, Norman was able to write directly to BigQuery from the Asset Export API, thus eliminating 40% of the remaining code. The only code remaining was rewritten in Go, and supported the collection of data external to GCP, such as G Suite data. Analysis is supported using SQL to denormalize the collected information to support reporting, auditing, and compliance efforts.

Here’s a look at how the table looks in BigQuery with Cloud Asset Inventory data:

BigQuery with Cloud Asset Inventory data.png

For example, you can easily query the following common questions in BigQuery:

1. Find the quantity of each asset type:

2. Find Cloud IAM policies containing Gmail accounts as a member:

With the broad resource and policy coverage from Cloud Asset Inventory, plus the powerful query capability of BigQuery, in-depth inventory analysis has gotten so much easier. Read more about how to analyze your asset data in BigQuery.

Try these new real-time notifications and BigQuery export features for better inventory management, monitoring, and deep analysis.