Unify Kubernetes and GCP resources for simpler and faster deployments

Adopting containers and Kubernetes means adopting new ways of doing things, not least of which is how you configure and maintain your resources. As a declarative system, Kubernetes allows you to express your intent for a given resource, and then creates and updates those resources using continuous reconciliation. Compared with imperative configuration approaches, Kubernetes-style declarative config helps ensure that your organization follows GitOps best practices like storing configuration in a version control system, and defining it in a YAML file.   

However, applications that run on Kubernetes often use resources that live outside of Kubernetes, for example, Cloud SQL or Cloud Storage, and those resources typically don’t use the same approach to configuration. This can cause friction between teams, and force developers into frequent “context switching”. Further, configuring and operating those applications is a multi-step process: configuring the external resources, then the Kubernetes resources, and finally making the former available to the latter. 

To help, today, we’re announcing the general availability of Config Connector, which lets you manage Google Cloud Platform (GCP) resources as Kubernetes resources, giving you a single place to configure your entire application.

Config Connector is a Kubernetes operator that makes all GCP resources behave as if they were Kubernetes resources, so you don’t have to learn and use multiple conventions and tools to manage your infrastructure. For cloud-native developers, Config Connector simplifies operations and resource management by providing a uniform and consistent way to manage all of cloud infrastructure through Kubernetes.

Automating infrastructure consistency

With its declarative approach, Kubernetes is continually reconciling the resources it manages. Resources managed by Kuberentes are continuously monitored, and “self-heal” to continuously meet the user’s desired state. However, monitoring and reconciliation of non-Kubernetes resources (a SQL server instance for example), happens as part of a separate workflow. In the most extreme cases, changes to your desired configuration, for example, changes to the number of your Cloud Spanner nodes, are not propagated to your monitoring and alerting infrastructure, causing false alarms and creating additional work for your teams. 

By bringing these resources under the purview of Kuberentes with Config Connector, you get resource reconciliation across your infrastructure, automating the work of achieving eventual consistency in your infrastructure. Instead of spinning up that SQL server instance separately and monitoring it for changes as a second workflow, you ask Config Connector to create a SQL server instance and an SQL database on that instance. Config Connector creates these resources, and now that they’re part of your declarative approach, the SQL server instance is effectively self-healing, just like the rest of your Kubernetes deployment. 

Using Kubernetes’ resource model relieves you from having to explicitly order resources in your deployment scripts. Just like for pods, deployments, or other native Kubernetes resources, you no longer have to explicitly wait for the SQL instance to be completed before starting to provision an SQL database on that instance, as illustrated in the YAML manifests below.

Additionally, by defining GCP resources as Kubernetes objects, you now get to leverage familiar Kubernetes features with these resources, such as Kubernetes Labels and Selectors. For example, here  we used cost-center as a label on the resources. You can now filter by this label using kubectl get. Furthermore, you can apply your organization’s governance policy using admission controllers, such as Anthos Policy Controller. For example, you can enforce that the cost-center label should exist on all resources in the cluster and only have an allowed range of values:

Faster development with simplified operations

For Etsy, Kubernetes was instrumental in helping them to move to the cloud, but the complexity of their applications meant they were managing resources in multiple places, slowing down their deployments.

“At Etsy, we run complex Kubernetes applications that combine custom code and cloud resources across many environments. Config Connector will allow Etsy to move from having two distinct, disconnected CI/CD pipelines to a single pipeline for both application code and the infrastructure it requires. Config Connector will simplify our delivery and enable end-to-end testing of cloud infrastructure changes, which we expect will result in faster deployment and lower friction usage of cloud infrastructure” – Gregg Donovan, Senior Staff Software Engineer, Etsy. 

Getting started with Config Connector

Today, Config Connector can be used to manage more than 60+ GCP services, including Bigtable, BigQuery, IAM Policies, Service Account and Service Account Keys, Pub/Sub, Redis, Spanner, Cloud SQL, Cloud Storage, Compute Engine, Networking and Cloud Load Balancer. 

Config Connector can be installed standalone on any Kubernetes cluster, and is also integrated into Anthos Config Management, for managing hybrid and multi-cloud environments. Get started with Config Connector today to simplify configuration management across GKE and GCP.

Using deemed SLIs to measure customer reliability

Do you own and operate a software service? If so, is your service a ”platform”? In other words, does it run and manage applications of a wide range of users and/or companies? There are both simple and complex types of platforms, all of which serve customers. One example could be Google Cloud, which provides, among other things, relatively low-level infrastructure for starting and running VM images. A higher-level example of a platform might be a blogging service that allows any customer to create and contribute to a blog, design and sell merchandise featuring pithy blog quotes, and allow readers to send tips to the blog author.

If you do run a platform, it’s going to break sooner or later. Some breakages are large and easy to understand, such as no one being able to reach websites hosted on your platform while your company’s failure is frequently mentioned on social media. However, other kinds of breakage may be less obvious to you—but not to your customers. What if you’ve accidentally dropped all inbound network traffic from Kansas, for example?

At Google Cloud, we follow SRE principles to ensure reliability for our systems and also customers partnered with the Customer Reliability Engineering (CRE) team. A core SRE operating principle is the use of service-level indicators (SLIs) to detect when your users start having a bad time. In this blog post, we’ll look at how to measure your platform customers’ approximate reliability using approximate SLIs, which we term “deemed SLIs.” We use these to detect low-level outages and drive the operational response.

Why use deemed SLIs?

CRE founder Dave Rensin noted in his SRECon 2017 talk, Reliability When Everything Is A Platform, that as a platform operator, your monitoring doesn’t decide your reliability—your customers do! The best way to get direct visibility into your customers’ reliability experience is to get them to define their own SLIs, and share those signals directly with you. That level of transparency is wonderful, but it requires active and ongoing participation from your customers. What if your customers can’t currently prioritize the time to do this?

As a platform provider, you might use any number of internal monitoring metrics related to what’s happening with customer traffic. For instance, say you’re providing an API to a storage service:

  • You may be measuring the total number of queries and number of successful responses as cumulative numeric metrics, grouped by each API function.

  • You may also be recording the 95th percentile response latency with the same grouping, and get a good idea of how your service is doing overall by looking at the ratio of successful queries and the response latency values. If your success ratio suddenly drops from its normal value of 99% to 75%, you likely have many customers experiencing errors. Similarly, if the 95th percentile latency rises from 600ms to 1400ms, your customers are waiting much longer than normal for responses.

The key insight to motivate the use of “deemed SLIs” is that metrics aggregated across all customers will miss edge cases—and your top customers are very likely to depend on those edge cases. Your top customers need to know about outages as soon as, or even before, they happen. Therefore, you most likely want to know when any of your top customers is likely to experience a problem, even if most of your customers are fine. 

Suppose FooCorp, one of your biggest customers, uses your storage service API to store virtual machine images:

  • They build and write three different images every 15 minutes.
  • The VM images are much larger than most blobs in your service. 
  • Every time one of their 10,000 virtual machines is restarted, it reads an image from the API. 
  • Therefore, their traffic rate is one write per five minutes and assuming a daily VM restart, one read per 8.6 seconds. 
  • Your overall API traffic rate is one write per second and 100 reads per second.

Let’s say you roll out to your service a change that has a bug, causing very large image reads and writes, which are likely to time out and not complete. You initially don’t see any noticeable effect on your API’s overall success rate and think your platform is running just fine. FooCorp, however, is furious. Wouldn’t you like to know what just happened?

Implementation of deemed SLIs

The first and foremost step is to see key metrics at the granularity of a single customer. This requires careful assessment and trade-offs. 

For our storage API, assuming we were originally storing two cumulative measures (success, total) and one gauge (latency) at one-minute intervals, we can measure and store three data points per minute with no problem at all. However, if we have 20,000 customers, then storing 60,000 points per minute is a very different problem. Therefore, we need to be careful in the selection of metrics for which we provide the per-customer breakdown. In some cases, it may be sensible to have per-customer breakdowns only for a subset of customers, such as those contracting for a certain level of paid support.

Next, identify your top customers. “Top” could mean:

  • invests the most money on your platform;

  • is expected to invest the most money on your platform in the next two years;

  • is strategic from the point of view of partnerships or publicity; or even

  • raises the most support cases and hence causes the greatest operational load on your team.

As we mentioned, customers use your platform in different ways and as a result, have different expectations of it. To find out what your customer might regard as an outage, you need to understand in some depth what their workload really does. In some cases, the customer’s clients might automatically read data from your API every 30 minutes, and update their state if new information is available. However, even if the API is completely broken for an hour, very few customers might actually notice. 

To determine your deemed SLIs, consider applying your understanding of the customer’s workload from the limited selection of metrics per customer.  Think about your observation of the volatility of the metrics over time, and if possible, observation of the metrics during a known customer outage. From this, pick the subset of metrics which you think best represent customer happiness. Identify the normal ranges of those metrics, and aggregate them into a dashboard view for that customer.

This is why we call these metrics “deemed SLIs”—you deem them to be representative of your particular customer’s happiness, in the absence of better information. 

Some of the metrics you look at for your deemed SLIs of the storage service might include:

  • Overall API success rate and latency

  • Read and write success rate for large objects (i.e., FooCorp’s main use case)

  • Read latency for objects below a certain size (i.e., excluding large image read bursts so there’s a clear view of API performance for its more common read use case).

The main challenges are:

  • Lack of technical transparency into the customer’s key considerations. For instance, if you only provide TCP load balancing to your customer, you can’t observe HTTP response codes. 

  • Lack of organizational transparency—you don’t have enough understanding of the customer’s workload to be able to identify what SLIs are meaningful to them.

  • Missing per-customer metrics. You might find that you need to know whether an API call is made internally or externally because the latter is the key representative of availability. However, this distinction isn’t captured in the existing metrics.

It’s important to remember that we don’t expect these metrics to be perfect at first— these metrics are often quite inconsistent with the customer’s experience in the beginning. So how do we fix this? Simple—we iterate.

Iteration when choosing deemed SLIs

Now sit back and wait for a significant outage of your platform. There’s a good chance that you won’t have to wait too long, particularly if you deploy configuration changes or binary releases often.

When your outage happens:

  • Do an initial impact analysis. Look at each of your deemed SLIs, see if they indicate an outage for that customer, and feed that information to your platform leadership.
  • Feed quantitative data into the postmortem being written for the incident. For example, “Top customer X first showed impact at 10:30 EST, reached a maximum of 30% outage at 10:50 EST, and had effectively recovered by 11:10 EST.”
  • Reach out to those customers via your account management teams, to discover what their actual impact was.

Here’s a quick reference table for what you need to do for each customer:

reference table.png

As you gain confidence in some of the deemed SLIs, you may start to set alerts for your platform’s on-call engineers based on those SLIs going out of bounds. For each such alert, see whether it represents a material customer outage, and adjust the bounds accordingly. 

It’s important to note that customers can also shoot themselves in the foot and cause SLIs to go out of bounds. For example, they might cause themselves a high error rate in the API by providing an out-of-date decryption key for the blob. In this case, it’s a real outage, and your on-caller might want to know about it. There’s nothing for the on-caller to do, however—the customer has to fix it themselves. At a higher level, your product team may also be interested in these signals because there may be opportunities to design the product to guard against customers making such mistakes—or at least advise the customer when they are about to do so.

If a top customer has too many “it was us, not the platform” alerts, that’s a signal to turn off the alerts until things improve. This may also indicate that your engineers should collaborate with the customer to improve their reliability on your platform.

When your on-call engineer gets deemed SLI alerts from multiple customers, on the other hand, they can have a high confidence that the proximate cause is likely on the platform side.

Getting started with your own deemed SLIs

In Google Cloud, some of these metrics are exposed to customers directly through project-related, Transparent SLIs.

If you run a platform, you need to know what your customers are experiencing.

  • Knowing that a top customer has started having a problem before they phone your support hotline shrinks incident detection by many minutes, reduces the overall impact of the outage, and improves relationships with that customer. 

  • Knowing that several top customers have started to have problems can even be used to signal that a recent deployment should presumptively be rolled back, just in case.

  • Knowing roughly how many customers are affected by an outage is a very helpful signal for incident triage—is this outage minor, significant, or huge? 

Whatever your business, you know who your most important customers are. This week, go and look at the monitoring of your top three customers. Identify a “deemed SLI” for each of them, measure it in your monitoring system, and set up an automated alert for when those SLIs go squirrelly. You can tune your SLI selection and alert thresholds over the next few weeks, but right now, you are in tune with your top three customers’ experience on your platform. Isn’t that great? 

Learn more about SLIs and other SRE practices from previous blog posts and the online Site Reliability Workbook.


Thanks to additional contributions from Anna Emmerson, Matt Brown, Christine Cignoli and Jessie Yang.

Accelerate GCP Foundation Buildout with automation

We know from working with customers that starting your cloud journey can be daunting. Fortunately, there are a variety of formal options to help you on your way, such as engaging trusted advisors in the Google Cloud Professional Services Organization or one of the many partners in the Google Cloud universe.

To further accelerate your cloud journey, we recently released the Cloud Foundation Toolkit, templates that will help you rapidly build a strong cloud foundation according to best practices.

The Cloud Foundation Toolkit provides a series of reference templates built by the Google Cloud Professional Services team with help from partners, and with a focus on foundational elements of Google Cloud Platform. These modules are available for both the popular Terraform infrastructure-as-code framework, as well as our own Cloud Deployment Manager

The templates themselves are entirely open source and available freely on GitHub. 

Top Cloud Foundation Toolkit modules
The Cloud Foundation Toolkit already includes about 60+ Terraform modules and 50+ Deployment Manager modules (and counting). Below are some of the most popular and fundamental GCP components according to GitHub repo stars and watches to get you started:

Getting started
To get started with using the Cloud Foundations Toolkit, first you need to understand Terraform or Deployment Manager. Then, to start using the toolkit itself, check out the Project Factory and GCP Folders modules. Please watch this quick demo to learn more about the Deployment Manager integration, or this video to learn how to use Cloud Foundations Toolkit with Terraform. Be sure to watch/star your favorite Cloud Foundation Toolkit repos and provide feedback by raising issues in their respective repositories.

Kubernetes development, simplified—Skaffold is now GA

Back in 2017, we noticed that developers creating Kubernetes-native applications spent a long time building and managing container images across registries, manually updating their Kubernetes manifests, and redeploying their applications every time they made even the smallest code changes. We set out to create a tool to automate these tasks, helping them focus on writing and maintaining code rather than managing the repetitive steps required during the edit-debug-deploy ‘inner loop’. From this observation, Skaffold was born.

Today, we’re announcing our first generally available release of Skaffold. Skaffold simplifies common operational tasks that you perform when doing Kubernetes development, letting you focus on your code changes and see them rapidly reflected on your cluster. It’s the underlying engine that drives Cloud Code, and a powerful tool in and of itself for improving developer productivity.

Skaffold’s central command, skaffold dev, watches local source code for changes, and rebuilds and redeploys applications to your cluster in real time. But Skaffold has grown to be much more than just a build and deployment tool—instead, it’s become a tool to increase developer velocity and productivity.

Feedback from Skaffold users bears this out. “Our customers love [Kubernetes], but consistently gave us feedback that developing on Kubernetes was cumbersome. Skaffold hit the mark in addressing this problem,” says Warren Strange, Engineering Director at ForgeRock. “Changes to a Docker image or a configuration that previously took several minutes to deploy now take seconds. Skaffold’s plugin architecture gives us the ability to deploy to Helm or Kustomize and use various Docker build plugins such as Kaniko. Skaffold replaced our bespoke collection of utilities and scripts with a streamlined tool that is easy to use.”

A Kubernetes developer’s best friend

Skaffold is a command line tool that saves developers time by automating most of the development workflow from source to deployment in an extensible way. It natively supports the most common image-building and application deployment strategies, making it compatible with a wide variety of both new and pre-existing projects. Skaffold also operates completely on the client-side, with no required components on your cluster, making it super lightweight and high-performance.

Skaffolds inner development loop.png
Skaffold’s inner development loop

By taking care of the operational tasks of iterative development, Skaffold removes a large burden from application developers and substantially improves productivity.

Over the last two years, there have been more than 5,000 commits from nearly 150 contributors to the Skaffold project, resulting in 40 releases, and we’re confident that Skaffold’s core functionality is mature. To commemorate this, let’s take a closer look at some of Skaffold’s core features.

Fast iterative development
When it comes to development, skaffold dev is your personal ops assistant: it knows about the source files that comprise your application, watches them while you work, and rebuilds and redeploys only what’s necessary. Skaffold comes with highly optimized workflows for local and remote deployment, giving you the flexibility to develop against local Kubernetes clusters like Minikube or Kind, as well as any remote Kubernetes cluster.

“Skaffold is an amazing tool that simplified development and delivery for us,” says Martin Höfling, Principal Consultant at TNG Technology Consulting GmbH. “Skaffold hit our sweet spot by covering two dimensions: First, the entire development cycle from local development, integration testing to delivery. Second, Skaffold enabled us to develop independently of the platform on Linux, OSX, and Windows, with no platform-specific logic required.”

Skaffold’s dev loop also automates typical developer tasks. It automatically tails logs from your deployed workloads, and port-forwards the remote application to your machine, so you can iterate directly against your service endpoints. Using Skaffold’s built-in utilities, you can do true cloud-native development, all while using a lightweight, client-side tool.

Production-ready CI/CD pipelines
Skaffold can be used as a building block for your production-level CI/CD pipelines. Taylor Barrella, Software Engineer at Quora, says that “Skaffold stood out as a tool we’d want for both development and deployment. It gives us a common entry point across applications that we can also reuse for CI/CD. Right now, all of our CI/CD pipelines for Kubernetes applications use Skaffold when building and deploying.”

Skaffold can be used to build images and deploy applications safely to production, reusing most of the same tooling that you use to run your applications locally. skaffold run runs an entire pipeline from build to deploy in one simple command, and can be decomposed into skaffold build and skaffold deploy for more fine-tuned control over the process. skaffold render can be used to build your application images, and output templated Kubernetes manifests instead of actually deploying to your cluster, making it easy to integrate with GitOps workflows.

Profiles let you use the same Skaffold configuration across multiple environments, express the differences via a Skaffold profile for each environment, and activate a specific profile using the current Kubernetes context. This means you can push images and deploy applications to completely different environments without ever having to modify the Skaffold configuration. This makes it easy for all members of a team to share the same Skaffold project configuration, while still being able to develop against their own personal development environments, and even use that same configuration to do deployments to staging and production environments.

On-cluster application debugging
Skaffold can help with a whole lot more than application deployment, not least of which is debugging. Skaffold natively supports direct debugging of Golang, NodeJS, Java, and Python code running on your cluster!

The skaffold debug command runs your application with a continuous build and deploy loop, and forwards any required debugging ports to your local machine. This allows Skaffold to automatically attach a debugger to your running application. Skaffold also takes care of any configuration changes dynamically, giving you a simple yet powerful tool for developing Kubernetes-native applications. skaffold debug powers the debugging features in Cloud Code for IntelliJ and Cloud Code for Visual Studio Code.

google cloud code.png

Cloud Code: Kubernetes development in the IDE

Cloud Code comes with tools to help you write, deploy, and debug cloud-native applications quickly and easily. It provides extensions to IDEs such as Visual Studio Code and IntelliJ to let you rapidly iterate, debug, and deploy code to Kubernetes. If that sounds similar to Skaffold, that’s because it is—Skaffold powers many of the core features that make Cloud Code so great! Things like local debugging of applications deployed to Kubernetes and continuous deployment are baked right into the Cloud Code extensions with the help of Skaffold.

To get the best IDE experience with Skaffold, try Cloud Code for Visual Studio Code or IntelliJ IDEA!

What’s next?

Our goal with Skaffold and Cloud Code is to offer industry-leading tools for Kubernetes development, and since Skaffold’s inception, we’ve engaged the broader community to ensure that Skaffold evolves in line with what users want. There are some amazing ideas from external contributors that we’d love to see come to fruition, and with the Kubernetes development ecosystem still in a state of flux, we’ll prioritize features that will have the most impact on Skaffold’s usefulness and usability. We’re also working closely with the Cloud Code team to surface Skaffold’s capabilities inside your IDE.

With the move to general availability, it’s never been a better time to start using (or continue to use) Skaffold, trusting that it will provide an excellent and production-ready development experience that you can rely on.

For more detailed information and docs, check out the Skaffold webpage, and as always, you can reach out to us on Github and Slack.


Special thanks to all of our contributors (you know who you are) who helped make Skaffold the awesome tool it is today!

How to integrate Policy Intelligence recommendations into an IaC pipeline

Chances are, you want to configure your Google Cloud environment for optimal security, cost and efficiency. Lucky for you, Google Cloud Policy Intelligence helps you do just that. Policy Intelligence’s new IAM and Compute Engine Rightsizing recommenders are currently in beta and automatically suggest ways to make your cloud deployment more secure and cost-effective.

google cloud iam.png

It’s easy enough to review and apply these recommendations from within the Google Cloud Console. But what if you use Infrastructure as Code (IaC)? Treating your cloud infrastructure as code can make the administration, roll-out and upkeep of your environment more consistent and repeatable, and free your teams from having to troubleshoot snowflake environments that have a tendency to drift over time. If you do have IaC pipelines, you may need to manually review your IaC manifests to prevent infrastructure drifts, and also to ensure that they reflect the recommendations you may apply from within the Google Cloud console. 

Further, as your Google Cloud footprint expands, relying on manual techniques alone to review and track recommendations is inefficient. 

What if you could make Policy Intelligence recommenders and your IaC pipelines work together?

In a perfect world, you’d be able to use the recommendations that GCP surfaces with your repeatable IaC pipelines. Imagine if you could setup a serverless pipeline to track Policy Intelligence recommendations, automatically update your IaC manifests, generate pull requests for authorized teams to review and approve, and finally, roll them out with your CI/CD tool.

iac pipelines.png
Click to enlarge

Turns out, you can! To get you started and learn more, here’s a tutorial that shows you how.

In this tutorial, we walk you through a pipeline that parses the Policy Intelligence recommendations generated by the platform to map them to the configuration you have in your Terraform manifests. The service updates your IaC manifests to reflect these recommendations and generates a pull request for your teams to review and approve. Upon approval and merge, a Cloud Build job rolls out the changes to the infrastructure in your GCP organization. 

Now you don’t have to choose between following Policy Intelligence’s latest recommendations and practicing Infrastructure as Code. The open source codebase for this tutorial is available on Github. Download it today and modify it to suit your specific DevOps pipeline.

Push configuration with zero downtime using Cloud Pub/Sub and Spring Framework

As application configuration grows more complex, treating it with the same care we treat code—applying best practices for code review, and rolling it out gradually—makes for more stable, predictable application behavior. But deploying application configuration together with the application code takes away a lot of the flexibility that having separate configuration offers in the first place. Compared with application code, configuration data has different:

  • Granularity – per server or per region, rather than (unsurprisingly) per application.

  • Lifecycle – more frequent if you don’t deploy your application very often; less frequent, perhaps, if you embrace continuous deployment for code.

This leads us to an important best practice for software development teams: separating code from configuration when deploying applications. More recently, DevOps teams have started to practice “configuration as code”—storing configuration in version-tracked repositories. 

But if you update your configuration data separately, how will your code learn about it and use it? It’s possible, of course, to push new settings and restart all application instances to pick up the updates, but that could result in unnecessary downtime.

If you’re a Java developer and use the Spring Framework, there’s good news. Spring Cloud Config lets applications monitor a variety of sources (source control, database etc.) for configuration changes. It then notifies all subscriber applications that changes are available using Spring Cloud Bus and the messaging technology of your choice. 

If you’re running on Google Cloud, one great messaging option is Cloud Pub/Sub. In the remainder of this blog post, you’ll learn how to configure Spring Cloud Config and Spring Cloud Bus with Cloud Pub/Sub, so you can enjoy the benefits of configuration maintained as code and propagated to environments automatically.

Setting up the server and the client

Imagine you want to store your application configuration data in a GitHub repository. You’ll need to set up a dedicated configuration server (to monitor and fetch configuration data from its true source), as well as a configuration client embedded in the application that contains your business logic. In a real world scenario, you’d have many business applications or microservices, each of which has an embedded configuration client talking to the server and retrieving the latest configuration from it. You can find the full source code for all the examples in this post in this Spring Cloud GCP sample app.

Spring Cloud GCP.png

Configuration server setup

To take advantage of the power of distributed configuration, it’s common to set up a dedicated configuration server. You configure a GitHub webhook to notify it whenever there are changes, and the configuration server, in turn, notifies all the interested applications that run the business logic that new configuration is available to be picked up.

The configuration server has the following three dependencies (we recommend using the Spring Cloud GCP Bill Of Materials for setting up dependency versions):

pom.xml

The first dependency, spring-cloud-gcp-starter-bus-pubsub, ensures that Cloud Pub/Sub is the Spring Cloud Bus implementation that powers all the messaging functionality.

The other two dependencies make this application act as a Spring Cloud Config server capable of being notified of changes by the configuration source (Github) on the /monitor HTTP endpoint it sets up.

The config server application also needs to be told where to find the updated configuration; we use a standard Spring application properties file to point it to the GitHub repository containing the configuration:

application.properties

You’ll need to customize the port if you are running the example locally. Like all Spring Boot applications, the configuration server normally runs on port 8080 by default, but that port is used by the business application we are about to configure, so an override is needed.

The last piece you need to run a configuration server is the Java code!

PubSubConfigGitHubServerApplication.java

As is typical for Spring Boot applications, the boilerplate code is minimal—all the functionality is driven by a single annotation, @EnableConfigServer. This annotation, combined with the dependencies and configuration, gives you a fully functional configuration server capable of being notified when a new configuration arrives by way of the /monitor endpoint. Then, in turn, the configuration server notifies all the client applications through a Cloud Pub/Sub topic.

Speaking of the Cloud Pub/Sub topic, if you run just the server application, you’ll notice in the Google Cloud Console that a topic named springCloudBus was created for you automatically, along with a single anonymous subscription (a bit of trivia: every configuration server is capable of receiving the configuration it broadcasts, but configuration updates are suppressed on the server by default).

Configuration client setup

Now that you have a configuration server, you’re ready to create an application that subscribes to that server’s vast (well… not that vast) knowledge of configuration.

The client application dependencies are as follows:

pom.xml

The client needs a dependency on spring-cloud-gcp-starter-bus-pubsub, just as the server did. This dependency enables the client application to subscribe to configuration change notifications arriving over Cloud Pub/Sub. The notifications do not contain the configuration changes; the client applications will pull those over HTTP.

Notice that the client application only has one Spring Cloud Config dependency: spring-cloud-config-client. This application doesn’t need to know how the server finds out about configuration changes, hence the simple dependency.

For this demo, we made a web application, but client applications can be any type of application that you need. They don’t even need to be Java applications, as long as they know how to subscribe to a Cloud Pub/Sub topic and retrieve content from an HTTP endpoint!

Nor do you need any special application configuration for a client application. By default, all configuration clients look for a configuration server on local port 8888 and subscribe to a topic named springCloudBus. To customize the configuration server location for a real-world deployment, simply configure the spring.cloud.config.uri property in the bootstrap.properties file, which is read before the regular application initialization. To customize the topic name, add the spring.cloud.bus.destination property to the regular application.properties file, making sure that the config server and all client applications have the same value.

And now, it’s time to add the client application’s code:

PubSubConfigApplication.java

ExampleController.java

Again, the boilerplate here is minimal—PubSubConfigApplication starts up a Spring Boot application, and ExampleController sets up a single HTTP endpoint /message. If no configuration server is available, the endpoint serves the default message of “none”. If a configuration server is found on the default localhost:8888 URL, the configuration found there at client startup time will be served. The @RefreshScope annotation ensures that the message property gets a new value whenever a configuration refresh event is received.

The code is now complete! You can use the mvn spring-boot:run command to start up the config server and client in different terminals and try it out. 

To test that configuration changes propagate from GitHub to the client application, update configuration in your GitHub repository, and then manually invoke the /monitor endpoint of your config server (you would configure this to be done automatically through a GitHub webhook for a deployed config server):

After running the above command, the /message endpoint serves the most recent value retrieved from GitHub.

And that’s all that’s required for a basic Spring Cloud Config with Cloud Pub/Sub-enabled bus server/client combination. In the real world, you’ll most likely serve different configurations to different environments (dev, QA etc.). Because Spring Cloud Config supports hierarchical representation of configuration, it can grow to adapt to any environment setup.

For more information, visit the Spring Cloud GCP documentation and sample.

Cloud Build named a Leader for Continuous Integration in the Forrester Wave

Today, we are honored to share that Cloud Build, Google Cloud’s continuous integration (CI) and continuous delivery (CD) platform, was named a Leader in The Forrester Wave™: Cloud-Native Continuous Integration Tools, Q3 2019. The report identifies the 10 CI providers that matter most for continuous integration (CI) and how they stack up on 27 criterias. Cloud Build received the highest score in both the current offering and strategy categories. 

“Google Cloud Build comes out swinging, matching up well with other cloud giants. Google Cloud Build is relatively new when compared to the other public cloud CI offerings; this vendor had a lot to prove, and it did…. Customer references are happy to trade the cost of paying for the operation and management of build servers for moving their operations to Google’s pay-per-compute system. One reference explained that it’s in the middle of a cloud shift and is currently executing up to 650 builds per day. With that proved out, this customer plans to move an additional 250 repositories to the cloud” – Forrester Wave™ report

Top score in the Current Offering category: Among all 10 CI providers evaluated in the Wave, Cloud Build got the highest score in the Current Offering category. The Current Offering score is based on Cloud Build’s strength in the developer experience, build speed and scale, enterprise security and compliance, and enterprise support, amongst other criteria.

Top score in the Strategy category: Along with top score in the current offering category, Cloud Build also received the highest score in the Strategy category. In particular, Cloud Build’s scores in the partner ecosystem, commercial model, enterprise strategy and vision, and product roadmap criteria contributed to the score. 

CI plays an increasingly important role in DevOps, allowing enterprises to drive quality from the start of their development cycle. The report mentions “that cloud-native CI is the secret development sauce that enterprises need to be fast, responsive, and ready to take on incumbents and would-be digital disruptors.”

At Google, we’ve seen first-hand how cloud-based serverless CI tool can help drive quality and security at scale, and the lessons we’ve learned in that process manifest directly in Cloud Build. 

Today, organizations of all sizes use Cloud Build to drive productivity improvements via automated, repeatable CI processes. Some customers include Zendesk, Shopify, Snap, Lyft, and Vendasta, who chose Cloud Build for its:

  • Fully serverless platform: Cloud Build scales up and scale down in response to load with no need to pre-provision servers or pay in advance for additional capacity. 

  • Flexibility: With custom build steps and pre-created extensions to third party apps, enterprises can easily tie their legacy or home-grown tools as a part of their build process.

  • Security and compliance features: Developers have an ability to perform deep security scans within the CI/CD pipeline and ensure only trusted container images are deployed to production. 

We’re thrilled by Forrester’s recognition for Cloud Build. You can download a copy of the report here.

Shrinking the impact of production incidents using SRE principles—CRE Life Lessons

If you run any kind of internet service, you know that production incidents happen. No matter how much robustness you’ve engineered in your architecture, no matter how careful your release process, eventually the right combination of things go wrong and your customers can’t effectively use your service. 

You work hard to build a service that your users will love. You introduce new features to delight your current users and to attract new ones. However, when you deploy a new feature (or make any change, really), it increases the risk of an incident; that is, something user-visible goes wrong. Production incidents burn customer goodwill. If you want to grow your business, and keep your current users, you must find the right balance between reliability and feature velocity. The cool part is, though, that once you do find that balance, you’ll be poised to increase both reliability and feature velocity.

In this post, we’ll break down the production incident cycle into phases and correlate each phase with its effect on your users. Then we’ll dive into how to minimize the cost of reliability engineering to keep both your users and your business happy. We’ll also discuss the Site Reliability Engineering (SRE) principles of setting reliability targets, measuring impact, and learning from failure so you can make data-driven decisions on which phase of the production incident cycle to target for improvements.

Understanding the production incident cycle

A production incident is something that affects the users of your service negatively enough that they notice and care. Your service and its environment are constantly changing. A flood of new users exploring your service (yay!) or infrastructure failures (boo!), for example, threaten the reliability of your service. Production incidents are a natural—if unwelcome—consequence of your changing environment. Let’s take a look at the production incident cycle and how it affects the happiness of your users:

1 User happiness falls.png
User happiness falls during a production incident and stabilizes when the service is reliable.

Note that the time between failures for services includes the time for the failure itself. This differs from the traditional measure since modern services can fail in independent, overlapping ways. We want to avoid negative numbers in our analysis.

Your service-level objective, or SLO, represents the level of reliability below which your service will make your users unhappy in some sense. Your goal is clear: Keep your users happy by sustaining service reliability above its SLO. Think about how this graph could change if the time to detect or the time to mitigate were shorter, or if the slope of the line during the incident were less steep, or if you had more time to recover between incidents. You would be in less danger of slipping into the red. If you reduce the duration, impact, and frequency of production incidents—shrinking them in various ways—it helps keep your users happy.

Graphing user happiness vs. reliability vs. cost

If keeping your reliability above your SLO will keep most of your users happy, how much higher than your SLO should you aim? The further below your SLO you go, of course, the unhappier your users become. The amazing thing, though, is that the further above the target level for your SLO you go, users will become increasingly indifferent to your reliability. You will still have incidents, and your users will notice them, but as long as your service is, on average, above its SLO, the incidents are happening infrequently enough that your users stay sufficiently satisfied. In other words, once you’re above your SLO, improving your reliability is not valuable to your users.

2 The optimal SLO threshold.png
The optimal SLO threshold keeps most users happy while minimizing engineering costs.

Reliability is not cheap. There are costs not only in engineering hours, but also in lost opportunities. For example, your time to market may be delayed due to reliability requirements. Moreover, reliability costs tend to be exponential. This means it can be 100 times more expensive to run a service that is 10 times more reliable. Your SLO sets a minimum reliability requirement, something strictly less than 100%. If you’re too far above your SLO, though, it indicates that you are spending more on reliability than you need to. The good news is that you can spend your excess reliability (i.e., your error budget) on things that are more valuable than maintaining excess reliability that your users don’t notice. You could, for example, release more often, run stress tests against your production infrastructure and uncover hidden problems, or let your developers work on features instead of more reliability. Reliability above your SLO is only useful as a buffer to prevent your users from noticing your instability. Stabilize your reliability, and you can maximize the value you get out of your error budget.

3 An unstable reliability curve.png
An unstable reliability curve prevents you from spending your error budget efficiently.

Laying the foundation to shrink production incidents

When you’re thinking about best practices for improving phases of the production incident cycle, there are three SRE principles that particularly matter for this task. Keep these in mind as you think about reliability.

1. Create and maintain SLOs
When SREs talk about reliability, SLOs tend to come up a lot. They’re the basis for your error budgets and define the desired measurable reliability of your service. SLOs have an effect across the entire production incident cycle, since they determine how much effort you need to put into your preparations. Do your users only need a 90% SLO? Maybe your current “all at once” version rollout strategy is good enough. Need a 99.95% SLO? Then it might be time to invest in gradual rollouts and automatic rollbacks.

4 SLOs closer to 100%.png
SLOs closer to 100% take greater effort to maintain, so choose your target wisely.

During an incident, your SLOs give you a basis for measuring impact. That is, they tell you when something is bad, and, more importantly, exactly how bad it is, in terms that your entire organization, from the people on call to the top-level executives, can understand.

If you’d like help creating good SLOs, there is an excellent (and free, if you don’t need the official certification) video walkthrough on Coursera.

2. Write postmortems
Think of production incidents as unplanned investments where all the costs are paid up front. You may pay in lost revenue. You may pay in lost productivity. You always pay in user goodwill. The returns on that investment are the lessons you learn about avoiding (or at least reducing the impact of) future production incidents. Postmortems are a mechanism for extracting those learned lessons. They record what happened and why it happened, and they identify specific areas to improve. It may take a day or more to write a good postmortem, but they capture the value of your unplanned investment instead of just letting it evaporate.

5 Identifying both technical and non-technical causes.png
Identifying both technical and non-technical causes of incidents is key to preventing recurrence.

When should you write a postmortem? Write one whenever your SLO takes a significant hit. Your postmortems become your reliability feedback loop. Focus your development efforts on the incident cycle phases that have recurring problems. Sometimes you’ll have a near miss when your SLO could have taken a hit, but it didn’t because you got lucky for some reason. You’ll want to write one then, too. Some organizations prefer to have meetings to discuss incidents instead of collaborating on written postmortems. Whatever you do, though, be sure to leave some written record that you can later use to identify trends. Don’t leave your reliability to luck! As the SRE motto says: Hope is not a strategy. Postmortems are your best tool for turning hope into concrete action items.

For really effective postmortems, those involved in the incident need to be able to trust that their honesty in describing what happened during the incident won’t be held against them. For that, you need the final key practice:

3. Promote a blameless culture
A blameless culture recognizes that people will do what makes sense to them at the time. It’s taken as a given that later analysis will likely determine these actions were not optimal (or sometimes flat-out counterproductive). If a person’s actions initiated a production incident, or worsened an existing one, we should not blame the person. Rather we should seek to make improvements in the system to positively influence the person’s actions during the next emergency.

6 A blameless culture.png
A blameless culture means team members assume coworkers act with good intentions and seek technical solutions to human fallibility instead of demanding perfection from people.

For example, suppose an engineer is paged in the middle of the night, acknowledges the page, and goes back to bed while a production incident develops. In the morning we could fire that engineer and assume the problem is solved now that there are only “competent” engineers on the team. But to do so would be to misunderstand the problem entirely: competence is not an intrinsic property of the engineer. Rather, it’s something that arises from the interaction between the person and the system that conditions them, and the system is the one we can change to durably affect future results. What kind of training are the on-call engineers given? Did the alert clearly convey the gravity of the incident? Was the engineer receiving more alerts than they could handle? These are the questions we should investigate in the postmortem. The answers to these questions are far more valuable than determining just that one person dropped the ball.

A blameless culture is essential for people to be unafraid to reach out for help during an emergency and to be honest and open in the resulting postmortem. This makes the postmortem more useful as a learning tool. Without a blameless culture, incident response is far more stressful. Your first priority becomes protecting yourself and your coworkers from blame instead of helping your users. This could come out as a lack of diligence, too. Investigations may be shallow and inconclusive if specifics could get someone—maybe you—fired. This ultimately harms the users of your service.

Blameless culture doesn’t happen overnight. If your organization does not already have a blameless culture, it can be quite a challenge to kick-start it. It requires significant support from all levels of management in order to succeed. But once a blameless culture has taken root, it becomes much easier to focus on identifying and fixing systemic problems.

What’s next?

If you haven’t already, start thinking about SLOs, postmortems, and blameless culture to discuss all of them with your coworkers. Think about what it would take to stabilize your reliability curve, and think about what your organization could do if you had that stability. And if you’re just getting started with SRE, learn more about developing your SRE journey.

Many thanks to Nathan Bigelow, Matt Brown, Christine Cignoli, Jesús Climent Collado, David Ferguson, Gustavo Franco, Eric Harvieux, Adrian Hilton, Piotr Hołubowicz, Ib Lundgren, Kevin Mould, and Alec Warner for their contributions to this post.

The 2019 Accelerate State of DevOps: Elite performance, productivity, and scaling

DevOps Research and Assessment (DORA), a pioneer in helping organizations achieve high DevOps and organizational performance with data-driven insights, and Google Cloud are excited to announce the launch of the 2019 Accelerate State of DevOps Report. The report provides a comprehensive view of the DevOps industry, providing actionable guidance for  organizations of all sizes and in all industries to improve their software delivery performance to ultimately become an elite DevOps performer. With six years of research and data from more than 31,000 professionals worldwide, the 2019 Accelerate State of DevOps Report is the largest and longest-running research of its kind.

New insights in 2019

We saw continued evidence that software speed, stability, and availability contribute to organizational performance, and this year we were able to uncover new insights into the practices and capabilities that drive high DevOps performance. Some insights include:

  • DevOps has “crossed the chasm”: Organizations across industries continue to improve their DevOps expertise, particularly among the highest performers. The proportion of elite performers has almost tripled, now at 20% of all organizations. This confirms reports from other industry analysts.
DevOps insights.jpg
  • Elite performers are more likely to use the cloud: Fast autoscaling, cost visibility, and reliability are some of the key benefits offered by cloud computing. The highest performing DevOps teams were 24 times more likely than low performers to execute on all five capabilities of cloud computing defined by the National Institute of Standards and Technology (NIST), which include on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service.

  • Most cloud users aren’t using it to its full potential: Only 29% of respondents who use cloud met all five of NIST’s above-mentioned criteria. This underscores the fact that organizations who claim to use cloud computing haven’t necessarily adopted all the essential patterns that matter for driving elite performance, which could be holding them back from reaping the benefits of the cloud.

  • For the first time, industry matters: In this year’s report, the retail industry saw better performance both in terms of speed and stability. However, consistent with previous years, we saw continued evidence that no other industry sees better or worse performance. This suggests that organizations of all types and sizes, including highly regulated industries such as financial services, government and retail, can achieve high levels of performance by adopting DevOps practices.

  • DevOps in the enterprise — Part 1: For the first time, we found evidence that enterprise organizations (those with more than 5,000 employees) are lower performers than those with fewer than 5,000 employees. Heavyweight process and controls, as well as tightly coupled architectures, are some of the reasons that result in slower speed and the associated instability.

  • DevOps in the enterprise — Part 2: Our analysis shows the highest DevOps performers (that is, the high and elite performers), focus on structural solutions that build community, which fall into one of these four patterns: Community Builders, University, Emergent, and Experimenters. 

  • No “one size fits all” approach, but concurrent efforts drive success: When investing in DevOps capabilities, particularly in large organizations, focus needs to be on both team-level and organization-level efforts. Continuous integration, automated testing, and monitoring are some of the efforts that work well at the team level.  Examples of organization-level capabilities include the ability to set architectural or change approval policies that span across departments and teams. The report breaks down these capabilities and outlines the strategies to adopt so you can execute on a DevOps strategy for maximum impact. 

  • Low performers use more proprietary software than high and elite performers: The cost to maintain and support proprietary software can be prohibitive, prompting high and elite performers to use open source solutions. This is in line with results from previous reports. In fact, the 2018 Accelerate State of DevOps Report found that elite performers were 1.75 times more likely to make extensive use of open source components, libraries, and platforms. 

How do you improve at DevOps?

This year’s report provides two research models to help drive DevOps improvements: performance and productivity. 

The performance research model looks at the constructs and levers you can pull to drive organizational performance, providing insights into how cloud, continuous delivery, disaster recovery testing, clear change management and a culture of psychological safety can positively impact software delivery performance. The research also finds that heavyweight change processes don’t work

performance research model.png

The productivity research model shows that organizations can improve engineer productivity by investing in easy-to-use tools and information search, a culture of psychological safety, and by reducing technical debt. Improved productivity also helps drive better employee work/life balance and reduces burnout.

productivity research model.png

This year’s report revalidates important findings for the sixth year in a row: First, that it’s possible to optimize for stability without sacrificing speed. Second, DevOps delivers value to customers and end users by impacting both commercial and non-commercial goals. 

Thanks to everyone who contributed to the survey. We hope this report helps organizations of all sizes, industries, and regions improve. We look forward to hearing your thoughts and feedback on the report. Here are some ways you can learn more about 2019 The Accelerate State of DevOps Report.

Introducing Spinnaker for Google Cloud Platform—continuous delivery made easy

Development teams want to adopt continuous integration (CI) and continuous delivery (CD), to identify and correct problems early in the development process, and to make the release process safe, low-risk, and quick. However, with CI/CD, developers often spend more time setting up and maintaining end-to-end pipelines and crafting deployment scripts than writing code.

Spinnaker, developed jointly by Google and Netflix, is an open-source multi-cloud continuous delivery platform. Companies such as Box, Cisco, and Samsung use Spinnaker to create fast, safe, repeatable deployments. Today, we are excited to introduce the Spinnaker for Google Cloud Platform solution, which lets you install Spinnaker in Google Cloud Platform (GCP) with a couple of clicks, and start creating pipelines for continuous delivery.

Spinnaker for GCP comes with built-in deployment best practices that can be leveraged whether teams’ resources (source code, artifacts, other build dependencies) are on-premises or in the cloud. Teams get the flexibility of building, testing, and deploying to Google-managed runtimes such as Google Kubernetes Engine (GKE), Google Compute Engine (GCE), or Google App Engine (GAE), as well as other clouds or on-prem deployment targets for hybrid and multi-cloud CD. 

Spinnaker for GCP integrates Spinnaker with other Google Cloud services, allowing you to extend your CI/CD pipeline and integrate security and compliance in the process. For instance, Cloud Build gives you the flexibility to create Docker containers or non-container artifacts.

Likewise, integration with Container Registry vulnerability scanning helps to automatically scan images, and Binary Authorization ensures that you only deploy trusted container images. Then, for monitoring hybrid deployments, you can use Stackdriver to gain insights into visibility into the performance, uptime, and overall health of your application, and of Spinnaker itself.

Google’s Chrome Ops Developer Experience team uses Spinnaker to deploy some of their services:

“Getting a new Spinnaker instance up and running with Spinnaker for GCP was really simple,” says Ola Karlsson, SRE on the Chrome Ops Developer Experience team. “The solution takes care of the details of managing Spinnaker and still gives us the flexibility we need. We’re now using it to manage our production and test Spinnaker installations.” 

Spinnaker for GCP lets you add sample pipelines and applications to Spinnaker that demonstrate best practices for deployments to Kubernetes, VMs and more. DevOps teams can use these as starting points to provide “golden path” deployment pipelines tailored to their company’s requirements.

“We want to make sure that the solution is great both for developers and DevOps or SRE teams,“ says Matt Duftler, Tech Lead for Google’s Spinnaker effort. “Developers want to get moving fast with the minimum of overhead. Platform teams can allow them to do that safely by encoding their recommended practice into Spinnaker, using Spinnaker for GCP to get up and running quickly and start onboard development teams.”

cloud shell.png
could spinnker.png

The Spinnaker for GCP advantage
The availability of Spinnaker for GCP gives customers a fast and easy way to set up Spinnaker in a production-ready configuration, optimized for GCP. Some other benefits include: 

  • Secure installation: Spinnaker for GCP supports one-click HTTPS configuration with Cloud Identity Aware Proxy (IAP), letting you control who can access the Spinnaker installation.
  • Automatic backups: The configuration of your Spinnaker installation is automatically backed up securely, for auditing and fast recovery.
  • Integrated auditing and monitoring: Spinnaker for GCP integrates Spinnaker with Stackdriver for simplified monitoring, troubleshooting and auditing of changes and deployments.
  • Simplified maintenance: Spinnaker for GCP includes many helpers to simplify and automate maintenance of your Spinnaker installations, including configuring Spinnaker to deploy to new GKE clusters and GCE or GAE in other GCP projects.

Existing Spinnaker users can migrate to Spinnaker for GCP today if they’re already using Spinnaker’s Halyard tool to manage their Spinnaker installations.

Production debugging comes to Cloud Source Repositories

Google Cloud has some great tools for software developers. Cloud Source Repositories and Stackdriver Debugger are used daily by thousands of developers who value Cloud Source Repositories’ excellent code search and Debugger’s ability to quickly and safely find errors in production services.

But Debugger isn’t a full-fledged code browser, and isn’t tightly integrated with all the most common developer environments. The good news is that these tools are coming together! Starting today, you can debug your production services directly in Cloud Source Repositories, for every service where Stackdriver Debugger is enabled.

Debugger CSR 1.gif

What’s new in Cloud Source Repositories

This integration brings two pieces of functionality to Cloud Source Repositories: support for snapshots, and logpoints.

Snapshots
Snapshots are point-in-time images of your application’s local variables and stack that are triggered when a code condition is met. Think of snapshots as breakpoints that don’t halt execution. To create one, simply click on a line number as you would with a traditional debugger, and the snapshot will activate the next time that one of your instances executes the selected line. When this happens, you’ll see the local variables captured during the snapshot and the complete call stack—without halting the application or impacting its state and ongoing operations!

You can navigate and view local variables in this snapshot from each frame in the stack, just as with any other debugger. You also have full access to conditions and expressions, and there are safeguards in place to protect against accidental changes to your application’s state.

Logpoints
Logpoints allow you to dynamically insert log statements into your running services without redeploying them. Each logpoint operates just like a log statement that you write into your code normally: you can add free text, reference variables, and set the conditions for the log to be saved. Logpoints are written to your standard output path, meaning that you can use them with any logging backend, not just Stackdriver Logging.

Creating a logpoint is a lot like creating a snapshot: simply click on the line number of the line where you wish to set it, and you’re done.

Upon adding a logpoint to your application, it’s pushed out to all instances of the selected service. Logpoints last for 24 hours or until the service is redeployed, whichever comes first. Logpoints have the same performance impact as any other log statement that exists normally in your source code.

Getting started

To use Cloud Source Repositories’ production debugging capabilities, you must first enable your Google Cloud Platform projects for Stackdriver Debugger. You can learn more about these setup steps in the Stackdriver Debugger documentation.

Once this is complete, navigate to the source code you wish to debug in Cloud Source Repositories, then select ‘Debug application’. Today this works best with code stored in Cloud Source Repositories or that is mirrored from supported third-party sources including GitHub, Bitbucket, and GitLab. Once you’ve selected your application you can start placing snapshots and logpoints in your code by clicking on the line numbers in the left gutter.

Debugger CSR 2.gif

Production debugging done right

Being able to debug code that’s running in production is a critical capability—and being able to do so from a full-featured code browser is even better! Now, through bringing production debugging to Cloud Source Repositories, you can track down hard-to-find problems deep in your code, while being able to do things like  continually sync code from a variety of different sources, cross-reference classes, look at blame layers, view change history, and search by class name, method name, etc. To learn more, check out this getting started guide

Russell Wolf, Product Manager, also contributed to this blog post.

GCP DevOps tricks: Create a custom Cloud Shell image that includes Terraform and Helm

If you develop or manage apps on Google Cloud Platform(GCP), you’re probably familiar with Cloud Shell, which provides you with a secure CLI that you can use to manage your environment directly from the browser. But while Cloud Shell’s default image contains most of the tools you could wish for, in some cases you might need more—for example, Terraform for infrastructure provisioning, or Helm, the Kubernetes package manager. 

In this blog post, you will learn how to create a custom Docker image for Cloud Shell that includes the Helm client and Terraform. At a high level, this is a two-step process:

  1. Create and publish a Docker image
  2. Configure your custom image to be used in Cloud Shell

Let’s take a closer look. 

1. Create and publish a custom Cloud Shell Docker image

First, you need to create new Docker image that’s based on the default Cloud Shell image and then publish the image you created to Container Registry.

1. Create a new repo and set the project ID where the Docker image should be published:

2.  With your file editor of choice, create a file named Dockerfile with the following content:

3. Build the Docker image:

4. Push the Docker image to Container Registry:

Note: You will need to configure Docker to authenticate with gcr by following the steps here.

2. Configure Cloud Shell image to use the published image

Now that you’ve created and published your image, you need to configure the Cloud Shell Environment to utilize the image that was published to Container Registry. In the Cloud Console follow these steps:

  1. Go to Cloud Shell Environment settings
  2. Click Edit
  3. Click “Select image from project”
  4. In the Image URL field enter: gcr.io/$GCP_PROJECT_ID/cloud-shell-image:latest
  5. Click “Save”Now open a new Cloud Shell session and you should see that the new custom image is used.

There you have it—a way to configure your Cloud Shell environment with all your favorite tools. To learn more about Cloud Shell, check out the documentation.