Your ML workloads cheaper and faster with the latest GPUs

Running ML workloads more cost effectively

Google Cloud wants to help you run your ML workloads as efficiently as possible. To do this, we offer many options for accelerating ML training and prediction, including many types of NVIDIA GPUs. This flexibility is designed to let you get the right tradeoff between cost and throughput during training or cost and latency for prediction.

We recently reduced the price of NVIDIA T4 GPUs, making AI acceleration even more affordable. In this post, we’ll revisit some of the features of recent generation GPUs, like the NVIDIA T4, V100, and P100. We’ll also touch on native 16-bit (half-precision) arithmetics and Tensor Cores, both of which provide significant performance boosts and cost savings. We’ll show you how to use these features, and how the performance benefit of using 16-bit and automatic mixed-precision for training often outweighs the higher list price of NVIDIA’s newer GPUs.

Half-precision (16-bit float)

Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Storing FP16 data reduces the neural network’s memory usage, which allows for training and deployment of larger networks, and faster data transfers than FP32 and FP64.

1.png
32-bit Float structure (Source: Wikipedia)
2.png
16-bit Float structure (Source: Wikipedia)

Execution time of ML workloads can be sensitive to memory and/or arithmetic bandwidth. Half-precision halves the number of bytes accessed, reducing the time spent in memory-limited layers. Lowering the required memory lets you train larger models or train with larger mini-batches.

The FP16 format is not new to GPUs. In fact, it has been supported as a storage format for many years on NVIDIA GPUs: High performance FP16 is supported at full speed on NVIDIA T4, NVIDIA V100, and P100 GPUs. 16-bit precision is a great option for running inference applications, however if you’re training a neural network entirely at this precision, the network may not converge to required accuracy levels without higher precision result accumulation.

Automatic mixed precision mode in TensorFlow

Mixed precision uses both FP16 and FP32 data types when training a model. Mixed-precision training offers significant computational speedup by performing operations in half-precision format whenever it’s safe to do so, while storing minimal information in single precision to retain as much information as possible in critical parts of the network. Mixed-precision training usually achieves the same accuracy as single-precision training using the same hyper-parameters.

NVIDIA T4 and NVIDIA V100 GPUs incorporate Tensor Cores, which accelerate certain types of FP16 matrix math, enabling faster and easier mixed-precision computation. NVIDIA has also added automatic mixed-precision capabilities to TensorFlow.

To use Tensor Cores, FP32 models need to be converted to use a mix of FP32 and FP16. Performing arithmetic operations in FP16 takes advantage of the performance gains of using lower-precision hardware (such as Tensor Cores). Due to the smaller representable range of float16, though, performing the entire training with FP16 tensors can result in gradient underflow and overflow errors. However, performing only certain arithmetic operations in FP16 results in performance gains when using compatible hardware accelerators, decreasing training time and reducing memory usage, typically without sacrificing model performance.

TensorFlow supports FP16 storage and Tensor Core math. Models that contain convolutions or matrix multiplication using the tf.float16 data type will automatically take advantage of Tensor Core hardware whenever possible.

This process can be configured automatically using automatic mixed precision (AMP). This feature is available in V100 and T4 GPUs, and TensorFlow version 1.14 and newer supports AMP natively. Let’s see how to enable it.

Manually: Enable automatic mixed precision via TensorFlow API

Wrap your tf.train or tf.keras.optimizers Optimizer as follows:

This change applies automatic loss scaling to your model and enables automatic casting to half precision.

(Note: To enable mixed precision in a for TensorFlow 2 Keras you can use: tf.keras.mixed_precision.Policy.)

Automatically: Enable automatic mixed precision via an environment variable

When using the NVIDIA NGC TFDocker image, simply set one environment variable:

As an alternative, the environment variable can be set inside the TensorFlow Python script:

(Note: For a complete AMP example showing the speed-up on training an image classification task on CIFAR10, check out this notebook.)

Please take a look at the Models that have been tested successfully using mixed-precision.

Configure AI Platform to use accelerators

If you want to start taking advantage of the newer NVIDIA GPUs like the T4, V100, or P100 you need to use the customization options: Define a config.yaml file that describes the GPU options you want. The structure of the YAML file represents the Job resource.

The first example shows a configuration file for a training job that uses Compute Engine machine types with a T4 GPU.

(Note: For a P100 or V100 GPU, configuration is similar, just replace type with the correct GPU type—NVIDIA_TESLA_P100 or NVIDIA_TESLA_V100.)

Use the gcloud* command to submit the job, including a --config argument pointing to your config.yaml file. This example assumes you’ve set up environment variables—indicated by a $ sign followed by capital letters—for the values of some arguments:

The following example shows how to submit a job with a similar configuration (using Compute Engine machine types with GPUs attached), but without using a config.yaml file:

(Note: Please verify you are running the latest Google Cloud SDK to get access to the different machine types.)

Hidden cost of low-priced instances

The conventional practice most organizations follow is to select lower-priced cloud instances to save on per-hour compute cost. However, the performance improvements of newer GPUs can significantly reduce costs for running compute-intensive workloads like AI.

To validate the concept that modern GPUs reduce the total cost of some common training workloads, we trained Google’s Neural Machine Translation (GNMT) model—which is used for applications like real-time language translations—on several GPUs. In this particular example we tested the GNMTv2 model using AI Platform Training using Custom Containers. By simply using modern hardware like a T4, we are able to train the model at 7% of the cost while obtaining the result eight times faster, as shown in the table below. (For details about the setup please take a look at the NVIDIA site.)

price.png
  • Each GPU Model was tested using three different runs and calculating the average numbers per section.

  • Additional costs for storing data (GNMT input data was stored on GCS) are not included, since they are the same for all tests.

A quick note: When calculating the cost of a training job using Consumed ML units use the following formula:

In this case to calculate the cost for running the job in the K80 use the Consumed ML units * $0.49formula: 465 * $0.49 = $227.85.

The Consumed ML units can be found on your Job details page (see below), and are equivalent to training units with the duration of the job factored in:

3.png

Looking at the specific NVIDIA GPUs, we can get more granular on the performance-price proposition.

  • NVIDIA T4 is well known for its low power consumption and great Inference performance for Image/Video Recognition, Natural Language Processing, and Recommendation Engines, just to name a few use cases. It supports half-precision (16-bit float) and automatic mixed precision for model training and gives a 8.1x speed boost over K80 at only 7% of the original cost.

  • NVIDIA P100 introduced half-precision (16-bit float) arithmetic. Using it gives a 7.6x performance boost over K80, at 27% of the original cost.

  • NVIDIA V100 introduced tensor cores that accelerate half-precision and automatic mixed precision. It provides an 18.7x speed boost over K80 at only 15% of the original cost. In terms of time savings, the time to solution (TTS) was reduced from 244 hours (about 10 days) to just 13 hours (an overnight run). 

What about model prediction?

GPUs can also drastically lower latency for online prediction (inference). However, the high availability demands of online prediction often requires keeping machines alive 24/7 and provisioning sufficient capacity in case of failures or traffic spikes. This can potentially make low latency online prediction expensive.

The latest price cuts to T4s, however, make low latency, high availability serving more affordable on the Google Cloud AI Platform. You can deploy your model on a T4 for about the same price as eight vCPUs, but with the low latency and high-throughput of a GPU.

The following example shows how to deploy a TensorFlow model for Prediction using 1 NVIDIA T4 GPU:

Conclusion

Model training and serving on GPUs has never been more affordable. Price reductions, mixed precision, and Tensor Cores accelerate AI performance for training and prediction when compared to older GPUs such as K80s. As a result, you can complete your workloads much faster, saving both time and money. To leverage these capabilities and reduce your costs, we recommend the following rules of thumb:

  • If your training job is short lived (under 20 minutes), use T4, since they are the cheapest per hour.

  • If your model is relatively simple (fewer layers, smaller number of parameters, etc.), use T4, since they are the cheapest per hour.

  • If you want the fastest possible runtime and have enough work to keep the GPU busy, use V100.

  • To take full advantage of the newer NVIDIA GPUs use 16-bit precision in P100 and enable mixed precision mode when using T4 and V100.

If you haven’t explored GPUs for model prediction or inference, take a look at our GPUs on Compute Engine page for more details. For more information on getting started, check out our blog post on the topic.

References


Acknowledgements: Special thanks to the following people who contributed to this post: 
NVIDIA: Alexander Tsado, Cloud Product Marketing Manager
Google: Henry Tappen, Product Manager; Robbie Haertel, Software Engineer; Viesturs Zarins, Software Engineer

1. Price is calculated as described here. Consumed ML Units * Unit Cost (different per region).

New AMD EPYC-based Compute Engine family, now in beta

At Google Cloud, we want you to be able to choose the best VMs for your workloads. Today, we’re excited to announce a new addition to our general purpose VMs: the N2D family, built atop 2nd Gen AMD EPYC™ Processors. 

N2D VMs are a great option for both general-purpose workloads and workloads that require high memory bandwidth.

  • General-purpose workloads that require a balance of compute and memory, like web applications and databases, can benefit from N2D’s performance,  price, and features. N2D VMs are designed to provide you with the same features as N2 VMs including local SSD, custom machine types, and transparent maintenance through live migration, while features like large machine types with up to an industry-leading 224 vCPUs, the largest general purpose VM on Compute Engine. At the same time, N2D instances provide savings of up to 13% over comparable N-series instances, and up to a 39% performance improvement on the Coremark benchmark compared to comparable N1 instances1.
  • HPC workloads such as crash analysis, financial modeling, rendering and reservoir analysis, will benefit from the N2D machine types configured with 128 and 224 vCPUs, which offer up to 70% higher platform memory bandwidth than comparable N1 instances. This, combined with higher core counts, provides over a 100% performance improvement on a variety of representative benchmarks, including Gromacs and NAMD, compared to n1-standard-96 vCPUs. 

N2D machine type details

N2D VMs are now available in beta from us-central1, asia-southeast1, and europe-west4, with more regions on the way! You can launch them on-demand or as preemptible VMs.When you sign up for committed use discounts, you can save up to 55% for three-year commitments versus on-demand pricing. Long-running N2D VMs can take advantage of sustained use discounts, and automatically save up to 20%. You can also configure N2D VMs as predefined machine types with vCPU to memory ratios of 1:1, 1:4, and 1:8,  up to 224 vCPUs. You can also create custom machine types with N2Ds, helping you meet the needs of diverse workloads.

Get started

It’s easy to get started with N2D VMs—simply visit the Google Cloud Console and launch one! To learn more about N2D VMs or other Compute Engine VM options, check out our machine types and our pricing pages.


1. N2D-standard-32 performed 39% better than N1-standard-32 when evaluated using Coremark.

How Udacity students succeed with Google Cloud

Editor’s note: Today we hear from Udacity, which uses a variety of Google Cloud technologies for its online learning platform. Read on to learn how they built online workspaces that give students immediate access to fast, isolated compute resources and private data sets. 

At Udacity, we use advanced technologies to teach about technology. One example is our  interactive “Workspaces,” which students use to gain hands-on experience with a variety of advanced topics like artificial intelligence, data science, programming and cloud. These online environments comprise everything from SQL interpreters to coding integrated development environments (IDEs), Jupyter Notebooks and even fully functional 3D graphical desktops—all accessible via an everyday browser.

udacity uLab.gif
Udacity’s latest IDE environment, “uLab,” where “Learning Guides” can demonstrate skills interactively.

To build these Workspaces, we relied heavily on Google Cloud Platform (GCP) in numerous interesting and novel ways. This article details our implementation and where we hope to take it in the future. 

Workspaces design goals

Udacity customers are smart, busy learners from all over the world, who access our courses remotely. To meet their needs, we designed Udacity Workspaces to:

  • Be ready to use in under 15 seconds

  • Offer advanced functionality directly inside the browser-based Udacity Classroom

  • Instantly furnish starter and example files to students in a new Workspace, and automatically save all student work and progress for the next session

  • Provide quick access to large external datasets

  • Function well with any session duration… from two minutes to four hours, or more

  • Provide reliable compute availability and GPU power wherever needed

We chose GCP for its ease of use, reliability, and cost-effectiveness. Let’s see how we used different GCP offerings to meet these goals.

Fast, personalized access to Workspaces 

Students demand immediate access to their Workspaces, but booting up a full GCP host from an image can take awhile. That’s OK if a student plans on using their Workspace for an hour, but not if they’re using it for a two minute Workspace coding challenge.

To address this, we built a custom server management tool (“Nebula”) that maintains pools of ready servers to assign to students immediately. To control costs, the pools are sized by a custom usage-pressure measurement algorithm to be fairly surge ready, but which also reduces the pools to as small as a single instance during idle periods. Pools are maintained in multiple data centers, to maximize access to GPUs.

GCP’s by-the-second pricing and flexible reservations policy served us well here. Given the short usage windows of some student exercises, hourly billing or bulk billing might have proved cost prohibitive.

Having ready-to-go server pools minimizes startup time, but we also needed to place “starter files,” or later on, the student’s own work from a previous session, onto the hosts as quickly as possible. After experimenting with several approaches, we decided to store these files as tarballs in Cloud Storage. We found that we can copy up to 3GB to and from Cloud Storage within our SLA time window, so we set a hard limit of 3GB for student drives.

Every time a student’s session goes idle for half an hour, we deallocate the host, compress and copy the student’s files to Cloud Storage, then delete the host. In this manner we make time-stamped backups of each session’s files, that students can opt to restore any time they need to (via the Workspaces GUI). An alternative approach could be to leverage Cloud Storage’s version control, which provides access to GCP’slifecycle controls as well. However, at the time we built the student files storage system, this GCP feature was still in beta, so we opted for a home-grown facility.

In addition, we take advantage of Cloud Functions to duplicate the student files in a second region to ensure against regional outages. Side note: if we were to build this feature today, we could take advantage of dual-region buckets to automatically save student files in two regions.

Access to large datasets

From time to time, students need to access large datasets, e.g., in our machine learning courses. Rather than writing these datasets on server images, we mount read-only drives to share a single dataset across multiple student hosts. We can update these datasets on new shared drives, and Nebula can point new sessions at these new drives without interrupting existing session mounts. 

To date, we’ve never run into a concurrent read-only mount limit for these drives. However, we do see a need for quick-mount read-write dataset drives. One example could be a large SQL database that a student is expected to learn to modify in bulk. Duplicating a large drive on-the-fly isn’t feasible, so one approach could be to manage a pool of writeable drive copies to mount just-in-time, or to leverage Google Cloud’s Filestore

With the Filestore approach, you’d pre-create many copies of data drives in a folder tree, and mount a particular folder on the Filestore to a specific student’s container when access is needed; that copy would then never be assigned to anybody else, and asynchronously deleted/replaced with a fresh, unaltered copy when the student’s work is finished.

Consistent compute power

In a shared environment (e.g. Google Kubernetes Engine ), one student’s runaway process could affect the compute performance of another student’s entire container (on the same metal). To avoid that, we decided on a “one-server-per-student” model, where each students gets access to a single Compute Engine VM, running several Docker containers—one container for the student’s server, another for an auto-grading system, and yet another for handling file backups and restores. In addition to providing consistent compute power, this approach also has a security advantage: it allows us to run containers in privileged mode, say, to use specialized tools, without risking a breach beyond the single VM allocated to any one student.

This architecture also ensures that GPU-equipped hosts aren’t shared either, so students benefit from all available performance. This is especially important as students fire up long-running, compute intensive jobs such as performing image recognition. 

As a cost control measure, we meter GPU host usage and display available remaining GPU time to students, so they  switch their GPUs on and off. This “switching” actually allocates a new host from our pools to the student (either GPU-enabled or not). Because we can do the switch in under 15 seconds, it feels approximately like a toggle switch, but some aspects of the session (such as open files) may be reset (e.g., in an IDE configuration). We encourage students to ration their GPU time and perform simpler tasks such as editing or file management in “CPU mode.”

One of our GPU host configurations provides an in-browser Ubuntu desktop with pass-through Nvidia K80 GPUs for high-performance compute and graphics. This configuration is heavily employed by our Autonomous Systems students, who run graphics-intensive programs like Gazebo (shown), and run robot environment simulations. You can read more about this configuration here.

udacity gpu.gif

Wanted: flexible and isolated images

This configuration has hit all our goals except for true image flexibility. For our large variety of courses we require many variations of software installations. Normally such needs would be satisfied with containers, but the requirement of isolated compute environments eliminates that as an option. 

In the past two years, we’ve empowered hundreds of thousands of Udacity students to advance their careers and learn new skills with powerful learning environments called Workspaces, built on top of GCP. Throughout, GCP has proven itself to be a robust platform and a supportive partner, and we look forward to future product launches on top of Google Cloud. If you’d like to learn more about the solutions we’ve built, feel free to reach out to me on Twitter, @Atlas3650.

Running workloads on dedicated hardware just got better

At Google Cloud, we repeatedly hear how important flexibility, openness, and choice are for your cloud migration and modernization journey. For enterprise customers that require dedicated hardware due to requirements such as performance isolation (for gaming), physical separation (for finance or healthcare), or license compliance (Windows workloads), we’ve improved the flexibility of our sole-tenant nodes to better meet your isolation, security, and compliance needs. 

Sole-tenant nodes already let you mix, match, and right-size different VM shapes on each node, take advantage of live migration for maintenance events, as well as auto-schedule your instances onto a specific node, node group, or group of nodes using node affinity labels. Today, we are excited to announce the availability of three new features on sole-tenant nodes: 

  • Live migration within a fixed node pool for bring your own license (BYOL) (beta)

  • Node group autoscaler (beta)

  • Migrate between sole- and multi-tenant nodes (GA) 

These new features make it easier and more cost-effective to deploy, manage, and run workloads on dedicated Google Cloud hardware.

More refined maintenance controls for Windows BYOL

There are several ways to license Windows workloads to run on Google Cloud: you can purchase on-demand licenses, use License Mobility for Microsoft applications, or bring existing eligible server-bound licenses onto sole-tenant nodes. Sole-tenant nodes let you launch your instances onto physical Compute Engine servers that are dedicated exclusively to your workloads to comply with dedicated hardware requirements. At the same time, sole-tenant nodes also provide visibility into the underlying host hardware and support your license reporting through integration with BigQuery.

Now, sole-tenant nodes offer you extended control over your dedicated machines with a new node group maintenance policy. This setting allows you to specify the behavior of the instances on your sole-tenant node group during host maintenance events. To avoid additional licensing costs and provide you with the latest kernel and security updates while supporting your per-core or per-processor licenses, the new ‘Migrate Within Node Group’ maintenance policy setting enables transparent installation of kernel updates, without VM downtime, and while keeping your unique physical core usage to a minimum.

migrate with node group.gif

Node groups configured with this setting live migrate instances within a fixed pool of sole-tenant nodes (dedicated servers) during host maintenance events. By limiting migrations to that fixed pool of hosts, you are able to dynamically move your virtual machines between already licensed servers and avoid license pollution. It also helps us keep you running on the newest kernel updates for better performance and security, and enables continuous uptime through automatic migrations. Now your server-bound bring-your-own-license workloads can strike a better balance between licensing cost, workload uptime, and platform security.

autoscale.png

In addition to the ‘Migrate Within Node Group’ setting, you can also choose to configure your node group to the ‘Default’ setting, which moves instances to a new host during maintenance events (recommended for workloads without server affinity requirements), or to the ‘Restart In Place” setting which terminates the instances and restarts them on the same physical server following host maintenance events.

For more information on node-group maintenance policies visit the bring your own license documentation.

Node group autoscaler

If you have dynamic capacity requirements, autoscaler for sole-tenant node groups automatically manages your pool of sole-tenant nodes, allowing you to scale your workloads without worrying about independently scaling your node group. Autoscaler for sole-tenant node groups increases the size of your node group when there is insufficient capacity to accommodate a new instance, and automatically decreases the size of a node group when it detects the presence of an empty node. This reduces scheduling overhead, increases resource utilization, and drives down your infrastructure costs.

node group autoscaler.png

Autoscaler allows you to set the minimum and maximum boundaries for your node group size and scales behind the scenes to accommodate your changing workload. For additional flexibility, autoscaling also supports a scale-out (increase-only) mode to support monotonically increasing workloads or workloads whose licenses are tied to a physical cores or processors. 

Migrating into sole tenancy

Finally, if you need additional agility for your workloads, you can now move instances into, between, and out of sole-tenant nodes. This allows you to achieve hardware isolation for existing VM instances based on your changing security, compliance, or performance isolation needs. You might want to move an instance into a sole-tenant node for special events like a big online shopping day, game launch, or any moment that requires peak performance and the highest level of control. The example below illustrates the steps for migrating an instance onto a sole-tenant node:

For details on rescheduling your instances onto dedicated hardware, see the documentation.

Pricing and availability

Pricing for sole-tenant nodes remains simple: pay only for the nodes you use on a per-second basis, with a one-minute minimum charge.  Sustained use discounts automatically apply, as do any new or existing committed use discounts. Visit the pricing page to learn more about sole-tenant nodes, as well as the regional availability page to find out if they are available in your region.

Windows Server applications, welcome to Google Kubernetes Engine

The promise of Kubernetes is to make container management easy and ubiquitous. Up until recently though, the benefits of Kubernetes were limited to Linux-based applications, preventing enterprise applications running on Windows Server from taking advantage of its agility, speed of deployment and simplified management. 

Last year, the community brought Kubernetes support to Windows Server containers. Building on this, we’re thrilled to announce that you can now run Windows Server containers on Google Kubernetes Engine (GKE). 

GKE, the industry’s first Kubernetes- based container management solution for the public cloud, is top rated by analysts and widely used by customers across a variety of industries. Supporting  Windows on GKE is a part of our commitment to provide a first-class experience for hosting and modernizing Windows Server-based applications on Google Cloud. To this end, in the past six months, we added capabilities such as the ability to bring their own Windows Server licenses (BYOL), virtual displays, and managed services for SQL Server and Active Directory. Volusion and Travix are among the many thousands of customers who have chosen Google Cloud to run and modernize their Windows-based application portfolios.

Bringing Kubernetes’ goodness to Windows Server apps

By running Windows Server apps as containers on Kubernetes, you get many of the benefits that Linux applications have enjoyed for years. Running your Windows Server containers on GKE can also save you on licensing costs, as you can pack many Windows Server containers on each Windows node.

kubernetes windows server app.png
Illustration of Windows Server and Linux containers running side-by-side in the same GKE cluster

In the beta release of Windows Server container support in GKE (version 1.16.4), Windows and Linux containers can run side-by-side in the same cluster. This release also includes several other features aimed at helping you meet the security, scalability, integration and management needs of your Windows Server containers. Some highlights include:

  • Private clusters: a security and privacy feature that allows you to restrict access to a cluster’s nodes and the master from the public internet—your cluster’s nodes can only be accessed from within a trusted Google Virtual Private Cloud (VPC).

  • Node Auto Upgrades: a feature that reduces the management overhead, provides ease of use and better security by automatically upgrading GKE nodes on your behalf. Make sure you build your container images using the Docker ‘multi-arch’ feature to avoid any version mismatch issues between the node OS version and the base container image. 

  • Regional clusters: an availability and reliability feature that allows you to create a multi-master, highly-available Kubernetes cluster that spreads both the control plane and the nodes across multiple zones in the same region. This provides increased control plane uptime of 99.95% (up from 99.5%), and zero-downtime upgrades.

  • Support for Group Managed Service Accounts (gMSA): gMSA is a type of Active Directory account that provides automatic password management, simplified service principal name (SPN) management, etc. for multiple servers. gMSAs are supported by Google Cloud’s Managed Microsoft Active Directory Service for easier administration.

  • Choice of Microsoft Long-Term Servicing Channel (LTSC) or Semi-Annual Channel (SAC) servicing channels, allowing you to choose the version that best fits your support and feature requirements. 

For full details on each of these features and more, please consult the documentation

With Windows Server 2008 and 2008 R2 reaching End of Support recently, you may be exploring ways to upgrade your legacy applications. This may be an opportune time to consider containerizing your applications and deploying them in GKE. In general, good candidates for containerization include custom-built .NET applications as well as batch and web applications. For applications provided by third-party ISVs, please consult the ISV for containerized versions of the applications.  

What customers are saying

We’ve been piloting Windows Server container support in GKE for several months now with preview customers, who have been impressed by GKE’s performance, reliability and security, as well as differentiated features such as automated setup and configuration for easier cluster management. 

Helix RE creates software that makes digital models of buildings, and recently switched from setting up and running Windows Kubernetes clusters manually to using GKE. Here’s what they had to say: 

“What used to take us weeks to set up and configure, now takes a few minutes. Besides saving time, features like autoscaling, high-availability, Stackdriver Logging and Monitoring are already baked in. Windows in GKE gives us the same scale, reliability, and ease of management that we have come to expect from running Linux in GKE.” -Premkumar Masilamani, Cloud Architect, Helix RE

Making it easier with partner solutions

Modernizing your applications means more than just deploying and managing containers. That is why we are working with several partners who can help you build, integrate and deploy Windows Server containers into GKE, for a seamless CI/CD and container management experience. We’re excited to announce that the following partners have already worked to integrate their solutions with Windows on GKE.

CircleCI 
CircleCI allows teams to rapidly release code they trust by automating the build, test, and delivery process. CircleCI ‘orbs’ bundle CircleCI configuration into reusable packages. They make it easy to integrate with modern tools, eliminating the need for teams to spend time and cycles building the integrations themselves. 

“We are excited to further our partnership with Google with our latest Google Kubernetes Engine (GKE) Orb. This orb supports deployment to Windows containers running on GKE, and allows users to automate deploys in minutes directly from their CI/CD pipeline. By simplifying the process of automating deploys, teams can build confidence in their process, ship new features faster, and take advantage of cutting-edge technology without having to overhaul their existing infrastructure.”  -Tom Trahan, VP of Business Development, CircleCI

CloudBees 
CloudBees enables enterprise developer teams to accelerate software delivery with continuous integration and continuous delivery (CI/CD). The CloudBees solutions optimize delivery of high quality applications while ensuring they are secure and compliant.

“We are pleased to offer support for Windows containers on Google Cloud Platform. This announcement broadens the options for CloudBees users to now run Microsoft workloads on GCP. It’s all about speeding up software delivery time and, with CloudBees running Windows containers on GCP, our users can enjoy a fast, modernized experience, leveraging the Microsoft technologies already pervasive within their organization.”  -Francois Dechery, Chief Strategy Officer, CloudBees 

GitLab 
GitLab is a complete DevOps platform, delivered as a single application, with the goal of fundamentally changing the way Development, Security, and Ops teams collaborate.

“GitLab and Google Cloud are lowering the barrier of adoption for DevOps and Kubernetes within the Windows developer community. Within minutes, developers can create a project, provision a GKE cluster, and execute a CI/CD pipeline with Windows Runners now on GitLab.com or with GitLab Self-managed to automatically deploy Windows apps onto Kubernetes.” –Darren Eastman, Senior Product Manager, GitLab”

Checkout GitLab’s blog and video to learn more.

Get started today

We hope that you will take your Windows Server containers for a spin on GKE—to get started, you can find detailed documentation on our website. If you are new to GKE, get started by checking out the Google Kubernetes Engine page and the Coursera course on Architecting with GKE

Please don’t hesitate to reach out to us at [email protected] And please take a few minutes to give us your feedback and ideas to help us shape upcoming releases.

Cheaper Cloud AI deployments with NVIDIA T4 GPU price cut

Google Cloud offers a wide range of GPUs to accelerate everything from AI deployment to 3D visualization. These use cases are now even more affordable with the price reduction of the NVIDIA T4 GPU. As of early January, we’ve reduced T4 prices by more than 60%, making it the lowest cost GPU instance on Google Cloud

Hourly Pricing Per T4 GPU.png
Prices above are for us-central1 and vary by region. A full GPU pricing table is here.

Locations and configurations

Google Cloud was the first major cloud provider to launch the T4 GPU and offer it globally (in eight regions). This worldwide footprint, combined with the performance of the T4 Tensor Cores, opens up more possibilities to our customers. Since our global rollout, T4 performance has improved. The T4 and V100 GPUs now boast networking speeds of up to 100 Gbps, in beta, with additional regions coming online in the future. 

These GPU instances are also flexible to suit different workloads. The T4 GPUs can be attached to our n1 machine types that support custom VM shapes. This means you can create a VM tailored specifically to meet your needs, whether it’s a low cost option like one vCPU, one GB memory, and one T4 GPU, or as high performance as 96 vCPUs, 624 GB memory, and four T4 GPUs—and most anything in between. This is helpful for machine learning (ML), since you may want to adjust your vCPU count based on your pre-processing needs. For visualization, you can create VM shapes for lower end solutions all the way up to powerful, cloud-based professional workstations.

Machine Learning

With mixed precision support and 16 GB of memory, the T4 is also a great option for ML workloads. For example, Compute Engine preemptible VMs work well for batch ML inference workloads, offering lower cost compute in exchange for variable capacity availability. We previously shared sample T4 GPU performance numbers for ML inference of up to 4,267 images-per-second (ResNet 50, batch size 128, precision INT8). That means you can perform roughly 15 million image predictions in an hour for a $0.11 add-on cost for a single T4 GPU with your n1 VM. 

Google Cloud offers several options to access these GPUs. One of the simplest ways to get started is through Deep Learning VM Images for AI Platform and Compute Engine, and Deep Learning Containers for Google Kubernetes Engine (GKE). These are configured for software compatibility and performance, and come pre-packaged with your favorite ML frameworks, including PyTorch and TensorFlow Enterprise

We’re committed to making GPU acceleration more accessible, whatever your budget and performance requirements may be. With the reduced cost of NVIDIA T4 instances, we now have a broad selection of accelerators for a multitude of workloads, performance levels, and price points. Check out the full pricing table and regional availability and try the NVIDIA T4 GPU for your workload today.

Connect to your VPC and managed Redis from App Engine and Cloud Functions

Do you wish you could access resources in your Virtual Private Cloud (VPC) with serverless applications running on App Engine or Cloud Functions? Now you can, with the new Serverless VPC Access service.

Available now, Serverless VPC access lets you access virtual machines, Cloud Memorystore Redis instances, and other VPC resources from both Cloud Functions and App Engine (standard environments), with support for Cloud Run coming soon.

How it works

App Engine and Cloud Functions services exist on a different logical network from Compute Engine, where VPCs run. Under the covers, Serverless VPC Access connectors bridge these networks. These resources are fully managed by Google Cloud, requiring no management on your part. The connectors also provide complete customer and project-level isolation for consistent bandwidth and security. 

Serverless VPC Access connectors allow you to choose a minimum and maximum bandwidth for the connection, ranging from 200–1,000 Mbps. The capacity of the connector is scaled to meet the needs of your service, up to the maximum configured (please note that you can obtain higher maximum throughput if you need by reaching out to your account representative).

While Serverless VPC Access allows connections to resources in a VPC, it does not place your App Engine service or Cloud Functions inside the VPC. You should still shield App Engine services from public internet access via firewall rules, and secure Cloud Functions via IAM. Also note that a Serverless VPC Access connector can only operate with a single VPC network; support for Shared VPCs is coming in 2020.

You can however share a single connector between multiple apps and functions, provided that they are in the same region, and that the Serverless VPC Access connectors were created in the same region as the app or function that uses them. 

Using Serverless VPC Access

You can provision and use a Serverless VPC Access connector alongside an existing VPC network by using the Cloud SDK command line. Here’s how to enable it with an existing VPC network:

Then, for App Engine, modify the App.yaml and redeploy your application:

To use Serverless VPC Access with Cloud Functions functions, first set the appropriate permissions then redeploy the function with the vpc- connector flag:

Once you’ve created and configured a VPC connector for an app or function, you can access VMs and Redis instances via their private network IP address (e.g. 10.0.0.123). 

Get started

Serverless VPC Access is currently available in Iowa, South Carolina, Belgium, London, and Tokyo, with more regions in the works. To learn more about using Serverless VPC Access connectors, check out the documentation and the usage guides for Cloud Functions and App Engine.

Introducing E2, new cost-optimized general purpose VMs for Google Compute Engine

General-purpose virtual machines are the workhorses of cloud applications. Today, we’re excited to announce our E2 family of VMs for Google Compute Engine featuring dynamic resource management to deliver reliable and sustained performance, flexible configurations, and the best total cost of ownership of any of our VMs.  

Now in beta, E2 VMs offer similar performance to comparable N1 configurations, providing:

  • Lower TCO: 31% savings compared to N1, offering the lowest total cost of ownership of any VM in Google Cloud.

  • Consistent performance: Your VMs get reliable and sustained performance at a consistent low price point. Unlike comparable options from other cloud providers, E2 VMs can sustain high CPU load without artificial throttling or complicated pricing. 

  • Flexibility: You can tailor your E2 instance with up to 16 vCPUs and 128 GB of memory. At the same time, you only pay for the resources that you need with 15 new predefined configurations or the ability to use custom machine types

Since E2 VMs are based on industry-standard x86 chips from Intel and AMD, you don’t need to change your code or recompile to take advantage of this price-performance. 

E2 VMs are a great fit for a broad range of workloads including web servers, business-critical applications, small-to-medium sized databases and development environments. If you have workloads that run well on N1, but don’t require large instance sizes, GPUs or local SSD, consider moving them to E2. For all but the most demanding workloads, we expect E2 to deliver similar performance to N1, at a significantly lower cost. 

Dynamic resource management

Using resource balancing technologies developed for Google’s own latency-critical services, E2 VMs make better use of hardware resources to drive costs down and pass the savings on to you. E2 VMs place an emphasis on performance and protect your workloads from the type of issues associated with resource-sharing thanks to our custom-built CPU scheduler and performance-aware live migration.

You can learn more about how dynamic resource management works by reading the technical blog on E2 VMs.

E2 machine types

At launch, we’re offering E2 machine types as custom VM shapes or predefined configurations:

We’re also introducing new shared-core instances, similar to our popular f1-micro and g1-small machine types. These are a great fit for smaller workloads like micro-services or development environments that don’t require the full vCPU.

E2 VMs can be launched on-demand or as preemptible VMs. They are also eligible forcommitted use discounts, bringing additional savings up to 55% for 3 year commitments. E2 VMs are powered by Intel Xeon and AMD EPYC processors, which are selected automatically based on availability. 

Get started

E2 VMs are rolling out this week to eight regions: Iowa, South Carolina, Oregon, Northern Virginia, Belgium, Netherlands, Taiwan and Singapore; with more regions in the works. To learn more about E2 VMs or other GCE VM options, check out our machine types page and our pricing page.

Performance-driven dynamic resource management in E2 VMs

Editor’s note: This is the second post in a two-post series. Click here for part 1: E2 introduction.

As one of the most avid users of compute in the world, Google has invested heavily in making compute infrastructure that is cost effective, reliable and performant. The new E2 VMs are the result of innovations Google developed to run its latency-sensitive, user-facing services efficiently. In this post, we dive into the technologies that enable E2 VMs to meet rigorous performance, security, and reliability requirements while also reducing costs.

In particular, the consistent performance delivered by E2 VMs is enabled by:

  • An evolution toward large, efficient physical servers

  • Intelligent VM placement

  • Performance-aware live migration

  • A new hypervisor CPU scheduler

Together we call these technologies dynamic resource management. Just as Google’s Search, Ads, YouTube, and Maps services benefited from earlier versions of this technology, we believe Google Cloud customers will find the value, performance, and flexibility offered by E2 VMs improves the vast majority of their workloads.

Introducing dynamic resource management

Behind the scenes, Google’s hypervisor dynamically maps E2 virtual CPU and memory to physical CPU and memory on demand. This dynamic management drives cost efficiency in E2 VMs by making better use of the physical resources.

Concretely, virtual CPUs (vCPUs) are implemented as threads that are scheduled to run on demand like any other thread on the host—when the vCPU has work to do, it is assigned an available physical CPU on which to run until it goes to sleep again. Similarly, virtual RAM is mapped to physical host pages via page tables that are populated when a guest-physical page is first accessed. This mapping remains fixed until the VM indicates that a guest-physical page is no longer needed.

The image below shows vCPU work coming and going over the span of a single millisecond. Empty space indicates a given CPU is free to run any vCPU that needs it.

A trace of 1 millisecond of CPU scheduler execution.png
A trace of 1 millisecond of CPU scheduler execution. Each row represents a CPU over time and each blue bar represents a vCPU running for a time span. Empty regions indicate the CPU is available to run the next vCPU that needs it.

Notice two things: there is a lot of empty space, but few physical CPUs are continuously empty. Our goal is to better utilize this empty space by scheduling VMs to machines and scheduling vCPU threads to physical CPUs such that wait time is minimized. In most cases, we are able to do this extremely well. As a result, we can run more VMs on fewer servers, allowing us to offer E2 VMs for significantly less than other VM types.

For most workloads, the majority of which are only moderately performance sensitive, E2 performance is almost indistinguishable from that of traditional VMs. Where dynamic resource management can differ in performance is in the long tail—the worst 1% or 0.1% of events. For example, a web serving application might see marginally increased response times once per 1,000 requests. For the vast majority of applications, including Google’s own latency-sensitive services, this difference is lost in the noise of other performance variations such as Java garbage collection events, I/O latencies and thread synchronization.

The reason behind the difference in tail performance is statistical. Under dynamic resource management, virtual resources only consume physical resources when they are in use, enabling the host to accommodate more virtual resources than it could otherwise. However, occasionally, resource assignment needs to wait several microseconds for a physical resource to become free. This wait time can be monitored in Stackdriver and in guest programs like vmstat and top. We closely track this metric and optimize it in four ways that we detail below.

1. An evolution toward large, efficient physical servers

Over the past decade, core count and RAM density has steadily increased such that now our servers have far more resources than any individual E2 VM. For example, Google Cloud servers can have over 200 hardware threads available to serve vCPUs yet an E2 VM has at most 16 vCPUs. This ensures that a single VM cannot cause an unmanageable increase in load.

We continually benchmark new hardware and look for platforms that are cost-effective and perform well for the widest variety of cloud workloads and services. The best ones become the “machines of the day” and we deploy them broadly. E2 VMs automatically take advantage of these continual improvements by flexibly scheduling across the zone’s available CPU platforms. As hardware upgrades land, we live-migrate E2 VMs to newer and faster hardware, allowing you to automatically take advantage of these new resources.

2. Intelligent VM placement

Google’s cluster management system, Borg, has a decade of experience scheduling billions of diverse compute tasks across diverse hardware, from TensorFlow training jobs to Search front- and back-ends. Scheduling a VM begins by understanding the resource requirements of the VM based on static creation-time characteristics.

By observing the CPU, RAM, memory bandwidth, and other resource demands of VMs running on a physical server, Borg is able to predict how a newly added VM will perform on that server. It then searches across thousands of servers to find the best location to add a VM.

These observations ensure that when a new VM is placed, it is compatible with its neighbors and unlikely to experience interference from those instances.

3. Performance-aware live migration

After VMs are placed on a host, we continuously monitor VM performance and wait times so that if the resource demands of the VMs increase, we can use live migration to transparently shift E2 load to other hosts in the data center.

The policy is guided by a predictive approach that gives us time to shift load, often before any wait time is encountered.

VM live migration is a tried-and-true part of Compute Engine that we introduced six years ago. Over time, its performance has continually improved to the point where its impact on most workloads is negligible.

4. A new hypervisor CPU scheduler

In order to meet E2 VMs performance goals, we built a custom CPU scheduler with significantly better latency guarantees and co-scheduling behavior than Linux’s default scheduler. It was purpose-built not just to improve scheduling latency, but also to handle hyperthreading vulnerabilities such as L1TF that we disclosed last year, and to eliminate much of the overhead associated with other vulnerability mitigations. The graph below shows how TCP-RR benchmark performance improves under the new scheduler.

netperf tcp request response.png

The new scheduler provides sub-microsecond average wake-up latencies and extremely fast context switching. This means that, with the exception of microsecond-sensitive workloads like high-frequency trading or gaming, the overhead of dynamic resource management is negligible for nearly all workloads.

Get started

E2 VMs were designed to provide sustained performance and the lowest TCO of any VM family in Google Cloud. Together, our unique approach to fleet management, live-migration at scale, and E2’s custom CPU scheduler work behind the scenes to help you maximize your infrastructure investments and lower costs.

E2 complements the other VM families we announced earlier this year—general-purpose (N2) and compute-optimized (C2) VMs. If your applications require high CPU performance for use-cases like gaming, HPC or single-threaded applications, these VM types offer great per-core performance and larger machine sizes.

Delivering performant and cost-efficient compute is our bread and butter. The E2 machine types are now in beta. If you’re ready to get started, check out the E2 docs page and try them out for yourself!

App Engine Java 11 is GA—deploy a JAR, scale it, all fully managed

Attention, Java developers. If you want to build modern Java backends, use modern frameworks, or use the latest language features of Java 11, know that you can now deploy and scale your Java 11 apps in App Engine with ease. 

We’re happy to announce that the App Engine standard environment Java 11 runtime is now generally available, giving you the flexibility to run any Java 11 application, web framework, or service in a fully managed serverless environment. 

Modern, unrestricted, managed
With the App Engine standard environment Java 11 runtime, you are in control of what you want to use to develop your application. You can use your favorite framework, such as Spring Boot, Micronaut, Quarkus, Ktor, or Vert.x. In fact, you can use pretty much any Java application that serves web requests specified by the $PORT environment variable (typically 8080). You can also use any JVM language, be it Apache Groovy, Kotlin, Scala, etc.

With no additional work, you also get the benefits of the fully managed App Engine serverless platform. App Engine can transparently scale your application up to handle traffic spikes, and also scale it back down to zero when there’s no traffic. App Engine automatically updates your runtime environment with latest security patches to the operating system and the JDK, so you don’t have to spend time provisioning or managing servers, load balancer, or even any infrastructure at all!

You also get traffic splitting, request tracing, monitoring, centralized logging, and production debugger capabilities out of the box.

In addition, if you can start your Java 11 application locally with java -jar app.jar, then you can run it on App Engine standard environment Java 11 runtime, with all the benefits of a managed serverless environment.

Finally, the App Engine standard environment Java 11 runtime comes with twice the amount of memory than the earlier Java 8 runtime, at no additional cost. Below is a table outlining the memory limit for each instance class.

memory limits.png

Getting started with a Spring Boot application
At beta, we showed you how to get started with a simple hello world example. Now, let’s take a look at how to start up a new Spring Boot application.

Learn how to deploy a Spring Boot application using a JAR file to Google App Engine standard for Java 11. The runtime can now deploy a JAR file, using gcloud command line, or Maven and Gradle plugins.

To start up a new Spring Boot application, all you need is a GCP project and the latest gcloud CLI installed locally. Then, follow these steps:

1. Create a new Spring Boot application from the Spring Boot Initilizr with the Web dependency and unzip the generated archive. Or, simply use this command line:

2. Add a new REST Controller that returns “Hello App Engine!”:

src/main/java/com/example/demo/HelloController.java

3. Build the application JAR:

4. Deploy it using gcloud CLI:

Once the deployment is complete, browse over to https://[PROJECT-ID].appspot.com to test it out (or simply run gcloud app browse). Your application will use the default app.yaml configuration, on an F1 instance class.

To customize your runtime options, such as running with more memory and CPU power, configure an environment variable, or change a Java command line flag, or add an app.yaml file:

src/main/java/appengine/app.yaml

Then, you can deploy an application using either a Maven or Gradle plugin:

Note: You can also configure the plugin directly into Maven’s pom.xml, or in Gradle’s build script.

Finally, you can also deploy a pre-built JAR with an app.yaml configuration using the gcloud CLI tool. First create an empty directory and place both the JAR file and app.yaml in that directory so the directory content looks like this:

Then, from that directory, simply run:

Try it out!
Read the App Engine Standard Java 11 runtime documentation to learn more. Try it with your favorite frameworks with samples in the GCP Java Samples Github Repository. If you have an existing App Engine Java 8 application, read the migration guide to move it to App Engine Java 11. Finally, don’t forget you can take advantage of the App Engine free tier while you experiment with our platform.

From the App Engine Java 11 team: Ludovic Champenois, Eamonn McManus, Ray Tsang, Guillaume Laforge, Averi Kitsch, Lawrence Latif, and Angela Funk.

Optimize your Google Cloud environment with new AI-based recommenders

You want your Google Cloud environment to be as unique as your organization, configured for optimal security, cost and efficiency. We are excited to offer new recommenders for Google Cloud Platform (GCP) in beta, which automatically suggest ways to make your cloud deployment more secure and cost-effective, with maximum performance.

Now in beta, the recommender family includes the Identity and Access Management (IAM) Recommender and the Compute Engine Rightsizing Recommender, with more to come. 

With IAM Recommender, you can automatically detect overly permissive access policies and receive suggested adjustments to them based on the access patterns of similar users in your organization.

The Compute Engine Rightsizing Recommender helps you choose the optimal virtual machine size for your workload. You can use this recommender to help avoid provisioning machines that are too small or too large. 

How recommenders work

Our recommenders use analytics and machine learning to automatically analyze your usage patterns and to determine if your Google Cloud resources and policies are optimally configured. For example, the Compute Engine Rightsizing Recommender analyzes CPU and memory utilization over the previous eight days to identify the optimal machine type for the workload.

IAM recommendations are generated by analyzing the IAM permissions for each customer individually to create an overall model to recommend more secure IAM policies. The recommendations are custom tailored to your environment. For example, if a set of permissions hasn’t been used in 90 days, the IAM Recommender may recommend that you apply a less permissive role. 

Access recommendations today

You can check out your IAM recommendations today by visiting the IAM page in the Cloud Console and view the policy bindings which can be optimized. You can learn more about how to access the IAM Recommender through the API by looking at the IAM Recommender documentation.

The Compute Engine Rightsizing Recommender is available within the Compute Engine page in the Cloud Console and you can see which VMs can be optimized. Most notably with this beta you will be able to access recommendations programmatically through an API. To  learn more, take a look at the VM Rightsizing Recommender documentation.

You can opt-out of these recommendations by going to the Recommendation section in the Security & Privacy navigation panel from the Cloud Console.

Virtual display devices for Compute Engine now GA

Today, we’re excited to announce the general availability (GA) of virtual display devices for Compute Engine virtual machines (VMs), letting you add a virtual display device to any VM on Google Cloud. This gives your VM Video Graphics Array (VGA) capabilities without having to use GPUs, which can be powerful but also expensive. 

Many solutions such as system management tools, remote desktop software, and graphical applications require you to connect to a display device on a remote server. Compute Engine virtual displays allow you to add a virtual display to a VM at startup, as well as to existing, running VMs. For Windows VMs, the drivers are already included in the Windows public images; and for Linux VMs, this feature works with the default VGA driver. Plus, this feature is offered at no extra cost. 

We’ve been hard at work with partners Itopia, Nutanix, Teradici and others to help them integrate their remote desktop solutions with Compute Engine virtual displays to allow our mutual customers to leverage Google Cloud Platform (GCP) for their remote desktop and management needs. 

Customers such as Forthright Technology Partners and PALFINGER Structural Inspection GmbH (StrucInspect) are already benefiting from partner solutions enabled by virtual display devices. 

“We needed a cloud provider that could effectively support both our 3D modelling and our artificial intelligence requirements with remote workstations,” said Michael Diener, Engineering Manager for StrucInspect. “Google Cloud was well able to handle both of these applications, and with Teradici Cloud Access Software, our modelling teams saw a vast improvement in virtual workstation performance over our previous solution. The expansion of GCP virtual display devices to support a wider range of use cases and operating systems is a welcome development that ensures customers like us can continue to use any application required for our client projects.”

Our partners are equally excited about the general availability of virtual display devices.

“We’re excited that the GCP Virtual Display feature is now GA because it enables our mutual customers to quickly leverage Itopia CAS with Google Cloud to power their Virtual Desktop Infrastructure (VDI) initiatives,” said Jonathan Lieberman, itopia Co-Founder & CEO.

“With the new Virtual Display feature, our customers get a much wider variety of cost-effective virtual machines (versus GPU VMs) to choose from in GCP,” said Carsten Puls, Sr. Director, Frame at Nutanix. “The feature is now available to our joint customers worldwide in our Early Access of Xi Frame for GCP.”

Now that virtual display devices is GA, we welcome you to start using the feature in your production environment. For simple steps on how you can use a virtual display device when you create a VM instance or add it to a running VM, please refer to the documentation.