BigQuery under the hood: How zone assignments work

BigQuery is a serverless, highly scalable, and cost-effective cloud data warehouse. It’s designed to be flexible and easy to use. There are lots of interesting features and design decisions made when creating BigQuery, and we’ll dive into how zone assignments work in this post. Like other Google Cloud services, BigQuery takes advantage of our global cloud regions to make sure your data is available when you need it.

One key aspect of BigQuery’s architecture is that it is multi-tenant; it runs workloads from different customers on compute and storage infrastructure that does not require any customer involvement in capacity planning. When loading data into BigQuery, customers choose a region to load the data and, optionally, purchase a compute reservation. Then, the service takes care of provisioning.

Zone assignments keep your data flowing

Each BigQuery region is internally deployed in multiple availability zones. Customer data is replicated between these zones, and there is fast automatic failover to the secondary zone if the primary zone is experiencing issues. The failover is designed to be transparent to customers and have no downtime. We’re always expanding our capacity footprint to support customer growth and onboard new customers. To make sure that each customer gets to think of storage as infinite, and gets sufficient compute resources to load and analyze their data, we continuously recompute the best placement for the primary and secondary zones in the region.

To ensure the best primary and secondary zone assignments, the assignment algorithm takes into account the storage and compute usage of each customer and the available capacity in each zone. Then it makes sure that the usage will fit in the currently assigned zones. If it doesn’t, it finds another suitable zone for that customer and orchestrates a move from their current zone to a new zone. All of this happens in the background without causing any disruptions to your workload.

Any datasets that share the same region can be joined together in a single query. To ensure good query performance, we attempt to colocate compute and storage so that I/O is within the same zone in order to take advantage of high-throughput networking within the zone. I/O bandwidth within a zone is very high (Google’s Jupiter network fabric can sustain more than 1 petabit/second of total bisection bandwidth), but network capacity between zones is much more constrained. Our assignment algorithm makes sure that Google Cloud projects within the same Google Cloud organization are assigned to the same subset of zones in every region. To support very large orgs, we compute project cliques based on cross-project query patterns within the organization. That breaks it up into more manageable chunks that can be placed separately. To handle cross-org reads, the algorithm also looks into past query patterns to discover relationships between organizations and make an effort to have at least one common zone between orgs that are related. The query engine also allows reading small amounts of data not being colocated by either reading remotely or copying some data to the compute zone before running the query. In rare cases, when the algorithm cannot ensure this or there is a new cross-org query pattern, queries that read large amounts of data may fail.

Best practices for organizing and moving your data

  • To get the best performance for workloads that read/write data in datasets belonging to different projects, ensure that the projects are in the same Google Cloud org.

  • If you want to make your data available to other BigQuery users in your Google Cloud organization, you can use IAM permissions to grant access. 

  • If you wish to share data with BigQuery users outside your organization, use Table Copy to move data to the target project. From there, they can do any subsequent analysis in that project. Table Copy is supported via asynchronous replication and can support cross-zone data moves. 

  • You can move data across regions using the dataset copy feature.

Learn more about regions and zones in Google Cloud services, and find more details on how Compute Engine handles zones.

10 top tips: Unleash your BigQuery superpowers

Lots of us are already tech heroes by day. If you know SQL, for example, you’re a hero.You have the power to transform data into insights. You can save the day when someone in need comes to you to reveal the magic numbers they can then use in their business proposals. You can also amaze your colleagues with patterns you found while roaming around your data lakes.

With BigQuery, Google Cloud’s enterprise data warehouse, you quickly become a superhero: You can run queries faster than anyone else. You’re not afraid of running full table scans. You’ve made your datasets highly available, and you no longer live in fear of maintenance windows. Indexes? We don’t need indexes where we’re going, or vacuums either.

If you’re a BigQuery user, you’re already a superhero. But superheroes don’t always know all their superpowers, or how to use them. Here are the top 10 BigQuery superpowers to discover.

01 BQ super powers.jpg

1. The power of data

Let’s say your favorite person has been trapped by an evil force, which will only release them if you answer this simple riddle: Who were the top superheroes on Wikipedia the first week of February 2018?

Oh no! Where will you get a log of all the Wikipedia page views? How can you tell which pages are superheroes? How long will it take to collect all of this data, and comb through it all? Well, I can answer that question (see the source data here). Once data is loaded, it will only take 10 seconds to get the query results. This is how:

There it is—all the superheroes on the English Wikipedia page, and the number of page views for whatever time period you choose. And these are the top 10, for the first week of February 2018:

02 wikipedia results.png

You’ve saved your friend! But first, the evil spirit needs more detail. Well, this query will do:

03.png
04.png

You can have the power of data too: check out the Wikipedia pageviews, and my latest Wikidata experiments (plus all of BigQuery’s public datasets) and copy paste these queries, modify them, and save your friends.

2. The power of teleportation

You want to see the tables with the Wikipedia pageviews and Wikidata? Let’s jump to the BigQuery web UI. Did you know that you can autocomplete your queries while typing them? Just press tab while writing your queries. Or you can run a sub-query by selecting it and pressing CMD-E. And teleportation? Jump straight to your tables with CMD and click on them. For example, that Wikipedia 2018 page views table we queried previously has more than 2TB of data, and the Wikidata one has facts for more than 46 million entities. And we just joined them to get the results we wanted.

Also, while looking at the schema, you can click on the fields, and that will auto-populate your query. Ta-da!

3. The power of miniaturization

Did I just say that the page views table has more than 2TB of data? That’s a lot! Remember that in BigQuery you have 1TB of free queries every month, so going through 2TB in one query means you will be out of the free quota pretty quickly. So how much data did I just consume? Let me run that first query again, without hitting the cache.

The result? 4.6 sec elapsed, 9.8 GB processed.

How is that possible? I just joined a 2TB table with a 750GB one. Even with partitioning, one week of Wikipedia page views is 2TB, divided by 52 weeks…that’s 38.5GB. So even with daily partitioning, I’m somehow querying less data.

Well, turns out I have the data in the tables clustered by the language of the Wikipedia and title, so I can make sure to always use those filters when going through the Wikipedia logs.

And that’s how you miniaturize your queries!

4. The power of X-ray vision

Let’s say you want to get more data out of Wikidata for each superhero. Well, this query will do:

05.png
06.png

Why did this query take more time to process? Well, with our X-ray vision powers, we can see what BigQuery did in the background. Let’s look at the query history and the execution details tab.

07.png

Those are all the steps BigQuery had to go through to run our query. Now, if this is a little hard to read, we have some alternatives. For example, the legacy BigQuery web UI has more compact results:

08.png

You can see that the slowest operations were computing while reading the 56-million-row table twice.

09.png

I’ll focus on that to improve performance. If I change the two, shown in these lines:

Now my query runs in half the time! The slowest part has moved elsewhere, as shown here:

10.png

Which is this JOIN now:

11.png

It even shows us that it’s looking for all the superheroes between “3-D Man” and “Zor-El”… yes, it’s going through the whole alphabet. Get an even deeper view of the BigQuery query plan visualizer.

5. The power of materialization

It’s really cool to have these tables in BigQuery. But how did I load them? I periodically bring new raw files into Cloud Storage, and then I read them raw into BigQuery. In the case of the Wikipedia pageviews, I do all the CSV parsing inside BigQuery, as there are many edge cases, and I need to solve some case by case.

Then I materialize these tables periodically into my partitioned and clustered tables. In the case of Wikidata, they have some complicated JSON—so I read each JSON row raw into BigQuery. I could parse it with SQL, but that’s not enough. And that brings us to our next super power.

6. Navigating the multiverse

So we live in this SQL universe, a place where you can go beyond SQL alone. It’s an incredible place to manipulate and understand data, but each universe has its limitations and its rules. What if we could jump to a different universe, with different rules and powers, and manage to connect both universes, somehow? What if we could jump into the…JavaScript universe? We can, with UDFs—user-defined functions. They can easily extend BigQuery’s standard SQL. For example, I can download a random JavaScript library and use it from within BigQuery, like for performing natural language processing and lots more. 

Using UDFs means I can take each row of Wikidata JSON from above and parse it inside BigQuery, using whatever JavaScript logic I want to use, and then materialize this into BigQuery.

7. Time travel

Let’s take one particular table. It’s a beautiful table, with a couple thousand rows. But not everyone is happy—turns out someone wants to delete half of its rows, randomly. How would our super-enemy pull this off?

Oh no. Half of the rows of our peaceful universe are gone. Randomly. How is that even fair? How will we ever recover from this?

5 days later

We learned how to move forward without these rows, but we still miss them. If only there was a way to travel back in time and bring them back.

Yes we can.

Instead of:

we can write:

Warning: CREATE OR REPLACE TABLE deletes the table history, so write the results elsewhere. 

8. The power of super-speed

How fast is BigQuery? It’s this fast.

12.png

The quick summary: BigQuery can run HyperLogLog++, Google’s internal implementation of the HyperLogLog algorithm, for cardinality estimation. It lets BigQuery count uniques a lot faster than other databases can do, and has some other cool features that make BigQuery perform incredibly well.  

9. Invulnerability

Our most annoying enemy? It’s a black hole of data, that thing that happens when we try to divide by zero. However it’s possible to avoid that using BigQuery expressions like the SAFE. prefix.

SAFE. prefix

Syntax:

Description

If you begin a function with the SAFE. prefix, it will return NULL instead of an error.

Operators such as + and = do not support the SAFE. prefix. To prevent errors from a division operation, use SAFE_DIVIDE. Some operators, such as IN, ARRAY, and UNNEST, resemble functions, but do not support the SAFE.prefix. The CAST and EXTRACT functions also do not support the SAFE. prefix. To prevent errors from casting, use SAFE_CAST. Find out more in the BigQuery docs.

10. The power of self-control

All superheroes struggle when they first discover their super-powers. Having super strength is cool, but you can break a lot of things if you’re not careful. Having super-speed is fun—but only if you also learn how to brake. You can query 5PB of data in three minutes, sure—but then remember that querying 1PB is one thousand times more expensive than querying 1TB. And you only have 1TB free every month. If you have not entered a credit card, don’t worry—you will have your free terabyte every month, no need to have a credit card. But if you want to go further, now you need to be aware of your budget and set up cost controls.

Check out this doc on creating custom cost controls, and find out how BigQuery Reservations work to easily use our flat-rate pricing model. Remember, with great powers comes great responsibility. Turn on your cost controls.

And there are a lot more. How about the power to predict the future? And there’s a whole world of ML to explore, not to mention all the GIS capabilities you can find in BigQuery. Check out Lak Lakshmanan talk about more of the awesome resources we have. And that brings me to our bonus super power:

11. The power of community

No superhero should stand alone. Join our Reddit community, where we share tips and news. Come to Stack Overflow for answers, and to help new superheroes learning the ropes. We can all learn from each other. And follow me and my friends on Twitter. 

If you’re ready to test your powers, try to solve our weekly BigQuery Data Challenge. It’s fun, free of charge, and you might win $500 in cloud credits!

What’s happening in BigQuery: New federated queries, easier ML, and more metadata

BigQuery, Google Cloud’s petabyte-scale data warehouse, lets you ingest and analyze data quickly and with high availability, so you can find new insights, trends, and predictions to efficiently run your business. Our engineering team is continually making improvements to BigQuery so you can get even more out of it. Recently added BigQuery features include new federated data sources, BigQuery ML transforms, integer partitions, and more.

Read on to learn more about these new capabilities and get quick demos and tutorial links so you can try them yourself.

Query Orc and Parquet files directly with BigQuery

Parquet and ORC are popular columnar open source formats for large-scale data analytics. As you make your move to the cloud, you can use BigQuery to analyze data stored in these formats. Choosing between keeping these files in Cloud Storage vs. loading your data into BigQuery can be a difficult decision. 

To make it easier, we launched federated query support for Apache ORC and Parquet files in Cloud Storage from BigQuery’s Standard SQL interface. Check out a demo:

This video demonstrates a newly-released set of BigQuery features! BigQuery now supports querying Parquet and ORC files stored in GCS, and BigQuery is now able to understand Hive-partitioned tables in GCS.

This new feature joins other federated querying capabilities from within BigQuery, including storage systems such as Cloud Bigtable, Google Sheets, and Cloud SQL, as well as AVRO, CSV, and JSON file formats in Cloud Storage—all part of BigQuery’s commitment to building an open and accessible data warehouse. Read more details on this launch.  

Your turn: Load and query millions of movie recommendations 

Love movies? Here’s a way to try query federation: analyze over 20 million movie ratings and compare analytic performance between Cloud SQL, BigQuery federated queries, and BigQuery native storage. Launch the code workbook and follow along with the video. Don’t have a Google Cloud account? Sign up for free.

New data transformations with BigQuery ML

The success of machine learning (ML) models depends heavily on the quality of the dataset used in training. Preprocessing your training data during feature engineering can get complicated when you also have to do those same transformations on your production data at prediction time. 

We announced some new features in BigQuery ML that can help preprocess and transform data with simple SQL functions. In addition, because BigQuery automatically applies these transformations at the time of predictions, the productionization of ML models is greatly simplified.

Binning data values with ML.BUCKETIZE

One decision you will face when building your models is whether to throw away records where there isn’t enough data for a given dimension. For example, if you’re evaluating taxi cab fares in NYC, do you throw away rides to latitude and longitudes that only appear once in the data? One common technique is to bucketize continuous values (like lat/long) into discrete bins. This will help group and not discard infrequent dropoff locations in the long tail:

Binning data values.png

BigQuery provides out-of-the-box support for several common machine learning operations that do not require a separate analysis pass through the data. For example, here’s an example of bucketizing the inputs, knowing the latitude and longitude boundaries of New York:

Binning data values 1.png

BigQuery ML now supports running the same preprocessing steps at serving time if you wrap all of your transformations in the new TRANSFORM clause. This saves you from having to remember and implement the same data transformations you did during training on your raw prediction data. Check out the blog post to see the complete example for NYC Taxi cab fare prediction and read more in the BigQuery ML preprocessing documentation

Use flat-rate pricing with BigQuery Reservations

BigQuery Reservations is now in beta in U.S. and E.U. regions. BigQuery Reservations allows you to seamlessly purchase BigQuery slots to take advantage of flat-rate pricing and manage BigQuery spending with complete predictability.

BigQuery Reservations allows you to:

  • Purchase dedicated BigQuery slots by procuring commitments in a matter of seconds
  • Programmatically and dynamically distribute committed BigQuery slots to reservations for workload management purposes
  • Use assignments to assign Google Cloud projects, folders, or your entire organization to reservations
BigQuery Reservations.png

Quickly analyze metadata with INFORMATION_SCHEMA

As a data engineer or analyst, you are often handed a dataset or table name with little or no context of what’s inside or how it is structured. Knowing what tables and columns are available across your datasets is a critical part of exploring for insights. Previously, you could select a preview of the data or click each table name in the BigQuery UI to inspect the schema. Now, with INFORMATION_SCHEMA, you can do these same tasks at scale with SQL. 

How can you quickly tell how many tables are in a dataset? What about the total number and names of columns and whether they are partitioned or clustered? BigQuery natively stores all this metadata about your datasets, tables, and columns in a queryable format that you can quickly access with INFORMATION_SCHEMA:

metadata.png

As you can see, with INFORMATION_SCHEMA.COLUMNS there is a list of every column and its data type for all tables in the baseball dataset. 

Let’s expand the previous query to aggregate some useful metrics like:

  • Count of tables
  • Names of tables 
  • Total number of columns
  • Count of partitioned columns
  • Count of clustered columns
metadata results.png

Try the above query with different public datasets like github_repos or new_york or even your own Google Cloud project and dataset. 

Your turn: Analyze BigQuery public dataset and table metadata quickly with INFORMATION_SCHEMA

Practice analyzing dataset metadata with this 10-minute demo video and code workbook. And check out the documentation for INFORMATION_SCHEMA for reference

Partition your tables by an integer range

BigQuery natively supports partitioning your tables, which makes managing big datasets easier and faster to query. You can decide which of your timestamp, date, or range of integers of your data to segment your table by. 

Creating an integer range partitioned table

Let’s assume we have a large dataset of transactions and a customer_id field that we want to partition by. After the table is set up, we will be able to filter for a specific customer without having to scan the entire table, which means faster queries and more control over costs.

We’ll use the BigQuery CLI for this example. Here we create an integer range partitioned table named mypartitionedtable in mydataset in your default project.

The partitioning is based on a start of 0, end of 100, and interval of 10. That means we have customer IDs from 0 to 100 and want to segment them in groups of 10 (so IDs 0-9 will be in one partition, and so on.). Note that new data written to an integer range partitioned table will automatically be partitioned. This includes writing to the table via load jobs, queries, and streaming. As your dataset changes over time, BigQuery will automatically re-partition your data too.

Using table decorators to work with specific partitions 

You can use table decorators to quickly access specific segments of your data. For example, if you wanted all the customers in the first partition (0-9 ID range) add the $0 suffix to the table name:

This is particularly useful when you need to load in additional data—you can just specify the partitions using decorators. 

For additional information about integer range partitions, see Integer range partitioned tables.

In case you missed it

For more on all things BigQuery, check out these recent posts, videos and how-tos:

To keep up on what’s new with BigQuery, subscribe to our release notes. You can try BigQuery with no charge in our sandbox. And let us know how we can help.

Use third-party keys in the cloud with Cloud External Key Manager, now beta

At Google Cloud Next UK last month, we announced the alpha version of Google Cloud’s External Key Manager (Cloud EKM). Today, Cloud EKM is available in beta, so we wanted to provide a deeper look at what Cloud EKM is and how it can be valuable for your organization. 

In a first for any public cloud, Cloud EKM will let you achieve full separation between your data and your encryption keys. At its heart, Cloud EKM lets you protect data at rest in BigQuery and Compute Engine using encryption keys that are stored and managed in a third-party key management system that’s deployed outside Google’s infrastructure.

Cloud EKM.png
Cloud EKM provides the bridge between Cloud KMS and an external key manager.

This approach offers several unique security benefits: 

  • Maintain key provenance over your third-party keys. You have strict control over the creation, location, and distribution of your keys.  

  • Full control over who accesses your keys. Because keys are always stored outside Google Cloud, you can enforce that access to data at rest for BigQuery and Compute Engine requires an external key. 

  • Centralized key management. Use one key manager for both on-premises and cloud-based keys, ensuring a single policy point and allowing enterprises to easily take advantage of hybrid deployments. 

To make Cloud EKM easy to implement, we are working with five industry-leading key management vendors: Equinix, Fortanix, Ionic, Thales, and Unbound. (The Ionic and Fortanix integrations are ready today; Equinix, Thales, and Unbound are coming soon.) Check out the videos below to learn more.

Equinix and Cloud EKM

In collaboration with Equinix, Google Cloud brings customers the next level of control for their cloud environments with External Key Manager. Check out the video to learn more.

Fortanix and Cloud EKM

In collaboration with Fortanix, Google Cloud brings customers the next level of control for their cloud environments with External Key Manager. Check out the video to learn more.

Ionic and Cloud EKM

In collaboration with Ionic, Google Cloud brings customers the next level of control for their cloud environments with External Key Manager. Check out the video to learn more.

Thales and Cloud EKM

In collaboration with Thales, Google Cloud brings customers the next level of control for their cloud environments with External Key Manager. Watch the video to learn more.

Unbound and Cloud EKM

In collaboration with Unbound, Google Cloud brings customers the next level of control for their cloud environments with External Key Manager. Check out the video to learn more.

For more information on Cloud EKM, including how to get started, check out the documentation.

TensorFlow Enterprise makes accessing data on Google Cloud faster and easier

Data is at the heart of all AI initiatives. Put simply, you need to collect and store a lot of it to train a deep learning model, and with the advancements and increased availability of accelerators such asGPUs andCloud TPUs, the speed of getting the data from its storage location to the training process is increasingly important.

TensorFlow, one of the most popular machine learning frameworks, was open sourced by Google in 2015. Although it caters to every user, those deploying it on Google Cloud can benefit from enterprise-grade support and performance from the creators of TensorFlow. That’s why we recently launched TensorFlow Enterprise—for AI-enabled businesses on Google Cloud.

In this post, we look at the improvements TensorFlow Enterprise offers in terms of accessing data stored on Google Cloud. If you useCloud Storage for storing training data, jump to the GCS reader improvements section to see how TensorFlow Enterprise doubles data throughput from Cloud Storage. If you useBigQuery to store data, jump to the BigQuery reader section to learn how TensorFlow Enterprise allows you to access BigQuery data directly in TensorFlow with high throughput.

Cloud Storage reader improvements
TensorFlow Enterprise introduces some improvements in the way TensorFlow Dataset reads data from Cloud Storage. To measure the effect of these improvements, we will run the same TensorFlow code with 1.14 and TensorFlow Enterprise and compare the average number of examples per second read from Cloud Storage. The code we run simply reads tfrecord files and prints the number of examples per second:

The output looks something like this:

The dataset we use for this experiment will be the fake imagenet data, a copy of which can be found at gs://cloud-samples-data/ai-platform/fake_imagenet.

We will run the same code on a Compute Engine VM first with TensorFlow 1.14, then with TensorFlow Enterprise and compare the average number of examples per second. In our experiments we ran the code on a VM with 8 CPUs and 64GB of memory, and reading from a regional Cloud Storage bucket in the same region as the VM.

We see significant improvement:

TensorFlow Speeds.png

You can further improve data reading speed from Cloud Storage by adjusting various parameters TensorFlow provides. For example, when we replace the parallel_interleave call above to the following code:

We see further improvement:

TensorFlow Speed Improvements.png

There are other factors that affect reading speed from Cloud Storage. One common mistake is to have too many small tfrecord files on Cloud Storage as opposed to fewer larger ones. To achieve high throughput from TensorFlow reading data from Cloud Storage, you should group the data so that each file is more than 150MB.

BigQuery Reader
TensorFlow Enterprise introduces the BigQuery reader that allows you to read data directly from BigQuery. For example:

The BigQuery reader uses BigQuery’s Storage API for parallelized data access to allow high data throughput.  The output looks like the following:

With the code above, each batch of of examples is structured as a Python OrderedDict whose keys are the column names specified in the read_session call, and whose values are TensorFlow tensors ready to be consumed by a model. For instance:

Conclusion
The speed of getting data from its storage location to the machine learning training process is increasingly critical to a deep learning model builders’ productivity. TensorFlow Enterprise can help by providing optimized performance and easy access to data sources, and continues to work with users to introduce improvements that make TensorFlow workloads more efficient on GCP.  

For more information, please see our summary blog of Tensorflow Enterprise. To get started, visit TensorFlow Enterprise.

How to build your first Google Maps Platform integration with deck.gl

We live in a world full of data. It’s often hard to tease out the hidden meaning found within it because–in its rawest form–the data is still very abstract. This is particularly true of Geo-based data, where the very elements that give a row of data a sense of location (its “geo-ness”) can be difficult to mentally model. Take latitude and longitude. While they describe precise location, they’re largely useless without a means to plot and calculate over them. Computers are great at this. Humans, not so much. This is where visualizations come to the rescue, revealing a world of interconnectedness that’s often hidden when looked at as raw numbers. 

In this post I’ll walk you through how to setup and use Google Maps Platform with deck.gl. We’ll start with a simple “Hello, World!” style introduction to ensure all our plumbing is in order. Then we’ll move to a basic visualization of some real data. And once we’re done, you’ll have sound boilerplate code to work from for future visualizations. Finally, we’ll demonstrate some techniques for interactivity that will help unlock hidden meaning and opportunity.  

Hello, world!
Let’s start by reviewing the simplest of map visualizations using deck.gl. To begin, download the starter project here

All of the files you’ll need to work with in this tutorial are in the /src directory. The main file we will be looking at is hello-world.html, where you’ll see I’ve already included the code for importing the Maps JavaScript API, as well as using it to initialize and apply some basic styling to the map. 

To view this page I recommend using a simple web server. While this page doesn’t absolutely need it, future examples will need some serving mechanism to avoid the dreaded CORS error when loading data and map styles.

To begin, from the command line, go to the directory of the starter project you just downloaded.

Next, run the following from the command line to start a simple web server to serve the project files: 

Next, you’ll need to replace YOUR_API_KEY with your actual API key in the script tag at the top of hello-world.html where we’re importing the Maps JavaScript API. If you aren’t familiar with how to generate an API key, check out our quick video walkthrough.

Once you’ve done this, you can view the page at:

It should look like this:

Example1

Adding a basic deck.gl visualization
While our map is nice, it’s not exactly ready for exploring real-world use cases. Let’s add a very basic deck.gl visualization: Scatterplot Layer. 

To start, we include the unpackaged version of the deck.gl library in the app by adding the following script tag in the  section of index.html:

Next, in our main script in index.html, you’ll see we have already initialized the map and defined a data point for you: 

The first step in applying deck.gl visualization layers to this map is to import the GoogleMapsOverlay and ScatterplotLayer objects from deck.gl at the top of this script tag:

Next, we initialize GoogleMapsOverlay and pass an instance of the deck.gl ScatterplotLayer layer to it at the bottom of the same script. A “Layer” is a core concept of deck.gl that creates a visualization type that takes data and renders it on a map based on a set of properties that define the look and behavior of the layer.

Here we specify one data point and a radius of 10 meters, then add the overlay to the map:

The data attribute of deck.gl layers is quite robust. It will take an Iterable (an array of data), String (a URL), Promise or a generic Object (that implements length). For more details on what you can feed a layer see the deck.gl properties documentation. 

The end result should be a single lonely marker, lost in the hustle and bustle of the big city.  

ScatterplotLayer
Scatterplot layer with a single data point

Adding a datasource
Let’s add some more data from one of BigQuery’s public datasets to create a more meaningful scatterplot visualization. For those new to BigQuery, see this getting started guide to get through the initial setup and to see an example of how to query a public dataset. 

We’ll be using New York City’s Citi Bike dataset. Citi Bike is New York City’s bike share system, and the largest in the nation with 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City.

To retrieve the data, run the following query in the BigQuery query editor in the Google Cloud console:

Once the query completes, click SAVE RESULTS beneath the query editor, and select JSON (local file) to save the data to your working directory. For this example, we named the file stations.json

QueryEditor

Now let’s tell the layer about this new JSON data source and how to parse the data it contains.

getPosition, getFillColor, and getRadius are our data accessors. They tell the layer how to reach into the data and extract values that it uses to determine how to render different attributes of the visualization. In the case of a ScatterPlot Layer, these accessors allow it to determine place, size, and color for each data point, but you don’t always have to base rendering decisions off of data. 

For example, here we are keeping the fill color constant for all rows, but dynamically setting the position and radius of the points based on the data:

Now when we reload the app, our end result visualizes about 800 Citi Bike stations across New York City.  

Header
Finished Scatterplot layer

That’s all there is to it. You’ve built your first deck.gl data visualization using data from one of Google Cloud’s Big Query public datasets. If you ran into any issues, you can also refer to the code in  scatterplot.html to see the final code of this tutorial.

In our next post, we’ll pick up where we left off and integrate even more data and some interesting interactivity into this Scatterplot Layer.

For more information on Google Maps Platform, visit our website.

How to quickly solve machine learning forecasting problems using Pandas and BigQuery

Time-series forecasting problems are ubiquitous throughout the business world. For example, you may want to predict the probability that some event will happen in the future or forecast how many units of a product you’ll sell over the next six months. Forecasting like this can be posed as a supervised machine learning problem. 

Like many machine learning problems, the most time-consuming part of forecasting can be setting up the problem, constructing the input, and feature engineering. Once you have created the features and labels that come out of this process, you are ready to train your model.

A common approach to creating features and labels is to use a sliding window where the features are historical entries and the label(s) represent entries in the future. As any data-scientist that works with time-series knows, this sliding window approach can be tricky to get right.

1_sliding window on an example dataset.gif
A sliding window on an example dataset. Each window represents a feature vector for the dataset and the label(s) is one or more points in the future.

Below is a good workflow for tackling forecasting problems:

1. Create features and labels on a subsample of data using Pandas and train an initial model locally
2. Create features and labels on the full dataset using BigQuery
3. Utilize BigQuery ML to build a scalable machine learning model
4. (Advanced) Build a forecasting model using Recurrent Neural Networks in Keras and TensorFlow

In the rest of this blog, we’ll use an example to provide more detail into how to build a forecasting model using the above workflow. (The code is available on AI Hub)

First, train locally

Machine learning is all about running experiments. The faster you can run experiments, the more quickly you can get feedback, and thus the faster you can get to a Minimum Viable Model (MVM). It’s beneficial, then, to first work on a subsample of your dataset and train locally before scaling out your model using the entire dataset.

Let’s build a model to forecast the median housing price week-by-week for New York City. We spun up a Deep Learning VM on Cloud AI Platform and loaded our data from nyc.gov into BigQuery. Our dataset goes back to 2003, but for now let’s just use prices beginning 2011.

2_median housing price.png

Since our goal is to forecast future prices, let’s create sliding windows that accumulate historical prices (features) and a future price (label). Our source table contains date and median price:

3_forecast future prices.png

Here is the entire dataset plotted over time:

4_entire dataset plotted.png

To create our features, we’ll pick a historical window size—e.g., one year—that will be used to forecast the median home price in six months. To do this, we have implemented a reusable function based on Pandas that allows you to easily generate time-series features and labels. Feel free to use this function on your own dataset.

After running create_rolling_features_label, a feature vector of length 52 (plus the date features) is created for each example, representing the features before the prediction date.

5_create_rolling_features_label.png

This can be shown with a rolling window:

6_rolling window.gif
The create_rolling_features_label function creates windows for the feature and label. In this case, the features consist of 52 weeks and the label consists of a week 6 months into the future.

Once we have the features and labels, the next step is to create a training and test set. In time-series problems, it’s important to split them temporally so that you are not leaking future information that would not be available at test time into the trained model.

7_training and test set.png
The training set (blue) will consist of data where the label occurs before the split date (2015-12-30′), while the test set (green) consists of rows where the label is after this date.

In practice, you may want to scale your data using z-normalization or detrend your data to reduce seasonality effects. It may help to utilize differencing, as well to remove trend information. Now that we have features and labels, this simply becomes a traditional supervised learning problem, and you can use your favorite ML library to train a model. Here is a simple example using sklearn:

Scale our model

Let’s imagine we want to put our model into production and automatically run it every week, using batch jobs, to get a better idea of future sales.Let’s also imagine we may want to forecast a model day-by-day.

Our data is stored in BigQuery, so let’s use the same logic that we used in Pandas to create features and labels, but instead run it at scale using BigQuery. We have developed a generalized Python function that creates a SQL string that lets you do this with BigQuery:

We pass the table name that contains our data, the value name that we are interested in, the window size (which is the input sequence length), the horizon of how far ahead in time we skip between our features and our labels, and the labels_size (which is the output sequence length). Labels size is equal to 1 here because, for now, we are only modeling sequence-to-one—even though this data pipeline can handle sequence-to-sequence. Feel free to write your own sequence-to-sequence model to take full advantage of the data pipeline!

We can then execute the SQL string scalable_time_series in BigQuery. A sample of the output shows that each row is a different sequence. For each sequence, we can see the time ranges of the features and the labels. For the features, the timespan is 52 weeks, which is the window_size, and for labels it is one day, which is the labels_size.

8_scalable_time_series.png

Looking at the same sampled rows, we can see how the training data is laid out. We have a column for each timestep of the previous price, starting with the farthest back in time on the left  and moving forward. The last column is the label, the price one week ahead.

9_price one week ahead.png

Now we have our data, ready for training, in a BigQuery table. Let’s take advantage of BigQuery ML and build a forecasting model using SQL.

10_forecasting model using SQL.png

Above we are creating a linear regression model using our 52 past price features and predicting our label price_ahead_1. This will create a BQML MODEL in our bqml_forecasting dataset.

11_52 past price features.png

We can check how our model performed by calling TRAINING_INFO. This shows the training run index, iteration index, the training and eval loss at each iteration, the duration of the iteration, and the iteration’s learning rate. Our model is training well since the eval loss is continually getting smaller for each iteration.

12_TRAINING_INFO.png

We can also do an evaluation of our trained model by calling EVALUATE. This will show common evaluation metrics that we can use to compare our model with other models to find the best choice among all of our options.

13_EVALUATE.png

Lastly, machine learning is all about prediction. The training is just a means to an end. We can get our predictions by using the above query, where we have prepended predicted_ to the name of our label.

Now, let’s imagine that we want to run this model every week. We can easily create a batch job that is automatically executed using a scheduled query.

14_scheduled query.png

Of course, if we want to build a more custom model, we can use TensorFlow or another machine library, while using this same data engineering approach to create our features and labels to be read into our custom machine learning model. This technique could possibly improve performance.

15_custom machine learning model.png

To use an ML framework like TensorFlow, we’ll need to write the model code and also get our data in the right format to be read into our model. We can make a slight modification to the previous query we used for BigQuery ML so that the data will be amenable to the CSV file format. 

For this example, imagine you wanted to build a sequence-to-sequence model in TensorFlow that can handle variable length features. One approach to achieve this would be to aggregate all the features into a single column named med_sales_price_agg, separated by semicolons. The features (if we have more than just this feature in the future) and the label are all separated by a comma.

16_med_sales_price_agg.png

We’ll execute the query in BigQuery and will make a table for train and eval. This will then get exported to CSV files in Cloud Storage. The diagram above is what one of the exported CSV files looks like—at least the header and the first line—with some comments added. Then when reading the data into our model using tf.data, we will specify the delimiter pattern shown above to correctly parse the data.

Please check out our notebook on AI Hub for an end-to-end example showing how this would work in practice and how to submit a training job on Google Cloud AI Platform. For model serving, the model can deployed on AI Platform or it can be deployed directly in BigQuery

Conclusion

That’s it! The workflow we shared will allow you to automatically and quickly setup any time-series forecasting problem. Of course, this framework can also be adapted for a classification problem, like using a customer’s historical behavior to predict the probability of churn or to identify anomalous behavior over time. Regardless of the model you build, these approaches let you quickly build an initial model locally, then scale to the cloud using BigQuery.

Learn more about BigQuery and AI Platform.

Skip the maintenance, speed up queries with BigQuery’s clustering

BigQuery is Google Cloud’s serverless data warehouse, automating much of the toil and complexity associated with setting up and managing an enterprise-grade data warehouse. If there’s a maintenance task to be done, BigQuery’s philosophy is to take care of it for our users autonomously. 

To get the most out of BigQuery, one of our key best practices is table partitioning and clustering. In this blog post, we’ll explain the internal workings of clustered tables in BigQuery. We’ll also cover automatic re-clustering, which fixes a classic data warehousing concern in a completely transparent way. It does this as a background process unique to BigQuery that continuously optimizes clustered tables for best performance. 

You’ll find partitioning and clustering vastly improve the cost and performance profiles of your workloads. In addition, automatic re-clustering seamlessly provides a practical benefit: making your clustered tables resilient to real-world demands such as continuous inserts. One less thing to worry about!

How clustering works in BigQuery

Within a partitioned table, individual partitions behave as independent tables—one per partition. As such, the behavior of clustering for each partition of a partitioned table automatically extends to the clustering of non-partitioned tables.

Clustering is supported on primitive non-repeated top-level columns, such as INT64, BOOL, NUMERIC, STRING, DATE, GEOGRAPHY, and TIMESTAMP.

In general, there are two typical usage patterns for clustering within a data warehouse:

  • Clustering on columns that have a very high number of distinct values, like userId or transactionId.

  • Clustering on multiple columns that are frequently used together. When clustering by multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data. You can filter by any prefix of the clustering columns and get the benefits of clustering, like regionId, shopId and productId together; or regionId and shopId; or just regionId.

Data in a BigQuery table is stored in Capacitor format blocks. This means that table clustering defines a “weak” sort order on these blocks. In other words, BigQuery attempts to distribute the data such that the blocks store non-overlapping ranges of values for the clustering keys. BigQuery automatically determines the boundaries of these new blocks within each partition as it is written to a table. 

You can see below the layout of data in a table partitioned on the eventDate date column and clustered on the eventId clustering column:

Partition and clustering pruning in BigQuery.png
Partition and clustering pruning in BigQuery

In this query, BigQuery first applies the filter on the partitioning column to limit the blocks to only those in partition for “2019-08-02.” Additionally, since the table is clustered on eventID, blocks are organized so that there are non-overlapping ranges of values for eventID column. Among other things, BigQuery’s scalable metadata management system stores the range information of columns within each block. It uses this information to further limit the scan to blocks that have eventId between 201 and 300, as well as between 10301 and 10400. 

Clustering improves performance of aggregation queries

Since clustering implies sort order, rows with the same value for the clustering columns are stored in the same or nearby blocks. This allows BigQuery to optimize aggregation queries that group by the clustering columns. In order to compute aggregates, BigQuery first computes partial aggregates from each block. It then shuffles and merges these partial aggregates to compute the final aggregate. Since rows with same value for clustering columns are generally together, partial aggregations produced are significantly smaller in size, thus reducing the amount of intermediate data that needs to be shuffled. This improves aggregation query performance.

Maintaining table clustering

Data is typically written to a BigQuery table on a continuous basis using load, query, copy jobs or through the streaming API. As more data comes in, the newly inserted data may be written to blocks that have column value ranges that overlap with those of the currently active blocks in the table. To maintain the performance characteristics of a clustered table, BigQuery performs automatic re-clustering in the background to restore the sort property of the table. Remember, in a partitioned table, clustering is maintained for data within the scope of each partition. 

Traditionally, similar processes in data warehouses, such as VACUUM or automatic clustering, require some level of setup and administration. They also require the user to bear the cost of the process, since the processes use cluster time otherwise dedicated to valuable analytics. BigQuery’s automatic re-clustering is unique in two ways:

  • This background operation does not use any of your resources, and is thus provided free to users.

  • The system performs automatic re-clustering autonomously and transparently, with no action required from you.

Automatic re-clustering.png
Automatic re-clustering: no cost, no setup, no maintenance

How automatic re-clustering works

Automatic re-clustering works in a manner similar to an LSM tree. In steady state, most of the data in a partition is in fully sorted blocks, referred to as the baseline. As new data is inserted into a partition, BigQuery may either perform a local sort for the new data or defer such sorting until there is sufficient data to require a write. Once there is sufficient amount of data, the system generates locally sorted blocks, called deltas. After the deltas have accumulated enough data, comparable in size to the size of the current baseline, BigQuery merges the baseline and deltas to generate a new baseline. While regenerating baselines is I/O- and CPU-intensive, you won’t notice it one bit.

Automatic re-clustering automatically maintains clustered tables in BigQuery.png
Automatic re-clustering automatically maintains clustered tables in BigQuery

BigQuery attempts to merge deltas and baselines into a new baseline until the resulting baseline reaches 500GB. After this, as more deltas are created, they are merged into a new baseline without perturbing the previous baselines. This method avoids wasting time and resources rewriting baselines every time new data enters BigQuery.

Learn more about using BigQuery

In addition to making auto re-clustering completely free and autonomous for our users, you don’t pay for ingest into BigQuery, and query capacity is not consumed one bit. We hear from customers that these two workloads combined can consume up to 30% of their processing resources that could otherwise be dedicated to business-critical analytics and data processing. BigQuery’s approach  means that these efficiencies directly translate to a more resilient data warehouse, providing faster queries and more savings.

For more practical examples of the cost and performance impact of clustering, head over to Felipe Hoffa’s blog post on clustering. As always, you can try BigQuery with our free perpetual tier of 1TB of data processed and 10GB of data stored per month.

What’s happening in BigQuery: New persistent user-defined functions, increased concurrency limits, GIS and encryption functions, and more

We’ve been busy this summer releasing new features for BigQuery, Google Cloud’s petabyte-scale data warehouse. BigQuery lets you ingest and analyze data quickly and with high availability, so you can find new insights, trends, and predictions to efficiently run your business. Recently added BigQuery features include new user-defined functions, faster reporting capabilities, increased concurrency limits, and new functions for encryption and GIS, all with the goal of helping you get more out of your data faster.  

Read on to learn more about these new capabilities and get quick demos and tutorial links so you can try these features yourself.

BigQuery persistent user-defined functions are now in beta

The new persistent user-defined functions (UDFs) in BigQuery let you create SQL and JavaScript functions that you can reuse across queries and share with others. Setting up these functions allows you to save time and automate for consistency and scalability. For example, if you have a custom function that handles date values a certain way, you can now create a shared UDF library, and anyone who has access to your dataset can invoke those date values in their queries. UDFs can be defined in SQL or JavaScript. Here’s an example:

Creating a function to parse JSON into a SQL STRUCT

Ingesting and transforming semi-structured data from JSON objects into your SQL tables is a common engineering task. With BigQuery UDFs, you can now create a persistent Javascript UDF that does the parsing for you. Here, we’ll take a JSON string input and convert it into multiple fields in a SQL STRUCT

First we’ll define the function in this query:

After executing the query, click the “Go to function” button in the BigQuery UI to see the function definition:

Go to function.png

You can now execute a separate query to call the UDF:

And voila! Our JSON string is now a SQL STRUCT:

JSON string.png

Share your Persistent UDFs 

The benefit of persistent UDFs is that other project team members can now invoke your new function in their scripts without having to re-create it or import it first. Keep in mind that you will need to share the dataset that contains your UDFs in order for them to access it. 

Learn more:

Concurrent query limit has doubled

To help enterprises get insights faster, we’ve raised the concurrent rate limit for on-demand, interactive queries from 50 to 100 concurrent queries per project in BigQuery. This means you can run twice as many queries at the same time. As before, queries with results returned from the query cache, dry run queries, and queries ran inbatch mode do not count against this limit.

You can monitor your team’s concurrent queries in Stackdriver and visualize them in Data Studio.

Data Studio.png

Learn more:

BigQuery’s new user interface is now GA

We introduced the new BigQuery user interface (UI) last year to make it easier for you to uncover data insights and share them with teammates and colleagues in reports and charts. The BigQuery web UI is now generally available in the Google Cloud Platform (GCP) console. 

You can check out key features of the new UI in the quick demo below:

  • Easily discover data by searching across tables, datasets, and projects

  • Quickly preview table metadata (size, last updated) and total rows

  • Start writing queries faster by clicking on columns to add.

BigQuery Web UI - Quick Demo

If you haven’t seen the new UI yet, try it out by clicking the blue button in the top right of your Google Cloud console window.

Learn more:

BigQuery’s GIS functions are now GA

We’re continually working on adding new functionality to BigQuery so you can expand your data analysis to other data types. You might have noticed in the BigQuery Web UI demo that there’s now a field for hurricane latitude and longitude. These Geographic Information System (GIS) data types are now natively supported in BigQuery, as are the GIS functions to analyze, transform, and derive insights from GIS data. 

Here’s a look at using BigQuery GIS functions and this tutorial to plot the path of a hurricane:

BigQuery GIS functions.png

Applying GIS functions to geographic data (including lat/long, city, state, and zip code) lets analysts perform geographic operations within BigQuery. You can more easily answer common business questions like “Which store is closest for this customer?” “Will my package arrive on time?” or “Who should we mail a promotion coupon to?”

You can also now cluster your tables using geography data type columns. The order of the specified clustered columns determines the sort order of the data. For our hurricane example, we clustered on `iso_time` to increase performance for common reads that want to track the hurricane path sorted by time. 

Learn more:

AEAD encryption functions are now available in Standard SQL

BigQuery usesencryption at rest to help keep your data safe, and provides support for customer managed encryption keys (CMEKs), so you can encrypt tables with specific encryption keys you control. But in some cases, you may want to encrypt individual values within a table. AEAD (Authenticated Encryption with Associated Data) encryption functions, now available in BigQuery, allow you to create keysets that contain keys for encryption and decryption, use these keys to encrypt and decrypt individual values in a table, and rotate keys within a keyset.

This can be particularly useful for applications of crypto-deletion or crypto-shredding. For example, say you want to keep data for all your customers in a common table. By encrypting each of your customers’ data using a different key, you can easily render that data unreadable by simply deleting the encryption key. If you’re not familiar with the concept of crypto-shredding, you’ve probably already used it without realizing it—it’s a common practice for things like factory-resetting a device and securely wiping its data. Now you can do the same type of data wipe on your structured datasets in BigQuery. 

Learn more:

Check out a few more updates worth sharing

Our Google Cloud engineering team is continually making improvements to BigQuery to accelerate time-to-value for our customers. Here are a few other recent highlights: 

  • You can now run scheduled queries at more frequent intervals. The minimum time interval for custom schedules has changed from three hours to 15 minutes. Faster schedules means fresher data for your reporting needs.

  • The BigQuery Data Transfer Service now supports transferring data into BigQuery from Amazon S3. These Amazon S3 transfers are now in beta.

  • Creating a new dataset? Want to make it easy for all to use? Add descriptive column labels within SQL using SQL DDL labels.

  • Clean up your old BigQuery ML models with new SQL DDL statement support for DROP MODEL.

In case you missed it

For more on all things BigQuery, check out these recent posts, videos and how-tos:

To keep up on what’s new with BigQuery, subscribe to our release notes and stay tuned to the blog for news and announcements And let us know how else we can help.

Predictive marketing analytics using BigQuery ML machine learning templates

Enterprises are collecting and generating more data than ever—to better understand their business landscape, their market, and their customers. As a result, data scientists and analysts increasingly need to build robust machine learning models that can forecast business trajectories and help leaders plan for the future. However, current machine learning tools make it difficult to quickly and easily create ML models, delaying time to insights.

To address these challenges, we announced BigQuery ML, a capability inside BigQuery that allows data scientists and analysts to build and operationalize machine learning models in minutes on massive structured or semi-structured datasets. BigQuery ML democratizes predictive analytics so that users unfamiliar with programming languages like Python and Java can build machine learning models with basic SQL, and is generally available.

To make it even easier for anyone to get started with BigQuery ML, we have open-sourced a repository of SQL templates for common machine learning use cases. The first of these, tailored specifically for marketing, were built in collaboration with SpringML, a premier Google Cloud Platform partner that helps customers successfully deploy BigQuery and BigQuery ML. Each template is tutorial-like in nature, and includes a sample dataset for Google Analytics 360 and CRM along with SQL code for the following steps of machine learning modeling: data aggregation and transformation (for feature and label creation), machine learning model creation, and surfacing predictions from the model on a dashboard. Here’s more on the three templates:

Customer segmentation—By dividing a customer base into groups of individuals that are similar in specific ways, marketers can custom-tailor their content and media to unique audiences. With this template, users can implement a BigQuery ML k-means clustering model to build customer segmentations.

Customer Lifetime Value (LTV) prediction—Many organizations need to identify and prioritize customer segments that are most valuable to the company. To do this, LTV can be an important metric that measures the total revenue reasonably expected from a customer. This template implements a BigQuery ML multiclass logistic regression model to predict the LTV of a customer to be high, medium, or low.

Conversion or purchase prediction—There are many marketing use cases that can benefit from predicting the likelihood of a user converting, or making a purchase, for example ads retargeting, where the advertiser can bid higher for website visitors that have a higher purchase intent, or email campaigns, where emails are sent to a subset of customers based on their likelihood to click on content or purchase. This template implements a BigQuery ML binary logistic regression model to build conversion or purchase predictions.

To start using these open-source SQL templates and more, visit our repository—the code is licensed under Apache v2. We will also be adding templates for more use cases in the future. And to learn more about applying BigQuery ML for marketing analytics, watch this Google Cloud OnAir webinar.

Migrating your traditional data warehouse platform to BigQuery: announcing the data warehouse migration offer

Today, we’re announcing a data warehousing migration offer for qualifying customers, one that makes it easier for them to move from traditional data warehouses such as Teradata, Netezza to BigQuery, our serverless enterprise data warehouse.

For decades, enterprises have relied on traditional on-premises data warehouses to collect and store their most valuable data. But these traditional data warehouses can be costly, inflexible, and difficult to maintain, and for many, they no longer meet today’s business needs. Enterprises need an easy, scalable way to store all that data, as well as to take advantage of advanced analytic tools that can help them find valuable insights. As a result, many are turning to cloud data warehousing solutions like BigQuery.

BigQuery is Google Cloud’s serverless, highly scalable, low-cost enterprise data warehouse designed to make all data analysts productive. There’s no infrastructure to manage, so you can focus on finding meaningful insights using familiar Standard SQL. Leading global enterprises like 20th Century Fox, Domino’s Pizza, Heathrow Airport, HSBC, Lloyds Bank UK, The New York Times, and many others rely on BigQuery for their data analysis needs, helping them do everything from break down data silos to jump-start their predictive analytics journey—all while greatly reducing costs.

Here’s a little more on the benefits of BigQuery in contrast to traditional on-premises data warehouses.

benefits of BigQuery.png

Recently, independent analyst firm Enterprise Strategy Group (ESG) released a report examining the economic advantages of migrating enterprise data warehouse workloads to BigQuery. They developed a three-year total-cost-of-ownership (TCO) model that compared the expected costs and benefits of upgrading an on-premises data warehouse, migrating to a cloud-based solution provided by the same on-premises vendor, or redesigning and migrating data warehouse workloads to BigQuery. ESG found that an organization could potentially reduce its overall three-year costs by 52 percent versus the on-premises equivalent, and by 41 percent when compared to an AWS deployment.

BQ expected cost ownership.png

You can read more about the above total cost of ownership (TCO) analyses in ESG’s blog post.

How to begin your journey to a modern data warehouse

While many businesses understand the value of modernizing, not all know where to start. A typical data warehouse migration requires three distinct steps:

  1. Data migration: the transfer of the actual data contents from the data warehouse from the source to the destination system.
  2. Schema migration: the transfer of metadata definitions and topologies.
  3. Workload migration:the transfer of workloads that include ETL pipes, processing jobs, stored-procedures, reports, and dashboards.

Today, we’re also pleased to announce the launch of BigQuery’s data warehouse migration utility. Based on the existing migration experience, we have built this data warehouse migration service to automate migrating data and schema to BigQuery, and significantly reduce the migration time.

How to get started with our data warehousing migration offer

Our data warehousing migration offer and tooling equips you with architecture and design guidance from Google Cloud engineers, proof-of-concept funding, free training, and usage credits to help speed up your modernization process.

Here’s how it works: 

Step 1: Planning consultation

You’ll receive expert advice, examples, and proof-of-concept funding support from Google Cloud, and you’ll work with our professional services or a specialized data analytics partner on your proof of concept.

Step 2: Complementary training

You’ll get free training from Qwiklabs, Coursera, or Google Cloud-hosted classroom courses to deepen your understanding of BigQuery and related GCP services. 

Step 3: Expert design guidance

Google Cloud engineers will provide you with architecture design guidance, through personalized deep-dive workshops at no additional cost.

Step 4: Migration support

Google Cloud’s professional services organization—along with our partners—have helped enterprises all over the world migrate their traditional data warehouses to BigQuery. And as part of this offer, qualified customers may also be eligible to receive partner funding support to offset the migration and BigQuery implementation costs.

Interested in learning more? Contact us.