Introducing BigQuery Flex Slots for unparalleled flexibility and control

Organizations of all sizes look to BigQuery to meet their growing analytics needs. We hear that customers value BigQuery’s radically innovative architecture, serverless delivery model, and integrated advanced capabilities in machine learning, real-time analytics, and business intelligence. To help you balance explosive demand for analytics with the need for predictable spend, central control, and powerful workload management, we recently launched BigQuery Reservations.

Today we are introducing Flex Slots, a new way to purchase BigQuery slots for short durations, as little as 60 seconds at a time. A slot is the unit of BigQuery analytics capacity. Flex Slots let you quickly respond to rapid demand for analytics and prepare for business events such as retail holidays and app launches. Flex Slots are rolling out to all BigQuery Reservations customers in the coming days!

BigQuery Flex Slot.jpg

Flex Slots give BigQuery Reservations users immense flexibility without sacrificing cost predictability or control.

  • Flex Slots are priced at $30 per slot per month, and are available in increments of 500 slots.
  • It only takes seconds to deploy Flex Slots in BigQuery Reservations. 
  • You can cancel after just 60 seconds, and you will only be billed for the seconds Flex Slots are deployed.

Benefits of Flex Slots

You can seamlessly combine Flex Slots with existing annual and monthly commitments to supplement steady-state workloads with bursty analytics capability. You may find Flex Slots especially helpful for short-term uses, including:

  • Planning for major calendar events, such as the tax season, Black Friday, popular media events, and video game launches. 
  • Meeting cyclical periods of high demand for analytics, like Monday mornings.
  • Completing your data warehouse evaluations and dialing in the optimal number of slots to use.

Major calendar events. For many businesses, specific days or weeks of the year are crucial. Retailers care about Black Friday and Cyber Monday, gaming studios focus on the first few days of launching new titles, and financial services companies worry about quarterly reporting and tax season. Flex Slots enable such organizations to scale up their analytics capacity for the few days necessary to sustain the business event, and scale down thereafter, only paying for what they consumed.

Payment technology provider Global Payments plans to add even more flexibility to their usage with this feature. “BigQuery has been a steady engine driving our Merchant Portal Platform and analytics use cases. As a complex multinational organization, we were anxious to leverage BigQuery Reservations to manage BigQuery cost and resources. We had been able to manage our resources effectively in most areas but were missing a few,” says Mark Kubik, VP BI, data and analytics, application delivery at Global Payments. “With Flex Slots, we can now better plan for automated test suites, load testing, and seasonal events and respond to rapid growth in our business. We are eager to implement this new feature in our workloads to drive efficiency, customer experience, and improved testing.”

Cyclical demand. If the majority of your users log into company systems at nine every Monday morning to check their business dashboards, you may spin up Flex Slots to rapidly respond to increased demand on your data warehouse. This is something that the team at Forbes has found helpful. 

“Moving to BigQuery Reservations enabled us to self-manage our BigQuery costs,” says David Johnson, vice president, business intelligence, Forbes. “Flex Slots will give us an additional layer of flexibility—we can now bring up slots whenever we have a large processing job to complete, and only pay for the few minutes they were needed.”

Evaluations. Whether you’re deciding on BigQuery as your cloud data warehouse or trying to understand the right number of BigQuery slots to purchase, Flex Slots provide the flexibility to quickly experiment with your environment.

BigQuery Flex Shot Evaluations.jpg

The BigQuery advantage

Flex Slots are especially powerful considering BigQuery’s unique architecture and true separation of storage and compute. Because BigQuery is serverless, provisioning Flex Slots doesn’t require instantiating virtual machines. It’s a simple back-end configuration change, so acquiring Flex Slots happens very quickly. And because BigQuery doesn’t rely on local disk for performance, there is no warm-up period with poor and unpredictable performance. Flex Slots perform optimally from the moment they’re provisioned. 

Flex Slots is an essential part of our BigQuery Reservations platform. BigQuery Reservations give intelligence-hungry enterprises the control necessary to enable their organizations with a powerful tool like BigQuery while minimizing fiscal and security risks:

  • With Reservations, administrators can centrally decide who in their organization can make purchasing decisions, neutralizing the fear of shadow IT.  

  • Users can manage and predict their organizations’ BigQuery spend and conformism to fixed budgets.

  • Administrators can optionally manage how their departments, teams, and workloads get access to BigQuery in order to meet their specific analytics needs. 

  • Flex Slots offer BigQuery users an unparalleled level of flexibility—purchase slots for short bursts to complement your steady-state workloads. 

Getting started with Flex Slots

Flex Slots are rolling out as we speak, and will be available in the coming days in the BigQuery Reservations UI.

You can purchase Flex Slots alongside monthly and annual commitment types, with the added benefit of being able to cancel them at any time after the first 60 seconds. To get started right away, try the BigQuery sandbox. If you are thinking about migrating to BigQuery from other data warehouses, check out our data warehouse migration offer

Learn more about:

BigQuery leads the way toward modern data analytics

At Google, we think you should have the right tools and support to let you embrace data growth. Enterprises and digital-native organizations are generating incredible value from their data using Google Cloud’s smart analytics platform. At the heart of the platform is BigQuery, a cloud-native enterprise data warehouse. BigQuery helps organizations develop and operationalize massively scalable, data-driven intelligent applications for digital transformation.   

Enterprises are modernizing with BigQuery to unlock blazing-fast business insights 

Businesses are using BigQuery to run mission-critical applications at scale in order to optimize operations, improve customer experiences, and lower total cost of ownership (TCO). We have customers running queries on massive datasets, as large as 100 trillion rows, and others running more than 10,000 concurrent queries across their organization. We’re seeing adoption across regions and industry verticals including retail, telecommunications, financial services, and more.

Wayfair is one example of a retailer that was looking to scale its growing $8 billion global business while providing a richer experience for its 19 million active customers, 6,000 employees, and 11,000 suppliers. By moving to BigQuery, Wayfair can now make real-time decisions, from merchandising and personalized customer experiences to marketing and promotional campaigns. Wayfair’s data-driven approach provides the company with valuable and actionable insights across every part of the business. And they’re now able to seamlessly fulfill millions of transactions during peak shopping seasons. 

Financial services company KeyBank is migrating to BigQuery for scalability and reduced costs as compared to their on-prem data warehouse. “We are modernizing our data analytics strategy by migrating from an on-premises data warehouse to Google’s cloud-native data warehouse, BigQuery,” says Michael Onders, chief data officer at Keybank. “This transformation will help us scale our compute and storage needs seamlessly and lower our overall total cost of ownership. Google Cloud’s smart analytics platform will give us access to a broad ecosystem of data transformation tools and advanced machine learning tools so that we can easily generate predictive insights and unlock new findings from our data.” 

Other customers finding success with our smart analytics tools are Lowe’s, Sabre, and Lufthansa. They’re all modernizing their data analytics strategies and transforming their businesses to remain competitive in a changing data landscape.

Product innovation is simplifying migrations and improving price predictability

We are continuing to make it easy to modernize your data warehouse with BigQuery. The new product capabilities we’re announcing are helping customers democratize advanced analytics, be assured of price predictability, and simplify migrations at scale.

  • Simplifying migrations at scale: We’re helping customers fast-track data warehouse migrations to BigQuery with the general availability of RedShift and S3 migration tools. Customers can now seamlessly migrate from Amazon Redshift and Amazon S3 right into BigQuery with BigQuery Data Transfer Service. Customers such as John Lewis Partnership, Home Depot, Reddit, and Discord have all accelerated business insights with BigQuery by freeing themselves of the performance and analytics limitations of their Teradata and Redshift environments. ”Migrating from Redshift to BigQuery has been game-changing for our organization,” says Spencer Aiello, tech lead and manager, machine learning at Discord. “We’ve been able to overcome performance bottlenecks and capacity constraints as well as fearlessly unlock actionable insights for our business.” 

  • Offering enterprise readiness and price predictability: Enterprise customers need price predictability to do accurate forecasting and planning. We recently launched Reservations for workload management, and today, we’re pre-announcing beta availability of BigQuery Flex Slots, which enable customers to instantly scale up and down their BigQuery data warehouse to meet analytics demands without sacrificing price predictability. With Flex Slots, you can now purchase BigQuery commitments for short durations—as little as seconds at a time. This lets organizations instantly respond to rapid demand for analytics and plan for major business events, such as retail holidays and game launches. Learn more about Flex Slots here. We’re also pre-announcing the beta availability of column-level access controls in BigQuery. With BigQuery column-level security, you can now have access policies applied not just at the data container level, but also to the meaning and/or content of the data in your columns across your enterprise data warehouse. Finally, we now support unlimited DML/DDL statements on a table in BigQuery—find more details here.

  • Democratizing advanced analytics: We’re making advanced analytics even more accessible to users across an organization. We’re excited to announce that BigQuery BI Engine is becoming generally available. Customers can analyze large and complex datasets interactively with sub-second query response time and high concurrency for interactive dashboarding and reporting. One Fortune 500 global media outlet using BI Engine summarized it well: “To deliver timely insights to our editors, journalists and management, it’s important we answer questions quickly. Once we started using BigQuery BI Engine, we saw an immediate performance boost with our existing Data Studio dashboards—everyone’s drilling down, filtering, and exploring data at so much faster a pace.” Learn more here.

All these product innovations and more are helping customers jump-start their digital transformation journeys with ease. 

Our cohesive partner ecosystem creates a strong foundation

We’re making deep investments in our partner ecosystem and working with global and regional system integrators (GSIs) and other tech partners to simplify migrations across the planning phase, offer expertise, and make go-to-market delivery easier. GSI partners such as Wipro, Infosys, Accenture, Deloitte, Capgemini, Cognizant, and more have dedicated centers of excellence and Google Cloud partner teams. These teams are committed to defining and executing on a joint business plan, and have built end-to-end migration programs, accelerators, and services that are streamlining the modernization path to BigQuery. The Accenture Data Studio, Infosys Migration Workbench (MWB), and Wipro’s GCP Data and Insights Migration Studio are all examples of partner solutions that can help modernize your analytics landscape by supporting migrations at scale.

Partners are essential for many cloud migration journeys. “Enterprises today are seeking to be data-driven as they navigate their digital journey,” says Satish H.C., EVP, data analytics at Infosys. “For our clients, we enable this transformation with our solutions like Digital Brain, Information Grid, Data Marketplace and Next Gen Analytics platform, powered by Google-native technologies like BigQuery, BigQuery ML, AI Platform and Cloud Functions.”

“We are excited to be partnering with Google Cloud to help streamline data warehouse migrations to BigQuery so that organizations can unlock the full potential of their data,” says – Sriram Anand, Managing Director, North America Lead for Data Engineering at Accenture. “As our clients are managing increasingly fast-changing business needs, they are looking for ways to scale up to petabytes of data on-demand without performance disruptions and run blazing-fast queries to drive business innovation.”

Tech partners are also core to our data warehouse modernization solution. With Informatica, customers can easily and securely migrate data and its schema from their on-prem applications and systems into BigQuery. Datometry and CompilerWorks both help customers migrate workloads without having to rewrite queries. Datometry eliminates the need to rewrite queries and instead converts the incoming request into target dialect on the fly, while CompilerWorks converts queries’ source dialect SQL into target dialect SQL. Along with their core offerings, these tech partners have also developed additional migration accelerators.

We’re also happy to announce that SADA, a leading global business and technology consultancy managed service provider, just announced a multi-year agreement with Google Cloud. They will be introducing new solutions to help organizations modernize data analytics and data warehousing with Google Cloud, including support for Netezza, Teradata, and Hadoop migrations to BigQuery. These solutions offer a shorter time to value on new releases, expedite decision making with data-driven insights, and allow customers to focus more on innovation. Learn more here.

Making the BigQuery move

We’re seeing this momentum across our smart analytics portfolio as industry analysts such as Gartner and Forrester have recognized Google Cloud as a Leader in five new analyst reports over the past year, including the new Forrester Wave™: Data Management for Analytics, Q1 2020

These launches, updates, and new migration options are all designed to help businesses digitally transform their operations. Try the BigQuery sandbox to get started with BigQuery right away. Jumpstart your modernization journey with the data warehouse migration offer, and get expert design guidance and tools, partner solutions, and funding support to expedite your cloud migration.

Taking a practical approach to BigQuery slot usage analysis

Google BigQuery is a fully managed serverless solution for your enterprise data warehouse workloads. Nothing could be easier than that: Just upload your data via batch or streaming and start running your queries. The underlying system will seamlessly take care of providing the infrastructural resources needed to complete your different jobs. 

It seems like magic, right? Especially when you think about the fact that, behind the scenes, there is a large-scale distributed system with a multitude of parallel tasks executed by dozens of microservices spread across several availability zones across your selected Google Cloud region (find more details about BigQuery technology).  

But what if you want more visibility into the power under the hood? In this blog post, we’ll dive into all the currently available options to monitor and analyze resource usage. We’ll also describe the newly available extension to INFORMATION_SCHEMA views, now in beta, that offers you practical access to the underlying data. Let’s start by exploring the computational capacity unit we use to understand the load of the system.

BigQuery’s computational capacity unit: the slot

Every time you perform a query, there are several back-end tasks that need to be completed (such as reading data from tables, pruning data that is not useful anymore, performing eventual aggregations, etc.). Each task, executed on an ad-hoc microservice, requires an adequate amount of computational power in order to be fulfilled. The slot is the computational capacity unit to measure that power. The BigQuery engine dynamically identifies the amount of slots needed to perform a single query, and background processes will transparently allocate the adequate computation power needed to accomplish the task.

So, it’s essential to understand how to monitor and analyze slot usage, because that lets your technical team understand if there are any bottlenecks, then allows the business to choose the best pricing model (on-demand vs. flat-rate).

We’ll now analyze three different strategies to gain better visibility of slot usage and see how you can get started using them:

  • Slot usage analysis with system tables

  • Real-time usage monitoring with Stackdriver

  • Slot usage analysis with BigQuery audit logs 

Slot usage analysis with system tables

We’re announcing an extended version of INFORMATION_SCHEMA views that contain real-time information about BigQuery jobs. 

This is part of our internal series of views, called INFORMATION_SCHEMA, that lets you extract useful information related to datasets, routines, tables and views. With a simple query, you can access a series of metadata that facilitate the analysis of the current data warehouse definition (i.e., tables list, descriptions of the fields, etc).

This new extended version makes it easier to analyze slot usage (and other resources as well) because all the jobs information needed is just a query away.

Here’s an example. Let’s assume that you want to understand last month’s daily consumption of slots split by users (all together with other information) within a single project. The query you have to write is really simple:

The pivotal element here is total_slot_ms field, which contains the total amount of slots per millisecond used by a query. That’s the total number of slots consumed by the query over its entire execution time, considered in milliseconds. If you want to compute the average slot usage of the query, divide the value by the milliseconds duration of the query. You can do that by subtracting the value of endTime field from creationTime field. For example, if you have a 10-second query that used 20,000 totalSlotMs, it means that the average slot usage is 2: 20.000/10*1000.

If you continue digging into the column definitions of the views, you will find a lot of information that will enable you to implement different kinds of analysis. You can, for example, easily compute the most expensive queries within your organization, find users who are heavily issuing queries, understand what the most expensive projects are, and also perform deep analysis on single queries by studying their query stages. Since details are available within one second of the job completion, you can use that information to implement your own triggers (e.g., once a load job has been successfully completed, launch the query to clean and transform the data for production usage).

Note that the data is typically available within seconds, and jobs data is retained for a period of 180 days. If you want to maintain a backup of the data for historical reasons or to perform analysis later, use scheduled queries to automate the export of the data in an ad-hoc (partitioned) table. And keep in mind that real-time slot usage can fluctuate over the runtime of a query. To get deeper into the details, try this open-source solution to visualize slots consumption.

Real-time slot usage monitoring with Stackdriver

If you want instead to monitor in real time the slot usage of your project, Stackdriver is the place to go. It lets you: 

  • Have an immediate overview of the project slot usage, thanks to the native slot utilization chart

  • Create ad-hoc charts using several available metrics (i.e., slots allocated or slots allocated per job type)

  • Create ad-hoc alerts to receive notifications when a certain event occurs (i.e., when the number of available slots is under a certain threshold for more than five minutes)

Check out this guide on implementing monitoring for BigQuery

https://cloud.google.com/bigquery/docs/monitoring

.

Slot usage analysis with BigQuery audit logs

Another possible option to get access to the underlying information is exploring BigQuery audit logs. These logs contain a multitude of information, including slot usage. 

You have two ways to query the data:

  • Stackdriver Logging, if you want to perform a quick search of a precise value of interest 

  • BigQuery, if you want to perform a more in-depth analysis

Stackdriver Logging

From the Stackdriver Logging page;

  • Select BigQuery as resource

  • Select the desired time frame (such as last hour or no limit) 

  • Filter for all the INFO entries 

  • Search for the ones containing getQueryResult method name

Stackdriver Logging.png

Once you’ve found the query you were looking for, expand its payload and look for the entry of interest. For example,  protoPaylod.serviceData.jobGetQueryResultResponse.jobStatistics.totalSlotMs represents the total amount of slots-ms as described earlier.

If you want to perform more in-depth analysis, you can use BigQuery. First, create a sink to transfer the data from Stackdriver Logging to BigQuery. Then you can perform more complex queries to analyze the slot usage of your project. For example, if you use the same query as above, this will be the outcome:

We’re excited to see how you’ll use these tools to build your customized solution. Learn more here about organizing slots. And your next step might be to use Data Studio to generate custom dashboards to be shared within your organization.

Google completes Looker acquisition

I’m pleased to announce that Google has completed its acquisition of Looker. Together, we’re excited to offer customers a comprehensive analytics solution that integrates and visualizes insights at every layer of their business.  

Today, hundreds of joint customers already rely on Google Cloud and Looker to make smarter business decisions. And by joining together, we believe we will be uniquely positioned to address the data analytics and business intelligence demands of even more enterprises globally, across all industries. Looker will strengthen Google Cloud’s analytics and data warehouse capabilities, including BigQuery, enabling our customers to address some of their toughest business challenges, faster—all while maintaining complete control of their data

Google Cloud and Looker share a common philosophy around delivering open solutions and supporting customers wherever they are—be it on Google Cloud, in other public clouds, or on premises. As more organizations adopt a multi-cloud strategy, Looker customers and partners can expect continued support of all cloud data management systems like Amazon Redshift, Azure SQL, Snowflake, Oracle, Microsoft SQL Server and Teradata.

We’re thrilled to officially welcome the Looker team to Google Cloud. For more on what this means for Looker and its community of customers and partners, please read the blog from Looker CEO, Frank Bien.

Comments on the news:

  • “Companies across the globe are migrating to the cloud and adopting a multi-cloud strategy. For leading technology providers, it’s critical that we put customers first to help them achieve their digital transformation goals,” says Frank Slootman, CEO of Snowflake. “The combination of Google Cloud and Looker and their commitment to multi-cloud support will help deliver more value to customers faster. We look forward to our continued partnership with both companies.”
  • “There is no slowing in the data analytics and business intelligence market, in fact, it’s only growing as more and more data is created and enterprises across all industries look to use data to be better informed, make smarter decisions and digitally transform their organizations,” said Howard Dresner, Chief Research Officer, Dresner Advisory Services. “Our research derived directly from customer feedback shows Google Cloud as a Leader in Analytical Data Infrastructure, and Looker as a Leader in enterprise business intelligence. We see synergy and nominal overlap between their solutions, and believe this acquisition should strengthen their ability to serve enterprise customers together.

Google Cloud named a Leader in the Forrester Wave: Data Management for Analytics

We’re pleased to announce that Forrester has named Google Cloud a Leader in its report, The Forrester Wave™: Data Management for Analytics (DMA) Q1, 2020. This is the first DMA Wave Forrester is publishing and it assesses Google Cloud’s broader data analytics portfolio, with BigQuery at the heart. 

In this report, Google Cloud received the highest score possible in the categories of: roadmap, performance, high availability, scalability, data ingestion, data storage, data security, and customer use cases. This evaluation validates for us Google Cloud’s data analytics strategy, and we believe echoes the feedback we’re hearing from customers who are now able to process huge amounts of data quickly and efficiently.

Enterprise customers are betting big on BigQuery

The need to digitally transform their businesses means that organizations are looking for solutions that can seamlessly scale with their growing data needs. 

Providing simplicity and speed

We build our data analytics products to be easy to use, and tightly integrated with the rest of Google Cloud’s ecosystem. BigQuery is built for speed, letting you analyze gigabytes to petabytes of data using ANSI SQL with no operational overhead. BigQuery’s serverless model means that resource provisioning happens behind the scenes, so you can focus on analytics instead of managing infrastructure. Customers can now run analytics at scale with a 26% to 34% lower three-year TCO than cloud data warehouse alternatives.

S4 Agtech, which offers risk management solutions for agriculture companies, migrated to Google Cloud to save money and time, as well as, scale databases and machine learning models faster. The company uses BigQuery as its data warehouse, and has dramatically reduced storage and processing costs by 80%, while providing customers their analytics results 25% faster. S4 has also gained more time for innovation in helping its customers de-risk crop production, including updating and improving algorithms.

And our broader data analytics platform extends to streaming analytics, with options for both batch and streaming data. Customers can query data in real time  and know what is happening right now in order to drive fast and informed business decisions. 

Ensuring security and trust

With a secure, trusted data analytics platform, it becomes easier to let users access the data they need to run their businesses. BigQuery provides strong security and governance controls with Cloud Data Loss Prevention (DLP) and Cloud Identity and Access Management (IAM). Data is encrypted at rest and in transit by default ensuring that customer data is protected.

We are honored to be a leader in this Forrester Wave™, and look forward to continuing to innovate and partner with you on your digital transformation journey. 

Download the full Forrester Wave™: Data Management for Analytics Q1, 2020 report.

And check out how UPS uses our smart analytics platform. To learn more about BigQuery, visit our website, and get started immediately with the free BigQuery Sandbox.

BigQuery under the hood: How zone assignments work

BigQuery is a serverless, highly scalable, and cost-effective cloud data warehouse. It’s designed to be flexible and easy to use. There are lots of interesting features and design decisions made when creating BigQuery, and we’ll dive into how zone assignments work in this post. Like other Google Cloud services, BigQuery takes advantage of our global cloud regions to make sure your data is available when you need it.

One key aspect of BigQuery’s architecture is that it is multi-tenant; it runs workloads from different customers on compute and storage infrastructure that does not require any customer involvement in capacity planning. When loading data into BigQuery, customers choose a region to load the data and, optionally, purchase a compute reservation. Then, the service takes care of provisioning.

Zone assignments keep your data flowing

Each BigQuery region is internally deployed in multiple availability zones. Customer data is replicated between these zones, and there is fast automatic failover to the secondary zone if the primary zone is experiencing issues. The failover is designed to be transparent to customers and have no downtime. We’re always expanding our capacity footprint to support customer growth and onboard new customers. To make sure that each customer gets to think of storage as infinite, and gets sufficient compute resources to load and analyze their data, we continuously recompute the best placement for the primary and secondary zones in the region.

To ensure the best primary and secondary zone assignments, the assignment algorithm takes into account the storage and compute usage of each customer and the available capacity in each zone. Then it makes sure that the usage will fit in the currently assigned zones. If it doesn’t, it finds another suitable zone for that customer and orchestrates a move from their current zone to a new zone. All of this happens in the background without causing any disruptions to your workload.

Any datasets that share the same region can be joined together in a single query. To ensure good query performance, we attempt to colocate compute and storage so that I/O is within the same zone in order to take advantage of high-throughput networking within the zone. I/O bandwidth within a zone is very high (Google’s Jupiter network fabric can sustain more than 1 petabit/second of total bisection bandwidth), but network capacity between zones is much more constrained. Our assignment algorithm makes sure that Google Cloud projects within the same Google Cloud organization are assigned to the same subset of zones in every region. To support very large orgs, we compute project cliques based on cross-project query patterns within the organization. That breaks it up into more manageable chunks that can be placed separately. To handle cross-org reads, the algorithm also looks into past query patterns to discover relationships between organizations and make an effort to have at least one common zone between orgs that are related. The query engine also allows reading small amounts of data not being colocated by either reading remotely or copying some data to the compute zone before running the query. In rare cases, when the algorithm cannot ensure this or there is a new cross-org query pattern, queries that read large amounts of data may fail.

Best practices for organizing and moving your data

  • To get the best performance for workloads that read/write data in datasets belonging to different projects, ensure that the projects are in the same Google Cloud org.

  • If you want to make your data available to other BigQuery users in your Google Cloud organization, you can use IAM permissions to grant access. 

  • If you wish to share data with BigQuery users outside your organization, use Table Copy to move data to the target project. From there, they can do any subsequent analysis in that project. Table Copy is supported via asynchronous replication and can support cross-zone data moves. 

  • You can move data across regions using the dataset copy feature.

Learn more about regions and zones in Google Cloud services, and find more details on how Compute Engine handles zones.

Now updated: Our Data Engineering Learning Path

With the market for artificial intelligence and machine learning-powered solutions projected to grow to $1.2B by 2023, it’s important to consider business needs now and in the future. We’ve heard from our customers and have witnessed internally that the data engineering role has evolved and now requires a larger set of skills. In the past, data engineers worked with distributed systems and Java programming to use Hadoop MapReduce in the data center but now, they need to leverage AI, machine learning, and business intelligence skills to efficiently manage and analyze data. To address the new skills data engineers now need, we updated our Data Engineering on Google Cloud learning path.

We’ve added new course content to this learning path like introductions to Data Fusion and Cloud Composer. We also added more labs on advanced BigQuery, BigQuery ML, and Bigtable streaming to help you get more hands-on practice.

This learning path covers the primary responsibilities of data engineers and consists of five courses: 

  • Google Cloud Big Data and Machine Learning Fundamentals – Start off by learning the important GCP big data and machine learning concepts and terminologies. 

  • Modernizing Data Lakes and Data Warehouse with Google Cloud – Understand the responsibilities of data engineers, the business need for effective data pipelines, and the benefits of data engineering in the cloud. This course will also dig deeper into the use cases and available GCP solutions for data lakes and warehouses, the key components of data pipelines. 

  • Building Batch Data Pipelines on Google Cloud – Discover which paradigm to use for different batch data as this course walks you through the main data pipeline paradigms: extra-load, extract-load-transform or extract-transform-load. You’ll also learn more about data transformation technologies such as how to use BigQuery, execute Spark on Dataproc, pipeline graphs in Data Fusion, and do serverless data processing with Dataflow. 

  • Building Resilient Streaming Analytics Systems on Google Cloud – Learn how to build streaming data pipelines on Google Cloud, apply aggregations and transformations to streaming data using Dataflow, and store processed records to BigQuery or Bigtable for analysis in order to get real-time metrics on business operations. 

  • Smart Analytics, Machine Learning, and AI on Google Cloud –  Extract more insights from your  data by learning how to customize machine learning in data pipelines on Google Cloud in this course. You will learn how to use AutoML for when you need little to no customization and how to use AI Platform Notebooks and BigQuery ML for more tailored machine learning capabilities. You will also be taught how to productionalize machine learning solutions using Kubeflow Pipelines. 

Want to learn more? Join us for a special webinar Data Engineering, Big Data, and Machine Learning 2.0, on Feb 21 at 9:00 AM PST with Lak Lakshmanan, Head of Google Cloud Data Analytics and AI Solutions. We will go over what this learning path has to offer, demonstrate hands-on labs, and answer any questions you have. Also, just for attending the webinar, we will give you special discounts on training. Register today!

Bring 20/20 vision to your pipelines with enhanced monitoring

Stream analytics is bringing data to life in a way that was previously unimaginable, unlocking new use cases, from connected medical devices in healthcare to predictive maintenance on the factory floor. But with new uses comes new challenges that, if left unaddressed, can lead to unintended behaviors for end-user applications. 

Before the days of modern stream analytics, you could guarantee the reliability of your batch data processing by re-executing your data workflows. Plus, since batch processing latency was a lesser concern, ensuring that your data was delivered within your SLOs was a manageable task. 

Stream processing is a different beast, however. Stream analytics shrinks the time horizon between a user event and an application action, which means it is more important than ever to quickly respond to performance degradations in your data pipelines. To that end, Dataflow, Google Cloud’s fully managed batch and stream data processing service, now includes new observability features that will allow you to identify, diagnose, and remediate your pipelines faster than ever. With better observability, you can spend less time fixing problems and more time getting value out of your data.

Introducing Dataflow observability

With this launch, we are introducing new charts into the Dataflow monitoring UI and streamlined workflows with the Cloud Monitoring interface. You will find these charts in the new “Job metrics” tab located at the top of the screen when you navigate to the job details page within Dataflow.

In addition to the data freshness, system latency, and autoscaling graphs that have historically been a part of the Dataflow monitoring experience, you’ll now also see throughput and CPU utilization charts. Throughput charts, shown below, show how many elements (or bytes) are flowing through your pipeline. The time-series graph contains a line for each step of your pipeline, which can quickly illustrate which step(s) of your pipeline could be slowing down the overall processing of your job. Our new time selector tool allows you to drag your cursor over interesting points in the graph to zoom in for higher fidelity.

throughput.png

CPU utilization charts the utilization of your workers over time. These charts can indicate whether you have allocated the appropriate amount of cores for your workers or if you have selected the appropriate amount of workers for your job (assuming you have disabled autoscaling). You can toggle between multiple views, including an all-worker view, stats view, four top-utilized machines, and four least-utilized machines, as shown here:

cpu utilization.png

Developers can create alerts with just a few clicks by using the “Create alerting policy” link in the top right corner of the chart card. You can find job and worker logs in an expandable panel at the bottom of your screen, giving you all of the tools to debug your stuck pipelines.

Dataflow observability in the real world

We’ve heard from customers about how useful this new feature has been already. 

“We are loving the new UI! In the last day we’ve already been using it to track throughput of our pipelines and diagnose issues,” said Vinay Mayar, senior software engineer at Expanse.

It’s been helpful for Ocado too. “The killer feature of the page is the ability to see the throughput for each processing step,” says Mateusz Juraszek, software engineer at Ocado Technology. “It’s great that all the statistics are gathered in one place on the JOB METRICS page. Displaying data freshness and system latency enables us to quickly and preemptively detect anything that might affect reliability, and then use other charts or logs to investigate and address what we discover.”

What’s next for pipeline observability

The general availability of these observability charts is our first step toward making Dataflow monitoring the best in class for data engineers. Over the coming months, we plan to add new features including memory and disk usage charts, I/O metrics such as response latencies and error rates for Pub/Sub calls, and visualizers that will significantly enhance the explainability of Dataflow jobs. By spending less time managing reliability and performance and more time extracting value from your data, you can spend your time laying the foundation for tomorrow’s cutting-edge streaming analytics applications.

Learn more about these new Dataflow features.

Building more secure data pipelines with Cloud Data Fusion

For those of you working in data analytics, ETL and ELT pipelines are an important piece of your data foundation. Cloud Data Fusion is our fully managed data integration service for quickly building and managing data pipelines. 

Cloud Data Fusion is built on the open source project CDAP, and this open core lets you build portable data pipelines. A CDAP server might satisfy your need to run a few simple data pipelines. But when it comes to securing a larger number of business-critical data pipelines, you’ll often need to put a lot more effort into logging and monitoring those pipelines. You will also need to manage authentication and authorization to protect that data when you have servers running workloads for multiple teams and environments. These additional services can require a lot of maintenance effort from your operations team and take time away from development. The goal is running pipelines, not logging, monitoring, or the identity and access management (IAM) service.

We designed Cloud Data Fusion to take care of most of this work for you. And since it’s part of Google Cloud, you can take advantage of built-in security benefits when using Cloud Data Fusion rather than self-managed CDAP servers:

  • Cloud-native security control with Cloud IAM—Identity management and authentication efforts are taken care of by Cloud Identity

  • Full observability with Stackdriver Logging and Monitoring—Logs include pipeline logs and audit logs

  • Reduced exposure to public internet with private networking

Let’s take a look at these features in detail.

Access control with Cloud IAM 

The number one reason to use Cloud Data Fusion over self-managed CDAP servers is that it integrates seamlessly with Cloud IAM. That lets you control access to your Cloud Data Fusion resources. With Cloud IAM, Cloud Data Fusion is able to easily integrate with other Google Cloud services. You can also use Cloud Identity for users and groups management and authentication [such as multi-factor authentication (MFA)], instead of implementing or deploying your own.

There are two predefined roles in Cloud Data Fusion: admin and viewer. As a practice of the IAM principle of least privilege, the admin role should only be assigned to users who need to manage (create and delete) the instances. The viewer role should be assigned to users who only need to access the instances, not manage them. Both roles can access the Cloud Data Fusion web UI to create pipelines and plugins.

Assign roles and permissions to groups with users instead of assigning them to users directly whenever possible. This helps you control users’ access to Cloud Data Fusion resources in a more organized manner, especially when you assign permissions to the groups repeatedly on multiple projects.

Read more about the two Cloud Data Fusion roles and their corresponding permissions.

Private IP instance

The private IP instance of Cloud Data Fusion connects with your Virtual Private Cloud (VPC) privately. Traffic over this network does not go through the public internet, and reduces potential attack surface as a result. You can find more about setting up private IP for Cloud Data Fusion.

VPC Service Controls

We’re also announcing beta support for VPC Service Controls to Cloud Data Fusion. You can now prevent data exfiltration by adding a Cloud Data Fusion instance to your service perimeter. When configured with VPC-SC, any pipeline that reads data from within the perimeter will fail if it tries to write the data outside the service perimeter.

Stackdriver Logging

Stackdriver Logging and Monitoring are disabled by default in Cloud Data Fusion, but we recommend you enable these tools for observability.

With the extra information provided by the logs and metrics, you can not only investigate and respond to incidents faster, but understand how to manage your particular infrastructure and workloads more effectively in the long run. There are a range of logs that can help you run your Cloud Data Fusion pipelines better.

Pipeline logs

These are generated by your pipelines in Cloud Data Fusion. They are useful for understanding and troubleshooting your Cloud Data Fusion pipelines. You can find these logs in the Cloud Data Fusion UI as well as in the Stackdriver logs of the Dataproc clusters that execute the pipelines.

Admin activity audit logs

These logs record operations that modify the configuration or metadata of your resources. Admin activity audit logs are enabled by default and cannot be disabled.

Data access audit logs

Data access audit logs contain API calls that read the configuration or metadata of the resources, as well as user-driven API calls that create, modify, or read user-provided resource data.

Admin activity audit logs and data access audit logs are useful for tracking who accessed or made changes to your Cloud Data Fusion resources. In case there’s any malicious activity, a security admin will be able to find and track down the bad actor in the audit logs.

These Google Cloud features can give you extra control and visibility into your Cloud Data Fusion pipelines. Cloud IAM helps you to control who can access your Cloud Data Fusion resources; private instance minimizes exposure to public internet; and Stackdriver Logging and Monitoring provides information about your workloads, changes in permission, and access to your resources. Together, they create a more secure solution for your data pipeline on Google Cloud.

Learn more about Cloud Data Fusion.

Scale your Composer environment together with your business

When you’re building data pipelines, it’s important to consider business needs now and in the future. We often hear from customers that they want to configure and optimize their Cloud Composer environments. So we on the Cloud Composer engineering team will share in this post how Cloud Composer—built on Apache Airflow—works, and offer some tips to optimize your Cloud Composer performance. 

Cloud Composer is a fully managed workflow orchestration service that lets you author, schedule, and monitor pipelines that span across clouds and on-premises data centers. It’s built on Apache Airflow open source software and operated using the Python programming language.

We’ll start by analyzing how Airflow configurations can affect performance, then offer tips on ways to quickly bootstrap your initial settings for high performance. You may also find this sizing guide helpful—make a copy and add your own numbers. 

Understanding Apache Airflow scheduler architecture with CeleryExecutor

Let’s start with this detailed architecture of Airflow scheduler/worker in Cloud Composer. This assumes you’re already familiar with overall Cloud Composer architecture and Apache Airflow concepts.

In the diagram below, you can see that the process of parsing DAGs loads DAGs from files repeatedly. The process checks DAGs and fires scheduling actions, such as starting a DAG run or creating a task. Tasks are sent to workers for execution via a Redis-based task queue.

Scheduler architecture of Airflow on Composer .png
Scheduler architecture of Airflow on Composer

The scheduler launches multiple processes by calling Python multiprocessing.Process to parse DAG files in parallel. The total number of DAG processing processes that the scheduler can launch is limited by the Airflow config (scheduler)-max_threads.

Each DAG parsing process will complete the following steps:

  • Parse a subset of DAG files to generate DAG runs and tasks for those DAG runs.

  • Collect tasks that meet all dependencies.

  • Set these tasks to the SCHEDULED state.

The main process of the scheduler will do the following in a loop:

  • Collect all SCHEDULED tasks from DAG parsing processes.

  • Set eligible tasks to QUEUED state.

  • Send certain number of QUEUED tasks into Celery queue. This number is calculated by the parallelism config parameter, which represents the max number of tasks running concurrently.

  • Remaining tasks will remain in QUEUED state.

Life of a task in Cloud Composer

Every Airflow Task goes through the process and constraints depicted below before being executed by a worker. In sequence, a Airflow task needs to pass these Airflow config constraints to be finally executed by a worker:

Constraints in Airflow at different stages.png
Constraints in Airflow at different stages
  1. The DAG parsing process in the scheduler parses the DAG definition, creating task instances for each task in the DAG.

  2. If all of the task dependencies are met, the task is set to the SCHEDULED state.

  3. Once the task is in the SCHEDULED state, the scheduler main process picks it for processing.

  4. The Scheduler main pProcess picks tasks in the SCHEDULED state, taking into account the constraints `dag_concurrency` for maximum number of tasks per DAG and `non_pooled_task_slot_count’ for max number of tasks in the system, together with other criteria for queueing. Effectively queued tasks are set to the QUEUED state.

  5. As the next step, the Scheduler main process queues tasks in the Celery queue based on the `parallelism` constraint, which limits the number of queued tasks in the Celery queue. Queued tasks are kept in state QUEUED.

  6. Last, worker processes take tasks from the Celery queue as long as the number of tasks in the worker is lower than the `worker_concurrency` constraint. Tasks effectively running in a worker are set to the RUNNING state.

Recommended Airflow config variables for optimal performance

Here’s a quick reference table with our recommendations for various Airflow configs that may affect performance. We’re going to discuss the rationale behind each of them in the following sections.

Screen Shot 2020-02-04 at 3.36.36 PM.png

Choose the right Airflow scheduler settings 

When you need to scale your Cloud Composer environment, you’ll want to choose the right Airflow configs as well as node and machine type settings.

The Airflow default config for scheduler max_threads is only two, which means even if the Airflow scheduler pod runs in a 32-core node, it can only launch two DAG parsing processes. Therefore, it’s recommended to set max_threads to at least the number of vCPUs per machine.

If you find tasks are taking a long time in SCHEDULED state, it can mean that tasks are constrained by dag_concurrency or non_pooled_task_slot_count. You can consider increasing the value of the two options.

If you find tasks are stuck in QUEUED state, it can mean they may be constrained by parallelism. It may, however, also be limited by worker processing power, because tasks are only set to RUNNING state after they’re already picked up by a worker. You can consider increasing parallelism or adding more worker nodes.

Test Airflow worker performance 

Cloud Composer launches a worker pod for each node you have in your environment. Each worker pod can launch multiple worker processes to fetch and run a task from the Celery queue. The number of processes a worker pod can launch is limited by Airflow config worker_concurrency

To test worker performance, we ran a test based on no-op PythonOperator and found that six or seven concurrent worker processes seem to already fully utilize one vCPU with 3.75GB RAM (the default n1-standard-1 machine type). The addition of worker processes can introduce large context switch overhead and can even result in out-of-memory issues for worker pods, ultimately disrupting task execution.

`worker_concurrency` = 6-8 * cores_per_node or per_3.75GB_ram

Cloud Composer uses six as the default concurrency value for environments. For environments with more cores in a single node, use the above formula to quickly get a worker_concurrency number that works for you. If you do want a higher concurrency, we recommend monitoring worker pod stability closely after the new value takes effect. Worker pod evictions that happen because of out-of-memory errors may indicate the concurrency value is too high. Your real limit may vary depending on your worker process memory consumption.

Another consideration to take into account is long-running operations that are not CPU-intensive, such as polling status from a remote server that consumes memory for running a whole Airflow process. We advise lifting your worker_concurrency number slowly and monitoring closely after adjustment.

Consider more nodes vs. more powerful machines

Big node setup vs small node setup.png
Big node setup vs. small node setup with the same number of vCPUs. In the image on the right, the Airflow Scheduler pod runs in a relatively less powerful machine.

Our internal tests show that worker processing power is most influenced by the total number of vCPU cores rather than machine type. There’s not much difference in terms of worker processing power between a small number of multi-core machines and a large number of single-core machines, as long as the total number of CPU cores is the same.

However, in the small node setup, with a large number of nodes but less powerful machines, the Scheduler runs in a small machine, and it may not have enough compute power to produce tasks for workers to execute. 

Therefore, we recommend setting up a Cloud Composer cluster with a relatively small number of powerful machines, keeping in mind that if the number of machines is too small, a failure of one machine will impact the cluster severely.

Our internal tests show that with worker_cores:scheduler_cores ratio up to around 9:1, there is no performance difference in terms of system turnout for the same amount of cores, as long as there are no long-running tasks. We recommend that you only exceed that ratio when you have long-running tasks. You can use the formula below to quickly calculate a good worker_cores:scheduler_cores ratio to start with.

For example, if you set up your environment initially with three nodes and two cores per machine and then estimate you may have 24 long-running tasks at the same time, you could try to scale your environment up to 9 + 24 / (2 * 6) = 11 nodes. If you want to have more performance, it may be worth trying with a more powerful machine type instead.

Use our sizing guide to get started, and have a wonderful journey with Cloud Composer!

10 top tips: Unleash your BigQuery superpowers

Lots of us are already tech heroes by day. If you know SQL, for example, you’re a hero.You have the power to transform data into insights. You can save the day when someone in need comes to you to reveal the magic numbers they can then use in their business proposals. You can also amaze your colleagues with patterns you found while roaming around your data lakes.

With BigQuery, Google Cloud’s enterprise data warehouse, you quickly become a superhero: You can run queries faster than anyone else. You’re not afraid of running full table scans. You’ve made your datasets highly available, and you no longer live in fear of maintenance windows. Indexes? We don’t need indexes where we’re going, or vacuums either.

If you’re a BigQuery user, you’re already a superhero. But superheroes don’t always know all their superpowers, or how to use them. Here are the top 10 BigQuery superpowers to discover.

01 BQ super powers.jpg

1. The power of data

Let’s say your favorite person has been trapped by an evil force, which will only release them if you answer this simple riddle: Who were the top superheroes on Wikipedia the first week of February 2018?

Oh no! Where will you get a log of all the Wikipedia page views? How can you tell which pages are superheroes? How long will it take to collect all of this data, and comb through it all? Well, I can answer that question (see the source data here). Once data is loaded, it will only take 10 seconds to get the query results. This is how:

There it is—all the superheroes on the English Wikipedia page, and the number of page views for whatever time period you choose. And these are the top 10, for the first week of February 2018:

02 wikipedia results.png

You’ve saved your friend! But first, the evil spirit needs more detail. Well, this query will do:

03.png
04.png

You can have the power of data too: check out the Wikipedia pageviews, and my latest Wikidata experiments (plus all of BigQuery’s public datasets) and copy paste these queries, modify them, and save your friends.

2. The power of teleportation

You want to see the tables with the Wikipedia pageviews and Wikidata? Let’s jump to the BigQuery web UI. Did you know that you can autocomplete your queries while typing them? Just press tab while writing your queries. Or you can run a sub-query by selecting it and pressing CMD-E. And teleportation? Jump straight to your tables with CMD and click on them. For example, that Wikipedia 2018 page views table we queried previously has more than 2TB of data, and the Wikidata one has facts for more than 46 million entities. And we just joined them to get the results we wanted.

Also, while looking at the schema, you can click on the fields, and that will auto-populate your query. Ta-da!

3. The power of miniaturization

Did I just say that the page views table has more than 2TB of data? That’s a lot! Remember that in BigQuery you have 1TB of free queries every month, so going through 2TB in one query means you will be out of the free quota pretty quickly. So how much data did I just consume? Let me run that first query again, without hitting the cache.

The result? 4.6 sec elapsed, 9.8 GB processed.

How is that possible? I just joined a 2TB table with a 750GB one. Even with partitioning, one week of Wikipedia page views is 2TB, divided by 52 weeks…that’s 38.5GB. So even with daily partitioning, I’m somehow querying less data.

Well, turns out I have the data in the tables clustered by the language of the Wikipedia and title, so I can make sure to always use those filters when going through the Wikipedia logs.

And that’s how you miniaturize your queries!

4. The power of X-ray vision

Let’s say you want to get more data out of Wikidata for each superhero. Well, this query will do:

05.png
06.png

Why did this query take more time to process? Well, with our X-ray vision powers, we can see what BigQuery did in the background. Let’s look at the query history and the execution details tab.

07.png

Those are all the steps BigQuery had to go through to run our query. Now, if this is a little hard to read, we have some alternatives. For example, the legacy BigQuery web UI has more compact results:

08.png

You can see that the slowest operations were computing while reading the 56-million-row table twice.

09.png

I’ll focus on that to improve performance. If I change the two, shown in these lines:

Now my query runs in half the time! The slowest part has moved elsewhere, as shown here:

10.png

Which is this JOIN now:

11.png

It even shows us that it’s looking for all the superheroes between “3-D Man” and “Zor-El”… yes, it’s going through the whole alphabet. Get an even deeper view of the BigQuery query plan visualizer.

5. The power of materialization

It’s really cool to have these tables in BigQuery. But how did I load them? I periodically bring new raw files into Cloud Storage, and then I read them raw into BigQuery. In the case of the Wikipedia pageviews, I do all the CSV parsing inside BigQuery, as there are many edge cases, and I need to solve some case by case.

Then I materialize these tables periodically into my partitioned and clustered tables. In the case of Wikidata, they have some complicated JSON—so I read each JSON row raw into BigQuery. I could parse it with SQL, but that’s not enough. And that brings us to our next super power.

6. Navigating the multiverse

So we live in this SQL universe, a place where you can go beyond SQL alone. It’s an incredible place to manipulate and understand data, but each universe has its limitations and its rules. What if we could jump to a different universe, with different rules and powers, and manage to connect both universes, somehow? What if we could jump into the…JavaScript universe? We can, with UDFs—user-defined functions. They can easily extend BigQuery’s standard SQL. For example, I can download a random JavaScript library and use it from within BigQuery, like for performing natural language processing and lots more. 

Using UDFs means I can take each row of Wikidata JSON from above and parse it inside BigQuery, using whatever JavaScript logic I want to use, and then materialize this into BigQuery.

7. Time travel

Let’s take one particular table. It’s a beautiful table, with a couple thousand rows. But not everyone is happy—turns out someone wants to delete half of its rows, randomly. How would our super-enemy pull this off?

Oh no. Half of the rows of our peaceful universe are gone. Randomly. How is that even fair? How will we ever recover from this?

5 days later

We learned how to move forward without these rows, but we still miss them. If only there was a way to travel back in time and bring them back.

Yes we can.

Instead of:

we can write:

Warning: CREATE OR REPLACE TABLE deletes the table history, so write the results elsewhere. 

8. The power of super-speed

How fast is BigQuery? It’s this fast.

12.png

The quick summary: BigQuery can run HyperLogLog++, Google’s internal implementation of the HyperLogLog algorithm, for cardinality estimation. It lets BigQuery count uniques a lot faster than other databases can do, and has some other cool features that make BigQuery perform incredibly well.  

9. Invulnerability

Our most annoying enemy? It’s a black hole of data, that thing that happens when we try to divide by zero. However it’s possible to avoid that using BigQuery expressions like the SAFE. prefix.

SAFE. prefix

Syntax:

Description

If you begin a function with the SAFE. prefix, it will return NULL instead of an error.

Operators such as + and = do not support the SAFE. prefix. To prevent errors from a division operation, use SAFE_DIVIDE. Some operators, such as IN, ARRAY, and UNNEST, resemble functions, but do not support the SAFE.prefix. The CAST and EXTRACT functions also do not support the SAFE. prefix. To prevent errors from casting, use SAFE_CAST. Find out more in the BigQuery docs.

10. The power of self-control

All superheroes struggle when they first discover their super-powers. Having super strength is cool, but you can break a lot of things if you’re not careful. Having super-speed is fun—but only if you also learn how to brake. You can query 5PB of data in three minutes, sure—but then remember that querying 1PB is one thousand times more expensive than querying 1TB. And you only have 1TB free every month. If you have not entered a credit card, don’t worry—you will have your free terabyte every month, no need to have a credit card. But if you want to go further, now you need to be aware of your budget and set up cost controls.

Check out this doc on creating custom cost controls, and find out how BigQuery Reservations work to easily use our flat-rate pricing model. Remember, with great powers comes great responsibility. Turn on your cost controls.

And there are a lot more. How about the power to predict the future? And there’s a whole world of ML to explore, not to mention all the GIS capabilities you can find in BigQuery. Check out Lak Lakshmanan talk about more of the awesome resources we have. And that brings me to our bonus super power:

11. The power of community

No superhero should stand alone. Join our Reddit community, where we share tips and news. Come to Stack Overflow for answers, and to help new superheroes learning the ropes. We can all learn from each other. And follow me and my friends on Twitter. 

If you’re ready to test your powers, try to solve our weekly BigQuery Data Challenge. It’s fun, free of charge, and you might win $500 in cloud credits!

What’s happening in BigQuery: New federated queries, easier ML, and more metadata

BigQuery, Google Cloud’s petabyte-scale data warehouse, lets you ingest and analyze data quickly and with high availability, so you can find new insights, trends, and predictions to efficiently run your business. Our engineering team is continually making improvements to BigQuery so you can get even more out of it. Recently added BigQuery features include new federated data sources, BigQuery ML transforms, integer partitions, and more.

Read on to learn more about these new capabilities and get quick demos and tutorial links so you can try them yourself.

Query Orc and Parquet files directly with BigQuery

Parquet and ORC are popular columnar open source formats for large-scale data analytics. As you make your move to the cloud, you can use BigQuery to analyze data stored in these formats. Choosing between keeping these files in Cloud Storage vs. loading your data into BigQuery can be a difficult decision. 

To make it easier, we launched federated query support for Apache ORC and Parquet files in Cloud Storage from BigQuery’s Standard SQL interface. Check out a demo:

This video demonstrates a newly-released set of BigQuery features! BigQuery now supports querying Parquet and ORC files stored in GCS, and BigQuery is now able to understand Hive-partitioned tables in GCS.

This new feature joins other federated querying capabilities from within BigQuery, including storage systems such as Cloud Bigtable, Google Sheets, and Cloud SQL, as well as AVRO, CSV, and JSON file formats in Cloud Storage—all part of BigQuery’s commitment to building an open and accessible data warehouse. Read more details on this launch.  

Your turn: Load and query millions of movie recommendations 

Love movies? Here’s a way to try query federation: analyze over 20 million movie ratings and compare analytic performance between Cloud SQL, BigQuery federated queries, and BigQuery native storage. Launch the code workbook and follow along with the video. Don’t have a Google Cloud account? Sign up for free.

New data transformations with BigQuery ML

The success of machine learning (ML) models depends heavily on the quality of the dataset used in training. Preprocessing your training data during feature engineering can get complicated when you also have to do those same transformations on your production data at prediction time. 

We announced some new features in BigQuery ML that can help preprocess and transform data with simple SQL functions. In addition, because BigQuery automatically applies these transformations at the time of predictions, the productionization of ML models is greatly simplified.

Binning data values with ML.BUCKETIZE

One decision you will face when building your models is whether to throw away records where there isn’t enough data for a given dimension. For example, if you’re evaluating taxi cab fares in NYC, do you throw away rides to latitude and longitudes that only appear once in the data? One common technique is to bucketize continuous values (like lat/long) into discrete bins. This will help group and not discard infrequent dropoff locations in the long tail:

Binning data values.png

BigQuery provides out-of-the-box support for several common machine learning operations that do not require a separate analysis pass through the data. For example, here’s an example of bucketizing the inputs, knowing the latitude and longitude boundaries of New York:

Binning data values 1.png

BigQuery ML now supports running the same preprocessing steps at serving time if you wrap all of your transformations in the new TRANSFORM clause. This saves you from having to remember and implement the same data transformations you did during training on your raw prediction data. Check out the blog post to see the complete example for NYC Taxi cab fare prediction and read more in the BigQuery ML preprocessing documentation

Use flat-rate pricing with BigQuery Reservations

BigQuery Reservations is now in beta in U.S. and E.U. regions. BigQuery Reservations allows you to seamlessly purchase BigQuery slots to take advantage of flat-rate pricing and manage BigQuery spending with complete predictability.

BigQuery Reservations allows you to:

  • Purchase dedicated BigQuery slots by procuring commitments in a matter of seconds
  • Programmatically and dynamically distribute committed BigQuery slots to reservations for workload management purposes
  • Use assignments to assign Google Cloud projects, folders, or your entire organization to reservations
BigQuery Reservations.png

Quickly analyze metadata with INFORMATION_SCHEMA

As a data engineer or analyst, you are often handed a dataset or table name with little or no context of what’s inside or how it is structured. Knowing what tables and columns are available across your datasets is a critical part of exploring for insights. Previously, you could select a preview of the data or click each table name in the BigQuery UI to inspect the schema. Now, with INFORMATION_SCHEMA, you can do these same tasks at scale with SQL. 

How can you quickly tell how many tables are in a dataset? What about the total number and names of columns and whether they are partitioned or clustered? BigQuery natively stores all this metadata about your datasets, tables, and columns in a queryable format that you can quickly access with INFORMATION_SCHEMA:

metadata.png

As you can see, with INFORMATION_SCHEMA.COLUMNS there is a list of every column and its data type for all tables in the baseball dataset. 

Let’s expand the previous query to aggregate some useful metrics like:

  • Count of tables
  • Names of tables 
  • Total number of columns
  • Count of partitioned columns
  • Count of clustered columns
metadata results.png

Try the above query with different public datasets like github_repos or new_york or even your own Google Cloud project and dataset. 

Your turn: Analyze BigQuery public dataset and table metadata quickly with INFORMATION_SCHEMA

Practice analyzing dataset metadata with this 10-minute demo video and code workbook. And check out the documentation for INFORMATION_SCHEMA for reference

Partition your tables by an integer range

BigQuery natively supports partitioning your tables, which makes managing big datasets easier and faster to query. You can decide which of your timestamp, date, or range of integers of your data to segment your table by. 

Creating an integer range partitioned table

Let’s assume we have a large dataset of transactions and a customer_id field that we want to partition by. After the table is set up, we will be able to filter for a specific customer without having to scan the entire table, which means faster queries and more control over costs.

We’ll use the BigQuery CLI for this example. Here we create an integer range partitioned table named mypartitionedtable in mydataset in your default project.

The partitioning is based on a start of 0, end of 100, and interval of 10. That means we have customer IDs from 0 to 100 and want to segment them in groups of 10 (so IDs 0-9 will be in one partition, and so on.). Note that new data written to an integer range partitioned table will automatically be partitioned. This includes writing to the table via load jobs, queries, and streaming. As your dataset changes over time, BigQuery will automatically re-partition your data too.

Using table decorators to work with specific partitions 

You can use table decorators to quickly access specific segments of your data. For example, if you wanted all the customers in the first partition (0-9 ID range) add the $0 suffix to the table name:

This is particularly useful when you need to load in additional data—you can just specify the partitions using decorators. 

For additional information about integer range partitions, see Integer range partitioned tables.

In case you missed it

For more on all things BigQuery, check out these recent posts, videos and how-tos:

To keep up on what’s new with BigQuery, subscribe to our release notes. You can try BigQuery with no charge in our sandbox. And let us know how we can help.