Learn how to create dynamic partitions with PostgreSQL

A step by step guide to automatically shard your big data tables

I’m a software engineer at Panya Studios where we’re moving to a microservices architecture — and the first service that we created is for our real-time chat system.

As part of creating this new service, we’ll be storing our chat data for analytics. While seemingly simple, this effort required us to pause and consider the following question:

“How can we efficiently read and write chat data?”

With over a million users and counting, our chat database can grow really big and really fast — so it’s all about scalability. In response, the answer is partitioning!

Some basic definitions

Before we get started, here’s a few basic terms.

  • Partitioning refers to splitting a large table into smaller tables.
  • Dynamic refers to constantly changing.
  • Dynamic Partitioning thus refers to automatically splitting a large table into smaller tables.

Now that we’re on the same page, let’s go more in depth on how we can achieve dynamic partitioning with PostgreSQL!

PostgreSQL

PostgreSQL is a powerful, open source object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads. https://www.postgresql.org/about/

One of the many features that PostgreSQL supports is table partitioning. In fact, they have a great post about partitioning which I highly encourage you to read.

Although the examples they provided are hard coded, their documentation is excellent and contains the core features of partitioning:

  • Inheritance
  • Functions
  • Triggers

The last item to make this truly dynamic is string interpolation! Combine the four features together, and you get . . . dynamic partitioning!

TLDR; Show me the code!

Our goal is to automatically achieve the creation of new partitions each month — instead of manually creating each one ahead of time.

Visualization of a chat table partitioned by month.

Here’s how to create dynamic partitioning in 3 simple steps.

Step #1 — Create the parent table

Step #2 — Create the function

I’ll try my best to explain what’s going on:

  • First we CREATE a new FUNCTION called “chat_insert_function()”
  • Then we DECLARE the variables that will be used
  • Afterwards we BEGIN initializing the variables
  • Next we query to see if the partitioned table exists. If not, we dynamically create a new partitioned table e.g. (chat_yyyy_mm) that INHERITS from chat_master.
  • Finally, we insert the new record to the partitioned table

Some things to note:

  • The parent table is always empty.
  • “NEW” signifies a new database row for INSERT/UPDATE operations
  • Tables cannot contain dash (-), so table names are underscored (_)
  • “||” is PostgreSQL syntax for string concatenation
  • For STRING INTERPOLATION, we use the built-in function “format()”
  • The syntax for escape string is backslash (), but the letter ‘E’ has to come before the opening single quote — see line 19.
  • RETURN NULL means that this function returns nothing to the caller, which is the TRIGGER.

Step #3 — Create the trigger

To put it simply, this trigger will fire chat_insert_function() before every insert into chat_master — thereby redirecting the NEW records inserted into its respective partitioned table.

Querying the data

Now that we’ve created a script that would dynamically create partitions when new records get inserted — what about READing from them?

Good question! In order to test this, we must first INSERT a new row into chat_master.

INSERT statement

INSERT INTO public.chat_master (
program_id, user_id, dialogue, created_at)
VALUES ('program_1', 'A01', 'hello world!', '2018-11-11'
);

Assuming that you followed the previous steps to create the parent table, function, and trigger, you should see the results as

NOTICE:  A partition has been created chat_2018_11 
INSERT 0 0

If you refresh your tables, you should see a new one “chat_2018_11” with the same columns as chat_master.

SELECT statement
Now that we have some data stored in our database, we can SELECT from them.

SELECT * FROM chat_master
|id | program_id  |user_id|   dialogue     | created_at
-----------------------------------------------------------------
| 2 | "program_1" | "A01" | "hello world!" | "2018-11-11 00:00:00"

We can also SELECT directly from the partitioned table

SELECT * FROM chat_2018_11
|id | program_id  |user_id|   dialogue     | created_at
-----------------------------------------------------------------
| 2 | "program_1" | "A01" | "hello world!" | "2018-11-11 00:00:00"

Wait a second … isn’t this duplicate data?

Nope! That’s what I thought at first too, but this isn’t the case. When you’re inserting data into chat_master, it’s only writing to the partitioned table. However, when you’re reading from chat_master, it’s grabbing data from all the inherited partitioned tables.

Let me illustrate this concept with the following queries:

INSERT INTO public.chat_master (
program_id, user_id, dialogue, created_at)
VALUES ('program_2', 'A01', 'hello panya!', '2018-12-12'
);
NOTICE:  A partition has been created chat_2018_12 
INSERT 0 0

SELECT * FROM chat_master
|id | program_id  |user_id|   dialogue     | created_at
-----------------------------------------------------------------
| 2 | "program_1" | "A01" | "hello world!" | "2018-11-11 00:00:00"
| 3 | "program_2" | "A01" | "hello panya!" | "2018-12-12 00:00:00"
SELECT * FROM chat_2018_12
|id | program_id  |user_id|   dialogue     | created_at
-----------------------------------------------------------------
| 2 | "program_2" | "A01" | "hello panya!" | "2018-12-12 00:00:00"

You can also run the following command to verify that chat_master is empty

SELECT * FROM ONLY chat_master

Why Partitioning Matters

Dynamic partitioning is great when you want to automatically split a large table into smaller ones. This can be beneficial when running full table scans and filtering by the partitions on the WHERE clause.

Dynamic partitioning is also more efficient than executing bulk database operations — for example, drop the partition table instead of bulk deletes. By implementing dynamic partitioning on large tables, you might see improvements in your database query performance, maintainability, and scalability.

Try giving partitioning a shot! If you have another way of doing this or you have any problems with examples above, just drop a comment below to let me know.

Thanks for reading — and please follow me here on Medium for more interesting software engineering articles!


Learn how to create dynamic partitions with PostgreSQL was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.

Amazon ElastiCache for Redis adds support for Redis 5.0.3 and the ability to change Redis command names

February 28, 2019: Amazon ElastiCache for Redis adds support for Redis 5.0.3. Redis 5.0.3 is a maintenance release which includes several bug fixes and minor enhancements. You can now stay current with the most up to date version of Redis and benefit from recently introduced capabilities such as Redis Streams (made available with the launch of Redis 5.0 on ElastiCache in November 2018).

Amazon DocumentDB (with MongoDB compatibility) now supports new features for aggregations, arrays, and indexing

Amazon DocumentDB (with MongoDB compatibility) is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. Today, Amazon DocumentDB added support for new aggregation string operators ($concat, $substr, $substrBytes, $substrCP, $strcasecmp), an array aggregation operator ($size), an aggregation group accumulator operator ($push), and aggregation stages ($redact and $indexStats) that allow you to compose powerful aggregations over your documents. Additionally, Amazon DocumentDB now supports positional array operators ($[] and $[]) for updating elements in an array and hint() for selecting an index.

To use the new features, you can create a new Amazon DocumentDB cluster with just a few clicks in the Amazon DocumentDB Management Console or you can upgrade your existing cluster.

For more information, see the AWS Database Blog: New Amazon DocumentDB feature for aggregations, array, and indexing

Amazon Connect Simplifies Adding AWS Lambda Functions to Contact Flows

You can now select any AWS Lambda function available in your AWS account from within the Amazon Connect contact flow editor. AWS Lambda functions are used in contact flows to customize the caller experience by pulling data from an external database, such as a CRM system containing customer information or a database containing order history. Previously, to add AWS Lambda functions into your contact flow, you had to grant permissions to execute AWS Lambda from Amazon Connect using the AWS Command Line Interface (CLI) and then open the AWS Lambda console to copy over the function ARN. Now, you can do both using drop-downs in the UI.

Building a render farm in GCP using OpenCue—new guide available

From rendering photorealistic humans and fantastical worlds which blend seamlessly with live action photography, to creating stylized characters and environments for animated features, we are in a golden age of computer-generated imagery. It’s no wonder that this work requires more and more processing power, faster networks, and more capable storage to complete each frame of these projects.

As the work necessary to complete each frame in a movie grows in complexity, so does the number of scenes requiring visual effects (VFX) or animation. A blockbuster film’s shot count is now in the thousands, and for an animated feature, every shot requires a multitude of different rendering tasks. In addition, visual content created for streaming services, television, advertisements, and game cinematics increasingly call for visual effects and animation augmentation—much of it at the same level of quality as feature films. The number of projects requiring VFX and animation work is growing rapidly and pushing render requirements to new heights.

Google Cloud Platform (GCP) can help by providing resources to get this work done efficiently and in a cost effective manner. By using Instance Templates to tailor a Virtual Machine (VM) in size to fit the resource requirements of each individual frame or task, you optimize your spend by right sizing your VMs. Managed Instance Groups (MIGs) can be used to scale the number of resources in these templates to the amount of tasks you need to render. When processing is complete for each of these, simply shut down the associated resources so you only pay for what you use, when you use it.

But how does one orchestrate the distribution of the multitude of rendering tasks required for an individual film, much less the group of films larger studios work on concurrently?

For a long time, studios have carried the cost of building their own render management tools, or used a third party software provider to help solve this problem. There is now another option. In collaboration with Sony Pictures Imageworks, Google recently released OpenCue, an open source, high-performance render manager built specifically for the needs of the visual effects and animation industry. OpenCue can be run in a variety of ways, and it’s capable of managing resources that are exclusively on-premise, entirely in the cloud, or spanning both in a hybrid environment.

Today, we’re announcing a new solution: Building a render farm in GCP using OpenCue. This tutorial guides you through deploying OpenCue, and all the resources required, to build a render farm in GCP. It explores a workflow for creating and deploying all prerequisite software as Docker images, as well as managing the size and scale of compute resources through Instance Templates and MIGs. It also provides an overview of the OpenCue interface as you manage rendering an animation scene from start to finish.

We hope you find this guide useful. Please tell us what you think and be sure to sign up for a trial at no cost to explore building a render farm in GCP using OpenCue.