By Rob Giseburt May 24, 2023
Are you looking at your stale, flat relational database and thinking: “It’s so rigid and lifeless!” Do you need polymorphism? Do you have a NoSQL database that has no obvious shape and is causing you lack-of-relationship nightmares? With JSON in PostgreSQL, you can have your cake and relationships too!
What Problem Are We Trying to Solve?
I’d like to explain the decades-old issue we’re working on here.
Let’s take a look at a fairly common use case: a Content Management System (CMS) for an online publication. The publication has articles or stories, which contain references to a mix of Copy, Images, Videos, and Ads. People are also involved in the Story – some involved in the making of the Story (author, editor, photographer, etc), some are referenced in the Story. Add in Places, Things, Events, Products, and other references you don’t even know about yet, and it gets pretty complicated.
Each of these entities have different but overlapping content types, metadata types, and references. Essentially, each entity has its own schema. To make matters worse, you may have new entities being discovered and added over time – like shooting scripts, image rights, embedded ad copy, etc.
The final straw is that you have been tasked with the impossible job of adding a change-log to your database that tracks certain changes of certain entities, keeping both the old and new data, and consolidating it into a single, unified stream to inform downstream systems.
This post will explain how to build just such a beast and we’ll do it using PostgreSQL’s powerful mix of relational and JSON capabilities!
When building a system handling complex data, technology decisions boil down to finding the right balance of flexibility, performance, and maintainability. For this article, we’ll focus on the database storing the data behind all of this.
Relational databases are the poster child for rigid data schemas. Costs in performance, storage usage, and maintainability rise exponentially for every degree-of-freedom added to the dataset.
Representing complex, extensible data in a relational database is a decades-old problem and many solutions have been used:
- Adding tables – adding one or more new tables for each new entity. This gets cumbersome, and usually leads to frequent database schema upgrades.
- Adding columns / sparsing – adding additional columns to tables for specific cases. This makes a table row the union of all possible attributes for different types of entities. The data is spread thin and has to be gathered up knowing the type of entity the row represents. This pushes responsibility for data integrity and other constraints up to the application. These guarantees must be provided by the code writing to the database, compensated for in anything reading from the database, or both.
- Column overloading – providing columns where the value is interpreted differently based on context. For example, using a text or binary column and storing data that may be a number, boolean, text, or even a deep data structure such as JSON. This is a terrible idea, but people do it. This also requires deep knowledge of the dataset, and has no data shape or referential integrity guarantees.
- External data or opaque data – where the data that’s flexible is housed in a completely different system, such as in a file or other database system, or perhaps as an opaque “blob” within the database. This can have performance impact, and also lacks data quality guarantees.
- Many-to-many relationships – are another sticking point for relational databases. We’ll come back to this later.
NoSQL databases solve some of these problems – but not all:
- NoSQL databases are often JSON-document based with one of a variety of query mechanisms. Most NoSQL databases are schema-less, and without concept of relationships. These often amount to a folder of partially indexed JSON documents with a REST-like interface.
- While NoSQL databases have their uses, relationships and data integrity are left up to the consuming code, and often end up being combined with indexing and caching services, and the data shape integrity is left up to the code using the database.
Looking at PostgreSQL: Why JSON?
Most NoSQL databases use JSON for its flexibility and expressivity. However, a JSON document on its own is a flat, two-dimensional tree without any linkage to other documents. Without structure enforcement, managing large amounts of this data can quickly become a burden that’s left up to the code (or people) using the database.
Contrast that to very rigid, row-and-column relational database tables that can, through references, become highly dimensional data. Since the data in columns is traditionally purely scalar, it’s effectively a tree of all leaves. The rigidity of relational databases provides intelligent indexing, storage, constraints providing the structural integrity of data and references, uniqueness guarantees, and prevention of losing essential data.
Now we can combine the two to get the best parts of both!
PostgreSQL Support for JSON
PostgreSQL added JSON support in version 9.2 and has greatly expanded this capability since. AWS Aurora Serverless (PostgreSQL-flavor) uses PostgreSQL 10 protocol and language. Unfortunately that doesn’t include the newer SQL/JSON Path that arrived in PostgreSQL 12, so plan accordingly. Nothing that can be done with SQL/JSON Path can’t be done without it, but it is more convenient.
For better efficiency in usage and space, instead of storing strings and parsing them at the time of usage in
json columns, one should use
jsonb (JSON-binary) columns. To quote the docs:
The JSON data type stores an exact copy of the input text, which processing functions must reparse on each execution; while JSONb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process since no reparsing is needed. JSONb also supports indexing, which can be a significant advantage.
JSONb, due to the additional indexing, also has the operators needed to accomplish the tasks described below, whereas
json columns don’t. In every case where I say “JSON data,” I mean JSON data stored in a
jsonb column type.
One of the most powerful features of using JSON in a database column is the ability to express many-to-many relationships all at once, along with any additional data about those relationships. Options for many-to-many relations include: in order of more-to-less traditional:
- Junction tables – in classic relational DBs it’s common to use associative tables, or more formally “Associative Entities“. This is a table with columns of primary key references to join two or more other tables. Additional metadata about the relationship can also be held in additional columns.
- These index well, support constraint checks, and are a well-known design pattern. But they can be a lot of work to maintain.
- Since the data is stored across multiple tables, it takes multiple statements to freeze or thaw a dataset complete with relationships.
- Array columns – allowing a single column to contain a variable-length list of same-typed values. These can mimic an associative table, but joining is asymmetric, as it’s easier to join from an array column than it is to join to one. Array columns are supported in PostgreSQL, but not in many other databases.
- Arrays index using the same GIN index type as
jsonbcolumns, and cannot contain additional metadata with each value, as you can with JSON. (Note: It is possible to make a
jsonbcolumn, but we don’t recommend it, as there’s no value over a
jsonbcolumn that contains an array.)
- Arrays index using the same GIN index type as
- JSON – allows a complete tree of relationships to be expressed, along with relevant metadata about those relationships. This is the pattern we use in this post.
- This is not as performant as using an Associative Table for joins mostly due to differences in indexing, but this allows a detailed relationship to be expressed with metadata in a single freeze or thaw statement.
- Similar to arrays, joining performance is asymmetric.
- Graph databases – represent relationships as specific “first class entities”. Until PostgreSQL grows graph capabilities we compromise by using JSON.