From SQL to Nested Data Structures
(Published on 2019-08-13)
This is the most typical problem you’ll find in every application that talks with an SQL database: You want to fetch a listing of some kind of database entry, but the entries themselves may contain nested structured data.
To demonstrate what I mean, let’s start with a simple example of a game database1: We have a table with rows for each top-level game entry and a table for releases of said games. There is a one-to-many relation between games and releases.
CREATE TABLE games (
integer PRIMARY KEY,
game_id
title text
);
CREATE TABLE releases (
integer PRIMARY KEY,
release_id integer REFERENCES games,
game_id date
release_date );
Now suppose we want to display a listing of games with their release dates. It would be nice if the database could give us a single JSON document with the following structure:
[
{
"game_id": 7,
"title": "Tsukihime",
"releases": [
{ "release_id": 339, "release_date": "2000-12-29" },
{ "release_id": 341, "release_date": "2006-12-28" }
]
},
{ ... }
]
But relational databases are designed to only deal with rows and columns, so getting that JSON structure with a single query is not going to happen. Some databases do support structured column types, but dynamically constructing those can be unwieldy and/or slow.
A common and simple approach to fetch nested data is to first fetch the list of games from the database, then iterate over the resulting rows and fetch the relevant release entries at each iteration. I’ll be using Perl 5 and DBIx::Simple in this article, but the ideas should translate to any other dynamic programming language. Here’s an example of such iteration:
my @games = $db->query('SELECT game_id, title FROM games')->hashes;
for my $game (@games) {
$game->{releases} = [
$db->query(
'SELECT release_id, release_date FROM releases WHERE game_id = ?',
$game->{game_id}
)->hashes
];
}# `@games` now has the requested structure.
This approach works, but it’s not super convenient and it can be really slow when fetching many rows. Of course, this is not a new problem. Almost every programming language has several libraries and frameworks available to it to help with fetching complex data structures from SQL databases, ranging from a few helper functions to complete ORMs. What I propose here is not a groundbreaking new technique, nor am I going to pitch a particular library. Instead, I describe a simple yet generic API to deal with the problem.
Constructing SQL queries in Perl
But before I get to the solution, let me first briefly cover how one can dynamically construct and compose SQL queries. Having such an abstraction is useful in general, but these concepts will also simplify the rest of this article.
SQL queries are treated as strings in most programming languages and one can dynamically construct and compose queries by simple string concatenation. There are several problems with actually constructing queries that way, but, for our purposes, it almost suffices. Almost. The biggest problem in terms of composability is that, not only do you need to concatenate strings to get the final query, you also need to keep track of bind parameters. And that can get annoying real quick when your queries are dynamic.
Here’s a simple example of using string interpolation to abstract common functionality into a separate function, without losing much on flexibility:
# Our abstraction to fetch game entries.
sub fetch_games {
my $game_ids = shift;
$db->query("SELECT * FROM games WHERE game_id IN($game_ids)");
# We could be doing some transformations here.
# Sorting, pagination, fetching extra information, etc.
}
# We can use that function to fetch games that have already been released.
my $ids = "SELECT game_id FROM releases WHERE release_date <= CURRENT_DATE";
my $released_games = fetch_games($ids);
But what if we wanted to replace that CURRENT_DATE
with
a user-provided value? Doing that without fear of arbitrary SQL
injection involves using a bind parameter, which our
fetch_games()
function doesn’t support. Fortunately,
there’s a solution. Or rather, there are many solutions, but let’s stick
with one I’m familiar with: SQL::Interp. That module
provides, among other things, a wonderful function called
sql()
that takes a list of SQL strings and variable
references (which are converted into bind parameters) and returns a
single value that can be interpolated elsewhere. Let’s use it to
transform the above example:
use SQL::Inerp 'sql';
sub fetch_games {
my $game_ids = shift;
# Note the use of `iquery` instead of `query`,
# this is a DBIx::Simple wrapper around `sql_interp()`.
$db->iquery("SELECT * FROM games WHERE game_id IN(", $game_ids, ")");
}
my $latest_date = '2004-01-20'; # Our user input
# This is our "SQL + bind parameters" in a single variable.
my $ids = sql("SELECT game_id FROM releases WHERE release_date <=", \$latest_date);
# ...which can be passed to fetch_games():
my $games_with_release_before_latest_date = fetch_games($ids);
That gives us safe variable interpolation into our SQL queries and we can compose queries as if they were normal strings2. You can do a lot more powerful query construction tricks with SQL::Interp, but I’ll not cover that here. Its documentation has several examples and there are hints at even more fun tricks in the SQL::Interpolate docs (which is a different and older module, but those examples are easily ported).
Structured documents
With that out of the way, we can go back to fetching nested
structures from a relational database. Let’s build upon the iteration
example given earlier and improve its performance. Instead of running a
separate query to fetch the releases for each game, let’s fetch the
releases of all the relevant games in a single query. We select the
necessary rows with an IN(id1,id2,,...)
clause and then
merge the returned data back into the @games
structure, as
follows:
my @games = $db->query('SELECT game_id, title FROM games')->hashes;
# List of game_ids, which we can use for our IN() clause.
my @game_ids = map $_->{game_id}, @games;
# Fetch all the releases linked to @game_ids.
my @releases = $db->iquery(
'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN',
@game_ids
\
)->hashes;
# Create a 'game_id' => [releases] lookup table for quick access
my %releases;
for my $release (@releases) {
# Add this release to %releases and remove the 'game_id' column,
# which was only needed for this merging step.
push @{$releases{ delete $release->{game_id} }}, $release;
}
# Now merge the release information back into @games
for my $game (@games) {
$game->{releases} = $releases{ $game->{game_id} } || [];
}# `@games` now has the requested structure.
It works and it’s beautifully fast, but it’s also verbose and not something you’d like to type out every time you want to query the database. Okay, I admit, it’s more verbose than necessary because I wanted this code to be readable by people who aren’t experts in Perl. It can be done with less code but the number of steps won’t change much. Rather than playing code golf, let’s do what every sensible programmer does when they notice repetition: Let’s abstract the repeated bits into a separate function.
Let’s first define the inputs of the function, we’ll need:
@games
: The list of objects that we want to extend."releases"
: The name of the field we are adding to the game objects."game_id"
: The column used for querying and merging.- A function that, given a list of game ids, returns a query for fetching the release information.
The function would return a modified list of objects with the new information embedded inside. Since we’ll be passing the objects to the function as a reference and since it’s easier to modify data in-place in Perl, that’s what we’ll be doing. In other (more functional) languages, it may be easier to avoid mutating the given objects and return a list of fresh new objects instead.
Here’s what that function looks like. It does exactly the same as the code listed above3, so I figure I can get away with a shorter and slightly less readable version this time:
sub enrich {
my($field_name, $merge_field, $sql, @objects) = @_;
my %ids = map +($_->{$merge_field}, []), @objects;
return if !keys %ids;
my @result = $db->iquery( $sql->([keys %ids]) )->hashes;
push @{$ids{ delete $_->{$merge_field} }}, $_ for (@result);
$_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects);
}
And here’s how we can rewrite the previous example by using that function:
my @games = $db->query('SELECT game_id, title FROM games')->hashes;
'releases', 'game_id', sub {
enrich 'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN', $_[0]
sql @games;
},
# `@games` now has the requested structure.
It’s short, fast and readable. At least, it is once you get used to
the enrich
function - I admit it may not be very intuitive
the first time you see it.
More nesting
Let’s expand our example database with a new table. Each release may be available for one or more platforms:
CREATE TABLE releases_platforms (
integer REFERENCES releases,
release_id
platform text );
We also want that list of platforms to be included in our nested data structure, so that it looks like this:
[
{
"game_id": 7,
"title": "Tsukihime",
"releases": [
{
"release_id": 339,
"release_date": "2000-12-29",
"platforms": [
{ "platform": "Linux" },
{ "platform": "Windows" }
]
},
{ ... }
]
},
{ ... }
]
Fortunately, the enrich
function can easily handle
nested objects: all we have to do is pass a list of release
objects to enrich
instead of game objects, and we
can do that with a simple map
:
'platforms', 'release_id', sub {
enrich 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0]
sql map @{$_->{releases}}, @games; },
That’s it, we now have the data structure as described above. It is possible to nest arbitrarily deep with this approach.
This works because enrich
modifies the data structure
in-place. A version that copies the given data structure would be
slightly more complex, but that’s a surmountable problem.
Variations
That array of platform
objects in the previous example
is more verbose than we’d like. It would be nicer if we could flatten
that to an array of platform strings, so that a release object looks
more like this:
{
"release_id": 339,
"release_date": "2000-12-29",
"platforms": [ "Linux", "Windows" ]
}
Of course, such a variation of enrich
could be written
quite easily:
sub enrich_flatten {
# This is the same as the original enrich()
my($field_name, $merge_field, $sql, @objects) = @_;
my %ids = map +($_->{$merge_field}, []), @objects;
return if !keys %ids;
my @result = $db->iquery( $sql->([keys %ids]) )->hashes;
# This is the actual merge strategy, which we'll modify slightly:
push @{$ids{ delete $_->{$merge_field} }}, values %$_ for (@result);
$_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects);
}
Usage is exactly the same:
'platforms', 'release_id', sub {
enrich_flatten 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0]
sql map @{$_->{releases}}, @games; },
Other merge strategies could be implemented in the same way. For
example, one could imagine a scenario where, instead of a one-to-many
relation as in the previous examples, there is a one-to-one mapping
between the list of objects and the query results. An alternative merge
strategy in that case could be to copy the columns of the new query into
the already existing object. You could achieve the exact same effect in
the original SQL query by adding a plain old JOIN
, but
dynamic application-level joins could still have their uses in
abstracting common functionality or structuring a larger code base.
This version of enrich
has a limitation that the
$merge_field
is used to identify both the field in the
original objects and the column in the query results. This is fine in
the example database - I deliberately used full game_id and
release_id column names rather than calling them just
id - but there will no doubt come a time when this limitation
starts to get annoying and you’ll want to support different column names
for object and result queries. It’s easy to support that, in any
case.
A more complex example
This enrich
function is cute and all, but the examples
I’ve given so far can be implemented just as easily with your average
ORM. Yet there is an important difference (apart from having one less
dependency): enrich
provides for easier ad-hoc flexibility,
meaning: you can use the full power of SQL in the enrichment
queries.
Let’s add some joins and filters to our example. Let’s say we want the following structure:
[
{
"game_id": 7,
"title": "Tsukihime",
"platforms": [ "Windows", "Linux" ],
"releases": [
{
"release_id": 339,
"release_date": "2000-12-29",
"platform_count": 2
}
]
},
{ ... }
]
That is, we want to have a per-game “platforms” array rather than per-release, and instead of the platform list for releases, we just want a single “platform_count”. Let’s say we want the list of releases ordered by release date and we also want to be able to filter on the release date. Here’s a full example to achieve all of that:
my @games = $db->query('SELECT game_id, title FROM games')->hashes;
# This is the per-game platforms list.
'platforms', 'game_id', sub {
enrich_flatten 'SELECT DISTINCT game_id, platform
sql FROM releases
JOIN releases_platforms USING (release_id)
WHERE game_id IN', $_[0]
@games;
},
my $latest_date = '2004-01-20'; # Our user input
# This is the releases list.
'releases', 'game_id', sub {
enrich 'SELECT game_id, release_id, release_date,
sql (SELECT COUNT(*) FROM releases_platforms rp
WHERE rp.release_id = r.release_id) AS platform_count
FROM releases r
WHERE game_id IN', $_[0], '
AND release_date <=', \$latest_date, '
ORDER BY release_date'
@games; },
It’s not the most elegant piece of code ever written, but it does the trick. The alternatives aren’t likely to be much better.
Final notes
I’ve been using this approach - with a slightly different API - for
the experimental rewrite of VNDB.org and
I’m happy with the result. The main advantage of the enrich
family of functions is that you can easily and efficiently run complex
queries and merge the results to construct complex JSON documents -
ad-hoc and without any boilerplate, as there’s no need to define,
declare or even name your data structure up front. This sometimes makes
it harder to follow the code, as you may have to visualize a monstrously
complex data structure in your mind, but this approach can be incredibly
helpful for quick prototyping. Good documentation and/or a good
debugging strategy can make up for lack of boilerplate.
The exact API for the enrich()
functions is something
I’ve not fully decided on yet. I’ve not put that much thought into the
order of its arguments, the need for “variations” for different
scenarios could be reduced by adding the merge strategy as a separate
argument, at which point named arguments may be necessary to keep things
readable, etc. The exact API doesn’t matter all that much, it’s the idea
that matters.
As I mentioned in the introduction: while I’ve been using Perl for these examples, I’m convinced that this idea can be easily ported to other dynamic languages. I did in fact experiment with porting this idea to Elixir and it was every bit as convenient, in a way that Ecto’s built-in ORM functionality wasn’t.
It’s quite possible that this abstraction is already available in some languages and libraries and that I’ve just missed it. It’s also possible that this is, in fact, a terrible idea in the long run. Only way to find out is to give it a try. On that note, if you feel like playing around with the examples in this article, I wrote a little script ⓘ to test that the examples actually work, which also makes for a good base for quick experimentation.
This is a rather simplified schema inspired by VNDB.org - the project that made me experiment with all this.↩︎
I’m sure that at this point you’d call me out as a madman for suggesting that treating SQL as strings is an acceptable strategy for constructing dynamic queries, and I’d actually be sympathetic to that view - I’m well aware that it’s rather brittle. So now you’re going to try and convince me that your favorite query builder library is the ultimate solution to this problem and that everyone should use it. Except it’s not available for Perl and/or it has severe limitations and/or it has so much boilerplate that it’s an utter waste of time for quick prototyping. I’ve played around with query builders in various languages, but the only one I actually kinda liked was Elixir’s Ecto. But alas, I’m stuck with Perl for now, and string interpolation isn’t all that bad.↩︎
I lied, this version does two more things: It removes duplicate identifiers from the
IN
clause and does an early return if there are no identifiers to fetch.↩︎