Yoran Heling
projects@yorhel.nl
home - git - @ayo
= donate = paypal
= pgp = only used for releases
key - mit
7446 0D32 B808 10EB A9AF A2E9 6239 4C69 8C27 39FA

Cute decorative scissors, cutting through your code.

From SQL to Nested Data Structures

(Published on 2019-08-13)

This is the most typical problem you'll find in every application that talks with an SQL database: You want to fetch a listing of some kind of database entry, but the entries themselves may contain nested structured data.

To demonstrate what I mean, let's start with a simple example of a game database1: We have a table with rows for each top-level game entry and a table for releases of said games. There a one-to-many relation between games and releases.

CREATE TABLE games (
    game_id  integer PRIMARY KEY,
    title    text
);

CREATE TABLE releases (
    release_id    integer PRIMARY KEY,
    game_id       integer REFERENCES games,
    release_date  date
);

Now suppose we want to display a listing of games with their release dates. It would be nice if the database could give us a single JSON document with the following structure:

[
    {
        "game_id": 7,
        "title": "Tsukihime",
        "releases": [
            { "release_id": 339, "release_date": "2000-12-29" },
            { "release_id": 341, "release_date": "2006-12-28" }
        ]
    },
    { ... }
]

But relational databases are designed to only deal with rows and columns, so getting that JSON structure with a single query is not going to happen. Some databases do support structured column types, but dynamically constructing those can be unwieldy and/or slow.

A common and simple approach to fetch nested data is to first fetch the list of games from the database, then iterate over the resulting rows and fetch the relevant release entries at each iteration. I'll be using Perl 5 and DBIx::Simple in this article, but the ideas should translate to any other dynamic programming language. Here's an example of such iteration:

my @games = $db->query('SELECT game_id, title FROM games')->hashes;

for my $game (@games) {
    $game->{releases} = [
        $db->query(
            'SELECT release_id, release_date FROM releases WHERE game_id = ?',
            $game->{game_id}
        )->hashes
    ];
}
# `@games` now has the requested structure.

This approach works, but it's not super convenient and it can be really slow when fetching many rows. Of course, this is not a new problem. Almost every programming language has several libraries and frameworks available to it to help with fetching complex data structures from SQL databases, ranging from a few helper functions to complete ORMs. What I propose here is not a groundbreaking new technique, nor am I going to pitch a particular library. Instead, I describe a simple yet generic API to deal with the problem.

Constructing SQL queries in Perl

But before I get to the solution, let me first briefly cover how one can dynamically construct and compose SQL queries. Having such an abstraction is useful in general, but these concepts will also simplify the rest of this article.

SQL queries are treated as strings in most programming languages and one can dynamically construct and compose queries by simple string concatenation. There are several problems with actually constructing queries that way, but, for our purposes, it almost suffices. Almost. The biggest problem in terms of composability is that, not only do you need to concatenate strings to get the final query, you also need to keep track of bind parameters. And that can get annoying real quick when your queries are dynamic.

Here's a simple example of using string interpolation to abstract common functionality into a separate function, without losing much on flexibility:

# Our abstraction to fetch game entries.
sub fetch_games {
    my $game_ids = shift;
    $db->query("SELECT * FROM games WHERE game_id IN($game_ids)");
    # We could be doing some transformations here.
    # Sorting, pagination, fetching extra information, etc.
}

# We can use that function to fetch games that have already been released.
my $ids = "SELECT game_id FROM releases WHERE release_date <= CURRENT_DATE";
my $released_games = fetch_games($ids);

But what if we wanted to replace that CURRENT_DATE with a user-provided value? Doing that without fear of arbitrary SQL injection involves using a bind parameter, which our fetch_games() function doesn't support. Fortunately, there's a solution. Or rather, there are many solutions, but let's stick with one I'm familiar with: SQL::Interp. That module provides, among other things, a wonderful function called sql() that takes a list of SQL strings and variable references (which are converted into bind parameters) and returns a single value that can be interpolated elsewhere. Let's use it to transform the above example:

use SQL::Inerp 'sql';

sub fetch_games {
    my $game_ids = shift;
    # Note the use of `iquery` instead of `query`,
    # this is a DBIx::Simple wrapper around `sql_interp()`.
    $db->iquery("SELECT * FROM games WHERE game_id IN(", $game_ids, ")");
}

my $latest_date = '2004-01-20'; # Our user input

# This is our "SQL + bind parameters" in a single variable.
my $ids = sql("SELECT game_id FROM releases WHERE release_date <=", \$latest_date);

# ...which can be passed to fetch_games():
my $games_with_release_before_latest_date = fetch_games($ids);

That gives us safe variable interpolation into our SQL queries and we can compose queries as if they were normal strings2. You can do a lot more powerful query construction tricks with SQL::Interp, but I'll not cover that here. Its documentation has several examples and there are hints at even more fun tricks in the SQL::Interpolate docs (which is a different and older module, but those examples are easily ported).

Structured documents

With that out of the way, we can go back to fetching nested structures from a relational database. Let's build upon the iteration example given earlier and improve its performance. Instead of running a separate query to fetch the releases for each game, let's fetch the releases of all the relevant games in a single query. We select the necessary rows with an IN(id1,id2,,...) clause and then merge the returned data back into the @games structure, as follows:

my @games = $db->query('SELECT game_id, title FROM games')->hashes;

# List of game_ids, which we can use for our IN() clause.
my @game_ids = map $_->{game_id}, @games;

# Fetch all the releases linked to @game_ids.
my @releases = $db->iquery(
    'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN',
    \@game_ids
)->hashes;

# Create a 'game_id' => [releases] lookup table for quick access
my %releases;
for my $release (@releases) {
    # Add this release to %releases and remove the 'game_id' column,
    # which was only needed for this merging step.
    push @{$releases{ delete $release->{game_id} }}, $release;
}

# Now merge the release information back into @games
for my $game (@games) {
    $game->{releases} = $releases{ $game->{game_id} } || [];
}
# `@games` now has the requested structure.

It works and it's beautifully fast, but it's also verbose and not something you'd like to type out every time you want to query the database. Okay, I admit, it's more verbose than necessary because I wanted this code to be readable by people who aren't experts in Perl. It can be done with less code but the number of steps won't change much. Rather than playing code golf, let's do what every sensible programmer does when they notice repetition: Let's abstract the repeated bits into a separate function.

Let's first define the inputs of the function, we'll need:

The function would return a modified list of objects with the new information embedded inside. Since we'll be passing the objects to the function as a reference and since it's easier to modify data in-place in Perl, that's what we'll be doing. In other (more functional) languages, it may be easier to avoid mutating the given objects and return a list of fresh new objects instead.

Here's what that function looks like. It does exactly the same as the code listed above3, so I figure I can get away with a shorter and slightly less readable version this time:

sub enrich {
    my($field_name, $merge_field, $sql, @objects) = @_;
    my %ids = map +($_->{$merge_field}, []), @objects;
    return if !keys %ids;
    my @result = $db->iquery( $sql->([keys %ids]) )->hashes;
    push @{$ids{ delete $_->{$merge_field} }}, $_ for (@result);
    $_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects);
}

And here's how we can rewrite the previous example by using that function:

my @games = $db->query('SELECT game_id, title FROM games')->hashes;

enrich 'releases', 'game_id', sub {
    sql 'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN', $_[0]
}, @games;

# `@games` now has the requested structure.

It's short, fast and readable. At least, it is once you get used to the enrich function - I admit it may not be very intuitive the first time you see it.

More nesting

Let's expand our example database with a new table. Each release may be available for one or more platforms:

CREATE TABLE releases_platforms (
    release_id  integer REFERENCES releases,
    platform    text
);

We also want that list of platforms to be included in our nested data structure, so that it looks like this:

[
    {
        "game_id": 7,
        "title": "Tsukihime",
        "releases": [
            {
                "release_id": 339,
                "release_date": "2000-12-29",
                "platforms": [
                    { "platform": "Linux" },
                    { "platform": "Windows" }
                ]
            },
            { ... }
        ]
    },
    { ... }
]

Fortunately, the enrich function can easily handle nested objects: all we have to do is pass a list of release objects to enrich instead of game objects, and we can do that with a simple map:

enrich 'platforms', 'release_id', sub {
    sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0]
}, map @{$_->{releases}}, @games;

That's it, we now have the data structure as described above. It is possible to nest arbitrarily deep with this approach.

This works because enrich modifies the data structure in-place. A version that copies the given data structure would be slightly more complex, but that's a surmountable problem.

Variations

That array of platform objects in the previous example is more verbose than we'd like. It would be nicer if we could flatten that to an array of platform strings, so that a release object looks more like this:

{
    "release_id": 339,
    "release_date": "2000-12-29",
    "platforms": [ "Linux", "Windows" ]
}

Of course, such a variation of enrich could be written quite easily:

sub enrich_flatten {
    # This is the same as the original enrich()
    my($field_name, $merge_field, $sql, @objects) = @_;
    my %ids = map +($_->{$merge_field}, []), @objects;
    return if !keys %ids;
    my @result = $db->iquery( $sql->([keys %ids]) )->hashes;

    # This is the actual merge strategy, which we'll modify slightly:
    push @{$ids{ delete $_->{$merge_field} }}, values %$_ for (@result);
    $_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects);
}

Usage is exactly the same:

enrich_flatten 'platforms', 'release_id', sub {
    sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0]
}, map @{$_->{releases}}, @games;

Other merge strategies could be implemented in the same way. For example, one could imagine a scenario where, instead of a one-to-many relation as in the previous examples, there is a one-to-one mapping between the list of objects and the query results. An alternative merge strategy in that case could be to copy the columns of the new query into the already existing object. You could achieve the exact same effect in the original SQL query by adding a plain old JOIN, but dynamic application-level joins could still have their uses in abstracting common functionality or structuring a larger code base.

This version of enrich has a limitation that the $merge_field is used to identify both the field in the original objects and the column in the query results. This is fine in the example database - I deliberately used full game_id and release_id column names rather than calling them just id - but there will no doubt come a time when this limitation starts to get annoying and you'll want to support different column names for object and result queries. It's easy to support that, in any case.

A more complex example

This enrich function is cute and all, but the examples I've given so far can be implemented just as easily with your average ORM. Yet there is an important difference (apart from having one less dependency): enrich provides for easier ad-hoc flexibility, meaning: you can use the full power of SQL in the enrichment queries.

Let's add some joins and filters to our example. Let's say we want the following structure:

[
    {
        "game_id": 7,
        "title": "Tsukihime",
        "platforms": [ "Windows", "Linux" ],
        "releases": [
            {
                "release_id": 339,
                "release_date": "2000-12-29",
                "platform_count": 2
            }
        ]
    },
    { ... }
]

That is, we want to have a per-game "platforms" array rather than per-release, and instead of the platform list for releases, we just want a single "platform_count". Let's say we want the list of releases ordered by release date and we also want to be able to filter on the release date. Here's a full example to achieve all of that:

my @games = $db->query('SELECT game_id, title FROM games')->hashes;

# This is the per-game platforms list.
enrich_flatten 'platforms', 'game_id', sub {
    sql 'SELECT DISTINCT game_id, platform
           FROM releases
           JOIN releases_platforms USING (release_id)
          WHERE game_id IN', $_[0]
}, @games;

my $latest_date = '2004-01-20'; # Our user input

# This is the releases list.
enrich 'releases', 'game_id', sub {
    sql 'SELECT game_id, release_id, release_date,
                (SELECT COUNT(*) FROM releases_platforms rp
                  WHERE rp.release_id = r.release_id) AS platform_count
           FROM releases r
          WHERE game_id IN', $_[0], '
            AND release_date <=', \$latest_date, '
          ORDER BY release_date'
}, @games;

It's not the most elegant piece of code ever written, but it does the trick. The alternatives aren't likely to be much better.

Final notes

I've been using this approach - with a slightly different API - for the experimental rewrite of VNDB.org and I'm happy with the result. The main advantage of the enrich family of functions is that you can easily and efficiently run complex queries and merge the results to construct complex JSON documents - ad-hoc and without any boilerplate, as there's no need to define, declare or even name your data structure up front. This sometimes makes it harder to follow the code, as you may have to visualize a monstrously complex data structure in your mind, but this approach can be incredibly helpful for quick prototyping. Good documentation and/or a good debugging strategy can make up for lack of boilerplate.

The exact API for the enrich() functions is something I've not fully decided on yet. I've not put that much thought into the order of its arguments, the need for "variations" for different scenarios could be reduced by adding the merge strategy as a separate argument, at which point named arguments may be necessary to keep things readable, etc. The exact API doesn't matter all that much, it's the idea that matters.

As I mentioned in the introduction: while I've been using Perl for these examples, I'm convinced that this idea can be easily ported to other dynamic languages. I did in fact experiment with porting this idea to Elixir and it was every bit as convenient, in a way that Ecto's built-in ORM functionality wasn't.

It's quite possible that this abstraction is already available in some languages and libraries and that I've just missed it. It's also possible that this is, in fact, a terrible idea in the long run. Only way to find out is to give it a try. On that note, if you feel like playing around with the examples in this article, I wrote a little script to test that the examples actually work, which also makes for a good base for quick experimentation.


  1. This is a rather simplified schema inspired by VNDB.org - the project that made me experiment with all this.

  2. I'm sure that at this point you'd call me out as a madman for suggesting that treating SQL as strings is an acceptable strategy for constructing dynamic queries, and I'd actually be sympathetic to that view - I'm well aware that it's rather brittle. So now you're going to try and convince me that your favorite query builder library is the ultimate solution to this problem and that everyone should use it. Except it's not available for Perl and/or it has severe limitations and/or it has so much boilerplate that it's an utter waste of time for quick prototyping. I've played around with query builders in various languages, but the only one I actually kinda liked was Elixir's Ecto. But alas, I'm stuck with Perl for now, and string interpolation isn't all that bad.

  3. I lied, this version does two more things: It removes duplicate identifiers from the IN clause and does an early return if there are no identifiers to fetch.