Fast Key Lookup with a Small Read-Only Database

(Published on 2019-05-14)

I was recently thinking about how to help users on VNDB.org improve their password security. One idea is to show a warning when someone’s password is already in some kind of database. While there are several online APIs we can query for this purpose, I chose to ignore these (apart from the potential privacy implications, I’d also rather avoid dependencies on external services when possible). As an alternative, there are several password dictionaries available for download that can be used for offline querying. I decided to go with the free CrackStation password dictionary.

The dictionary comes in the form of a compressed text file with one password per line. This is a great format to keep the size of the dictionary small, but it’s a really shitty format for fast querying. The obvious solution is to import the dictionary into some database (any RDMBS or NoSQL store of your choice), but those tend to be relatively heavyweight solutions in terms of code base, operational dependencies, database size or all of the above. Generic databases are typically optimized to support (moderately) fast inserts, updates and deletes, which means they often can’t pack data very tightly. Some databases do support compression, though.

So all I need is a small, compressed database that supports only a single operation: check if a key (password, in this case) is present. Surely I could quickly hack together a simple database that provides a good size/performance ratio?

My Little B-tree

I had some misgivings about implementing a B-tree. It may be a relatively simple data structure, but I’ve written enough off-by-one errors in my life to be wary of actually implementing one, and given this use case, I felt that there had to be an even simpler solution. After thinking up and rejecting some alternative strategies and after realizing that I really only needed to support two operations - bulk data insertion and exact key lookup - I came to the conclusion that a B-tree isn’t such a bad idea after all. One major advantage of B-trees is that they provide a natural solution to split up the data into fixed-sized blocks, and that is a very useful property if we want to use compression.

The database format I came up with is basically a concatenation of compressed blocks. Blocks come in two forms: leaf blocks and, uh, let’s call them index blocks.

The leaf blocks contain all the password strings. I initially tried to encode the passwords as a concatenated sequence of length-prefixed strings, but I could not find a way to quickly parse and iterate through that data in Perl 5 (the target language for this little project¹). As it turns out, doing a substring match on the full leaf block is faster than trying to work with unpack and substr, even if a naive substring match can’t take advantage of the string ordering to skip over data when it has found a “larger” key. So I made sure that the leaf block starts with a null byte and encoded the passwords as a sequence of null-terminated strings. That way we can reliably find the key by doing a substring search on "\x00${key}\x00".

The index blocks are the intermediate B-tree nodes and consist of a sorted list of block references interleaved with keys. Like this:

$block1 # Contains all the passwords "less than" $password1
$password1
$block2 # Contains all the passwords "less than" $password2
$password2
$block3 # Contains all the passwords "greater than" $password2

The $blockN references are stored as a 64bit integer and encode the byte offset of the block relative to the start of the database file, the compressed byte length of the block, and whether it is a leaf or index block. Strictly speaking only the offset is necessary, the length and leaf flag can be stored alongside the block itself, but this approach saves a separate decoding step. The $passwordN keys are encoded as a null-terminated string, if only for consistency with leaf blocks.

Finally, at the end of the file, we store a reference to the parent block, so that our lookup algorithm knows where to start its search.

Given a sorted list of passwords², creating a database with that format is relatively easy. The algorithm goes roughly as follows (apologies for the pseudocode, the actual code is slightly messier in order to do the proper data encoding):

my @blocks; # Stack of blocks, $block[0] is a leaf block.

for my $password (@passwords) {
    $blocks[0]->append_password($password);

    # Flush blocks when they get large enough.
    for(my $i=0; $blocks[$i]->length() > $block_size; $i++) {
        my $reference = $blocks[$i]->flush();
        $blocks[$i+1]->append_block_reference($reference);
    }
}

# Flush the remaining blocks.
for my $i (0..$#blocks) {
    my $reference = $blocks[$i]->flush();
    $blocks[$i+1]->append_block_reference($reference);
}

write_parent_node_reference();

That’s it, really. No weird tree balancing tricks, no need to “modify” index blocks in any other way than appending some data. Flushing a block is nothing more than compressing the thing, appending it to the database file and noting its length and byte offset for future reference.

Lookup is just as easy. I don’t even need pseudocode to demonstrate it, here’s the actual implementation:

sub lookup_rec {
    my($q, $F, $ref) = @_;
    my $buf = readblock $F, $ref;
    if($ref->[0]) { # Is this a leaf block?
        return $buf =~ /\x00\Q$q\E\x00/; # Substring search
    } else {
        # This is an index block, walk through the block references and
        # password strings until we find a string that's larger than our query.
        while($buf =~ /(.{8})([^\x00]*)\x00/sg) {
            return lookup_rec($q, $F, dref $1) if $q lt $2;
        }
        return lookup_rec($q, $F, dref substr $buf, -8)
    }
}

# Usage: lookup($query, $database_filename)
#   returns true if $query exists in the database.
sub lookup {
    my($q, $f) = @_;
    open my $F, '<', $f or die $!; 
    # Read the last 8 bytes in the file for the reference to the parent block.
    sysseek $F, -8, 2 or die $!;
    die $! if 8 != sysread $F, (my $buf), 8;
    # Start the recursive lookup
    lookup_rec($q, $F, dref $buf)
}

The full code, including that of the benchmarks below, can be found on git.

Benchmarks

I benchmarked my little B-tree implementation with a few different compression settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For comparison I also added a naive implementation that performs a simple linear lookup in the sorted dictionary, and another one that uses LMDB.

Here are the results with the crackstation-human-only.txt.gz dictionary, containing 63,941,069 passwords at 247 MiB original size.

Database	DB Size (MiB)	Memory (RES, KiB)	Lookups/sec
Naive (plain)	684	6,376	0.16 (6.1 s)
Naive (gzip)	246	6,340	0.12 (8.3 s)
B-tree plain 1k	698	6,460	17,857
B-tree plain 4k	687	6,436	9,345
B-tree gzip 1k	261	10,772	9,345
B-tree gzip 4k	244	10,572	5,076
B-tree zstd 1k	291	6,856	12,345
B-tree zstd 4k	268	6,724	6,944
LMDB	1,282	590,792	333,333

Well shit. My little B-tree experiment does have an awesome size/performance ratio when compared to the Naive approach (little surprise there), but the performance difference with LMDB is insane. Although, really, that isn’t too surprising either, LMDB is written in C and has been heavily optimized for performance. LMDB’s memory usage in this benchmark should be taken with a grain of salt, its use of mmap() tends to throw off memory measurements.

I used the default compression levels of zstd and gzip. I expect that a slightly higher compression level for zstd could reduce the database sizes to below gzip levels without too much of a performance penalty.

What’s curious is that the B-tree gzip 4k database is smaller than the Naive (gzip) one. I wonder if I have a bug somewhere that throws away a chunk of the original data. Or if I somehow ended up using a different compression level. Or if gzip is just being weird.

Here’s the same benchmark with the crackstation.txt.gz dictionary, containing 1,212,356,398 passwords at 4.2 GiB original size³.

Database	DB Size (MiB)	Memory (RES, KiB)	Lookups/sec
Naive (plain)	14,968	38,536	0.01 (110 s)
Naive (gzip)	4,245	38,760	0.01 (136 s)
B-tree plain 1k	15,377	6,288	15,384
B-tree plain 4k	15,071	6,456	8,196
B-tree gzip 1k	4,926	10,780	7,352
B-tree gzip 4k	4,344	10,720	4,273
B-tree zstd 1k	5,389	6,708	10,000
B-tree zstd 4k	4,586	6,692	5,917
LMDB	26,453	3,259,368	266,666

The main conclusion I draw from this benchmark is that the B-tree implementation scales pretty well with increasing database sizes, as one would expect. I’m not sure why Perl decided to use more memory for the Naive benchmarks, but it doesn’t really matter.

Improvements

Is this the best we can do? No way! Let’s start with some low-hanging fruit:

The current lookup function reads the database file from scratch on every lookup. An LRU cache for uncompressed blocks ought to speed things up considerably.
Keys in index blocks are repeated in leaf blocks, this isn’t really necessary.
It’s possible to encode an “offset inside block” field to the block references and add a few more strings to the index blocks, allowing parts of a block to be skipped when searching for the key. This way we can get some of the performance benefits of smaller block sizes without the costs of an increase in database size. Or store multiple (smaller) intermediate B-tree nodes inside a single block. Same thing.
The lookup function could be rewritten in a faster language (C/C++/Rust), I’m pretty sure this would be a big win, too.

Thinking beyond B-trees, an alternative and likely much more efficient approach is to use a hash function to assign strings to leaf blocks, store an array of block pointers in the index blocks (without interleaving keys) and then use the hash function to index the array for lookup. This makes it harder to cap the size of leaf blocks, but with the small password strings that’s not likely to be a problem. It does significantly complicate creating the database in the first place.

Perhaps an even better approach is to not store the strings in the first place and simply use a (sufficiently) large bloom filter.

But this was just a little side project. My goal was to get 1 ms lookups with the least amount of code and with a database size that isn’t too far off from the compressed dictionary. That goal turned out to be pretty unambitious.

Yeah, people still use Perl 5 in 2019.↩︎
But note that the passwords in the CrackStation dictionary are not sorted according to Perl’s string comparison algorithm, so it requires a separate LC_COLLATE=C sort command to fix that. Also note that sorting a billion strings is a pretty challenging problem in its own right, but enough has been written about that. Arguably, enough has been written about B-trees and databases as well.↩︎
Running these benchmarks was a bit of a nightmare as I was running low on free space on my SSD. I had to delete some unused build artefacts from other projects in an emergency in order for sort to be able to finish sorting and writing the Naive (plain) database, upon which all the others are based. Ncdu has saved this experiment, its author deserves an extra tasty pizza for dinner today.↩︎