projects@yorhel.nl
home - git - @ayo

Cute decorative scissors, cutting through your code.

Ncdu Binary Export File Format

This document describes the new binary file format added in ncdu 2.6. This format offers the following advantages compared to the JSON export file format:

These features come at the cost of increased complexity. The JSON format is generally easier to work with and therefore still the recommended approach for external tooling to interact with ncdu’s export/import functionality.

A binary export can be created with the -O option to ncdu. It is also possible to convert to and from the JSON format:

ncdu -O export.ncdu /         # Scan root, write to 'export.ncdu'
ncdu -f in.json -O out.ncdu   # Convert from JSON to binary
ncdu -f in.ncdu -o out.json   # Convert from binary to JSON

Format description

File signature

An exported file starts with the following file signature (in hex):

bf 6e 63 64 75 45 58 31

Formatted as a C string, that is "\xbfncduEX1".

Non-backwards compatible changes to the export format should use a different file signature. N.B. A different compression algorithm is a non-backwards compatible change.

Block format

The file signature is followed by one or more blocks. A block has the following format:

TypeLen 4 bytes Big-endian type + length of this block
Content n bytes n = Length - 8
TypeLen 4 bytes Repeat of TypeLen

The high 4 bits of the TypeLen indicate the block type, the lower 28 bits encode the length of the block, including the header and footer.

The TypeLen is repeated at the end of the block to allow for reading the file in both forwards and backwards direction.

The block type determines how the Content should be interpreted. There are currently two block types:

Type Meaning
0 Data block
1 Index block

Parsers should ignore blocks with an unknown type.

A valid file must have at least one data block and exactly one index block. The index block must be the last block in the file.

Data blocks

Data blocks have the following contents:

Number 4 bytes Big-endian unsigned block number
Compressed_data n bytes

Every data block must have a unique number, starting from zero and ideally (but not necessarily) allocated without gaps. Data blocks may appear in a different order than their numbering.

Data is compressed with Zstandard. Data must be compressed in a single frame and the uncompressed size must be available through ZSTD_getFrameContentSize(), so that readers can pre-allocate a properly-sized buffer for decompression.

The total length of a data block, including block header and footer, must not exceed 16 MiB minus one byte. The total size of the decompressed data must also not exceed 16 MiB minus one byte.

The decompressed data consists of a stream of one or more Items (see below).

Index block

The index block provides a lookup table for data blocks and a reference to the root item:

Block_pointers n*8 bytes
Root_itemref 8 bytes

Block_pointers is an array containing an 8-byte pointer for each data block in the file. Pointers are indexed by block number, so the first pointer is for block number 0, the second pointer for block number 1, etc. Each pointer is interpreted as a 64bit big-endian unsigned integer. The higher 40 bits indicate the byte offset of the data block header, relative to the start of the file. The lower 24 bits indicate the block length and must be equivalent to the length in the TypeLen of the corresponding data block. An all-zero value indicates that there is no block with this number in the file.

The last 8 bytes of the index block represent an unsigned big-endian integer that refers to the root item of the directory tree. See Itemref below.

Itemref

An Itemref encodes a reference to an Item, there are two types:

Absolute
An absolute Itemref is a 64bit unsigned integer that encodes a block number in the higher 40 bits and a byte offset of the start of the item within the block in the lower 24 bits. Every item in the file has exactly one absolute Itemref value. The Root_itemref in the index block must be absolute.
Relative
A relative Itemref is a negative integer that represents the byte offset of the referenced item relative to the start of the item containing the reference. Relative references can only reference a previously written item within the same block.

Item

An Item represents a file or directory entry, encoded as a CBOR map. Key/value pairs may be encoded in any order and unknown keys are ignored. Summary of keys recognized by ncdu:

Key Field Value
0 type i32
1 name String
2 prev Itemref
3 asize u64
4 dsize u64
5 dev u64
6 rderr bool
7 cumasize u64
8 cumdsize u64
9 shrasize u64
10 shrdsize u64
11 items u64
12 sub Itemref
13 ino u64
14 nlink u32
15 uid u32
16 gid u32
17 mode u16
18 mtime u64

Common fields for all items

type

Mandatory. A negative value indicates that the item has been excluded from the size calculations for some reason, positive values are used for different item types:

-4 Excluded with --exclude-kernfs
-3 Excluded with -x
-2 Excluded by pattern match
-1 Error while reading this entry
0 Directory
1 Regular file
2 Non-regular file (symlink, device, etc)
3 Hardlink candidate (i.e. stat().st_nlink > 1)

Unrecognized negative values are treated as equivalent to -2, unrecognized positive values are treated as a non-regular file (type=2).

name
Mandatory. Ncdu always encodes the name as a byte string, but also accepts UTF-8 text strings. Ncdu does not support indefinite-length CBOR strings, the name must be encoded with a known length.
prev
Reference to the previous item in the same directory. This field must be absent if this is the first item in a directory. This field forms a singly-linked list of all items in a directory.

Fields for type >= 0

asize
Apparent size of this file/directory as reported by stat().st_size. Optional, defaults to 0.
dsize
Disk usage of this file/directory as reported by stat().st_blocks multiplied by the block size. Optional, defaults to 0.

Fields for type = 0

dev
Device number. Optional, defaults to the same device number as the parent directory, or 0 of this is the root item.
rderr
Whether an error occurred while reading this directory. When true, an error occurred while reading the directory list itself and the list may therefore be incomplete. When false, an error occurred while reading a child item. This implies that somewhere in this sub-tree there must be at least one item of type=-1 or a directory with rderr=true.
cumasize
Cumulative apparent size of this directory. Optional, defaults to 0.
cumdsize
Cumulative disk usage of this directory. Optional, defaults to 0.
shrasize
Shared apparent size. Optional, defaults to 0.
shrdsize
Shared disk usage. Optional, defaults to 0.
items
Cumulative number of items in this directory. Ncdu currently caps this number to 2^32-1 when reading, but supports larger numbers when exporting. Optional, defaults to 0.
sub
Reference to the last item in this directory, or absent if the directory is empty.

Fields for type = 3

ino
Inode number.
nlink
Number of links to this inode.

Extended information

These fields are only exported when the -e flag is passed to ncdu. They are relevant to all items with type >= 0.

uid
User id.
gid
Group id.
mode
File mode.
mtime
Last modification time as a UNIX timestamp.

Limitations

Compressed data block size
16 MiB minus 1 byte. This limit comes from Block_pointers in the index block using 24 bits to encode the block length.
Uncompressed data block size
16 MiB minus 1 byte. This limit comes from Itemref encoding item offset in 24 bits.
Largest data block number
33,554,428. The size of the index block is limited by the 28-bit length in the block’s TypeLen header, which limits the number of Block_pointers it can hold to ((2^28 - 1) - 16) / 8 (subtract one to get the maximum block number because counting starts at 0).
Compressed data size
Excluding block overhead, the total amount of compressed data is limited to about 1 TiB. This is limited by Block_pointers using 40 bits to encode the data block offset within the file.
Uncompressed data size
Limited by either the maximum number of data blocks or the compressed data size, depending on compression ratio and the chosen data block size. Assuming the number of data blocks is the limit, about 512 TiB of uncompressed data can be stored with the maximum data block size of 16 MiB. Ncdu’s adaptive block size selection has a limit of about 40 TiB.

The real question is how many items an export can hold with the above limits in place. This will heavily depend on the average encoded item size and the compression ratio, both of which can vary wildly from one directory structure to another.

I’ve had one report with ~1.4 billion files resulting in a ~21 GiB file. Extrapolating from that and assuming the compressed data size is the limiting factor, this format could hold ~68 billion items. Increasing the compression level and using larger data block sizes to further improve compression ratio, one could perhaps store about 100 billion items. On the one hand, that sounds like an insane number nobody will ever reach. On the other hand, a decade ago I couldn’t imagine people having more than 100 million files, yet here we are.

On the upside, all the major limitations can be attributed to the maximum size and format of the index block. It’s possible to implement an alternative index format in the future that can be automatically switched to whenever any of the above limits are exceeded, thus providing a seamless upgrade path without breaking compatibility for the existing exports that do fit within the limits.

Security considerations

Directory trees can get very large and you can easily exceed available RAM when attempting to read everything into memory. Reading only small parts of the tree can help cut down on memory use, but it’s still a good idea to implement limits or detect and handle when you’re about to run out of memory.

There are several places in the format where byte offets are used to refer to blocks or items. These offets must be validated to ensure that they stay within the bounds of the respective file or block. In particular, itemref offsets could potentially refer to memory before (in the case of a relative itemref) or after (absolute itemref) the decompressed data, and pointers in the index block could refer to offsets beyond the end of the file.

The CBOR encoding used for items is self-delimiting, but a badly formatted item may not be properly terminated before the end of the decompressed block contents. Readers should take care that this does not lead to reading past the allocated buffer.

In a well-formed directory tree, each item is referenced exactly once by either the Root_itemref or a pref or sub field. However, it is also possible to construct a file where this is not the case, and implementers should be aware that itemref loops are possible.

Implementation notes

Data block size
It is up to the file writer to choose a suitable data block size. This is a compromise between compression efficiency and memory use: larger blocks compress better but also require more memory, both for reading and writing. Ncdu currently keeps 8 uncompressed blocks in memory when reading and one block per thread when writing. Ncdu starts with blocks of 64 KiB, but gradually increases the size to 2 MiB for very large directory trees in order to not bloat the index size too much and to prevent running into the maximum data block number limit.
Testing

If you’re implementing a custom writer for this format, make sure to check out the ncdubinexp.pl script in the git repository. Ncdu only reads the parts of a file that it actually needs, so passing a file to ncdu is no guarantee that it is well-formed. The ncdubinexp.pl script is more thorough in validating file correctness but misses a few invariants that ncdu does check for, so the best way to verify a file is to run both:

ncdu -o/dev/null -f file.ncdu   # Read entire tree and export to /dev/null
ncdubinfmt.pl <file.ncdu        # Read and verify the entire file