Yxml - A small, fast and correct* XML parser
*But see the Bugs and Limitations and Conformance Issues below.
Yxml is a small (6 KiB
) non-validating yet mostly
conforming XML parser written in C. Its primary goals are small binary
size, simplicity and correctness. It also happens to be pretty fast.
The code can be obtained from the git repo and is available under a permissive MIT license. The only two files you need are yxml.c and yxml.h, which can easily be included and compiled as part of your project. Complete API documentation is available in the manual.
The API follows a simple and mostly buffer-less design, and only consists of three functions:
void yxml_init(yxml_t *x, void *buf, size_t bufsize);
(yxml_t *x, int ch);
yxml_ret_t yxml_parse(yxml_t *x); yxml_ret_t yxml_eof
Be aware that simple is not necessarily easy or convenient. The API is relatively low-level and designed to integrate into pretty much any application and for any use case. This includes incrementally parsing data from a socket in an event-driven fashion and parsing large XML files on memory-restricted devices. It is possible to implement a more convenient and high-level API on top of yxml, but I’m not very fond of libraries that do more than what I strictly need.
There are no tarball releases available at the moment. The API is relatively stable, but I won’t currently promise any ABI stability. Dynamic linking against yxml is therefore not a very good idea.
Features
- Simple and low-level API.
- Does not require
malloc()
. - Pure C, should be very portable.
- Recognizes and consumes the UTF-8 BOM.
- Parses entity references (
&
) and character references (&
). - Verifies most well-formedness constraints, including the correct nesting of elements.
- Parses XML documents in any ASCII-compatible encoding.
- Extensively fuzzed.
But let’s not be too optimistic, because there are also…
Bugs and Limitations
- A conditional section in a
<!DOCTYPE ..>
declaration will result in a parse error. - Allows multiple
<!DOCTYPE ..>
declarations. - Information encoded in the XML and doctype declarations is currently not available through the API.
These issues may be addressed in future versions.
Conformance Issues
- Does not verify that non-ASCII characters in element names, element content, attribute names and attribute values are within the allowed Unicode character ranges.
- Does not verify that attribute names within the same element are unique.
- Does not verify that the contents of a
<!DOCTYPE ..>
declaration follow the XML grammar. - Can’t parse documents in a non-ASCII-compatible encoding. You’ll have to convert it to UTF-8 or something similar first.
- No support for custom entity references, neither through the API nor
using
<!ENTITY>
.
These conformance issues are the result of the byte-oriented and minimal design of yxml and I do not intent to fix these directly within the library. The intention is to make sure that all of the above mentioned issues can be fixed on top of yxml (by the application, or by a wrapper) if strict conformance is required, but the required functionality to support custom entity references and DTD handling has not been implemented yet.
Non-features
And now follows a list of things that are not part of the core XML specification and are not directly supported. As with the conformance issues, these features can be implemented on top of yxml.
- No helper functions to deal with namespaces. Yxml will parse XML files with namespaces just fine, but it’s up to the application to do the rest.
- No DTD or XML Schema validation.
- No XSLT.
- No XPath.
- Doesn’t do your household chores.
Users
Yxml is used in a few products. Let me know if I missed one.
- FreeBSD’s PKG uses it to parse VuXML metadata (src).
- getdns uses it to parse DNSSEC trust anchor metadata (src).
- Fuchsia uses it to parse SVG images (src).
- ncdc uses it to parse XML-encoded file lists (src).
- BTstack - apparently Bluetooth uses XML somewhere.
- A MATLAB GIfTI library (src).
- RetroArch (src).
- radare2 uses it to parse information out of XNU binaries (src).
- Crank Software’s Storyboard uses it to parse runtime configurations (license).
Comparison
The following benchmark compares expat, libxml2 and Mini-XML with yxml. A strlen(3) implementation is also included as an indication of the “theoretical” minimum.
SIZE PERFORMANCE
LIB VER LICENSE OBJ STATIC WIKI DISCOGS
strlen 25 816 0.16 0.09
expat 2.1.0 MIT 162 139 194 432 1.47 1.09
libxml2 2.9.1 MIT 464 328 518 816 2.53 1.75
mxml 2.7 LGPL2+static 32 733 75 832 12.38 7.80
yxml git MIT 5 971 31 416 1.15 0.74
The code for these benchmarks is available in the bench/ directory on git. Some explanatory notes:
OBJ
is the total size of all object code of the library, measured with size(1).STATIC
is the file size of a minimal statically linked binary when linked against musl 0.9.13, measured with wc(1) after running strip(1).- The performance is the time, in seconds, to load a large XML file.
WIKI
refers toenwiki-20130805-abstract5.xml
(162 MiB) from a Wikipedia Dump,DISCOGS
refers todiscogs_20130801_labels.xml
(94 MiB) from a Discogs Data Dump. - Libxml2 has been compiled with most of its features disabled with
./configure
, but it still manages to be the very definition of bloat. - Everything has been compiled with gcc 4.8.1 at
-O2
. - Benchmarks are run on Linux 3.10.7 with a 3 Ghz Intel Core Duo E8400 and with 4GB RAM.
And just for fun, here’s the same comparison when compiled with
-Os
, i.e. optimized for small size. Interestingly enough,
Mini-XML actually runs faster with -Os
than with
-O2
.
SIZE PERFORMANCE
LIB VER LICENSE OBJ STATIC WIKI DISCOGS
strlen 25 816 0.16 0.09
expat 2.1.0 MIT 113 314 145 632 1.58 1.20
libxml2 2.9.1 MIT 356 948 412 256 3.01 2.08
mxml 2.7 LGPL2+static 27 725 71 704 11.70 7.44
yxml git MIT 4 955 30 392 1.67 1.02
Validating vs. non-validating
TL;DR: yxml does not accept garbage XML documents, it will correctly handle and report issues if the input does not strictly follow the XML grammar.
The terms validating and non-validating have
specific meanings within the context of XML. A validating parser is one
that reads the doctype declaration (DTD) associated with a document, and
validates that the contents of the document follow the rules described
in the DTD. A DTD may also include instructions on how to parse the
document, including the definition of custom entity references
(&whatever;
) and instructions on how attribute values
or element contents should be normalized before passing its data to the
application.
A non-validating parser is one that ignores the DTD and happily parses documents that do not follow the rules described in that DTD. They (usually) don’t support entity references and will not normalize attribute values or element contents. A non-validating parser still has to verify that the XML document follows the XML syntax rules.
It should be noted that a lot of XML documents found in the wild are not described with a DTD, but instead use an alternative technology such as XML schema. Wikipedia has more information on this. Using a validating parser for such documents would only add bloat and may introduce potential security vulnerabilities.