Moving away from HDF5

Update [2016-01-30]: I wrote a follow-up here

In the research lab where I work, we've been developing a data processing pipeline for several years. This includes not only a program but also a new file format based on HDF5 for a specific type of data. While the choice of HDF5 was looking compelling on paper, we found many issues with it. Recently, despite the high costs, we decided to abandon this format in our software.

In this post, I'll describe what is HDF5 and what are the issues that made us move away from it.

What is HDF5?

For those who haven't come across it, Hierarchical Data Format, or HDF [in this post I'll only talk about the current version, HDF5], is a multipurpose hierarchical container format capable of storing large numerical datasets with their metadata. The specification is open and the tools are open source. Development of HDF5 is done by the HDF Group, a non-profit corporation.

What's in an HDF5 file?

An HDF5 file contains a POSIX-like hierarchy of numerical arrays (aka datasets) organized within groups.

A dataset can be stored in two ways: contiguously or chunked. If the former, the dataset is stored in a contiguous buffer in the file. If the latter, it is split uniformly in rectangular chunks organized in a B-tree.

HDF5 also supports lossless compression of datasets.

File system within a file

Effectively, you can see HDF5 as a file system within a file, where files are datasets and folders are groups. However, the HDF Group doesn't seem to like this comparison. The major differences are as follows:

  • An HDF5 file is portable: the entire structure is contained in the file and doesn't depend on the underlying file system. However it does depend on the HDF5 library.
  • HDF5 datasets have a rigid structure: they are all homogeneous (hyper)rectangular numerical arrays, whereas files in a file system can be anything.
  • You can add metadata to groups, whereas file systems don't support this.

A short story

Many neuroscience labs working on extracellular recordings had been using a file format for almost two decades. This was meant to be a temporary file format and no one expected that it would become so widely used. For this reason, not much thought had been given to it. The format mixed text and binary files, metadata was stored in poorly-specified XML file. There were some quirks like off-by-one discrepancies between files. It could happen that scientific results were wrong because the experimenter was confused by the format. There were also serious performance problems, and the format wouldn't have scaled to modern recording devices.

These files were used in a suite of graphical programs that had also been developed a while ago, and that wouldn't have scaled to these new devices.

As we worked on a new version of the processing software, we decided to also design a new version of this file format that would be based on HDF5.

HDF5 looked like an ideal choice: widely-supported, supposedly fast and scalable, versatile. We couldn't find any argument against it. The following advantages were the main reasons we chose HDF5 in the first place:

  • Open
  • Large community
  • You can create symlinks between datasets and HDF5 files
  • Transparent endianness support
  • Portability and metadata, as seen above
  • Chunked datasets can be resized along a given dimension
  • Possible support for compression

We spent months and years designing the perfect HDF5-based file format that would work for everyone. We ran many benchmarks on various configurations to find the best compromise between design and performance. We rewrote our entire Python software around this new format using the h5py library. People around the world started to generate petabytes of data with our program.

That's when we started to see several practical problems, which also made us aware of deeper issues with HDF5:

  • High risks of data corruption
  • Bugs and crashes in the HDF5 library and in the wrappers
  • Poor performance in some situations
  • Limited support for parallel access
  • Impossibility to explore datasets with standard Unix/Windows tools
  • Hard dependence on a single implementation of the library
  • High complexity of the specification and the implementation
  • Opacity of the development and slow reactivity of the development team

Our users were upset. They couldn't do things they could do with the old format, however clunky it may have been. We implemented hacks and patches around these bugs and limitations, and ended up with an unmaintainable code mess.

At some point, we said stop. For us, HDF5 was too much trouble, and we estimated that dropping it completely was the least painful choice. With so much data in this format in the wild, we still need to provide support, conversion, and export facilities, but we encourage our users to move to a simpler format.

Disadvantages of HDF5

What has gone wrong? The first mistake we did was to design a file format in the first place. This is an extremely hard problem, and the slightest mistake has huge and expensive consequences. This is better left off to dedicated working groups.

Let's now see the disadvantages of HDF5 in detail.

Single implementation

The HDF5 specification is very complex and low level. It spans about 150 pages. In theory, since the specification is open, anyone can write their own implementation. In practice, this is so complex that there is a single implementation, spanning over 300,000 lines of C code. The library may be hard to compile on some systems. There are wrappers in many languages, including Python. They all rely on the same C library, so they all share the bugs and performance issues of this implementation. Of course, the wrappers can add their own bugs and issues.

The code repository of the reference implementation is hard to find. It looks like there is an unofficial GitHub clone of an SVN repository. There are no issues, pull requests, little documentation, etc. just a bunch of commits. To understand the code properly, you have to become very familiar with the 150 pages of specification.

Overall, using HDF5 means that, to access your data, you're going to depend on a very complex specification and library, slowly developed over almost 20 years by a handful of persons, and which are probably understood by just a few people in the world. This is a bit scary.

Corruption risks

Corruption may happen if your software crashes while it's accessing an HDF5 file. Once a file is corrupted, all of your data is basically lost forever. This is a major drawback of storing a lot of data in a single file, which is what HDF5 is designed for. Users of our software have lost many hours of work because of this. Of course, you need to write your software properly to minimize the risk of crashes, but it is hard to avoid them completely. Some crashes are due to the HDF5 library itself.

To mitigate corruption issues, journaling was being considered in a future version of HDF5. I can find mentions of this feature on the mailing list, for example here in 2008, or in 2012. It was planned for the 1.10 version, which itself was originally planned for 2011, if not earlier. Finally it looks like journaling is not going to make it into the 1.10 release [see the Comments section in this page]. This release is currently planned for 2016, and the very first alpha version has been released a few days ago.

[Anecdotally, this version seems to break compatibility in that earlier releases [of HDF5] will not be able to read HDF5-1.10 files. Also, there is a big warning for the alpha release: PLEASE BE AWARE that the file format is not yet stable. DO NOT keep files created with this version.]

Various limitations and bugs

Once, we had to tell our Windows users to downgrade their version of h5py because a segmentation fault occurred with variable-length strings in the latest version. This is one of the disadvantages of using a compiled library instead of a pure Python library. There is no other choice since the only known implementation of HDF5 is written in C.

UTF-8 support in HDF5 seems limited, so in practice you need to rely on ASCII to avoid any potential problems.

There are few supported data types for metadata attributes. In Python, if your attributes are in an unsupported type (for example, tuples), they might be silently serialized via pickle to an opaque binary blog, making them unreadable in another language like MATLAB.

A surprising limitation: as of today, you still can't delete an array in an HDF5 file. More precisely, you can delete the link, but the data remains in the file so that the file size isn't reduced. The only way around this is to make a copy of the file without the deleted array (for example with the h5repack tool). This is problematic when you have 1TB+ files. The upcoming HDF5 1.10 promises to fix this partially, but it is still in alpha stage at the time of this writing.

Performance issues

Since HDF5 is a sort of file system within a file, it cannot benefit from the smart caching/predictive strategies provided by modern operating systems. This can lead to poor performance.

If you use chunking, you need to be very careful with the chunk size and your CPU cache size, otherwise you might end up with terrible performance. Optimizing performance with HDF5 is a rather complicated topic.

In our application, we have a particular use-case where we have a large contiguous array with, say, 100,000 lines and 1000 columns (in reality these numbers may be much larger than that), and we need to access a small number of lines quickly. Unfortunately, there is no locality in our access patterns. We found out that using h5py led to very slow access times, but it's due to a known weakness of the implementation of fancy indexing in h5py.

When we perform a regular selection with slices, we also found that h5py is several times slower than memory-mapping a file with NumPy, but it's unclear if this is due to h5py or HDF5 itself.

We also found that we can actually bypass libhdf5 when reading an HDF5 file, provided that we use uncompressed contiguous datasets. All we have to do is find the address of the first byte of the array in the file, and memory-map the buffer with NumPy. This also leads to faster access times.

Overall, in this situation, using memory-mapping instead of h5py/HDF5 leads to read access times that are significantly faster.

You'll find a standalone benchmark as a Jupyter notebook here.

In conclusion, we found out the hard way that HDF5 may be quite slower than simpler container formats, and as such, it is not always a good choice in performance-critical applications. This was quite surprising as we (wrongly) expected HDF5 to be particularly fast in most situations. Note that performance might be good enough in other use-cases. If you consider using HDF5 or another format, be sure to run detailed benchmarks in challenging situations before you commit to it.

[Update: note that an earlier version of this paragraph mentioned a 100x speed increase, but it's been pointed out in the comments below that the benchmark was not comparing the right thing. The paragraph above and the benchmark have been updated accordingly. Earlier versions of the benchmark can be found in the notebook history.]

Poor support on distributed architectures

Parallel access in HDF5 exists but it is a bit limited and not easy to use. MPI is required for multiprocessing.

HDF5 was designed at a time where MPI was the state-of-the-art for high performance computing. Now, we have large-scale distributed architectures like Hadoop, Spark, etc. HDF5 isn't well supported on these systems. For example, on Spark, you have to split your data into multiple HDF5 files, which is precisely the opposite of what HDF5 encourages you to do [see also this document by the HDF Group].

By contrast, flat binary files are natively supported on Spark.

Opacity

You depend on the HDF5 library to do anything with an HDF5 file. What is in a file? How many arrays there are? What are their paths, shapes, data types? What is the metadata? Without the HDF5 library, you can't answer any of these questions. Even when HDF5 is installed, you need dedicated tools or, worse, you need to write your own script. This adds considerable cognitive overhead when working with scientific data in HDF5.

You can't use standard Unix/Windows tools like awk, wc, grep, Windows Explorer, text editors, and so on, because the structure of HDF5 files is hidden in a binary blob that only the standard libhdf5 understands. There is a Windows-Explorer-like HDFView tool written in Java that allows you to look inside HDF5 files, but it is very limited compared to the tools you find in modern operating systems.

A simpler and roughly equivalent alternative to HDF5 would be to store each array in its own file, within a sensible file hierarchy, and with the metadata stored in JSON or YAML files. For the file format of the individual arrays, one can choose for example a raw binary format without a header (arr.tofile() in NumPy), or the NumPy format .npy which is just a flat binary file with a fixed-length ASCII header. [Note the paragraph about HDF5 in the page linked above] These files can be easily memory-mapped with very good performance since the file system and the OS are in charge in that case.

This leads to a self-documenting format that anyone can immediately explore with any command-line shell, on any computer on the planet, with any programming language, without installing anything, and without reading any specific documentation. In 20 or 30 years, your files are much more likely to be readable if they are stored in this format than if they're stored in HDF5.

Philosophy

HDF5 encourages you to put within a single file many data arrays corresponding to a given experiment or trial. These arrays are organized in a POSIX-like structure.

One can wonder why not just use a hierarchy of files within a directory.

Modern file systems are particularly complex. They have been designed, refined, battle-tested, and optimized over decades. As such, despite their complexity, they're now very robust. They're also highly efficient, and they implement advanced caching strategies. HDF5 is just more limited and slower. Perhaps things were different when HDF5 was originally developed.

If you replace your HDF5 file by a hierarchy of flat binary files and text files, as described in the previous section, you obtain a file format that is more robust, more powerful, more efficient, more maintainable, more transparent, and more amenable to distributed systems than HDF5.

The only disadvantage of this more rudimentary container format I can think of is portability. You can always zip up the archive, but this is generally slow, especially with huge datasets. That being said, today's datasets are so big that they don't tend to move a lot. Rather than sharing huge datasets, it might be a better idea to fire up a Jupyter server and serve analysis notebooks.

When datasets are really too big to fit on a single computer, distributed architectures like Spark are preferred, and we saw that these architectures don't support HDF5 well.

Conclusion

We've learned our lesson. Designing, maintaining, and promoting a file format within a community is hard. It cannot be reasonably done by a small group of people who also need to write software, develop algorithms, and do research.

I don't think we could have predicted all of our problems with HDF5, since we had only heard enthusiast opinions. Maybe HDF5 was great a decade ago, and it just became outdated.

What I do know is that we wouldn't have had these problems if we hadn't tried to develop a file format in the first place.

We've now rewritten our software to make it modular and completely agnostic to file formats. We've moved from writing a monolithic application to writing a library. We're encouraging our users to adapt these components to whatever file format they're already using. The APIs we provide make this straightforward.

There is always a tension, in that many of our users are biologists without a computer science background [to simplify, they're using Windows, Word, and MATLAB instead of Unix, vim/emacs, and Python] and they expect an integrated single-click graphical program. The solution we've found is to develop the library first, and then write separately an integrated solution based on this library.

Thanks to Max Hunter and others for their comments on this post.