PyQC / json_schema

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to serialize arrays efficiently?

tovrstra opened this issue · comments

@matt-chan @sunqm @dgasmith @avirshup Sorry for spamming. This might be of interest. If not, feel free to ignore.

So far, there seems to be no mention of support for (numpy) arrays to deal with array data efficiently, while this is obviously relevant, e.g. to represent orbitals or density matrices efficiently.

There are a few ways to cram numpy arrays into a JSON format but they are inefficient, e.g. list of lists, base64 encoding, etc. I just came across MessagePack: http://msgpack.org/, which seems to be a promising alternative, so we should at least keep it on our radar.

MessagePack is similar to JSON but more compact and with support for binary data. This is the Python implementation: https://github.com/vsergeev/u-msgpack-python There is also an extension for numpy arrays. https://github.com/lebedov/msgpack-numpy/

Similarities with JSON:

  • Ease of use
  • Serialization of hierarchical data
  • Suitable for long-term archiving of results
  • Cross-platform, Cross-language, ...
  • It is not possible to deserialize only a part of the data. (HDF5 can do this.)

The main advantage of MessagePack over HDF5 is simplicity. HDF5 is a horribly complex file format, with only a single implementation that turns out to have small glitches, albeit very very rarely. MessagePack already has many implementations because of its simple spec: https://github.com/msgpack/msgpack/blob/master/spec.md

A potential drawback is that MessagePack is not human readable.

What kind of "efficient" do you mean? performance? or data compression? or human readability?

I personally prefer .to_list(). It's the easiest method to serialize an array. For 2D tensor, the performance and compression level is not a problem in this solution. For high dimensional tensor, we need sort out other solutions anyway.

I was mainly thinking of fast IO and small files, without resorting to compression algorithms. I don't think readability matters much because I believe this is just a protocol to communicate between two different pieces of software. (?)

The size overhead of json.dumps(a.tolist()) is about 150%. E.g. try the following:

import numpy as np
import json
from base64 import b64encode
a = np.random.normal(0, 1, 1000)
print(len(json.dumps(a.tolist())))
print(len(b64encode(a.dumps())))
print(len(a.dumps()))

This gives me:

20588
10844
8133

I guess this is representative. Even for storing orbitals from a large DFT calculation or a database of DFT calculations, wasting 150% of space seems too much. Base 64 encoding "only" has a 33% overhead. It all depends on what you're willing to accept obviously.

Can we move this discussion here?

I think its unlikely that we will move off JSON as the upper level as its everywhere compared to BSON/MsgPack/HDF5 and the like. However, the JSON "spec" can be transferred to HDF5, BSON, or even MsgPack as is. Each of these formats encode the same basic types, but compresses in different ways. If you have something like a JSON format and you want to store this hopefully your dropping the data into Mongo or Redis which will use a better encoding and compress it.

Thanks for the link!

I see your point. Closing...