ifduyue / python-xxhash

Python Binding for xxHash

Home Page:https://pypi.org/project/xxhash/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pickable Interface to save state

mohammadhzp opened this issue · comments

Hello, Is it possible to make xxhash pickable ? how ?

It really will be great if it was pickable, that way we could've save hash state and resume hashing later on

I spend a few hours on this but no luck

Hello @mohammadhzp , right now xxhash object isn't pickable.

I agree with you it's necessary to provide save/load APIs for xxhash state

I'd would also really like this feature. My use case is that I'm doing a partial hash of files. I want to hash the first N bytes of each file, then I want to continue hashing the contents of the files that have hash collisions.

It is much faster to use xxhash in parallel processes when files live on an HDD.

Benchmark:

from os.path import join, basename, exists
import ubelt as ub
import random
import string


def _demodata_files(dpath=None, num_files=10, pool_size=3, size_pool=None):

    def _random_data(rng, num):
        return ''.join([rng.choice(string.hexdigits) for _ in range(num)])

    def _write_random_file(dpath, part_pool, size_pool, rng):
        namesize = 16
        # Choose 1, 4, or 16 parts of data
        num_parts = rng.choice(size_pool)
        chunks = [rng.choice(part_pool) for _ in range(num_parts)]
        contents = ''.join(chunks)
        fname_noext = _random_data(rng, namesize)
        ext = ub.hash_data(contents)[0:4]
        fname = '{}.{}'.format(fname_noext, ext)
        fpath = join(dpath, fname)
        with open(fpath, 'w') as file:
            file.write(contents)
        return fpath

    if size_pool is None:
        size_pool = [1, 4, 16]

    dpath = ub.ensure_app_cache_dir('pfile/random')
    rng = random.Random(0)
    # Create a pool of random chunks of data
    chunksize = 65536
    part_pool = [_random_data(rng, chunksize) for _ in range(pool_size)]
    # Write 100 random files that have a reasonable collision probability
    fpaths = [_write_random_file(dpath, part_pool, size_pool, rng)
              for _ in ub.ProgIter(range(num_files), desc='write files')]

    for fpath in fpaths:
        assert exists(fpath)
    return fpaths


def benchmark():
    import timerit
    import ubelt as ub
    from kwcoco.util.util_futures import JobPool  # NOQA
    ti = timerit.Timerit(3, bestof=1, verbose=2)

    max_workers = 4

    # Choose a path to an HDD
    dpath = ub.ensuredir('/raid/data/tmp')

    fpath_demodata = _demodata_files(dpath=dpath, num_files=1000,
                                     size_pool=[10, 20, 50], pool_size=8)

    for timer in ti.reset('hash_file(hasher=xx64)'):
        with timer:
            for fpath in fpath_demodata:
                ub.hash_file(fpath, hasher='xx64')

    for timer in ti.reset('hash_file(hasher=xxhash) - serial'):
        # jobs = JobPool(mode='thread', max_workers=2)
        jobs = JobPool(mode='serial', max_workers=max_workers)
        with timer:
            for fpath in fpath_demodata:
                jobs.submit(ub.hash_file, fpath, hasher='xxhash')
            results = [job.result() for job in jobs.jobs]

    for timer in ti.reset('hash_file(hasher=xxhash) - thread'):
        # jobs = JobPool(mode='thread', max_workers=2)
        jobs = JobPool(mode='thread', max_workers=max_workers)
        with timer:
            for fpath in fpath_demodata:
                jobs.submit(ub.hash_file, fpath, hasher='xx64')
            results = [job.result() for job in jobs.jobs]

    for timer in ti.reset('hash_file(hasher=xxhash) - process'):
        # jobs = JobPool(mode='thread', max_workers=2)
        jobs = JobPool(mode='process', max_workers=max_workers)
        with timer:
            for fpath in fpath_demodata:
                jobs.submit(ub.hash_file, fpath, hasher='xx64')
            results = [job.result() for job in jobs.jobs]


if __name__ == '__main__':
    benchmark()

Results are:

Timed hash_file(hasher=xx64) for: 3 loops, best of 1
    time per loop: best=420.524 ms, mean=449.902 ± 21.0 ms
Timed hash_file(hasher=xxhash) - serial for: 3 loops, best of 1
    time per loop: best=541.824 ms, mean=594.758 ± 39.6 ms
Timed hash_file(hasher=xxhash) - thread for: 3 loops, best of 1
    time per loop: best=670.375 ms, mean=816.609 ± 144.3 ms
Timed hash_file(hasher=xxhash) - process for: 3 loops, best of 1
    time per loop: best=288.406 ms, mean=294.248 ± 5.7 ms

The hash_file(hasher=xxhash) - process is the fastest clocking in a 295ms, threading does worse than serial mode, and the best variant of that clocks in at 450ms.

Of course this does not account for the overhead that would be required to pickle an existing hasher, but I think that should be minimal compared to the cost of reading files in a single loop.


If an instance of a hasher object was able to return some sort of pickleable "state" object, and then be able to create an new instance of a hasher object at that sate, then that would be sufficient to achieve pickleability.

Looking at the code, I think this would involve exposing xxhash_state via a public API, and then allowing the user to create a new xxhash.xxh32 or xxhash.xxh64 passing in that state. I'm not sure what type of data xxhash_state holds, but if its just bytes it shouldn't be that bad to expose.


For my reference I'm copying info about the xxhash state If found. It looks like its somewhat complicated. I'm not sure how straightforward it is to expose a pickleable version in Python and also provide binding to access / create a new hasher from a state. But it is just a collection of 32-bit integers, so it should be possible.

typedef uint32_t XXH32_hash_t;

struct XXH32_state_s {
   XXH32_hash_t total_len_32; /*!< Total length hashed, modulo 2^32 */
   XXH32_hash_t large_len;    /*!< Whether the hash is >= 16 (handles @ref total_len_32 overflow) */
   XXH32_hash_t v1;           /*!< First accumulator lane */
   XXH32_hash_t v2;           /*!< Second accumulator lane */
   XXH32_hash_t v3;           /*!< Third accumulator lane */
   XXH32_hash_t v4;           /*!< Fourth accumulator lane */
   XXH32_hash_t mem32[4];     /*!< Internal buffer for partial reads. Treated as unsigned char[16]. */
   XXH32_hash_t memsize;      /*!< Amount of data in @ref mem32 */
   XXH32_hash_t reserved;     /*!< Reserved field. Do not read or write to it, it may be removed. */
};   /* typedef'd to XXH32_state_t */

OK, I'll try to add this feature in my spare time.

If you want help I can also donate some time to this.

If you lay out the general structure, I can help with filling things out and testing.

@Erotemic glad you can help, welcome!

At the python side, we need to implement __getstate__ and __setstate__, see https://docs.python.org/3/library/pickle.html#pickle-inst .
At the xxHash side, we need to convert xxh*_state_t from/to char*, there's no public APIs provided by xxHash.

Hmm, so I think this is short-term doable without a public API provide by xxHash, but that means any xxHash implementation change might break the bindings. It might be worth submitting an issue there so they can provide us with a public API that is guaranteed to not change.

Assuming either a public API that returns the packed char* bytes or a hack that creates them by making assumptions about the xxHash internals, my guess on how __getstate__ might be implemented would be allocating space for that char*, populating it, passing it to PyBytes_FromStringAndSize, deallocating the char* and then returning the PyBytesObject, at which point Python could do whatever it wants with it (or would it be better to allocate that char* on the stack? I don't program with C very often, so I don't have a good intuition for memory management practices, I assume using the heap is correct in this instance, even though we would know the length of the char* at compile time).

Then to implement __setstate__, we could make a function that accepts a PyBytesObject, and then extracts a char* using PyBytes_AsString, unpacks the information and overwrites the existing state of the XXH\d\d_state_s object, and then deallocates the extracted char*.

Does this seem reasonable? Am I missing any C-specific details?

it seems reasonable to me.

Hmm, so I think this is short-term doable without a public API provide by xxHash, but that means any xxHash implementation change might break the bindings. It might be worth submitting an issue there so they can provide us with a public API that is guaranteed to not change.

I remember that xxHash has changed it's state from concrete struct to opaque struct for a very long time. So I don't think there will be a public API.

The packed char * bytes of a state probably aren't the same on different OS architectures (x86/x64/arm/endianness...). Even if we completed this feature, it's still limited and we would want to warn users.

Just chiming in that a pickleable xxhash object would be very useful. Thank you for working on this important feature.