tweag / funflow

Functional workflows

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ContentHashable for Hash{Map, Set} is not stable

guibou opened this issue · comments

Describe the bug

  • use ContentHashable to hash the content of an HashMap (directly or indirectly)
  • update my haskell packages. A new version of hashable is released, which change the hash function. The order of HashMap is hence changed, and my nicely content hashed object suddently change their hash.

See the instance for HashMap in the code, it uses toList, the order is unspecified.

https://github.com/tweag/funflow/blob/master/cas/hashable/src/Data/CAS/ContentHashable.hs#L361-L369

To Reproduce

main = do
   print =<< contentHash $ HashMap.fromList [("hello", 10), ("goodbye", 20)]

Run this code with different version of hashable (You may have to try different keys in the map, or a bigger map, in order to trigger the effect).

Expected behavior

It is impossible to guarantee that the hash won't change, but either:

  • Do not expose an instance for a type when it is known that the hash will change systematically on each release of a library as fundamental as hashable
  • (or) sort the values, at the cost of an Ord, Eq constraints on the keys of the HashMap.

The branch https://github.com/guibou/funflow/tree/stable_unordered-containers-hash-cas-hashable contains a commit which fix for this issue: it pre-sort the HashMap before hashing.

It adds an Ord constraint, so the design is discutable. I can open an MR if you agree with the idea.

However, removing cas-hashable from our codebase took me less time than fixing this (I'm now using a one line function which compute the sha256 sum of the output of cereal serialization, which is enough for my needs and should be more robust and easier to maintain), so I won't continue my efforts here unless you ask / continue the discussion.

Note that I wrote a custom instance for HashMap / HashSet for Serialize which sort the data first, as described here.

Hi @guibou, thanks for reporting this issue - I imagine that was a tricky one to track down 😬. We will take a closer look and get back to you regarding possible fix you linked. I would also be interested in hearing your use case for cas-hashable and funflow if you don't mind sharing.

Hi @dorranh.

We are solving virtual population of patient and the effect of treatments on them. So basically, we have a list of tasks (i.e. simulating a patient), some tasks depends on other task (i.e. comparing treatment effect versus control arm need to wait for the result of both task on the same patient. Or generating complete trial statistics needs to wait on all tasks) and we want to run theses tasks on a cluster of machines on AWS and we want to have an efficient and robust cache.

Caching is important for us because imagine the same simulation, we can run it with 1 million patient on two protocol arms (i.e. placebo versus treatment), that's 2 millions patients.. At the end of this run, we realize that we want to test another treatment (for example, double dose). With caching, we can avoid recomputing 2 millions patients (placebo / first treatment) and only recompute the new treatment.

We use content hash of the patient information / model / treatment.

Initially all of that was done with funflow and hence cas-hashable. It was completely rewritten with porcupine (https://github.com/YPares/porcupine). cas-hashable was still used to compute the hash of our resources until recently.

edit: that's why robust hash is important for us (see also #202) because when hash change, caching is invalidated and it may lead to additional computation cost, storage cost and user frustration.