Indexed Merkle tree to improve offchain state efficiency

Question

Indexed Merkle tree to improve offchain state efficiency

mitschabaude opened this issue 2 months ago · comments

Gregor Mitscha-Baude commented 2 months ago

Why:
Our current way to implement maps, or sets that support non-inclusion proofs, via Merkle trees is highly inefficient. It involves a Merkle tree of height 256 -- large enough to support arbitrary field elements as indices, so that we can write a key-value pair into the index determined by a hash of the key.

The excessive amount of hashing required to update a Merkle tree of this height (2 x 256 = 512 hashes per update) is the main bottleneck in our offchain state zkprogram, and the reason we can only process 6 state updates per proof at the moment.

How:
Indexed Merkle trees are a recently invented, vastly more efficient way to implement the same primitives (map, set). They allow us to store key-value pairs at subsequent indices, so that our tree only has to be the size that we want as the max number of keys (i.e. something like height 30 to support 2^30 ~= 1 billion keys).

Implementing this would reduce hashing for Merkle updates by a factor of 8-10. For offchain state, we could easily process ~50 updates per proof.

I suggest implementing two different variants: IndexedMerkleMap and IndexedMerkleSet. Maps need to encode an additional value field along each leaf, which makes them slightly more complex to implement. Sets are fully described here.

This also presents an opportunity to encode the new Merkle tree as a provable type in a more natural way, so that Merkle trees can just be passed into methods, and a simple call to something like IndexedMerkleMap.set(key, value) can replace the current complex back-and-forth between witnessing a value and Merkle witness, computing the root twice etc.

Sketch of API

This is a suggestion for the API that IndexedMerkleMap should support:

type IndexedMerkleMap = {
  // (lower-level) method to insert a new leaf `(key, value)`. proves that `key` doesn't exist yet
  insert(key: Field, value: Field): void;

  // (lower-level) method to update an existing leaf `(key, value)`. proves that the `key` exists.
  update(key: Field, value: Field): void;

  // method that performs _either_ an insertion or update, depending on whether the key exists
  set(key: Field, value: Field): void;

  // method to get a value from a key. returns an option to account for the key not existing
  // note: this has to prove that the option's `isSome` is correct
  get(key: Field): Option<Field> // the optional `Field` here is the value

  // optional / nice-to-have: remove a key and its value from the tree; proves that the key is included.
  // (implementation: leave a wasted leaf in place but skip it in the linked list encoding)
  remove(key: Field): void;
}

DFST · Answer 1 · Thu May 16 2024 22:56:05 GMT+0800 (China Standard Time)

Having toJSON() and fromJSON() is also important. The challenges with the MerkleMap that I'm facing

MerkleMap.set takes a long time, and reconstructing the big MerkleMap from the elements can easily take 10 minutes
It is possible to reduce this time 10x by serializing the Merkle Map, but the files are huge - can be hundreds of MB

Having the possibility to use indexed Maps to reduce Map reconstruction time and serialized Map size is very important.

DFST · Answer 2 · Thu May 16 2024 23:15:02 GMT+0800 (China Standard Time)

Can you give me some links on the Indexed Maps design? I'm curious how they handle proof of exclusion to make sure that no two keys are the same.

Gregor Mitscha-Baude · Answer 3 · Thu May 16 2024 23:39:47 GMT+0800 (China Standard Time)

I gave the link in the description! https://docs.aztec.network/learn/concepts/storage/trees/indexed_merkle_tree

DFST · Answer 4 · Fri May 17 2024 05:47:06 GMT+0800 (China Standard Time)

Now understand that they maintain pointers that allow for easy generation of proof of exclusion.

Indexed Merkle Maps is a great idea and will significantly speed up the code of rollups.
Thank you, @mitschabaude for this great addition to o1js

Martin Ondejka · Answer 5 · Fri May 17 2024 20:36:43 GMT+0800 (China Standard Time)

Great idea for the exclusion proof.

main bottleneck in our offchain state zkprogram, and the reason we can only process 6 state updates per proof at the moment.

A little bit of an off topic here: why not implement the rollup's state the same way Mina's ledger is implemented? Mina ledger has depth 35 and it's achieved in a way that each account has an index unrelated to the account's key with some book-keeping outside of a tree. It should be more efficient than Indexed Merkle Tree, since you don't have to do the update of 2 leaves as well as no range check.

Martin Ondejka · Answer 6 · Mon May 20 2024 21:10:48 GMT+0800 (China Standard Time)

I've completely missed that you need to prove the account exclusion when creating new account, so this is indeed useful also for a state of the rollup. I wonder why transaction snark doesn't do that or if it does where...

Gregor Mitscha-Baude · Answer 7 · Sat May 25 2024 21:00:21 GMT+0800 (China Standard Time)

I've completely missed that you need to prove the account exclusion when creating new account, so this is indeed useful also for a state of the rollup. I wonder why transaction snark doesn't do that or if it does where...

@MartinOndejka Transaction snark doesn't prove it, it relies on consensus

Martin Ondejka · Answer 8 · Mon May 27 2024 23:24:00 GMT+0800 (China Standard Time)

I've completely missed that you need to prove the account exclusion when creating new account, so this is indeed useful also for a state of the rollup. I wonder why transaction snark doesn't do that or if it does where...

@MartinOndejka Transaction snark doesn't prove it, it relies on consensus

It begs the question, what is the point of transaction snark then.

KimlikDAO bot · Answer 9 · Wed Jul 03 2024 10:13:20 GMT+0800 (China Standard Time)

How does this compare to hashing the keyspace to, say 64 bits, and using MerkleTree(65) with the hashed keys?

This should also give 1 billion collision free insertions.

Gregor Mitscha-Baude · Answer 10 · Wed Jul 03 2024 13:48:26 GMT+0800 (China Standard Time)

How does this compare to hashing the keyspace to, say 64 bits, and using MerkleTree(65) with the hashed keys?

This should also give 1 billion collision free insertions.

@KimlikDAO-bot some operations in IndexedMerkleTree have the same efficiency as a normal Merkle tree of the same size, others use 2x as many constraints (e.g. insertions) because they update 2 leaves instead of 1. so, if we compare against a Merkle tree of double the height, IndexedMerkleTree will perform at least as good, sometimes better.

Gregor Mitscha-Baude · Answer 11 · Wed Jul 03 2024 13:50:10 GMT+0800 (China Standard Time)

@KimlikDAO-bot the PR where we implemented it has concrete numbers: #1666 (these scale about linearly with the height)

KimlikDAO bot · Answer 12 · Wed Jul 03 2024 15:43:16 GMT+0800 (China Standard Time)

@KimlikDAO-bot the PR where we implemented it has concrete numbers: #1666 (these scale about linearly with the height)

Thank you! If i'm interpreting this right, IndexedMerkleTree is an optimization if you have a bound N << 2^255 on the number of insertions we will make (which is almost all use cases).

Here is another proposed solution if you have a bound N on the number of insertions: Hash the keys from 255 bits to only roughly log(N^2) bits. Now your Merkle height is reduced from 255 to log(N^2) and hash collisions are very unlikely. The constraints we're emitting per depth should be slightly lower in the normal MerkleTree compared to IndexedMerkleTree. (since IndexedMerkleTree keeps more intricate merkle nodes)

I'm curious how these two would compare. I will try to benchmark if I can figure out how to print the # constaints.

Gregor Mitscha-Baude · Answer 13 · Wed Jul 03 2024 16:25:55 GMT+0800 (China Standard Time)

Here is another proposed solution if you have a bound N on the number of insertions: Hash the keys from 255 bits to only roughly log(N^2) bits. Now your Merkle height is reduced from 255 to log(N^2) and hash collisions are very unlikely. The constraints we're emitting per depth should be slightly lower in the normal MerkleTree compared to IndexedMerkleTree. (since IndexedMerkleTree keeps more intricate merkle nodes)

Your proposal is more scary to me because you are reducing the key space to N^2 numbers. There is a difference between number of supported insertions (this can be fairly small) and size of the key space (number of keys that are supported in theory). For normal Merkle maps, both are the same, but for indexed merkle map, the key space is all field elements no matter how small we make the map.

If you reduce the key space, then since keys are usually hashes of larger values, you make it easier for two such key hashes to be the same even though the actual keys aren't. Such a collision could be detected offchain by an attacker and be exploited.
For example, if keys are hashes of public keys and the Merkle map represents coin balances, then finding a second preimage of someone else's public key hash could enable you spend their balance.

I like the IndexedMerkleMap design since it side-steps all security hazards associated with making the key space smaller.

KimlikDAO bot · Answer 14 · Wed Jul 03 2024 17:20:43 GMT+0800 (China Standard Time)

Such a collision could be detected offchain by an attacker and be exploited. For example, if keys are hashes of public keys and the Merkle map represents coin balances, then finding a second preimage of someone else's public key hash could enable you spend their balance.

You're right, the hashing is not good for most use cases!

Even in the cases where the keys are guaranteed to be pseudo-random (due to some other proof) I'm not sure which one would be cheaper.

We applied the key-hashing trick to keep a MerkleSet of HumanIDs. HumanIDs come with their own proofs of truthful computation, and when computed truthfully; they are random (with the randomness provided by the "blinding helpers" from the zk passport discussion).