pleonasm / merkle-tree

An implementation of a Merkle Tree in PHP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Migrate the code to zephir

alrik11es opened this issue · comments

Hey! I've tested this code and looks nice. I've tried to use just for fun but the performance for big trees seems to become really really slow.

I think this project should be migrated as PHP C extension with zephir or be just done in other language like Go.

https://zephir-lang.com/

What you think? Worth a try?

Could you point me to an example script that shows off the performance problem? I'd like to at least verify that there isn't some issue with the current implementation before looking into other options like Zephir.

Also: if you are looking for a Go implementation of a merkle tree, many already exist.

Yes I know, but none as easy to use as this one. I was playing with this example:

$tree = new FixedSizeTree(10000, $hasher, $finished);
$tree->set(0, 'genesis');
for($i=1; $i<10000; $i++) {
    $result = $tree->set($i, md5(mt_rand(0,99999)));
    $output->write($i."\r");
}

It's a real possibility that I were doing something wrong... Thanks for taking time on this.

Okay, so the problem is I'm iterating over the entire tree basically every time ->set() is called in order to decide if I should execute the "complete" callback or not. Turns out for very large trees like 10000, this means at least 10000 ->set() calls which means at least 10000 * 10000 iterations... so O(n^2) time based on the size of the tree (and actually more, since I'm only talking about the base nodes). This is not ideal.

The only other reason I try to calculate the hash on ->set() is if the value being passed to ->set() is very large. For example, if you had a tree size of 32 and your ->set() values were 1GB each or something, then there would be no reason to keep them around once the hash has been made of those values, so I try to re-calc every time.

So basically in the case of very large values passed to ->set(), you want to re-calc more often, and in the case of very large trees, you want to re-calc less often. Right now I don't let the user choose. I'll look into a way to allow the choice for the user so the 10000 node case can be covered.

Just as a note, It took my computer 2 minutes to do a tree size of 10000 with recalc on. I hacked it to turn the re-calc off just to see how fast it was, and it took 300ms instead, so about 4 orders of magnitude improvement.

Look for an update soon.

That was fast! Nice.

Seems right to me to try to optimize this library before any possible port to other languages.

I have at least other question but I'm gonna open another issue for that.

So I know this was a LONG time ago, but I released 2.0 and it definitely fixes this issue by default now. So I'm going to closed this as fixed.