Issue with replication

Question

Issue with replication

pfrazee opened this issue 3 years ago · comments

Repro repo: https://github.com/pfrazee/hyper-replication-bug

Something is failing in the replication code causing the program to end prematurely and without any information.

What's happening in the repro repo is that I'm creating a set of "Nodes" which are corestores, a set of cores, and autobases. I'm then randomly connecting and disconnecting them using core.replicate(), and also creating put() or del() operations. There are no reads occurring yet, so the rebased-hypercore index isnt being touched yet (no apply calls).

The failure seems to occur during an oplog append, and I traced it as follows:

head() is calling _getInputNode() on a remote core,
which in turn is calling core.get().
That cache misses,
so it calls out to the replicator which is creating a request.

At that point, you get into the complexities of the request code and I figured it'd be better to pass this off to yall.

Another interesting piece of info: if I connect all of my "nodes" corestores in replication, then the error doesn't occur

Mathias Buus · Answer 1 · Mon Nov 08 2021 22:34:16 GMT+0800 (China Standard Time)

Can you reopen this on autobase? I'll comment here for now. This is because append calls heads which wants to replicate the data.

This should error with "block not available" (but haven't landed that yet, which is why it hangs)
Autobase prob NOT should fetch it all for an append
Your test don't replicate writers with eachother, not sure if that's intentional? but that's why it bails.

Paul Frazee · Answer 2 · Mon Nov 08 2021 23:17:56 GMT+0800 (China Standard Time)

Moved it over.

Your test don't replicate writers with eachother, not sure if that's intentional? but that's why it bails.

It does but it's intentionally connecting various subsets of the corestores to simulate various network conditions

Mathias Buus · Answer 3 · Mon Nov 08 2021 23:32:22 GMT+0800 (China Standard Time)

Closing to keep it tidy here.