dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes

Home Page:https://dotmesh.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dotmesh pull retry logic gets confused by partial errors when creating a filesystem

alaric-dotmesh opened this issue · comments

A clone operation failed with a strange error:

time="2019-09-16T09:17:46Z" level=info msg="Downloading workspace & data from hub - downloaded 477.45/5171.85 MiB at 78.62 MiB/s (1/1)"
time="2019-09-16T09:17:46Z" level=info msg="Still pulling..." dots_pulling=1
time="2019-09-16T09:17:47Z" level=info msg="Transfer status polled" elapsed_ns=6776212170 index=1 message="Attempting to pull d671c4be-fb95-4835-a501-33c707fb66c2 got <Event zfs-recv-failed: err: \"exit status 1\", filesystemId: \"d671c4be-fb95-4835-a501-33c707fb66c2\", stderr: \"cannot receive incremental stream: checksum mismatch or incomplete stream\\n\">" sent_bytes=551458107 size_bytes=5423081024 status="retry 1" total=1 transfer_id=6edae9a4-7620-4c7a-acd2-a15566221b69
time="2019-09-16T09:17:47Z" level=info msg="Downloading workspace & data from hub - downloaded 525.91/5171.85 MiB at 77.61 MiB/s (1/1)"

However, the retry loop then tried again - but the original failure had created SOME snapshots, but the retry loop kept trying to create the filesystem from scratch again and failing:

time="2019-09-16T09:17:47Z" level=info msg="Still pulling..." dots_pulling=1
time="2019-09-16T09:17:48Z" level=info msg="Transfer status polled" elapsed_ns=33781652 index=1 message="Attempting to pull d671c4be-fb95-4835-a501-33c707fb66c2 got <Event zfs-recv-failed: err: \"exit status 1\", filesystemId: \"d671c4be-fb95-4835-a501-33c707fb66c2\", stderr: \"cannot receive new filesystem stream: destination 'pool/dmfs/d671c4be-fb95-4835-a501-33c707fb66c2' exists\\nmust specify -F to overwrite it\\n\">" sent_bytes=51 size_bytes=5423081024 status="retry 2" total=1 transfer_id=6edae9a4-7620-4c7a-acd2-a15566221b69

I've not dug into the code, but I suspect the "calculation of what we need to pull" bit isn't being re-done in the retry loop, so a failure that pulls in some snapshots will then cause all subsequent retries to fail as they try and pull in snapshots we've alreay got.

if we could go via discovering state when a fetch fails, we can probably cope better with this scenario