Add retrying for BatchCalculateChecksumTask exceptions

Question

Add retrying for BatchCalculateChecksumTask exceptions

mikkonie opened this issue 2 months ago · comments

The storage system we use for iRODS in production is experiencing a lot of performance issues. This results in checksum calculation errors as iRODS becomes unable to read files as required. SODAR gets the blame for that, which is factually incorrect but understandable, as it's the part of the system most visible to the user.

Taskflowbackend is able to recover from these crashes and continue the operation, so failing to calculate a single checksum will not stop the landing_zone_move flow execution. Alas, once we get to the actual validation part, the execution of the flow will naturally fail, as all the checksums have not been correctly computed.

Restarting the flow does often (albeit not reliably) help, as the storage system may have recovered from its issues in the meantime.

Hence, it could be tried to add a retry of N times to calculate a checksum in case it fails due to a temporal server failure.

This is, obviously, a workaround and a hack. The proper solution involves improving the storage backend. But if this does end up helping with the case of failed validations, it could be an acceptable temporary solution with a simple implementation. Might as well give it a shot.

Mikko Nieminen · Answer 1 · Mon Jun 03 2024 22:05:13 GMT+0800 (China Standard Time)

Done. It remains to be seen if this actually helps in production. This is one of those things which is not exactly trivial to test in dev.