Encountered: "BUG! Probe cycle finished without running its full course"

Question

Encountered: "BUG! Probe cycle finished without running its full course"

jeromegn opened this issue 2 years ago · comments

Jerome Gravel-Niquet commented 2 years ago

I figured I might bring this up since it was labeled "BUG".

I'm not sure what can cause this, I've been restarting a few instances and encountered it. It seems to have recovered from it.

From my logs, I see this sequence of things happened:

Notification::Rejoin
Notification::Active

and then:

ERROR foca: error handling timer: BUG! Probe cycle finished without running its full course

caio · Answer 1 · Tue Apr 12 2022 14:59:35 GMT+0800 (China Standard Time)

Oooh this is cool! The notification log lines are super useful, thanks a lot!

This error is meant to catch mistakes wiring timer events - I put it in place because I knew I'd mess something up when adding auto rejoin. I wonder if this happened because the rejoin happened right in the middle of a probe cycle.. don't think I've mapped this scenario to a test.

What I'm trying to catch there are incomplete cycles that may lead to the instance suspecting a member is down needlessly- If the cluster is working fine, this just means potential extra noise from false positive suspicions (a few extra messages being exchanged so that the suspected member can remain alive in the cluster), but if the whole cluster is experiencing issues, this added noise may become problematic.

I'll try to write a test that reproduces this as soon as free time comes- It's probably (hopefully 😅 ) a missing cleanup somewhere.

caio · Answer 2 · Sun Apr 17 2022 18:53:28 GMT+0800 (China Standard Time)

Landed the fix, thanks again for the report! There are more details in the PR/commits, but it's as I guessed previously: missing cleanup, could've made reasoning difficult during cluster issues but a healthy cluster would barely notice the added noise caused by this scenario.

I'll see if I can get the time to wrap #3 up before Monday EOD then ship it all together, otherwise I'll release a 0.1.1 just with the bugfix by then.

Jerome Gravel-Niquet · Answer 3 · Mon Oct 17 2022 22:57:42 GMT+0800 (China Standard Time)

I've started seeing these again:

foca: error handling timer: BUG! Probe cycle finished without running its full course

However this time I don't think I have any interesting logs surrounding it! Mostly because I'm running this in production with a not-too-verbose log level.

caio · Answer 4 · Sat Oct 22 2022 17:50:31 GMT+0800 (China Standard Time)

Ugh this log line should've at least contained the event / timer_token used but I messed up defining the span on debug level :( Not quite sure what could cause this yet (my hunch is that this is again related to auto rejoin) - I'll save some time Monday to see if I can map it out properly. At the very least I'll fix the traces to include more info when warning

Thanks for the report! Reopening

caio · Answer 5 · Mon Oct 24 2022 17:42:48 GMT+0800 (China Standard Time)

Found a bug which I expect is what triggered this new occurrence. Pretty similar to the previous case: In the previous the issue was missing cleanup when the cluster declared the instance down; This one was missing a cleanup when the instance perceives the cluster as empty (A cluster-wide restart could've triggered it, for example)

One thing that I realized as I fixed is that while foca would keep working Ok after emitting this error, it would never resume probing. I'm fixing that too on #13. Will merge and cut a new release soon.

As for the tracing thing I mentioned in the previous comment: I thought the error log was coming from a trace, but it isn't. I've opened #14 to address this properly sometime in the future.

Jerome Gravel-Niquet · Answer 6 · Mon Oct 24 2022 20:25:38 GMT+0800 (China Standard Time)

Thank you! Just upgraded, will let you know if I encounter something like this again :)