Panics in goka

Question

Panics in goka

pfortin-urbn opened this issue 3 years ago · comments

I see that your code sometimes panics and it seems like it could happen during the normal course of running - as an example I incurrered a panic because my decode returned an error - but that is NOT a wanted panic in production when my just dies and then might get restarted with the same consumer offset in kafka so this scenario just panics forever - I adjusted my decoder to stop returning errors so my application would not die.

Can I get an explanation on the scenario I just talked about and also why goka actually panics instead of passing a fatal error back to the application and let the app decide what to do?

Franz Eichhorn · Answer 1 · Fri Sep 17 2021 18:20:16 GMT+0800 (China Standard Time)

Hi Paul,
you're right, the code sometimes panic. The reason is to get out of a user-callback if e.g. ctx.Fail is called. But also if invalid calls are made, it panics.
What you're experiencing though is a bug, a critical one actually, as it can easily be triggered. It's been there for a while now, but I'm actually about to fix it.
The intended behavior is like this:

if sarama's group consumer reports a transient error, it should attempt to do a rebalance
if there's an error during rebalance/setup/shutdown or in a user callback (ctx.Fail, panic, invalid call), the app should shutdown with an error.

However, our current implementation of multierr.Errors does not allow to Unwrap errors, so we can't distinguish between those errors and hence always restart, as you've experienced. Or even worse, sometimes it gets stuck. This is related to issue #302 and also triggers the bug mentioned in #334.

The current goal is to get rid of goka's multierr package and use the one from hashicorp. That allows the distinction so we can get rid of the flaky behavior, along with some bugfixes and simplifying the whole error handling, because it clearly is a bit messy.

Would that solve your issue?

Paul Fortin · Answer 2 · Fri Sep 17 2021 20:04:54 GMT+0800 (China Standard Time)

I believe that your change will definitely help but my though is always that a library should never panic unless you are in the creation phase of the application and a critical dependancies does not exist or is inaccessible for any reason but having the library panic in the middle of a running application could cause other inconsistencies in that application's code or state that "could" be worse. My thinking is always that a library should always return errors (some being fatal) which the application MUST always handle properly and in the case of a "fatal"ensure that what caused the error is fixed or then the application code can decide to panic.

Thanks for the great library!!

Franz Eichhorn · Answer 3 · Fri Sep 17 2021 20:18:37 GMT+0800 (China Standard Time)

Sorry, then I got you wrong. We use panics internally, but of course the user will never face a panic as the processor fully recovers from them. If the decode errored and goka panicked, then it's another bug and we'll look into it, that should never happen!

Paul Fortin · Answer 4 · Fri Sep 17 2021 20:27:02 GMT+0800 (China Standard Time)

What should happen on a decode error? Also I failed to mention this but we use goka v1.0.6

Franz Eichhorn · Answer 5 · Fri Sep 17 2021 20:35:54 GMT+0800 (China Standard Time)

I'd say the processor should shut down, as this is a non-recoverable error, because there's something wrong/unexpected on the wire. If the processor is supposed to tolerate/handle erroneous data, the codec must handle that. This is explained here (formatting is a bit broken, just realized).
What I'm not fully sure is, what happens if the codec decides to "drop" a value. Without further testing I'd say it has to return nil and the processor or the callback drops it.

Paul Fortin · Answer 6 · Fri Sep 17 2021 21:07:12 GMT+0800 (China Standard Time)

The way I got around it was to put error values in the return struct and always return a nil for the error. Then I had my processor handle the error by logging and aborting the processing of that message. This does not seem ideal but it's the only way since if you crash then we need to figure out what happened - fix Kafka's offsets and then restart seemed worse from the production stand point.