ankane / datasketches-ruby

Sketch data structures for Ruby

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

core dump: std::bad_any_cast using DataSketches::KllIntsSketch

pda opened this issue · comments

We're occasionally seeing this error, mostly in our CI test suite, but maybe production too:

terminate called after throwing an instance of 'std::bad_any_cast'
  what():  bad any_cast
Aborted (core dumped)

I don't have a good stack trace, but I believe the code that triggers this looks something like:

sketch = DataSketches::KllIntsSketch.new

subsketch = DataSketches::KllIntsSketch.new.tap do |s|
  s.update(45_000_000)
  s.update(60_000_000)
  s.update(60_000_000)
end

sketch.merge(subsketch)

sketch.quantile(0.95)

I haven't catalogued all the instances of this error, so it may or may not affect more than just this DataSketches::KllIntsSketch use-case.

Sorry I don't have much more detail, but I figured I'd report what I know so far.

Versions:

  • datasketches (0.4.0) (but experienced on older versions too)
  • rice (4.1.0)
  • ruby 3.2.2 in ruby:3.2.2-slim-bookworm Docker image on amd64

Hi @pda, thanks for reporting, but I'm not able to reproduce. Are you using any other gems that use Rice?

Nope, no other gems using Rice.

I also can't reproduce it outside of the ~1% of CI runs where it happens. In theory we could configure those machines to write core dump files as build artifacts to get the full stacktrace etc, but we haven't found time to do that just yet.

It's possible I haven't reproduced the DataSketches::KllIntsSketch usage quite accurately in that code snippet, since the actual code is a bit more coupled to our data model.

It was also occurring on datasketches-ruby v0.2.4 with rice v4.0.4 before we upgraded in the hopes it would stop happening.

Anyway — I'm equally happy for this issue to either remain open, or be closed now as unactionable but remain as a searchable record of the bug lurking in there somewhere :)

If it's happening on CI, I'd recommend running the specific test case 1000 times to try to get it to error, then trying to make it minimal. I've tried running the code above with GC.stress = true, which can help find memory issues, but didn't have any luck. Will close for now, but let me know if you find a reliable way to reproduce.

Hi @pda, it looks like there was a fix in datasketches-cpp 5.0.1 for undefined behavior in KLL sketches. I'm not sure if it would cause this, but you can try upgrading the gem to see if that fixes it.

Oh nice, thanks for letting me know @ankane.
I'll give that a try and report back here when I get a chance 👍