Eblob: lots of errors after updating version from 0.22.16 to 0.23.0

Question

Eblob: lots of errors after updating version from 0.22.16 to 0.23.0

rudneff opened this issue 9 years ago · comments

After updating eblob library i got some errors. I think they linked together.
But before that one question, please: what means -1 in this message:

2015-07-28 19:11:33.291628 2: blob: start
2015-07-28 19:11:33.292360 2: bctl: index: 2/-1, using unsorted index: size: 146208, num: 1523, data: size: 30878861, max blob size: 200000000

In version 0.22.23 this message appeared after force restart of my application, then eblob_init finished with SIGABRT every time i tried to restart. Now i can't reproduce this terrible bug.

Error while starting application every second start (errno 29)

This appears every second start. Write thread is blocked with errno 29. Read thread works normally. For working properly restart required.

2015-07-28 18:36:39.683964 1: blob: c66800000000: i18: eblob_fill_write_control_from_ram: ERROR-pread-index: position: 54998906, offset: 0, size: 23254, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: 0x65e820: -29
2015-07-28 18:36:39.684032 1: blob: c66800000000: i18: eblob_writev: finished: position: 54998906, offset: 0, size: 23254, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: (nil): -29
Write thread 140737219917568: going sleep!
2015-07-28 18:36:39.716113 1: blob: c66800000000: i18: eblob_fill_write_control_from_ram: ERROR-pread-index: position: 54998906, offset: 0, size: 23254, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: 0x65e820: -29
2015-07-28 18:36:39.716170 1: blob: c66800000000: i18: eblob_writev: finished: position: 54998906, offset: 0, size: 23254, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: (nil): -29
Write thread 140737203132160: going sleep!
Stat: W: done: 0.000100 recs: 0 wps: 0 err: 2 R: done: 0.043850 reads: 877 rps: 877 err: 0
Stat: W: done: 0.000100 recs: 0 wps: 0 err: 2 R: done: 0.043850 reads: 877 rps: 0 err: 0
Stat: W: done: 0.000100 recs: 0 wps: 0 err: 2 R: done: 0.043850 reads: 877 rps: 0 err: 0
2015-07-28 18:36:42.684256 1: blob: c66800000000: i18: eblob_fill_write_control_from_ram: ERROR-pread-index: position: 54998906, offset: 0, size: 15040, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: 0x65e820: -29
2015-07-28 18:36:42.684327 1: blob: c66800000000: i18: eblob_writev: finished: position: 54998906, offset: 0, size: 15040, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: (nil): -29
Write thread 140737219917568: going sleep!
2015-07-28 18:36:42.716380 1: blob: c66800000000: i18: eblob_fill_write_control_from_ram: ERROR-pread-index: position: 54998906, offset: 0, size: 17873, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: 0x65e820: -29
2015-07-28 18:36:42.716451 1: blob: c66800000000: i18: eblob_writev: finished: position: 54998906, offset: 0, size: 17873, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: (nil): -29
Write thread 140737203132160: going sleep!

Thank you for your supporting.

Evgeniy Polyakov · Answer 1 · Wed Jul 29 2015 01:47:04 GMT+0800 (China Standard Time)

bctl: index: 2/-1 shows current and maximum index known to eblob. When eblob starts it doesn't know how many blob files (and indexes) are available, it has to enumerate them, and while doing this enumeration eblob sets maximum known index to -1, this allows to determine maximum real index (it is always bigger than -1).

Evgeniy Polyakov · Answer 2 · Wed Jul 29 2015 01:50:21 GMT+0800 (China Standard Time)

2015-07-28 18:36:39.683964 1: blob: c66800000000: i18: eblob_fill_write_control_from_ram: ERROR-pread-index: position: 54998906, offset: 0, size: 23254, flags: 0x100 [chunked_csum], total data size: 0, disk-size: 0, data_fd: 11, index_fd: 12, bctl: 0x65e820: -29

-29 error usually means data corruption - eblob tries to read outside of the index
Try running eblob_merge tool on your blobs and check again whether this issue persists

rudneff · Answer 3 · Wed Jul 29 2015 02:25:17 GMT+0800 (China Standard Time)

I merged two blobs into one before presumably "bad" start (with error 29). Result:

Completed input stream /tmp/test_queue/data-0.36: total: 2, rest: 2
Completed all blobs
Total records: 17972
Written records: 5627
Removed records: 12345
Broken records: 0

After merge everything works good. But erarlier (0.21.16) such problems doesn't exist. I mean merge was not needed, whether i finished program correctly or not.

Evgeniy Polyakov · Answer 4 · Wed Jul 29 2015 02:29:27 GMT+0800 (China Standard Time)

Crash should happen only once to become visible, probably you were lucky not to corrupt data with previous crashes. For example they might happen after sync timeout (you can specify rather small timeout or even zero, but it heavily affects performance).

Kirill Bushminkin · Answer 5 · Wed Jul 29 2015 04:51:29 GMT+0800 (China Standard Time)

@bioothod and merge tool fix broken blobs?

Evgeniy Polyakov · Answer 6 · Wed Jul 29 2015 06:30:08 GMT+0800 (China Standard Time)

@agend eblob_merge tries to fix broken blobs, it iterates over blob and indexes and skips broken and removed entries, all good records are being copied into destination blob.
It is possible to run over multiple input blobs and produce one output blob.

Kirill Bushminkin · Answer 7 · Wed Jul 29 2015 06:49:24 GMT+0800 (China Standard Time)

That os strange as we never had this issue before (i think from the beginning of our usage - couple years) and there we have reproducible scenario. May be you consider to take a more deeper look into our case? We can also run test over previous version of eblob.

On 29 июля 2015 г., at 1:30, Evgeniy Polyakov notifications@github.com wrote:

@agend eblob_merge tries to fix broken blobs, it iterates over blob and indexes and skips broken and removed entries, all good records are being copied into destination blob.
It is possible to run over multiple input blobs and produce one output blob.

—
Reply to this email directly or view it on GitHub.

Evgeniy Polyakov · Answer 8 · Wed Jul 29 2015 06:55:03 GMT+0800 (China Standard Time)

If it is easily reproducible, please show us backtrace (with debug package installed) after sigabort

Kirill Bushminkin · Answer 9 · Fri Jul 31 2015 21:40:42 GMT+0800 (China Standard Time)

I think we have found issue and it's on our side. @rudneff Please close it