Getting event sequence number mismatch error on publishBuildEvents
reimai opened this issue · comments
Hi, we are using buildbuddy version v2.12.32 and bazel v5.4.0 and getting this kind of errors from time to time:
on client:
Error:The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status='Status{code=UNKNOWN, description=event sequence number mismatch: received 47325, wanted 1, cause=null}' UNKNOWN: UNKNOWN: event sequence number mismatch: received 47325, wanted 1 UNKNOWN: UNKNOWN: event sequence number mismatch: received 47325, wanted 1
on server:
2023-05-15 19:40:20.410
stderr 2023/05/15 16:40:20.410 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=05fb796b-8554-4219-8aaf-236f6d73363b
2023-05-15 19:40:20.410
stderr 2023/05/15 16:40:20.410 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=05fb796b-8554-4219-8aaf-236f6d73363b
2023-05-15 19:40:10.260
stderr 2023/05/15 16:40:10.260 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=3bb859e8-9282-4f03-87cf-d9758f7fb309
2023-05-15 19:40:10.260
stderr 2023/05/15 16:40:10.259 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=3bb859e8-9282-4f03-87cf-d9758f7fb309
2023-05-15 19:40:01.691
stderr 2023/05/15 16:40:01.691 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=03c9efbe-bc1a-4e71-a66e-d468767dc0f2
2023-05-15 19:40:01.690
stderr 2023/05/15 16:40:01.690 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=03c9efbe-bc1a-4e71-a66e-d468767dc0f2
2023-05-15 19:39:53.942
stderr 2023/05/15 16:39:53.942 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=f93af676-1f0d-45a8-8bb8-39a75e79ad4a
2023-05-15 19:39:53.941
stderr 2023/05/15 16:39:53.941 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=f93af676-1f0d-45a8-8bb8-39a75e79ad4a
2023-05-15 19:39:46.362
stderr 2023/05/15 16:39:46.362 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=a2d64657-1011-4d5c-b64f-6eeeb949f8ba
2023-05-15 19:39:46.362
stderr 2023/05/15 16:39:46.362 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=a2d64657-1011-4d5c-b64f-6eeeb949f8ba
2023-05-15 19:39:38.625
stderr 2023/05/15 16:39:38.625 INF Finalized invocation in primary DB and enqueued for stats recording (status: COMPLETE_INVOCATION_STATUS) invocation_attempt=1 invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=9c38dfe0-12cd-4524-b078-dca98b672100
2023-05-15 19:38:53.006
stderr 2023/05/15 16:38:53.006 INF Created invocation "57f82ac5-3910-41a4-ba05-439f3b6a34e2", attempt 1 invocation_attempt=1 invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=9c38dfe0-12cd-4524-b078-dca98b672100
Somehow all the acks gone missing, and retries are getting the exact same error. Dropping 186916 events also looks kinda scary.
We would very appreciate any clues of what could be the source of the problem.
Strange issue.
This log signifies that Bazel, when sending Build Tool Events using an GRPC stream to BuildBuddy, failed to send either Started
or OptionsParsed
events among the first 100 events.
Because of this, BuildBuddy started to drop subsequent events in the buffer in hope to find the Started/OptionsParsed events eventually.
Does this happen consistently on your setup? If that's the case, enabling --build_event_json_file may help us understand Bazel events outputs better. In that file, look for BuildStarted and/or OptionsParsed in the file. (look for fields like command
or cmd_line
)
We will investigate this further from our end as well. Thanks for reporting!
We currently buffer 100 events while waiting for a Started
event https://github.com/buildbuddy-io/buildbuddy/blob/master/server/build_event_protocol/build_event_handler/build_event_handler.go#L979
We could make this configurable, but 18,6916 sounds like a lot of events to buffer. If you can tell us where the started event is in your --build_event_json_file that might help us make a more informed decision.
I'm trying to reproduce the issue with the json file on, no luck yet, will post if I'll catch it.
Got it! bejp.json.zip
Had to obfuscate the log a little bit, url and paths.
Server error:
2023-05-17 13:25:33.494 | stderr 2023/05/17 10:25:33.494 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=b2a4a19c-c2bd-4e73-a68e-1f65e2c57654 Show context
-- | --
| | 2023-05-17 13:25:33.494 | stderr 2023/05/17 10:25:33.494 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=b2a4a19c-c2bd-4e73-a68e-1f65e2c57654
| | 2023-05-17 13:25:25.734 | stderr 2023/05/17 10:25:25.734 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=2044d8af-a3f7-45a4-b238-5a4f38d16fa0
| | 2023-05-17 13:25:25.733 | stderr 2023/05/17 10:25:25.733 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=2044d8af-a3f7-45a4-b238-5a4f38d16fa0
| | 2023-05-17 13:25:19.178 | stderr 2023/05/17 10:25:19.178 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=cbc53e0e-019d-4e80-b915-ad861b9a90d8
| | 2023-05-17 13:25:19.178 | stderr 2023/05/17 10:25:19.178 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=cbc53e0e-019d-4e80-b915-ad861b9a90d8
| | 2023-05-17 13:25:13.779 | stderr 2023/05/17 10:25:13.779 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=a38e9c03-79cc-48b9-9665-87db3a5be864
| | 2023-05-17 13:25:13.779 | stderr 2023/05/17 10:25:13.778 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=a38e9c03-79cc-48b9-9665-87db3a5be864
| | 2023-05-17 13:25:09.033 | stderr 2023/05/17 10:25:09.033 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=6c475e64-8fe8-4f29-8e60-374c46210b58
| | 2023-05-17 13:25:09.032 | stderr 2023/05/17 10:25:09.032 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=6c475e64-8fe8-4f29-8e60-374c46210b58
| | 2023-05-17 13:25:04.017 | stderr 2023/05/17 10:25:04.016 WRN Error sending ack stream for invocation "0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2": rpc error: code = Canceled desc = context canceled invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=ba007de3-ae03-448c-91ab-06efe35f748d
| | 2023-05-17 13:25:03.846 | stderr 2023/05/17 10:25:03.846 INF Finalized invocation in primary DB and enqueued for stats recording (status: COMPLETE_INVOCATION_STATUS) invocation_attempt=1 invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=ba007de3-ae03-448c-91ab-06efe35f748d
| | 2023-05-17 13:15:52.257 | stderr 2023/05/17 10:15:52.256 INF Created invocation "0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2", attempt 1 invocation_attempt=1 invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=ba007de3-ae03-448c-91ab-06efe35f748d
Both BuildStarted and OptionsParsed are in your JSON file at line 1 and 5 respectively.
I am confused on why they did not get sent to the server successfully.
Do you see the same result when setting --bes_upload_mode=wait_for_upload_complete
? We have seen several customers reporting that fully_async
causes problem.
We've found the culprit here and are working on a fix - thanks for your help @reimai!
This should be fixed by #3998 and will go out in this week's release.