buildbuddy-io / buildbuddy

BuildBuddy is an open source Bazel build event viewer, result store, remote cache, and remote build execution platform.

Home Page:https://buildbuddy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting event sequence number mismatch error on publishBuildEvents

reimai opened this issue · comments

Hi, we are using buildbuddy version v2.12.32 and bazel v5.4.0 and getting this kind of errors from time to time:

on client:

Error:The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status='Status{code=UNKNOWN, description=event sequence number mismatch: received 47325, wanted 1, cause=null}' UNKNOWN: UNKNOWN: event sequence number mismatch: received 47325, wanted 1 UNKNOWN: UNKNOWN: event sequence number mismatch: received 47325, wanted 1

on server:

2023-05-15 19:40:20.410	
   stderr   2023/05/15 16:40:20.410 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=05fb796b-8554-4219-8aaf-236f6d73363b   
2023-05-15 19:40:20.410	
   stderr   2023/05/15 16:40:20.410 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=05fb796b-8554-4219-8aaf-236f6d73363b   
2023-05-15 19:40:10.260	
   stderr   2023/05/15 16:40:10.260 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=3bb859e8-9282-4f03-87cf-d9758f7fb309   
2023-05-15 19:40:10.260	
   stderr   2023/05/15 16:40:10.259 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=3bb859e8-9282-4f03-87cf-d9758f7fb309   
2023-05-15 19:40:01.691	
   stderr   2023/05/15 16:40:01.691 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=03c9efbe-bc1a-4e71-a66e-d468767dc0f2   
2023-05-15 19:40:01.690	
   stderr   2023/05/15 16:40:01.690 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=03c9efbe-bc1a-4e71-a66e-d468767dc0f2   
2023-05-15 19:39:53.942	
   stderr   2023/05/15 16:39:53.942 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=f93af676-1f0d-45a8-8bb8-39a75e79ad4a   
2023-05-15 19:39:53.941	
   stderr   2023/05/15 16:39:53.941 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=f93af676-1f0d-45a8-8bb8-39a75e79ad4a   
2023-05-15 19:39:46.362	
   stderr   2023/05/15 16:39:46.362 WRN Missing ack: saw 47325 and wanted 1. Bailing! invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=a2d64657-1011-4d5c-b64f-6eeeb949f8ba   
2023-05-15 19:39:46.362	
   stderr   2023/05/15 16:39:46.362 WRN We got over 100 build events before an event with options for invocation 57f82ac5-3910-41a4-ba05-439f3b6a34e2. Dropped the 186916 earliest event(s). invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=a2d64657-1011-4d5c-b64f-6eeeb949f8ba   
2023-05-15 19:39:38.625	
   stderr   2023/05/15 16:39:38.625 INF Finalized invocation in primary DB and enqueued for stats recording (status: COMPLETE_INVOCATION_STATUS) invocation_attempt=1 invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=9c38dfe0-12cd-4524-b078-dca98b672100   
2023-05-15 19:38:53.006	
   stderr   2023/05/15 16:38:53.006 INF Created invocation "57f82ac5-3910-41a4-ba05-439f3b6a34e2", attempt 1 invocation_attempt=1 invocation_id=57f82ac5-3910-41a4-ba05-439f3b6a34e2 request_id=9c38dfe0-12cd-4524-b078-dca98b672100  

Somehow all the acks gone missing, and retries are getting the exact same error. Dropping 186916 events also looks kinda scary.
We would very appreciate any clues of what could be the source of the problem.

Strange issue.

This log signifies that Bazel, when sending Build Tool Events using an GRPC stream to BuildBuddy, failed to send either Started or OptionsParsed events among the first 100 events.

Because of this, BuildBuddy started to drop subsequent events in the buffer in hope to find the Started/OptionsParsed events eventually.

Does this happen consistently on your setup? If that's the case, enabling --build_event_json_file may help us understand Bazel events outputs better. In that file, look for BuildStarted and/or OptionsParsed in the file. (look for fields like command or cmd_line)

We will investigate this further from our end as well. Thanks for reporting!

We currently buffer 100 events while waiting for a Started event https://github.com/buildbuddy-io/buildbuddy/blob/master/server/build_event_protocol/build_event_handler/build_event_handler.go#L979

We could make this configurable, but 18,6916 sounds like a lot of events to buffer. If you can tell us where the started event is in your --build_event_json_file that might help us make a more informed decision.

I'm trying to reproduce the issue with the json file on, no luck yet, will post if I'll catch it.

Got it! bejp.json.zip

Had to obfuscate the log a little bit, url and paths.

Server error:

2023-05-17 13:25:33.494 | stderr   2023/05/17 10:25:33.494 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=b2a4a19c-c2bd-4e73-a68e-1f65e2c57654    Show context
-- | --
  |   | 2023-05-17 13:25:33.494 | stderr   2023/05/17 10:25:33.494 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=b2a4a19c-c2bd-4e73-a68e-1f65e2c57654
  |   | 2023-05-17 13:25:25.734 | stderr   2023/05/17 10:25:25.734 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=2044d8af-a3f7-45a4-b238-5a4f38d16fa0
  |   | 2023-05-17 13:25:25.733 | stderr   2023/05/17 10:25:25.733 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=2044d8af-a3f7-45a4-b238-5a4f38d16fa0
  |   | 2023-05-17 13:25:19.178 | stderr   2023/05/17 10:25:19.178 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=cbc53e0e-019d-4e80-b915-ad861b9a90d8
  |   | 2023-05-17 13:25:19.178 | stderr   2023/05/17 10:25:19.178 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=cbc53e0e-019d-4e80-b915-ad861b9a90d8
  |   | 2023-05-17 13:25:13.779 | stderr   2023/05/17 10:25:13.779 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=a38e9c03-79cc-48b9-9665-87db3a5be864
  |   | 2023-05-17 13:25:13.779 | stderr   2023/05/17 10:25:13.778 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=a38e9c03-79cc-48b9-9665-87db3a5be864
  |   | 2023-05-17 13:25:09.033 | stderr   2023/05/17 10:25:09.033 WRN Missing ack: saw 31288 and wanted 1. Bailing! invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=6c475e64-8fe8-4f29-8e60-374c46210b58
  |   | 2023-05-17 13:25:09.032 | stderr   2023/05/17 10:25:09.032 WRN We got over 100 build events before an event with options for invocation 0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2. Dropped the 112310 earliest event(s). invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=6c475e64-8fe8-4f29-8e60-374c46210b58
  |   | 2023-05-17 13:25:04.017 | stderr   2023/05/17 10:25:04.016 WRN Error sending ack stream for invocation "0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2": rpc error: code = Canceled desc = context canceled invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=ba007de3-ae03-448c-91ab-06efe35f748d
  |   | 2023-05-17 13:25:03.846 | stderr   2023/05/17 10:25:03.846 INF Finalized invocation in primary DB and enqueued for stats recording (status: COMPLETE_INVOCATION_STATUS) invocation_attempt=1 invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=ba007de3-ae03-448c-91ab-06efe35f748d
  |   | 2023-05-17 13:15:52.257 | stderr   2023/05/17 10:15:52.256 INF Created invocation "0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2", attempt 1 invocation_attempt=1 invocation_id=0cc888fc-5f7e-4553-bd95-fe4b25e0e4e2 request_id=ba007de3-ae03-448c-91ab-06efe35f748d

Both BuildStarted and OptionsParsed are in your JSON file at line 1 and 5 respectively.

I am confused on why they did not get sent to the server successfully.
Do you see the same result when setting --bes_upload_mode=wait_for_upload_complete? We have seen several customers reporting that fully_async causes problem.

We've found the culprit here and are working on a fix - thanks for your help @reimai!

This should be fixed by #3998 and will go out in this week's release.