Ingest commit can fail due to a conflict in the status store

Question

Ingest commit can fail due to a conflict in the status store

patchwork01 opened this issue 2 months ago · comments

Description

When an ingest job status update is added to the status store, it's done in a transaction with TransactWriteItems. When a file is added to the state store asynchronously during an ingest job, the task submits a request to commit that to the state store committer, then carries on and finishes the job. The task can add the job finished status update at the same time as the state store committer adds the file added status update. This produces a transaction conflict because the two DynamoDB tables that implement the status store are updated in a single transaction.

If the status update is refused in the state store committer, the files are still committed successfully to the state store, but the ingest job gets stuck in an uncommitted state for reporting, the system test fails, and the state store commit request goes to the dead letter queue, causing an alarm in CloudWatch.

If the status update is refused in the ingest task, the task records a failure for that job, then terminates. The system test fails because the task goes down without processing all the jobs.

Steps to reproduce

Run system test MultipleTablesIT
See failure intermittently because an ingest job gets stuck in uncommitted state
Check logs in ingest task and state store committer
See state store committer or ingest task failed status store update due to a transaction conflict
See files were committed successfully to the state store, but the reporting is incorrect, the test failed, and the commit request went to the dead letter queue, causing an alarm in CloudWatch

Expected behaviour

The state store committer and ingest task should be able to add status updates for the same job simultaneously without any failure.

The two DynamoDB tables that implement the status store could be updated separately, without a transaction. This risks the two of them becoming inconsistent. There's one table for the status updates, and one for a summary of the current status of each job. The status updates tables seems more important to stay correct, as the job summary table is not always used.

We could add the status update and then update the job summary in a separate request, without combining them into a transaction. That would prevent any transaction conflict, but add a fail case where the status update was added but the job summary was not updated. The job summary is only used for certain queries, currently just to find jobs that failed validation, so it might be worth making it two separate updates.

Screenshots/Logs

Stack trace from state store committer lambda:

[main] committer.lambda.StateStoreCommitterLambda ERROR  - Failed commit request
sleeper.ingest.IngestStatusStoreException: Failed saving added files event for job <job-id>
at sleeper.ingest.status.store.job.DynamoDBIngestJobStatusStore.jobAddedFiles(DynamoDBIngestJobStatusStore.java:128)
at sleeper.commit.StateStoreCommitter.apply(StateStoreCommitter.java:115)
at sleeper.commit.StateStoreCommitter.apply(StateStoreCommitter.java:80)
at sleeper.statestore.committer.lambda.StateStoreCommitterLambda.handleRequest(StateStoreCommitterLambda.java:76)
at jdk.internal.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at lambdainternal.EventHandlerLoader$PojoMethodRequestHandler.handleRequest(EventHandlerLoader.java:290)
at lambdainternal.EventHandlerLoader$PojoHandlerAsStreamHandler.handleRequest(EventHandlerLoader.java:207)
at lambdainternal.EventHandlerLoader$2.call(EventHandlerLoader.java:925)
at lambdainternal.AWSLambda.startRuntime(AWSLambda.java:268)
at lambdainternal.AWSLambda.startRuntime(AWSLambda.java:207)
at lambdainternal.AWSLambda.main(AWSLambda.java:196)
Caused by: com.amazonaws.services.dynamodbv2.model.TransactionCanceledException: Transaction cancelled, please refer cancellation reasons for specific reasons [None, TransactionConflict] (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: TransactionCanceledException; Request ID: <request-id>; Proxy: null)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6858)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6825)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeTransactWriteItems(AmazonDynamoDBClient.java:5836)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.transactWriteItems(AmazonDynamoDBClient.java:5800)
at sleeper.ingest.status.store.job.DynamoDBIngestJobStatusStore.save(DynamoDBIngestJobStatusStore.java:177)
at sleeper.ingest.status.store.job.DynamoDBIngestJobStatusStore.jobAddedFiles(DynamoDBIngestJobStatusStore.java:126)
... 12 more