awslabs / dynamodb-streams-kinesis-adapter

The Amazon DynamoDB Streams Adapter implements the Amazon Kinesis interface so that your application can use KCL to consume and process data from a DynamoDB stream.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dynamodb-streams-kinesis-adapter broken on newest kcl 1.14.0 version

gjesse opened this issue · comments

See awslabs/amazon-kinesis-client#746 for more background:

Hello - yesterday I upgraded to 1.14.0 kcl client for our application that uses dynamodb streams for processing. Since then I've noticed these very consistent errors. we've seen 10s of thousands of these in just a few hours, and repeated for the same shard ids.

ERROR [2020-10-20 17:07:45,185] [RecordProcessor-0015] c.a.s.k.c.lib.worker.ProcessTask: ShardId shardId-00000001603151190143-090943ac: Caught exception: 
com.amazonaws.SdkClientException: Shard shardId-00000001603151190143-090943ac: GetRecordsResult is not valid. NextShardIterator: null. ChildShards: []
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher$AdvancingResult.accept(KinesisDataFetcher.java:126)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.SynchronousGetRecordsRetrievalStrategy.getRecords(SynchronousGetRecordsRetrievalStrategy.java:31)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockingGetRecordsCache.getNextResult(BlockingGetRecordsCache.java:50)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResultAndRecordMillisBehindLatest(ProcessTask.java:377)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResult(ProcessTask.java:342)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:159)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

As best I can tell, a GetRecordsResult with a null NextShardIterator and no child shards is a valid response - in fact there is no field specified for child shards at all here:
docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetRecords.html

Using kcl 1.14.0, and creating a worker using dynamodb-streams-kinesis-adapter 1.5.2. The worker is setup using this method: https://github.com/awslabs/dynamodb-streams-kinesis-adapter/blob/master/src/main/java/com/amazonaws/services/dynamodbv2/streamsadapter/StreamsWorkerFactory.java#L44

I am not setting any special configuration other than the following, which I believe shouldn't be relevant.

 KinesisClientLibConfiguration config =  new KinesisClientLibConfiguration(applicationName,
                                                                                  streamName,
                                                                                  credentialsProvider,
                                                                                  workerId)
                        .withInitialPositionInStream(initialPositionInStream)
                        .withMaxRecords(kclConfiguration.getMaxRecordsToFetch())
                        .withTaskBackoffTimeMillis(kclConfiguration.getBackOffTimeMillis())
                        .withIdleTimeBetweenReadsInMillis(kclConfiguration.getIdleTimeBetweenReads());

Also hitting this problem, and a downgrade for us is not trivial at this point :(

It's been feeling pretty dead for a long time. I wish AWS would either deprecate it officially or support it. We are looking at transitioning to kinesis.

I had the same error with KCL 1.14.3 and Kinesis Adapter 1.5.3.
The DynamoDB Streams shard is said to be automatically recreated every few hours (about 4 hours), and this error starts to occur at that timing.

Cause

After reading the source code, it seems that the cause is that the Kinesis Adapter does not meet the specifications of the #isValidResult() check process that was added to KinesisDataFetcher in KCL 1.14.0.

KCL's KinesisDataFetcher#isValidResult() has the following comment.

// GetRecords result should contain childShard information. There are two valid combinations for the nextShardIterator and childShards
// If the GetRecords call does not reach the shard end, getRecords result should contain a non-null nextShardIterator and an empty list of childShards.
// If the GetRecords call does not reach the shard end, getRecords result should contain a null nextShardIterator and a non-empty list of childShards.
// All other combinations are invalid and indicating an issue with GetRecords result from Kinesis service.

However, the Kinesis Adapter's AmazonDynamoDBStreamsAdapterClient#getRecords() implementation, it seems that we are not getting the child shards in the first place.
Therefore, it does not meet the specification that KCL assumes, "If nextShardIterator is null, then childShards must be a non-empty list.

First aid

At first, this error could be avoided by using KCL 1.13.x before the checking process was implemented.

However, as it is, DynamoDB Streams consumer applications will have to continue using the old version of KCL.

Idea for supports

I have come up with some ideas for a radical solution.

Idea 1. Make Kinesis Adapter get the child shards.

Currently, AmazonDynamoDBStreamsAdapterClient#getRecords() in Kinesis Adapter doesn't get the child shard information, but we can fix it to get it and set it in GetRecordsResult and return it.

I tried to fix it, but to get the child shards in the DynamoDB Streams API, use describeStream to get a list of shards, and then find the child shards of the shard in question.
However, since only a ShardIterator is passed as an argument to getRecords(), the ID of the parent shard is not known, and it is not possible to determine which shard is the target child shard from the list of shards.

Therefore, we think this idea is not feasible.

Idea 2. Don't check with KCL

When KCL is used in combination with Kinesis Adapter, it should not be checked by KinesisDataFetcher#isValidResult().

In this case, we can't make KCL depend on Kinesis Adapter, so how do we determine "used in combination with Kinesis Adapter"?

I think this can be solved by passing a flag to the constructor of KinesisDataFetcher that specifies whether or not to perform the check. (This flag can be specified in the KCL configuration.)

Do you have any other ideas?

Almost happy birthday for this issue, with zero comments from the maintainers..

@gguptp was this fixed or why is this issue closed now?

commented

We have released the newest dynamodb-streams-kinesis-adapter version 1.6.0, which is compatible with KCL 1.14.9 version

Thanks for the update!

@gguptp we've made the migration to dynamodb-streams-kinesis-adapter version 1.6.0 using KCL 1.14.9. However, we're still seeing the ERROR level logs about "GetRecordsResult is not valid" from com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask. But the stream processing does seem to be working. Can we suppress this error log with good conscience?

commented

Please make sure StreamsWorkerFactory is getting used to initialize KCL worker

Ah yes, thanks for support @gguptp!! We were already on 1.5.3 but we were not creating our workers with StreamsWorkerFactory yet and I didn't read the older release notes. Now it's running just fine on 1.6.0 with 1.14.9! We're seeing a slight increase in CPU usage but also lower latency. All good! 🙏🏻