opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.

Home Page:https://opensearch.org/docs/latest/clients/data-prepper/index/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Ownership can timeout on full buffer for pull based sources

graytaylor0 opened this issue · comments

Describe the bug
we currently update source coordination ownership for partitions synchronously in pull based sources like S3, OpenSearch, and DynamoDB. This happens in a loop approximately every 2 minutes, but when the buffer is very full, we spend time retrying to write to the buffer, which leads to expiring ownership of the partition, and reprocessing of that partition by another node of Data Prepper

Expected behavior
Asynchronously update ownership every 2 minutes without depending on the primary loop. For example, this is done here for DynamoDB (

if (System.currentTimeMillis() - lastCheckpointTime > DEFAULT_CHECKPOINT_INTERVAL_MILLS) {
). We should update ownership in a timely manner regardless of how long it takes to write to the buffer.

Alternative consideration
Increase the ownership timeout to be a higher value or check ownership updates in between attempts to write to the buffer

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. Ubuntu 20.04 LTS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Maybe we can just have buffer accumulator take in a callback that runs when the buffer times out.

@graytaylor0 ,

I'm not sure how that would be different. When it times out, doesn't the current thread continue and then iterate back to getting ownership? Or is there something in between?

If I understand the problem correctly, the ownership is expiring during the write to the buffer.

@dlvenable Buffer accumulator currently will block and retry internally here (

). So I was thinking the callback would run in between the backoff retries at some point here ( )