boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter

dboyadzhiev opened this issue · comments

Describe the bug

The pagination of S3 list_objects_v2 skip pages when using CommondPrefixes (i.e. Delimiter) and StartingToken

Use case:
Our API provides a list of S3 "folders" and supports pagination. It is a wrapper over our internal S3 bucket and forwards the information. The first response of the API returns a list of common prefixes and the next token provided by the PageIterator. The second request uses this token to continue the listing.

Expected Behavior

Using the paginator.paginate() method with the Delimiter parameter and not setting StartingToken should return all pages starting from the first one and its next token.
Using it again but this time with a given StartingToken (the first page next token) should return all pages starting from the second one and its next token.

Current Behavior

When the paginator.paginate() is called with StartingToken it returns the second page with an empty CommonPrefixes list but the third with a valid CommonPrefixes list

Reproduction Steps

You need a bucket with date partitions and files in them.

S3://by_bucket/2023-01-01/file1.json
S3://by_bucket/2023-01-01/file2.json
S3://by_bucket/2023-01-02/file1.json
S3://by_bucket/2023-01-02/file2.json
...
S3://by_bucket/2023-12-01/file1.json
S3://by_bucket/2023-12-01/file2.json
import boto3

BUCKET_NAME = ""
PREFIX = ""
token = None

s3_client = boto3.client("s3")
paginator = s3_client.get_paginator('list_objects_v2')

def request_page(token):
    paginator = s3_client.get_paginator('list_objects_v2')
    return paginator.paginate(
        Bucket=BUCKET_NAME,
        Delimiter='/',
        Prefix=PREFIX,
        PaginationConfig={'PageSize': 5, 'StartingToken': token}
    )

# simolate multi requests to an API 
steps = 0

# First request 
# print page 1 prefixes
# keep the token for page 2
print("Request 1")
for page in request_page(token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

# Second request 
# print page 2 prefixes
# keep the token for page 2
print("Request 2")
for page in request_page(next_token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

Output:

> Request 1
> S3://by_bucket/2023-01-01
> S3://by_bucket/2023-01-02
> S3://by_bucket/2023-01-03
> S3://by_bucket/2023-01-04
> S3://by_bucket/2023-01-05
> done in step: 1
>
> Request 2
> []
> S3://by_bucket/2023-01-11
> S3://by_bucket/2023-01-12
> S3://by_bucket/2023-01-13
> S3://by_bucket/2023-01-14
> S3://by_bucket/2023-01-15
> done in step: 3

Possible Solution

No response

Additional Information/Context

I followed the issue down to PageIterator.__iter__() (.venv/lib/python3.11/site-packages/botocore/paginate.py)

            if first_request:
                # The first request is handled differently.  We could
                # possibly have a resume/starting token that tells us where
                # to index into the retrieved page.
                if self._starting_token is not None:
                    starting_truncation = self._handle_first_request(
                        parsed, primary_result_key, starting_truncation
                    )
                first_request = False
                self._record_non_aggregate_key_values(parsed)

The primary_result_key is initiated a few lines before that as self.result_keys[0] and result_keys are essentially coming from a JSON schema from venv/lib/python3.11/site-packages/botocore/data/s3/2006-03-01/paginators-1.json

"ListObjectsV2": {
      "more_results": "IsTruncated",
      "limit_key": "MaxKeys",
      "output_token": "NextContinuationToken",
      "input_token": "ContinuationToken",
      "result_key": [
        "Contents",
        "CommonPrefixes"
      ]
    },

where result_key is Contents which is missing in the S3 response body parsed

SDK version used

1.31.17

Environment details (OS name and version, etc.)

MacOS 14.2.1 (23C71)

investigating the prolonged fortage in it.

Hey @dboyadzhiev, thanks for reaching out and for the detailed reproduction steps. I was able to reproduce this behavior, and will bring it up with the team. I'll provide an update when I know more.

Hi @dboyadzhiev, thanks for your patience. Could you clarify why you have the first call separate from the rest? I was able to get all the common prefixes by using just one loop, and initializing next_token to None. This seems to be what you're trying to achieve, unless I'm misunderstanding the problem.

We used that logic to implement pagination. With the code above I simulated two different requests. Imagine you have an app with a list of 20 files per page, and this is to click on the button "next".