How to analyze data directly from s3 bucket via S3 REST API or S3 SDK for python ?

Question

How to analyze data directly from s3 bucket via S3 REST API or S3 SDK for python ?

robinatw opened this issue 2 years ago · comments

Robin Guo commented 2 years ago

Requirement

Dear Community,

This is Robin from Novo Nordisk Pharmaceutical company.

Firstly, thanks to your contribution to project cellpainting-gallery.

As cellpainting-gallery is a public S3 bucket, We'd like to do some data analysis based the data you have on AWS S3.

I can list datasets with AWS CLI aws s3 ls --no-sign-request s3://cellpainting-gallery/cpg0000-jump-pilot/ .

seems it allows anonymous access the S3 bucket with AWS CLI , but when I access the S3 bucket via browser, it prompts me Access Denied.

http://s3.amazonaws.com/cellpainting-gallery/cpg0000-jump-pilot

<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>1Z4Z91ER4NJVXQYX</RequestId>
<HostId>0O1/TtCuJTVnxKUfT9tD4hfK78DTNAxLd/wEc4nYyGwhykyaKJzoZQljS+AKHjjYl1IbjRcJOIg=</HostId>
</Error>

We want to do that in a smart way, here we'd like to analyze data directly via s3 bucket instead of download datasets to local to analyze.

Because of limitation of local storage space, I'd like to know if it's possible to access the cellpainting-gallery's datasets via S3 REST API or S3 SDK for python (like boto3).

Let's take dataset cellpainting-gallery/cpg0000-jump-pilot as example, I'd like to get file size recursively through the whole dataset.

What prerequisites do I need to prepare for?

Do I need to create an AWS Account?
What default AWS S3 profile I need to prepare for accessing bucket cellpainting-gallery? like account and region.
Would you please provide code snippet of S3 REST API or S3 SDK for python to retrieve the dataset?

It would be great if you could tell me the detailed steps to implement it?

Best Regards,
Robin

Erin Weisbart · Answer 1 · Fri Mar 17 2023 00:50:53 GMT+0800 (China Standard Time)

You can use boto3/botocore to access cellpainting-gallery anonymously as follows:

import boto3
from botocore import UNSIGNED
from botocore.client import Config

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED),region_name='us-east-1') #set your region

Small "folders" (prefixes) can then be directly listed as:

s3.list_objects_v2(
    Bucket='cellpainting-gallery',
    Prefix='cpg0000-jump-pilot/source_4/images/2020_11_04_CPJUMP1/illum/BR00116991/')

Large "folders" (prefixes) you'll need to use a paginator such as:

paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='cellpainting-gallery', Prefix='cpg0000-jump-pilot/')

Shantanu Singh · Answer 2 · Mon Apr 01 2024 20:32:45 GMT+0800 (China Standard Time)

We now have https://github.com/broadinstitute/cpg/tree/main/cpgdata which will make indexing and finding files dramatically easier. I'll close this out now.