How to analyze data directly from s3 bucket via S3 REST API or S3 SDK for python ?
robinatw opened this issue · comments
Requirement
Dear Community,
This is Robin from Novo Nordisk Pharmaceutical company.
Firstly, thanks to your contribution to project cellpainting-gallery.
As cellpainting-gallery is a public S3 bucket, We'd like to do some data analysis based the data you have on AWS S3.
I can list datasets with AWS CLI aws s3 ls --no-sign-request s3://cellpainting-gallery/cpg0000-jump-pilot/
.
seems it allows anonymous access the S3 bucket with AWS CLI , but when I access the S3 bucket via browser, it prompts me Access Denied.
http://s3.amazonaws.com/cellpainting-gallery/cpg0000-jump-pilot
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>1Z4Z91ER4NJVXQYX</RequestId>
<HostId>0O1/TtCuJTVnxKUfT9tD4hfK78DTNAxLd/wEc4nYyGwhykyaKJzoZQljS+AKHjjYl1IbjRcJOIg=</HostId>
</Error>
We want to do that in a smart way, here we'd like to analyze data directly via s3 bucket instead of download datasets to local to analyze.
Because of limitation of local storage space, I'd like to know if it's possible to access the cellpainting-gallery's datasets via S3 REST API or S3 SDK for python (like boto3).
Let's take dataset cellpainting-gallery/cpg0000-jump-pilot as example, I'd like to get file size recursively through the whole dataset.
What prerequisites do I need to prepare for?
- Do I need to create an AWS Account?
- What default AWS S3 profile I need to prepare for accessing bucket cellpainting-gallery? like account and region.
- Would you please provide code snippet of S3 REST API or S3 SDK for python to retrieve the dataset?
It would be great if you could tell me the detailed steps to implement it?
Best Regards,
Robin
You can use boto3/botocore to access cellpainting-gallery
anonymously as follows:
import boto3
from botocore import UNSIGNED
from botocore.client import Config
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED),region_name='us-east-1') #set your region
Small "folders" (prefixes) can then be directly listed as:
s3.list_objects_v2(
Bucket='cellpainting-gallery',
Prefix='cpg0000-jump-pilot/source_4/images/2020_11_04_CPJUMP1/illum/BR00116991/')
Large "folders" (prefixes) you'll need to use a paginator such as:
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='cellpainting-gallery', Prefix='cpg0000-jump-pilot/')
We now have https://github.com/broadinstitute/cpg/tree/main/cpgdata which will make indexing and finding files dramatically easier. I'll close this out now.