The goal of this project is to make it easy to work with local or cloud storage as part of a data science workflow.
e.g. download_dataset azure_demo mnist/hand_drawn_digits
Just import the storage tools core module, create a client and download your dataset.
Secret keys live in their own files and storage tools knows how to find them.
Storage tools makes it easy to
- manage multiple versions of a dataset and
- know which version of the dataset you are working with locally.
pip install storage_tools
We recommend that you ...
Use forward slashes when specifying files, paths and dataset names.
project_root ∟ data ∟ secrets ∟ settings.ini
Add the following to .gitignore
secrets/
data/
Add the following to settings.ini
[DEFAULT]
local_path=data
Running storage_client from your project root will read project_root/secrets/settings.ini
and save all local data to project_root/data
If we follow the above conventions and have a project folder containing
project_root ∟ data ∟ mnist ∟ hand_drawn_digits ∟ digit0.png ∟ digit1.png ∟ ... ∟ secrets ∟ settings.ini ∟ main.py
where settings.ini
contains
[DEFAULT]
local_path=data
[azure_demo]
storage_client=storage_tools.core.AzureStorageClient
conn_str=<A connection string to an Azure Storage account without credential>
credential=<The credentials with which to authenticate>
container=<The name of a storage container>
We can use main.py
to
storage_client=new_storage_client('azure_demo')
storage_client.ls()
storage_client.ls('local_path')
storage_client.upload_dataset('mnist/hand_drawn_digits','major')
Note: If you run storage_client.ls()
again, you'll see the new file in the azure container.
Feel free to delete your local copy of this dataset (from data) to download from azure storage.
storage_client.download_dataset('mnist/hand_drawn_digits')
Note: If you run storage_client.ls('local_path')
again, you'll see the dataset in data.
See BlobServiceClient docs for more details on the settings used in settings.ini
from_connection_string
(conn_str
andcredential
)get_container_client
(container
)
It's the same as Azure except settings.ini
contains
[DEFAULT]
local_path=data
[aws_demo]
storage_client=storage_tools.core.AwsStorageClient
service_name=s3
aws_access_key_id=<An AWS access key ID>
aws_secret_access_key=<An AWS access key>
bucket=<The name of an AWS bucket that the access key is allowed to read from and write to>
git config --global core.autocrlf input
conda create -n storage_tools python==3.8 -y
conda activate storage_tools
pip install fastcore nbdev jupyter
pip install boto3 azure-storage-blob
!pip install mypy
Then from the storage_tools project folder
nbdev_build_lib
mypy storage_tools/core.py --ignore-missing-imports
For now, I'm ignoring the "Skipping analyzing 'azure': found module but no type hints or library stubs" error