google / weather-tools

Tools to make weather data accessible and useful.

Home Page:https://weather-tools.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create a performant, cloud-agnostic way to download & upload files to cloud buckets.

alxmrs opened this issue · comments

See discussion here: #254 (comment)

To investigate:

  • Can we do better than shutil.copyfileobj?
  • What are the optimal chunk sizes?
  • Can we copy data in parallel?
  • Are there optimizations we can do for large files?

One idea that @bahmandar has explored is calling gsutils in a subprocess (the CLI is really efficient at file transfer).

Here is the location for using gcloud for gcs files:
https://github.com/bahmandar/weather-tools/blob/mv-faster/weather_mv/loader_pipeline/sinks.py#L416

Fall back is shutil for downloading and fall back for remote is using apache beam file systems.

I also have a different shutil optimized for gcs:
https://github.com/bahmandar/weather-tools/blob/mv-faster/weather_mv/loader_pipeline/sinks.py#L402
One thing to note it seems like it is beneficial to change the buffer for gcs io and shutil together than just one of them.

Here are some advantages of just using gsutil vs a hand-rolled python solution:

  • over a size threshold, gsutil will automatically parallelize file transfer
  • gsutil uses checksums to verify the integrity of data transferred, and will automatically retry on corrupted data
  • the default dataflow image already has gcloud installed, so in theory it's easy to manage this dependency
  • we get to make use of heavily invested code (maybe some magic constants found from trial & error) from GCP + boto devs
  • we get all these features with a slick one-liner: subprocess.run(f'gsutil cp {src!r} {dst!r}', shell=True, check=True)

Thanks @mahrsee1997 for pointing this out! https://cloud.google.com/blog/products/storage-data-transfer/new-gcloud-storage-enables-super-fast-data-transfers/

With a 10GB file, gcloud storage was 94% faster than gsutil on download and 57% faster on upload.

@mahrsee1997 did some benchmarking of different cloud utilities to see what would be the fastest. Our results show that gsutil is the best fit for us. However! – It possible that gcloud alpha storage would be faster if we upgraded to the latest version of gcloud. Using this version of the SDK requires that we update the versions of all our GCP dependencies. This is something that we'll tackle, but in a future PR. #265 is still a great win.

I did the bench-marking on a file of size ~18.42 GiB & it appears that "gsutil" is the most efficient approach here. It's ~77% reduction in time than our original approach of shutil.

=========================================
gcloud alpha storage – 1st run: 6.48 minutes & 2nd run: 6.53 minutes
2022-12-04 00:46:35.871 GMT
[licence4.0] Uploading to store for 'gs://$BUCKET/tmp2/rahul/aplha-storage-20-00:00:00z-tprate.gb'.
2022-12-04 00:53:04.977 GMT
[licence4.0] Upload to store complete for 'gs://$BUCKET/tmp2/rahul/aplha-storage-20-00:00:00z-tprate.gb'.

—----------------------------------------------------------
gsutil – 1st run : 3.82 minutes & 2nd run : 4.72 minutes
2022-12-04 01:24:48.613 GMT
[licence4.0] Uploading to store for 'gs://$BUCKET/tmp2/rahul/gsutil-20-00:00:00z-tprate.gb'.
2022-12-04 01:28:37.283 GMT
[licence4.0] Upload to store complete for 'gs://$BUCKET/tmp2/rahul/gsutil-20-00:00:00z-tprate.gb'.

—---------------------------------------------------------------------------------
storage-client – 7.5 minutes
2022-12-04 08:07:27.727 GMT
[licence4.0] Uploading to store for 'gs://$BUCKET/tmp2/rahul/storage-client-20-00:00:00z-tprate.gb'.
2022-12-04 08:14:57.435 GMT
[licence4.0] Upload to store complete for 'gs://$BUCKET/tmp2/rahul/storage-client-20-00:00:00z-tprate.gb'.

—---------------------------------------------------------
shutil – 16.75 minutes
2022-12-04 08:07:18.234 GMT
[licence4.0] Uploading to store for 'gs://$BUCKET/tmp2/rahul/shutil-20-00:00:00z-tprate.gb'.
2022-12-04 08:24:03.314 GMT
[licence4.0] Upload to store complete for 'gs://$BUCKET/tmp2/rahul/shutil-20-00:00:00z-tprate.gb'.)