MIT-LCP / physionet-build

The new PhysioNet platform.

Home Page:https://physionet.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

S3 sync performance improvements

bemoody opened this issue · comments

Uploading files to S3 is coming along, but it's slow. The server just spent about 2 days uploading one database (143 GB, 57k files).

There also was a problem recently where a project's zip file was missing, so the server tried five times (re-uploading the entire project each time.)

There are a couple of things we could do to improve the situation without changing the S3 upload code itself:

  • Parallelizing background tasks (see pull #699)

  • Routing S3 connections through a separate interface or proxy (see issue #2103)

Here are some things we could do to improve the S3 upload logic:

  1. Upload files in order and track progress, so that when the upload task is interrupted/retried, it can be resumed without restarting the whole process.

  2. Check checksums and avoid uploading files that haven't changed when re-uploading a project.

  3. Detect files that already exist in S3 (in a previous project version) and do an S3-to-S3 copy instead of re-uploading.

(Keep in mind, all these problems apply equally to GCP.)

I'm somewhat inclined to throw away the Python upload code and just use rclone or awscli.

Duplicate of issue #1903