S3 sync performance improvements

Question

bemoody opened this issue 4 months ago · comments

Uploading files to S3 is coming along, but it's slow. The server just spent about 2 days uploading one database (143 GB, 57k files).

There also was a problem recently where a project's zip file was missing, so the server tried five times (re-uploading the entire project each time.)

There are a couple of things we could do to improve the situation without changing the S3 upload code itself:

Here are some things we could do to improve the S3 upload logic:

Upload files in order and track progress, so that when the upload task is interrupted/retried, it can be resumed without restarting the whole process.
Check checksums and avoid uploading files that haven't changed when re-uploading a project.
Detect files that already exist in S3 (in a previous project version) and do an S3-to-S3 copy instead of re-uploading.

bemoody · Answer 1 · Wed Apr 24 2024 23:08:12 GMT+0800 (China Standard Time)

(Keep in mind, all these problems apply equally to GCP.)

I'm somewhat inclined to throw away the Python upload code and just use rclone or awscli.

bemoody · Answer 2 · Wed Apr 24 2024 23:40:18 GMT+0800 (China Standard Time)

Duplicate of issue #1903