S3 sync performance improvements
bemoody opened this issue · comments
Uploading files to S3 is coming along, but it's slow. The server just spent about 2 days uploading one database (143 GB, 57k files).
There also was a problem recently where a project's zip file was missing, so the server tried five times (re-uploading the entire project each time.)
There are a couple of things we could do to improve the situation without changing the S3 upload code itself:
-
Parallelizing background tasks (see pull #699)
-
Routing S3 connections through a separate interface or proxy (see issue #2103)
Here are some things we could do to improve the S3 upload logic:
-
Upload files in order and track progress, so that when the upload task is interrupted/retried, it can be resumed without restarting the whole process.
-
Check checksums and avoid uploading files that haven't changed when re-uploading a project.
-
Detect files that already exist in S3 (in a previous project version) and do an S3-to-S3 copy instead of re-uploading.
(Keep in mind, all these problems apply equally to GCP.)
I'm somewhat inclined to throw away the Python upload code and just use rclone or awscli.