siebrand / s3p

list/copy/sync/compare S3 buckets 5x-50x faster than aws-cli

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

S3P - 5x to 50x faster than aws-cli

S3P provides a radically faster way to copy, list, sync and do other bulk operations over large AWS S3 buckets.

You can use it as a command-line tool for common operations, or you can use it as a library for nearly anything you can imagine.

Why is S3P so fast?

S3's API is structured around listing items in serial - request 1000 items, wait, then request the next 1000. This is how nearly all S3 tools work. S3P, however, can list items in parallel. It leverages S3's ability to request the first 1000 items equal-to or after a given key. Then, with the help of algorithmic bisection and some intelligent heuristics, S3P can scan the contents of a bucket with an arbitrary degree of parallism. In practice, S3P can list buckets up to 15x faster than conventional methods.

S3P is really just a fancy, really fast, S3 listing tool. Summarizing, copying and synching are all boosted by S3P's core ability to list objects radically faster.

We've sustained copy speeds up to 8gigabytes/second between two buckets in the same region using a single EC2 instance to run S3P.

S3P Blog Post

Read more about S3P on Medium.

Requirements

  1. NodeJS

  2. AWS-CLI

    The aws-cli is required for copying large files. Large files are defined as >= 100 megabytes by default for performance reasons. However, you can up that threshold to 5 gigabytes with the large-copy-threshold option. Files larger than 5 gigabytes can only be copied with the help of the aws-cli. (Why? the aws-sdk does not support coping larger files without a much more complicated solution. TODO!)

  3. Key names must use a limited character set:

    <space>
    !"#$%&'()*+,-./
    0123456789:;<=>?@
    ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`
    abcdefghijklmnopqrstuvwxyz{|}~
    

    Why? Since Aws-S3 doesn't support listing Keys in descending order, S3P uses a character-range-based divide-and-conquer algorithm.

AWS Credentials

s3p uses the same credentials aws-cli uses, so see their documentation: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

CLI

There is no need to install s3p directly. As long as you have NodeJS installed, you can run s3p directly using npx.

npx s3p help

Install NPM Package

npm install s3p

Features

In addition to performance, S3P provides flexible options for custom list, copying and comparing:

  • Only list files with a matching prefix, starting-after a given key, and/or stopping-at a given key. These options are very fast; the rest of the bucket not matching these criteria is ignored completely.
  • Filter source files with arbitrary JavaScript. Further filter every file listed arbitrarily based on Key, Size, or Date. This is slower, since every file must be filtered in JavaScript but none-the-less, quite useful.
  • When copying, syncing or comparing, re-key files by replacing prefixes, adding prefixes, or with an arbitrary JavaScript function.

Performance

Surprisingly, you don't even need to run S3P in the cloud to see much of its benefits. You can run it on your local machine and, since S3 copying never goes directly through S3P, it doesn't use up any AWS bandwidth.

S3-bucket-listing performance can hit almost 20,000 items per second.

S3-bucket-copying performance can exceed 8 gigabytes per second.

Yes, I've seen 9 gigabytes per second sustained! This was on a bucket with an average file size slightly larger than 100 megabytes. S3P was running on a single c5.2xlarge instance. By comparison, I've never seen aws-s3-cp get more than 150mB/s. That's over 53x faster.

The average file-size has a big impact on s3p's overall bytes-per-second:

location command aws-cli s3p speedup average size
local ls 2000 items/s 20000 items/s 10x n/a
local cp 30 mB/s 150 mB/s 5x 512 kB
ec2 cp 150 mB/s 8 gB/s 54x 100 mB

S3P was developed to operate on buckets with millions of items and 100s of terabytes. Currently, S3P is still only a single-core NODE application. There are opportunities for even more massively parallel S3 operations by forking workers or even distributing the work across instances with something like Elastic-Queue. If someone needs solutions that are 100-1000x faster than aws-cli, let us know. We'd love to work with you.
- shane@genui.com

TODO

  • local file system support
    • S3P was built to accelerate copying between two S3 buckets, but there's no reason it can't also accelerate copying to and from a local file system on an EC2 instance, an on-premises machine or your own dev machine.
    • currently supported:
      • copy to local file system
    • not supported yet:
      • copy from local file system
      • sync/compare to or from local file system
  • eliminate the dependency on aws-cli
    • aws-cli is currently used to copy "large" files. Files larger than 5gigabytes can't be copied with the standard copyObject API call, so aws-cli is used as a sub-processes.
  • document the API

Developed

S3P was originally developed by GenUI.com in conjunction with Resolution Bioscience, Inc.

About

list/copy/sync/compare S3 buckets 5x-50x faster than aws-cli

License:ISC License


Languages

Language:HTML 77.4%Language:JavaScript 21.7%Language:CSS 0.9%