gogama / aws-s3-stream

Read S3 objects in parallel into a stream

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

s3stream - Merge lines from S3 objects to stdout


Read large amounts of text files from Amazon S3 line-by-line and dump their merged contents to standard output so you can use a pipeline of Unix/Linux filters (e.g. sed, grep, gzip, jq, AWS CLI, etc.) to process them.


Features

  • Read S3 objects line-by-line and write their lines to standard output.
  • Download and merge up to 32 S3 objects in parallel.
  • Automatically unzip GZIP-ed text files.

Install

$ go get github.com/gogama/aws-s3-stream

Build

$ go build -o s3stream

Run

Read a single S3 object and dump its lines to standard output.

$ AWS_REGION=<region> ./s3stream s3://your-bucket/path/to/object 

Read multiple objects and dump them to stdout.

$ AWS_REGION=<region> ./s3stream -p s3://your-bucket/some/prefix obj1 obj2 obj3 obj4 obj5

Read all objects under a prefix with maximum concurrency, dump them to stdout, and GZIP and upload the merged object back to S3 (uses AWS CLI).

$ aws --region <region> s3 ls --recursive input-bucket/prefix |
    awk '{print $4}' |
    AWS_REGION=<region> ./s3stream -p s3://input-bucket/prefix -c 32 |
    gzip --best |
    aws s3 cp - s3://output-bucket/merged.gz

License

MIT

Backlog

  • Profile performance, assess bottlenecks and whether increasing/reducing parallelism would help in places.
  • Support other ways to provide AWS region other AWS_REGION environment variable.
  • Add option, on by default, to detect and ignore objects that aren't text.
  • Add option, off by default, to unpack and look inside archives.
  • Support S3 ARN format as well as S3 URLs.

About

Read S3 objects in parallel into a stream

License:MIT License


Languages

Language:Go 100.0%