s3stream - Merge lines from S3 objects to stdout

Read large amounts of text files from Amazon S3 line-by-line and dump their merged contents to standard output so you can use a pipeline of Unix/Linux filters (e.g. sed, grep, gzip, jq, AWS CLI, etc.) to process them.

Features

Read S3 objects line-by-line and write their lines to standard output.
Download and merge up to 32 S3 objects in parallel.
Automatically unzip GZIP-ed text files.

Install

$ go get github.com/gogama/aws-s3-stream

Build

$ go build -o s3stream

Run

Read a single S3 object and dump its lines to standard output.

$ AWS_REGION=<region> ./s3stream s3://your-bucket/path/to/object

Read multiple objects and dump them to stdout.

$ AWS_REGION=<region> ./s3stream -p s3://your-bucket/some/prefix obj1 obj2 obj3 obj4 obj5

Read all objects under a prefix with maximum concurrency, dump them to stdout, and GZIP and upload the merged object back to S3 (uses AWS CLI).

$ aws --region <region> s3 ls --recursive input-bucket/prefix |
    awk '{print $4}' |
    AWS_REGION=<region> ./s3stream -p s3://input-bucket/prefix -c 32 |
    gzip --best |
    aws s3 cp - s3://output-bucket/merged.gz

License

MIT

Backlog

Profile performance, assess bottlenecks and whether increasing/reducing parallelism would help in places.
Support other ways to provide AWS region other AWS_REGION environment variable.
Add option, on by default, to detect and ignore objects that aren't text.
Add option, off by default, to unpack and look inside archives.
Support S3 ARN format as well as S3 URLs.

gogama / aws-s3-stream