jgru / stream-to-s3

Stream data from stdin into an S3-bucket

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

stream-to-s3

streamtos3 is a utility to stream data (mainly from stdin) into an S3-bucket.

Motivation

Surprisingly, the official aws-cli does not support piping data into S3-storage. In order to be able to obey to the Unix philosophy and make use of the pipelining-paradigm, two tools, namely s3-echoer and 2s3 have been built. While the former is written in Go, the latter was written in Python. Unfortunately it is 8 years old, unmaintained and still using Python2, which reached its end of life, and an old boto-version. Therefore, I decided to recreate a similar utility, named streamtos3 using Python3 and the AWS SDK for Python boto3.

You may wonder, why someone would want to use that and consider the great aws-cli insufficient in that regard. This utitility is born from the need to handle situations where you do not want to write to the disk, for example during forensic investigations of a live system (or space reasons on tiny cloud instances as well).

Integrity protection

To protect the integrity of the uploaded data, a three-fold approach is implemented:

  1. An MD5-hash is calculated for each chunk of data this is supplied as ContentMD5 to the upload_part()-call, so that the AWS-endpoint checks the integrity of the received part.
  2. The ETag-checksum wich is returned as response to an upload_part()-call, is compared to the originally calculated MD5-hash. If the checksums do not match, the part is uploaded again, in total retry times in an interval of secs seconds before the upload fails.
  3. After completing the upload, the overall ETag of the object in the S3-bucket is retrieved and compared to the “local” ETag which is constructed by hashing the concatenation the MD5-sums of the individual parts, which is done by the S3-endpoint remotely as well.

By doing so, we can be fairly sure, that we stored the data, which was actually read.

Usage

Basic Usage

usage: streamtos3.py [-h] -k KEYFILE -b BUCKET -o OBJ [-c CHUNKSIZE] [-s SECS_WAIT] [-r RETRY] [-d] [infile]

Store data from stdin into S3 bucket DISCLAIMER: This is an _unvalidated_ proof of concept, use at _your own risk_!

positional arguments:
  infile                Filepath or '-' for stdin

optional arguments:
  -h, --help            show this help message and exit
  -k KEYFILE, --keyfile KEYFILE
                        File that contains: <AWS_KEY>:<AWS_SECRET> for S3 access.
  -b BUCKET, --bucket BUCKET
                        Name of the target bucket.
  -o OBJ, --obj OBJ     Name of the object to write
  -c CHUNKSIZE, --chunksize CHUNKSIZE
                        Split upload in CHUNK_SIZE bytes. (Default: 8 MiB)
  -s SECS_WAIT, --secs-wait SECS_WAIT
                        Time in seconds to wait until retry upload a failed part again (Default: 5)
  -r RETRY, --retry RETRY
                        Retries until giving up (Default: 5)
  -d, --debug           Print debug information

Examples

Dependencies

stream-to-s3 depends on the AWS SDK for Python boto3 which itself has the following dependencies.

  • botocore
  • jmespath
  • python-dateutil
  • s3transfer
  • six
  • urllib3

Please refer to the requirements.txt file for details on the versions.

Installation

Currently a setup.py is not provided, in order to use the tool, setup a venv and install the requirements:

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
python3 streamtos3.py -h

About

Stream data from stdin into an S3-bucket

License:MIT License


Languages

Language:Python 100.0%