Tool to process S3 dumps of JSON records made with AWS Kinesis Firehose

If you are using AWS Kinesis Firehose https://aws.amazon.com/kinesis/firehose/ to stream JSON records into some S3 bucket then this bucket contains some files in directories named based on the date of creation:

2016/04/01/00/aaa-2-2016-04-01-00-11-51-9c03f701-5a59-4b3e-9c49-b34c1c84b68c.gz
2016/04/01/04/bbb-2-2016-04-01-04-57-10-d70d8ee6-7369-446e-8047-b3341f532be8.gz
2016/04/01/05/ccc-2-2016-04-01-05-12-11-b2902d59-5193-4e6a-ac3d-ff971dca5910.gz
...

where each file contains a sequence of JSON records:

{-1st json record-}{-2nd json record-}...{-Nth json record-}

You can use this tool to export some fields from all JSON records that satisfies certain condition as CSV file. For example if each JSON record has following format:

{
  "username": "David",
  "event": {
    "time": "2016/12/10 23:30",
    "name": "some event",
    ...
  },
  ...
}

then you can export CSV file containing username and time for each record with event name equal to "login". Call:

node index.js -f 2016-04-01 -t 2016-04-01 -F username,event.time -C event.name=login some-bucket

to get:

username;event.time
David;2016/12/10 23:30
John;2016/12/10 23:45
...

Installation

Install npm dependencies:

npm install

Usage

  Usage: index [options] <bucket>

  Options:

    -h, --help                 output usage information
    -f, --date-from [string]   From date [2016-04-01]
    -t, --date-to [string]     To date [2016-04-02]
    -F, --fields [string]      Fields to be extracted [name,date,search.query]
    -C, --conditions [string]  Conditions [name=John,gl=uk]
    -p, --pool [integer]       Number of parallel files processed [20]
    -o, --out [string]         Output file name [output.csv]

Options --date-from, --date-to, --fields are required.

mtrunkat / node-firehose-s3-processor

Tool to process S3 dumps of JSON records made with AWS Kinesis Firehose

Installation

Usage

About

Languages