tap-s3-csv

A Meltano compatible Singer tap for reading delimited content from S3 by: Eric Simmerman (ets@bitacuity.com)

Forked from the smart_open adaptation from: https://github.com/retentionscience/tap-s3-csv

Original author of the Singer tap was Connor McArthur (connor@fishtownanalytics.com)

Singer tap that produces JSON-formatted data following the Singer spec.

Given a configuration that specifies a bucket, a file pattern to match, a file format (csv or excel), and a table name, this tap reads new files from S3, parses them, infers a schema, and outputs the data according to the Singer spec.

Installation

To run locally, clone this repo, then run:

python setup.py install

Now you can run:

tap-s3-csv --config configuration.json

to generate data.

How it works

This tap:

Expects to read your AWS credentials from the environment variables: "AWS_ACCESS_KEY_ID" and "AWS_SECRET_ACCESS_KEY"
Searches S3 for files matching the spec given.
Samples 1000 records out of the first five files found to infer datatypes.
Iterates through files from least recently modified to most recently modified, outputting data according to the generated schema & Singer spec.
After completing each file, it writes out state that can be used to enable incremental replication.

Example

Given a source file: s3://csv-bucket/csv-exports/today.csv

id,First Name, Last Name
1,Michael,Bluth
2,Lindsay,Bluth Fünke
3,Tobias,Fünke

And a config file:

{
    "start_date": "2017-05-01T00:00:00Z",
    "bucket": "csv-bucket",
    "strip_newlines": true,
    "tables": [
        {
            "name": "bluths",
            "pattern": "csv-exports/(.*)\\.csv$",
            "key_properties": ["id"],
            "format": "csv"
        }
    ]
}

An output record might look like:

{
  "id": 3,
  "first_name": "Tobias",
  "last_name": "Funke",
  "_s3_source_bucket": "csv-bucket",
  "_s3_source_file": "csv-exports/today.csv",
  "_s3_source_lineno": 4,
  "_s3_extra": null
}

Input File Gotchas

Input files MUST have a header row.
Input files MUST have cells fully populated. Missing cells will break the integration. Empty cells are handled by the tap.
If you have the choice, use CSV, not Excel. This tap is able to stream CSV files from S3 row-by-row, but it cannot stream Excel files. CSV files are more efficient and more reliable.
This tap can convert datetimes, but it does not infer date-time as an output datatype. If you want the tap to convert a field to datetime, you must specify date-time as the _conversion_type in schema_overrides. See "Configuration Format" below for more information.

Configuration Format

See below for an exhaustive list of configuration fields:

{
    // the start date to use on the first run. the tap outputs an updated state on each
    // run which you can use going forward for incremental replication
    "start_date": "2017-05-01T00:00:00Z",

    // the bucket to use. make sure the AWS credentials provided have read access.
    "bucket": "csv-bucket",

    // Attempt to strip newlines from each row as it is written. Some Singer targets do not properly handle newlines in message bodies 
    "strip_newlines": true,

    // table definitions. you can specify multiple tables to be pulled from a given
    // bucket.
    "tables": [
        // example csv table definition with schema overrides
        {
            // table name to output
            "name": "bluths_from_csv",

            // you can limit the paths searched in s3 if there are many files in your
            // bucket
            "search_prefix": "csv-exports",

            // pattern to match in the bucket
            "pattern": "csv-exports/(.*)\\.csv$",

            // primary key for this table. if append only, use:
            //   ["_s3_source_file", "_s3_source_lineno"]
            "key_properties": ["id"],

            // format, either "csv" or "excel"
            "format": "csv",

            // if the files don't have a header row, you can specify the field names
            "field_names": ["id", "first_name", "last_name"],

            // for any field in the table, you can hardcode the json schema datatype.
            // "_conversion_type" is the type that the tap will try to coerce the field
            // to -- one of "string", "integer", "number", or "date-time". this tap
            // also assumes that all fields are nullable to be more resilient to empty cells.
            "schema_overrides": {
                "id": {
                    "type": ["null", "integer"],
                    "_conversion_type": "integer"
                },

                // if you want the tap to enforce that a field is not nullable, you can do
                // it like so:
                "first_name": {
                    "type": "string",
                    "_conversion_type": "string"
                }
            }
        },

        // example excel definition
        {
            "name": "bluths_from_excel",
            "pattern": "excel-exports/(.*)\\.xlsx$",
            "key_properties": ["id"],
            "format": "excel",

            // the excel definition is identical to csv except that you must specify
            // the worksheet name to pull from in your xls(x) file.
            "worksheet_name": "Names"
        }
    ]
}

Output Format

Column names have whitespace removed and replaced with underscores.
They are also downcased.
A few extra fields are added for help with auditing:
- _s3_source_bucket: The bucket that this record came from
- _s3_source_file: The path to the file that this record came from
- _s3_source_lineno: The line number in the source file that this record was found on
- _s3_extra: If you specify field names in the config, and there are more records in a row than field names, the overflow will end up here.

ets / tap-s3-csv