tap-text
A Singer tap for extracting data from text files.
Written for the Stitch 2018 Q1 internal hackathon. This code should not be relied upon in production systems :)
Features
- extracts data from
- JSONL files
- CSV files
- logs and other unstructured messages
- schema inferred from source data with GenSON
- csv fields typed with Pandas
- unstructured data parsed with pygrok
Usage
See the example_data
directory for different configuration options.
pipenv install --dev
pipenv run tap-text -c example_data/json_config.json | $(pipenv --venv)/bin/singer-check-tap
TODO
- Optimization. Presently the code makes a complete first pass over the input data to build a schema, but the input data may be homogeneous enough that sampling every nth row could accurately describe the structure.
- Better Grok support. Perhaps give the ability to define a set of grok patterns and define them per-directory. Also have better handling of newlines in the source logs. E.g. a stacktrace may get logged over many lines but you'd want all those lines to be part of a single log entry.
- Refactor