RaekwonIII / DataEngineerTest

Repository contains code produced for a Data Engineer Test. It' about importing sample data and create an ETL process.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pre-requisites

This software uses python 2.7 and runs against RedShift data warehouse. A Linux environment is needed to execute bash scripts. Packages for python, python-dev, postgre-sql-dev, pip and virtualenv have to be installed in the system.

To perform a test against sample data

  • In a terminal, cd into the project directory
  • Execute ./bootstrap.sh to setup the virtual environment, activate it, check that necessary Python packages are installed, and set script files as executable.
  • Change configuration in config.ini, this file includes config for:
  • database connection details
  • table names
  • query templates
  • Launch python import.py to setup sample data in the database. The module logs information and errors to import.log
  • Launch python test.py to run metrics for each day in the sample data set. The metrics.py module logs information and errors to metrics.log.

Extras

  • metrics.py is designed to accept an optional date string as argument (python metrics.py -d '2014-09-02') which can be use in case of failure of batch process.
  • metrics.sh is a wrapper for it's namesake python counterpart and is designed to be used for scheduling the ETL process
  • It invokes bootstrap.sh to make sure virtual environment is active
  • It uses yesterday's date to run the python script, so should be scheduled after midnight.
  • The suggestion would be to execute this (crontab -l ; echo "0 3 * * * /path/to/metrics.sh") | sort - | uniq - | crontab - to launch the script during the night and collect data from previous day.

About

Repository contains code produced for a Data Engineer Test. It' about importing sample data and create an ETL process.


Languages

Language:Python 87.1%Language:Shell 12.9%