HvyD / vidIQ-Technical-Task

Round three of Data Engineer interview process

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

vidIQ Technical Assesment

The task at hand are:

1. Load attached data file to S3 Bucket

2. Implement a Partitioned Athena Database creating its schema using python

3. Use Airflow to add new partitions for Daily events

Setup

  1. Python3 & Airflow used

  2. Dependencies set in requirment file and at imports

  3. create aws config

    • create file dl.cfg
    • add the following contents (fill the fields)
    [AWS]
    AWS_ACCESS_KEY_ID=
    AWS_SECRET_ACCESS_KEY=
    
    
    [S3]
    BUCKET_NAME = 
    OUTPUT_LOCATION = 
    SOURCE_S3_KEY = 
    DEST_S3_KEY = 
    DATABASE = 
    
  4. Initialize Airflow & Run Webserver

  5. Run Scheduler (Open New Terminal Tab)

Usage

  1. First Run upload_to_aws.py then vidIQELT.py
  2. Access Airflow UI at your localhost
  3. Create Airflow Connections
  4. Run dags in Airflow UI

--Alternatively you can just export via cli:

export AIRFLOW_CONN_AWS_DEFAULT="s3://$AWS_CLIENT_ID:$AWS_CLIENT_SECRET@my-bucket?region_name=$AWS_REGION"

export AWS_DEFAULT_REGION=$AWS_REGION

export AWS_ACCESS_KEY_ID=$AWS_CLIENT_ID

export AWS_SECRET_ACCESS_KEY=$AWS_CLIENT_SECRET

Test

airflow test partitioned_athena_and_S3move <EXECUTION_DATE>

About

Round three of Data Engineer interview process


Languages

Language:Python 100.0%