Data Engineering Test

Why..

Running the Pyspark Job in a Local Environment for Unit testing

Since there will be no drastic load in dev/testing environment,I will be using pre- built image and run it locally.
If the workflow is bit complex then the dockerfile can be used to create the custom image

Running the Pyspark Job in a Production Environment

I have choosen AKS cluster

What..

Prerequisites for running the job in Local Environment

This can be used either for Unit testing

Docker Application in running state.I am using Docker Desktop in Windows
Command Prompt
Python Script
Optional:You can also use the DockerFile for complex workflows

Prerequisites for running the job in Prod Environment/Distributed Manner

Azure account
A valid access key/secret key
Azure Cli installed
AKS Cluster Created

How..

Local/Unit Testing

Once the Docker app is up and running.Open the command prompt

Note: I am using Docker Desktop in WIndows

Run the command without making use of dockerfile

docker run -it --rm --name python-script -v "$PWD":/usr/Subash/HelloFresh python:3 python readJSON.py

I am using -v to mount the current working directory into the container.

Note: I am importing the image from a public repository but its not always recommended due to security concerns. Either we can push the images to Azure Container Registry or Amazon ECR and use the docker pull command to extract the images.

To create image with DockerFile

docker build -t SparkData  -f /<path to docker file>/Dockerfile

DockerFile Contents You can also use RUN command with pip to install libraries.Currently I am not using any extrenal libraries.COPY command copies the files to Container directories

Note: It has to be ensured that there should be no extension to the Dockerfile (i.e. Docker does not recognize Dockerfile.txt

from gcr.io/datamechanics/spark:platform-3.1-dm14

ENV PYSPARK_MAJOR_PYTHON_VERSION=3

WORKDIR /opt/application/

copy readJSON_Spark.py

Execute the below command to run the image locally

docker run SparkData

Deploying on a Cluster

Create an AKS cluster

I have referred this link for hasslefree deployment-https://docs.microsoft.com/en-us/azure/aks/kubernetes-action#code-try-0

subashkonar13 / spark_dataengineer

Data Engineering Test

Why..

What..

How..

Create an AKS cluster

About

Languages