jpacerqueira-zz / spark-on-kubernetes

An Deployment and Setup of Apache Spark for multi-tenant usage in Kubernetes Clusters. This deploys 1 Executor per K8S POD , scales linearly.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

  • An setup to deploy Kubernetes in AWS EKS

  • The Standard GKS spark was here adapted to EKS ( each terraform deploy / run costs less than $0.29 USD )

  • Adopt multi-cloud strategies with this technologies and others compatible with Kubernetes and its auto-scaling capabilities.

    1. Step 0 : Setup your AWSCLI and account
    
     i.a.     $ brew install awscli
     i.b.     $ aws configure
    

In AWS/EKS core - Execution with Terraform deployment - analysis

     2. Setup implies the availability in an EC2 Gateway and setup of an EKS Kubernetes cluster

       i. Follow example in folder eks_deployment with script 
       i.a. [eks_deployment] $ bash -x step1-setup-terraform.sh  VAR1:_YOUR_LOCAL.pem   VAR2:_AWS_ACCESS_KEY_ID   VAR3:_AWS_SECRET_ACCESS_KEY   VAR4:_AWS_REGION  VAR5:_EKS_TF_CONFIG  

EKS for spark- K8S - Teraform Hashicorp Default package EKS for spark- K8S - Teraform Hashicorp Default package EKS for spark- K8S - Teraform Hashicorp Default package EKS for spark- K8S - Teraform Hashicorp Default package

     3. Use Terraform destroy in folder eks_deployment/eks-cluster

       i.a.  [eks-cluster] $  terraform destroy

EKS for spark- K8S - Teraform Hashicorp Default package

     4. Deployment of Kubernetes infrastructure with AWS CLI

       i. confirm in ' aws eks '  the context when your cluster deployed
       i.a. [eks-cluster] $  aws eks --region eu-west-1 update-kubeconfig --name spark-eks-QiTsE99z 
         a.               $  Added new context arn:aws:eks:eu-west-1:512336214250:cluster/spark-eks-QiTsE99z to /Users/joci/.kube/config

EKS for spark- K8S - Teraform Hashicorp Default package

     5. AWS Console - EKS regional setup detials for Spark-EKS-Version.x.y.z

AWS Console - Deployment of Kubernetes infrastructure with AWS CLI

     6. using kubectl activate token and login to Kubernetes console

Kubernetes - kubctl - proxy token - JOB1

  • K8S workloads Where All operations executed in order :

./setup-k8s-spark-workload.sh

     1. spark/install-spark-kubernetes-operator
     2. spark/create-spark-service-account

Kubernetes - proxy - JOB2

./execute-k8s-spark-workload.sh

     3. default spark-py-pi   OR  pyspark job under folder /jobs 

       i.   spark/run-spark-pi-2k8s-pods
       ii. e.g. ./execute-k8s-spark-workload.sh dataminer-categorized-pdf-to-csv-analytics

In Kubernetes Proxy - Execution Analysis and logs

Kubernetes - proxy - JOB3

In Docker/Kubernetes core - Execution analysis

DataMiner Notebook - package running in loal Kubernetes DataMiner Notebook - delta-lake package issue JOB1 DataMiner Notebook - delta-lake package issue JOB2 DataMiner Notebook - delta-lake package issue JOB2 DataMiner Notebook - delta-lake package issue JOB2 DataMiner Notebook - delta-lake package issue JOB2

Spark Running on Kubernetes

i. Follow latest Spark.2.4.5 in : https://spark.apache.org/docs/latest/running-on-kubernetes.html 

Spark Operator from GCP i. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.md ii. sparkctl (dedicated kubectl) available from : https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/sparkctl

EKS Operator for Terraform from Hashicorp HCL website i. https://learn.hashicorp.com/terraform/kubernetes/provision-eks-cluster

Next To\Do : adapted benchmark to EKS AKS with an dedicated sa3 S3 access layer for spark-executors/pods for filesystem. At the moment works in Spark with K8S filesystem

Aditional Notes : checkout branch execution_with_datapoints for data executions

Aditional literature : Apache Spark in Kubernetes with Fast S3 access layer s3a : https://towardsdatascience.com/apache-spark-with-kubernetes-and-fast-s3-access-27e64eb14e0f

-- test ---

About

An Deployment and Setup of Apache Spark for multi-tenant usage in Kubernetes Clusters. This deploys 1 Executor per K8S POD , scales linearly.


Languages

Language:Python 41.5%Language:Shell 30.5%Language:Dockerfile 15.4%Language:HCL 12.7%