jpacerqueira-zz/spark-on-kubernetes

An setup to deploy Kubernetes in AWS EKS
The Standard GKS spark was here adapted to EKS ( each terraform deploy / run costs less than $0.29 USD )
Adopt multi-cloud strategies with this technologies and others compatible with Kubernetes and its auto-scaling capabilities.
```
1. Step 0 : Setup your AWSCLI and account

 i.a.     $ brew install awscli
 i.b.     $ aws configure
```

In AWS/EKS core - Execution with Terraform deployment - analysis

     2. Setup implies the availability in an EC2 Gateway and setup of an EKS Kubernetes cluster

       i. Follow example in folder eks_deployment with script 
       i.a. [eks_deployment] $ bash -x step1-setup-terraform.sh  VAR1:_YOUR_LOCAL.pem   VAR2:_AWS_ACCESS_KEY_ID   VAR3:_AWS_SECRET_ACCESS_KEY   VAR4:_AWS_REGION  VAR5:_EKS_TF_CONFIG

     3. Use Terraform destroy in folder eks_deployment/eks-cluster

       i.a.  [eks-cluster] $  terraform destroy

     4. Deployment of Kubernetes infrastructure with AWS CLI

       i. confirm in ' aws eks '  the context when your cluster deployed
       i.a. [eks-cluster] $  aws eks --region eu-west-1 update-kubeconfig --name spark-eks-QiTsE99z 
         a.               $  Added new context arn:aws:eks:eu-west-1:512336214250:cluster/spark-eks-QiTsE99z to /Users/joci/.kube/config

     5. AWS Console - EKS regional setup detials for Spark-EKS-Version.x.y.z

     6. using kubectl activate token and login to Kubernetes console

K8S workloads Where All operations executed in order :

./setup-k8s-spark-workload.sh

     1. spark/install-spark-kubernetes-operator
     2. spark/create-spark-service-account

./execute-k8s-spark-workload.sh

     3. default spark-py-pi   OR  pyspark job under folder /jobs 

       i.   spark/run-spark-pi-2k8s-pods
       ii. e.g. ./execute-k8s-spark-workload.sh dataminer-categorized-pdf-to-csv-analytics

In Kubernetes Proxy - Execution Analysis and logs

In Docker/Kubernetes core - Execution analysis

Spark Running on Kubernetes

i. Follow latest Spark.2.4.5 in : https://spark.apache.org/docs/latest/running-on-kubernetes.html

Spark Operator from GCP i. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.md ii. sparkctl (dedicated kubectl) available from : https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/sparkctl

EKS Operator for Terraform from Hashicorp HCL website i. https://learn.hashicorp.com/terraform/kubernetes/provision-eks-cluster

Next To\Do : adapted benchmark to EKS AKS with an dedicated sa3 S3 access layer for spark-executors/pods for filesystem. At the moment works in Spark with K8S filesystem

Aditional Notes : checkout branch execution_with_datapoints for data executions

Aditional literature : Apache Spark in Kubernetes with Fast S3 access layer s3a : https://towardsdatascience.com/apache-spark-with-kubernetes-and-fast-s3-access-27e64eb14e0f

-- test ---

jpacerqueira-zz / spark-on-kubernetes

About

Languages