big-data-europe / docker-spark

Apache Spark docker image

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kubernetes example clarification on spark-submit

Fixmetal opened this issue · comments

Hello
I'm trying to migrate an application over kubernetes using this solution. I am humbly confused because I'm not from Spark world and I'd need some directions. Hope any out there would help me out.
Following https://github.com/big-data-europe/docker-spark#kubernetes-deployment I can successfully create a pod which is based on the base image with my application. But this pod will keep up forever since spark-submit is just and endless script.
I think it's just me but I'm missing how this is correct since in this way we would have:

  1. A master pod (which is cluster manager, right?)
  2. One or more worker(s) pod(s) which should compute what submitted applications instruct to
  3. A pod per application which will stay up forever (until the application will eventually end up)

What I was expecting from spark-submit was to submit the application to the workers and end up its life, but maybe I'm just having a bad light on this.
Can some Spark expert clarify the exact use case on k8s?
This is how I handled the spark-submit operation:

---
apiVersion: batch/v1
kind: Job
metadata:
  name: mySparkSubmitJob
spec:
  template:
    metadata:
      labels:
        app: spark-client
    spec:
      containers:
      - name: mySparkSubmitJobContainer
        image: myCustomImage
        command: [ "bin/spark-submit" ]
        args: 
        - "--master"
        - "spark://spark-master:7077"
        - "--deploy-mode"
        - "client"
        - "--conf"
        - "spark.yarn.submit.waitAppCompletion=false"
        - "--conf"
        - "spark.driver.host=spark-client"
        - "--conf"
        - "spark.executor.memory=2g"
        - "--conf"
        - "spark.executor.cores=1"
        - "--conf"
        - "spark.locality.wait=0"
        - "--conf"
        - "spark.network.timeout=432000"
        - "--conf"
        - "spark.ui.showConsoleProgress=false"
        - "--conf"
        - "spark.driver.extraClassPath=<path-to-dependency-file.jar>"
        - "--conf"
        - "spark.driver.extraJavaOptions=-Dlog4j.configurationFile=<log4jdriver-properties-file> -Djava.security.egd=file:///dev/urandom"
        - "--class"
        - "<className>"
        - "--jars"
        - "<path-to-dependency-file.jar>"
        - "<path-to-mainClass-file.jar>"
        - "-c"
        - "<path-to-application-config-file>"
      restartPolicy: OnFailure
  backoffLimit: 3

I think I found an answer: in client mode the pod itself becomes the driver and instructs worker to schedule the application job over them. A proper answer is here I guess. From what I understood the master is the Cluster Manager, Workers become Executors and my application creates the Driver pod.
Hence I converted the whole thing into a Deployment instead of using a Job.
Feel free to comment but I feel this is the point I was missing so I'm closing the issue.