Jacobbishopxy / spark-jottings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spark Jotting

Cluster mode

Cluster mode means spark-submit --master k8s://${MASTER_ADDR} ....

  • Since we are using k8s spark cluster (see detail), we need bcpkix-jdk15on & bcprov-jdk15on for spark-submit. In other words, these two dependencies must be included in $SPARK_HOME/jars (Note: run echo 'sc.getConf.get("spark.home")' | spark-shell to find out $SPARK_HOME if needed).

  • In addition, we need hadoop-aws as an extra package while executing spark-submit.

  • Check username/password of a deployed standalone MinIO.

    persistent depends on your Volume, login to your node IP then:

    cat <minio path>/.root_user
    cat <minio path>/.root_password
  • k8s spark cluster job cleaner

Client mode

Client mode means using bitnami/charts in k8s.

  • NFS share volume (Only required in Spark Client Mode, which used for uploading local JARs).

    On the client server:

    sudo apt update
    sudo apt install nfs-common

    Check available mounting directories:

    showmount -e <HOST_IP>

    Make the share directory and grant permission:

    sudo mkdir <YOUR_MOUNT_DIRECTORY> -p
    sudo chown nobody:nogroup <YOUR_MOUNT_DIRECTORY>

    Mount host directory:

    sudo mount <HOST_IP>:<HOST_SHARE_ADDRESS> <YOUR_MOUNT_DIRECTORY>

Utilities

  • accessing logs:

    kubectl logs -f -n dev <DRIVER_POD_NAME>
  • accessing UI

    kubectl port-forward -n dev <DRIVER_POD_NAME> 4040:4040
  • debugging

    kubectl describe pod -n dev <SPARK_DRIVER_POD>
  • killing driver

    kubectl describe pod -n dev <SPARK_DRIVER_POD>

Examples

Notes

  • --jars are used for local or remote jar files specified with URL and don't resolve dependencies, --packages are used for Maven coordinates, and do resolve dependencies. Source

Materials

About


Languages

Language:Scala 97.0%Language:Makefile 2.4%Language:Shell 0.6%