Examples and custom spark images for working with the spark-on-k8s operator on AWS.
Allows using Spark 2 with IRSA and Spark 3 with IRSA and AWS Glue as a metastore.
Note: Spark 3 images also include relevant jars for working with the S3A commiters
If you're looking for the Spark 3 custom distributions, you can find them here
Note: Spark 2 images will not be updated, please see the FAQ
- Deploy spark-on-k8s operator using the helm chart and the patched operator image
bbenzikry/spark-eks-operator:latest
Suggested values for the helm chart can be found in the flux example.
Note: Do not create the spark service account automatically as part of chart use.
- Create an AWS role for driver
- Create an AWS role for executors
- Add default service account EKS role for executors in your spark job namespace ( optional )
# NOTE: Only required when not building spark from source or using a version of spark < 3.1. In 3.1, executor roles will rely on the driver definition. At the moment they execute with the default service account.
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: SPARK_JOB_NAMESPACE
annotations:
# can also be the driver role
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/executor-role"
- Make sure spark service account ( used by driver pods ) is configured to an EKS role as well
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: SPARK_JOB_NAMESPACE
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/driver-role"
-
For spark < 3.0.0, see spark2.Dockerfile
-
For spark 3.0.0+, see spark3.Dockerfile
-
For pyspark, see pyspark.Dockerfile
Below are examples for latest versions.
If you want to use pinned versions, all images are tagged by the commit SHA.
You can find a full list of tags here
# spark2
FROM bbenzikry/spark-eks:spark2-latest
# spark3
FROM bbenzikry/spark-eks:spark3-latest
# pyspark2
FROM bbenzikry/spark-eks:pyspark2-latest
# pyspark3
FROM bbenzikry/spark-eks:pyspark3-latest
hadoopConf:
# IRSA configuration
"fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
driver:
.....
labels:
.....
serviceAccount: SERVICE_ACCOUNT_NAME
# See: https://github.com/kubernetes/kubernetes/issues/82573
# Note: securityContext has changed in recent versions of the operator to podSecurityContext
podSecurityContext:
fsGroup: 65534
- Make sure your driver and executor roles have the relevant glue permissions
{
/*
Example below depicts the IAM policy for accessing db1/table1.
Modify this as you deem worthy for spark application access.
*/
Effect: "Allow",
Action: ["glue:*Database*", "glue:*Table*", "glue:*Partition*"],
Resource: [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/table1",
"arn:aws:glue:eu-west-1:123456789012:database/default",
"arn:aws:glue:eu-west-1:123456789012:database/global_temp",
"arn:aws:glue:eu-west-1:123456789012:database/parquet",
],
}
- Make sure you are using the patched operator image
- Add a config map to your spark job namespace as defined here
apiVersion: v1
data:
hive-site.xml: |-
<configuration>
<property>
<name>hive.imetastoreclient.factory.class</name>
<value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
</property>
</configuration>
kind: ConfigMap
metadata:
namespace: SPARK_JOB_NAMESPACE
name: spark-custom-config-map
In order to submit an application with glue support, you need to add a reference to the configmap in your SparkApplication
spec.
kind: SparkApplication
metadata:
name: "my-spark-app"
namespace: SPARK_JOB_NAMESPACE
spec:
sparkConfigMap: spark-custom-config-map
-
Where can I find a Spark 2 build with Glue support?
As spark 2 becomes less and less relevant, I opted against the need to add glue support. You can take a look here for a reference build script which you can use to build a Spark 2 distribution to use with the Spark 2 dockerfile
-
Why a patched operator image?
The patched image is a simple implementation for properly working with custom configuration files with the spark operator. It may be added as a PR in the future or another implementation will take its place. For more information, see the related issue kubeflow/spark-operator#216