bde2020/spark-python-template:3.1.1-hadoop3.2 not working out of the box
devantler opened this issue · comments
As the bde2020/spark-base:3.1.1-hadoop3.2 image sets the ENV SPARK_APPLICATION_JAR_LOCATION
to a default value, this value is picked up by the submit.sh
in favor of the ENV SPARK_APPLICATION_PYTHON_LOCATION
, and the bde2020/spark-python-template:3.1.1-hadoop3.2 image thus always tries to submit a jar file before submitting the python file.
I have fixed this locally by overwriting the submit.sh file to ignore SPARK_APPLICATION_JAR_LOCATION
when I use the python-template. Below is my submit.sh
file:
#!/bin/bash
export SPARK_MASTER_URL=spark://${SPARK_MASTER_NAME}:${SPARK_MASTER_PORT}
export SPARK_HOME=/spark
function is_uri() {
regex='(https?|hdfs|file)://[-A-Za-z0-9\+&@#/%?=~_|!:,.;]*[-A-Za-z0-9\+&@#/%=~_|]'
if [[ $1 =~ $regex ]]
then
echo "true"
else
echo "false"
fi
}
/wait-for-step.sh
/execute-step.sh
if [[ `is_uri $SPARK_APPLICATION_PYTHON_LOCATION`=="true" || -f "${SPARK_APPLICATION_PYTHON_LOCATION}" ]]; then
echo "Submit application ${SPARK_APPLICATION_PYTHON_LOCATION} to Spark master ${SPARK_MASTER_URL}"
echo "Passing arguments ${SPARK_APPLICATION_ARGS}"
PYSPARK_PYTHON=python3 /spark/bin/spark-submit \
--master ${SPARK_MASTER_URL} \
${SPARK_SUBMIT_ARGS} \
${SPARK_APPLICATION_PYTHON_LOCATION} ${SPARK_APPLICATION_ARGS}
else
echo "Not recognized application."
fi
/finish-step.sh
I then copy this submit.sh
file to my image that builds FROM bde2020/spark-base:3.1.1-hadoop3.2, to overwrite the submit.sh
file.
This Is a bit of a hacky solution, and I suggest the default ENV values are removed or moved to the template images, so this does not happen.
I attached the code for further inspection:
Error is reproduceable for me with two different computers using this setup. The error is:
docker run --rm -e ENABLE_INIT_DAEMON=false --network hadoop --name pyspark pysparkexampleimage
sh: =~: unknown operand
Submit application /app/application.jar with main class my.main.Application to Spark master spark://spark-master:7077
Passing arguments
21/11/15 19:46:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/11/15 19:46:41 WARN DependencyUtils: Local jar /app/application.jar does not exist, skipping.
Error: Failed to load class my.main.Application.
21/11/15 19:46:41 INFO ShutdownHookManager: Shutdown hook called
21/11/15 19:46:41 INFO ShutdownHookManager: Deleting directory /tmp/spark-e27c7b0c-ef11-4302-921a-98f639371594
The second line also shows another error with the if-statement in the "is_uri()" function, that is probably semi-related.
The error in this issue was not present at the time of commit of the link (27th September, 2021) where it ran perfectly fine.
Using the Dockerfile and submit.sh from @niem94 copied into the linked setup (replaced the other Dockerfile) "removes" the error for me.