big-data-europe / docker-spark

Apache Spark docker image

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bde2020/spark-python-template:3.1.1-hadoop3.2 not working out of the box

devantler opened this issue · comments

As the bde2020/spark-base:3.1.1-hadoop3.2 image sets the ENV SPARK_APPLICATION_JAR_LOCATION to a default value, this value is picked up by the submit.sh in favor of the ENV SPARK_APPLICATION_PYTHON_LOCATION, and the bde2020/spark-python-template:3.1.1-hadoop3.2 image thus always tries to submit a jar file before submitting the python file.

I have fixed this locally by overwriting the submit.sh file to ignore SPARK_APPLICATION_JAR_LOCATION when I use the python-template. Below is my submit.sh file:

#!/bin/bash

export SPARK_MASTER_URL=spark://${SPARK_MASTER_NAME}:${SPARK_MASTER_PORT}
export SPARK_HOME=/spark

function is_uri() {
    regex='(https?|hdfs|file)://[-A-Za-z0-9\+&@#/%?=~_|!:,.;]*[-A-Za-z0-9\+&@#/%=~_|]'
    if [[ $1 =~ $regex ]]
    then 
        echo "true"
    else
        echo "false"
    fi
}


/wait-for-step.sh


/execute-step.sh
if [[ `is_uri $SPARK_APPLICATION_PYTHON_LOCATION`=="true" || -f "${SPARK_APPLICATION_PYTHON_LOCATION}" ]]; then
    echo "Submit application ${SPARK_APPLICATION_PYTHON_LOCATION} to Spark master ${SPARK_MASTER_URL}"
    echo "Passing arguments ${SPARK_APPLICATION_ARGS}"
    PYSPARK_PYTHON=python3  /spark/bin/spark-submit \
        --master ${SPARK_MASTER_URL} \
        ${SPARK_SUBMIT_ARGS} \
        ${SPARK_APPLICATION_PYTHON_LOCATION} ${SPARK_APPLICATION_ARGS}
else
    echo "Not recognized application."
fi
/finish-step.sh

I then copy this submit.sh file to my image that builds FROM bde2020/spark-base:3.1.1-hadoop3.2, to overwrite the submit.sh file.

This Is a bit of a hacky solution, and I suggest the default ENV values are removed or moved to the template images, so this does not happen.

I attached the code for further inspection:

working-pyspark.zip

Error is reproduceable for me with two different computers using this setup. The error is:

docker run --rm -e ENABLE_INIT_DAEMON=false --network hadoop --name pyspark pysparkexampleimage
sh: =~: unknown operand
Submit application /app/application.jar with main class my.main.Application to Spark master spark://spark-master:7077
Passing arguments
21/11/15 19:46:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/11/15 19:46:41 WARN DependencyUtils: Local jar /app/application.jar does not exist, skipping.
Error: Failed to load class my.main.Application.
21/11/15 19:46:41 INFO ShutdownHookManager: Shutdown hook called
21/11/15 19:46:41 INFO ShutdownHookManager: Deleting directory /tmp/spark-e27c7b0c-ef11-4302-921a-98f639371594

The second line also shows another error with the if-statement in the "is_uri()" function, that is probably semi-related.

The error in this issue was not present at the time of commit of the link (27th September, 2021) where it ran perfectly fine.

Using the Dockerfile and submit.sh from @niem94 copied into the linked setup (replaced the other Dockerfile) "removes" the error for me.

Hi @niem94 , @dstoft ,

thanks a lot for reaching out and giving a great trace of the issue. I see :( , I think we didn't do much tests when this is_uri was introduced 7f5ba39 . I will look into it and try to test it so that it doesn;t happen again.

Best regards,