big-data-europe / docker-spark

Apache Spark docker image

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could you please help how can i add external jars in Dockerfile that uses bde2020 as base image

arsoni20 opened this issue · comments

I have downloaded PostgreSQL jar in my project dir where I have Dockerfile, I have tried various ways but still I am getting ClassNotFoundException.

Any help will be greatly appreciated

My DockerFile -

FROM bde2020/spark-python-template:3.1.1-hadoop3.2
# set the working directory in the container
WORKDIR /code

COPY requirements.txt .

# install dependencies
RUN pip install --user -r requirements.txt

COPY src/ .

COPY postgresql-42.2.10.jar .
#ENV CLASSPATH postgresql-42.2.10.jar:${CLASSPATH}
COPY ./postgresql-42.2.10.jar /opt/spark/jars

CMD [ "spark-submit","./RunETL.py" ]

My Spark code that tries to write Dataframe to a PostgreSQL table -

df.write.format('jdbc') \
.options(
url = 'jdbc:postgresql://<ip>/postgres',
dbtable='employeee',
user='dba',
password='dba',
driver='org.postgresql.Driver') \
.mode('append') \
.save()

Hi @arsoni20 ,

thanks a lot for reaching out.

One thing we can try is that you change the command from:

COPY ./postgresql-42.2.10.jar /opt/spark/jars

to

COPY ./postgresql-42.2.10.jar /spark/jars

as that is the path we did define when setting up the base image.

Do let me know if that doesn't resolve it.

Best regards,

Hi @GezimSejdiu, I also have the same problem even using the path /spark/jars.
@arsoni20 Did you find some solution?

Hi @GezimSejdiu, I have tried adding Kafka jar files and it didn't work. Would you mind sharing the solution?

Hi @AmeliaPessoa , @ahlag ,

thanks a lot for reaching out with this issue.

Um, it is a bit hard why it isn't working but maybe I'm also missing more of the context. In order to reproduce it I build a whole example which (maybe) I will also write a small blog-post in the future for that :) (for sure if I manage to find some time to do it). Till then I will post it here:

Create a simple python script to just print the table schema from postgres using pySpark

# postgresql.py
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://postgres:5432/postgres") \
    .option("dbtable", "public.your_table") \
    .option("user", "postgres") \
    .option("password", "mysecretpassword") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

Adding jar to the custom docker image (I'm using here the python template):

postgres.Dockerfile will containes:

FROM bde2020/spark-python-template:3.2.0-hadoop3.2
	  
COPY postgresql.py /app/

COPY postgresql-42.3.3.jar /spark/jars

ENV SPARK_APPLICATION_PYTHON_LOCATION /app/postgresql.py
ENV SPARK_APPLICATION_ARGS ""

where postgresql-42.3.3.jar is your downloaded jar.

Build your custom docker image using python template:

docker build -t postgres-app -f postgresql.Dockerfile .

Add postgres service to your docker-compose where Spark is also running:

  postgres:
    image: postgres
    container_name: postgres
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_PASSWORD=mysecretpassword

and create a simple table there so that you can also add it to the configurations.

Execute your app and enjoy it

docker run --rm --network dockerspark_default --name pyspark-example-postgress postgres-app

You will then be able to see something like this:

22/02/28 22:24:45 INFO BlockManagerMasterEndpoint: Registering block manager 172.19.0.4:38513 with 366.3 MiB RAM, BlockManagerId(0, 172.19.0.4, 38513, None)
root
 |-- thekey: integer (nullable = true)
 |-- ticker: string (nullable = true)
 |-- date_val: date (nullable = true)
 |-- open_val: decimal(10,4) (nullable = true)

22/02/28 22:24:47 INFO SparkContext: Invoking stop() from shutdown hook

Hope I will find some time to also document this step-by-step and maybe also expand the python examples which also include those.

Best regards,
Gezim

@GezimSejdiu

Hi Gezim,

Thank you so much for your reply.
I have another question regarding the error I receive when using the following setting with Kafka.
https://github.com/ahlag/Spark-Streaming-with-Scala/blob/main/docker-setup/docker-compose.yaml

The error was Kafka Jars not found. Do you think by adding Kafka jars specified below to /spark/jars would solve the issue?
https://github.com/ahlag/Spark-Streaming-with-Scala/blob/8a741fd84ca244cb78737bad31742dc049d71a1c/10-WatermarkDemo/build.sbt#L9-L11

Hey @ahlag ,

yes, the kafka jar should be present to all the spark nodes in this case to the image you are using (being that from the template image or even from the base image). So, yes you need to provide those kafka-jars as they aren't present there in Spark to be able to use them (only if you provide them with your bundled jar).

I will close this issue as of now as I did provide one example how to do it, but feel free to open an issue if you are still facing the same issue with Kafka.

Best regards,