dask / dask-docker

Docker images for dask

Home Page:https://hub.docker.com/u/daskdev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error: Unable to instantiate java compiler

LaurentEsingle opened this issue · comments

What happened:

After installing Java and dask-sql using pip, whenever I try to run a SQL query from my python code I get the following error:

...
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 378, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 515, in _get_ral
    nonOptimizedRelNode = generator.getRelationalAlgebra(validatedSqlNode)
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Unable to instantiate java compiler
...
...
File "JaninoRelMetadataProvider.java", line 426, in org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile
  File "CompilerFactoryFactory.java", line 61, in org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory
java.lang.java.lang.NullPointerException: java.lang.NullPointerException

What you expected to happen:

I should get a dataframe as a result.

Minimal Complete Verifiable Example:

# The cluster/client setup is done first, in another module not the one executing the SQL query
# Also tried other cluster/scheduler types with the same error
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
    n_workers=4,
    threads_per_worker=1,
    processes=False,
    dashboard_address=':8787',
    asynchronous=False,
    memory_limit='1GB'
    )
client = Client(cluster)

# The SQL code is executed in its own module
import dask.dataframe as dd
from dask_sql import Context

c = Context()
df = dd.read_parquet('/vQuery/files/results/US_Accidents_June20.parquet') 
c.register_dask_table(df, 'df')
df = c.sql("""select ID, Source from df""") # This line fails with the error reported

Anything else we need to know?:

As mentioned in the code snippet above, due to the way my application is designed, the Dask client/cluster setup is done before dask-sql context is created.

Environment:

  • Dask version:
    • dask: 2020.12.0
    • dask-sql: 0.3.1
  • Python version:
    • Python 3.8.5
  • Operating System:
    • Ubuntu 20.04.1 LTS
  • Install method (conda, pip, source):
    • pip

Install steps

$ sudo apt install default-jre

$ sudo apt install default-jdk

$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

$ javac -version
javac 11.0.10

$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64

$ pip install dask-sql

$ pip list | grep dask-sql
dask-sql               0.3.1

Thanks @jrbourbeau for the mention.

Hi @LaurentEsingle and thank you very much for (1) using and testing dask-sql and (2) writing this very nice bug report!

As you opened this issue in the dask-docker repository I assume you are using the dask docker image? If yes, is there a reason you are not using the conda-installable dask-sql package or the dask-sql docker image (that question is more for me to understand :-))

I tried to reproduce your problem both inside the dask docker image and on my local computer (which also runs Ubuntu 20.04), but I was not able to see this error (I tried both with java 11.0.10 and 11.0.9). To be fair: I only checked with a CSV, not a parquet file but I do not see any reason that should be the problem :-)

Looking into some old mail by Julian, the core developer of Apache Calcite (which dask-sql uses), I think you are running into a similar problem: do you have some other java libraries in your classpath? Libraries like janino or codehaus? Do you set a custom classpath?

We can try some basic debugging. If you want, start your python interpreter and call

from dask_sql import java
# print where it gets the class from. That should be the DaskSQL.jar
print(java.org.codehaus.commons.compiler.CompilerFactoryFactory.class_.getProtectionDomain().getCodeSource().getLocation())
# print the JVM path, that should be your java installation
print(java.jvmpath)
# and now try to replicate the error
java.org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory()

Do those printed paths correspond to what you expect? Is the last line also failing?

Some possible solutions you might want to try out (depending on your use case):

  • use conda for installation (the conda package of dask-sql ships with a "clean" java installation without dependency problems)
  • use the dask-sql docker image (nbraun/dask-sql) if that is an alternative
  • check with $JAVA_HOME/bin/jps and then $JAVA_HOME/bin/jinfo if the classpath is actually picked up correctly after just having imported dask_sql in a python shell (the first command will print all running JVMs and the second can be used to show additional information to a JVM)

Last but not least a question: as you mentioned you are creating a dask cluster before importing dask-sql (which is totally fine): did you test without any created dask cluster (it shouldn't make a difference, but you never know...)?

Thank you @jrbourbeau for alerting Nils.
Thankyou @nils-braun for your quick reply.

I will now close the issue here and reopen it on Dask-Sql repo.

Again thanks!