Error: Unable to instantiate java compiler

LaurentEsingle opened this issue · comments

What happened:

After installing Java and dask-sql using pip, whenever I try to run a SQL query from my python code I get the following error:

File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 378, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 515, in _get_ral
    nonOptimizedRelNode = generator.getRelationalAlgebra(validatedSqlNode)
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Unable to instantiate java compiler
File "JaninoRelMetadataProvider.java", line 426, in org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile
  File "CompilerFactoryFactory.java", line 61, in org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory
java.lang.java.lang.NullPointerException: java.lang.NullPointerException

What you expected to happen:

I should get a dataframe as a result.

Minimal Complete Verifiable Example:

# The cluster/client setup is done first, in another module not the one executing the SQL query
# Also tried other cluster/scheduler types with the same error
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
client = Client(cluster)

# The SQL code is executed in its own module
import dask.dataframe as dd
from dask_sql import Context

c = Context()
df = dd.read_parquet('/vQuery/files/results/US_Accidents_June20.parquet') 
c.register_dask_table(df, 'df')
df = c.sql("""select ID, Source from df""") # This line fails with the error reported

Anything else we need to know?:

As mentioned in the code snippet above, due to the way my application is designed, the Dask client/cluster setup is done before dask-sql context is created.


  • Dask version:
    • dask: 2020.12.0
    • dask-sql: 0.3.1
  • Python version:
    • Python 3.8.5
  • Operating System:
    • Ubuntu 20.04.1 LTS
  • Install method (conda, pip, source):
    • pip

Install steps

$ sudo apt install default-jre

$ sudo apt install default-jdk

$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

$ javac -version
javac 11.0.10

$ echo $JAVA_HOME

$ pip install dask-sql

$ pip list | grep dask-sql
dask-sql               0.3.1

Thanks @jrbourbeau for the mention.

Hi @LaurentEsingle and thank you very much for (1) using and testing dask-sql and (2) writing this very nice bug report!

As you opened this issue in the dask-docker repository I assume you are using the dask docker image? If yes, is there a reason you are not using the conda-installable dask-sql package or the dask-sql docker image (that question is more for me to understand :-))

I tried to reproduce your problem both inside the dask docker image and on my local computer (which also runs Ubuntu 20.04), but I was not able to see this error (I tried both with java 11.0.10 and 11.0.9). To be fair: I only checked with a CSV, not a parquet file but I do not see any reason that should be the problem :-)

Looking into some old mail by Julian, the core developer of Apache Calcite (which dask-sql uses), I think you are running into a similar problem: do you have some other java libraries in your classpath? Libraries like janino or codehaus? Do you set a custom classpath?

We can try some basic debugging. If you want, start your python interpreter and call

from dask_sql import java
# print where it gets the class from. That should be the DaskSQL.jar
# print the JVM path, that should be your java installation
# and now try to replicate the error

Do those printed paths correspond to what you expect? Is the last line also failing?

Some possible solutions you might want to try out (depending on your use case):

  • use conda for installation (the conda package of dask-sql ships with a "clean" java installation without dependency problems)
  • use the dask-sql docker image (nbraun/dask-sql) if that is an alternative
  • check with $JAVA_HOME/bin/jps and then $JAVA_HOME/bin/jinfo if the classpath is actually picked up correctly after just having imported dask_sql in a python shell (the first command will print all running JVMs and the second can be used to show additional information to a JVM)

Last but not least a question: as you mentioned you are creating a dask cluster before importing dask-sql (which is totally fine): did you test without any created dask cluster (it shouldn't make a difference, but you never know...)?

Thank you @jrbourbeau for alerting Nils.
Thankyou @nils-braun for your quick reply.

I will now close the issue here and reopen it on Dask-Sql repo.

Again thanks!