Error: Unable to instantiate java compiler
LaurentEsingle opened this issue · comments
What happened:
After installing Java and dask-sql using pip, whenever I try to run a SQL query from my python code I get the following error:
...
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 378, in sql
rel, select_names, _ = self._get_ral(sql)
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 515, in _get_ral
nonOptimizedRelNode = generator.getRelationalAlgebra(validatedSqlNode)
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Unable to instantiate java compiler
...
...
File "JaninoRelMetadataProvider.java", line 426, in org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile
File "CompilerFactoryFactory.java", line 61, in org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory
java.lang.java.lang.NullPointerException: java.lang.NullPointerException
What you expected to happen:
I should get a dataframe as a result.
Minimal Complete Verifiable Example:
# The cluster/client setup is done first, in another module not the one executing the SQL query
# Also tried other cluster/scheduler types with the same error
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers=4,
threads_per_worker=1,
processes=False,
dashboard_address=':8787',
asynchronous=False,
memory_limit='1GB'
)
client = Client(cluster)
# The SQL code is executed in its own module
import dask.dataframe as dd
from dask_sql import Context
c = Context()
df = dd.read_parquet('/vQuery/files/results/US_Accidents_June20.parquet')
c.register_dask_table(df, 'df')
df = c.sql("""select ID, Source from df""") # This line fails with the error reported
Anything else we need to know?:
As mentioned in the code snippet above, due to the way my application is designed, the Dask client/cluster setup is done before dask-sql context is created.
Environment:
- Dask version:
- dask: 2020.12.0
- dask-sql: 0.3.1
- Python version:
- Python 3.8.5
- Operating System:
- Ubuntu 20.04.1 LTS
- Install method (conda, pip, source):
- pip
Install steps
$ sudo apt install default-jre
$ sudo apt install default-jdk
$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
$ javac -version
javac 11.0.10
$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64
$ pip install dask-sql
$ pip list | grep dask-sql
dask-sql 0.3.1
cc @nils-braun
Thanks @jrbourbeau for the mention.
Hi @LaurentEsingle and thank you very much for (1) using and testing dask-sql and (2) writing this very nice bug report!
As you opened this issue in the dask-docker repository I assume you are using the dask docker image? If yes, is there a reason you are not using the conda-installable dask-sql package or the dask-sql docker image (that question is more for me to understand :-))
I tried to reproduce your problem both inside the dask docker image and on my local computer (which also runs Ubuntu 20.04), but I was not able to see this error (I tried both with java 11.0.10 and 11.0.9). To be fair: I only checked with a CSV, not a parquet file but I do not see any reason that should be the problem :-)
Looking into some old mail by Julian, the core developer of Apache Calcite (which dask-sql uses), I think you are running into a similar problem: do you have some other java libraries in your classpath? Libraries like janino or codehaus? Do you set a custom classpath?
We can try some basic debugging. If you want, start your python interpreter and call
from dask_sql import java
# print where it gets the class from. That should be the DaskSQL.jar
print(java.org.codehaus.commons.compiler.CompilerFactoryFactory.class_.getProtectionDomain().getCodeSource().getLocation())
# print the JVM path, that should be your java installation
print(java.jvmpath)
# and now try to replicate the error
java.org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory()
Do those printed paths correspond to what you expect? Is the last line also failing?
Some possible solutions you might want to try out (depending on your use case):
- use conda for installation (the conda package of dask-sql ships with a "clean" java installation without dependency problems)
- use the dask-sql docker image (nbraun/dask-sql) if that is an alternative
- check with
$JAVA_HOME/bin/jps
and then$JAVA_HOME/bin/jinfo
if the classpath is actually picked up correctly after just having imported dask_sql in a python shell (the first command will print all running JVMs and the second can be used to show additional information to a JVM)
Last but not least a question: as you mentioned you are creating a dask cluster before importing dask-sql (which is totally fine): did you test without any created dask cluster (it shouldn't make a difference, but you never know...)?
Thank you @jrbourbeau for alerting Nils.
Thankyou @nils-braun for your quick reply.
I will now close the issue here and reopen it on Dask-Sql repo.
Again thanks!