pyarrow seems to be missing from the images
martinedgefocus opened this issue · comments
This seems fairly fundamental and necessary?
Per discussion here: https://dask.discourse.group/t/missing-pyarrow/796/2 it was suggested I report this here
Thanks for raising an issue @martinedgefocus. As there are a wide variety of use cases where folks use Dask, we try keep the packages included by default in these images minimal but allow users to specify additional dependencies they may need installed though setting environment variables (see https://docs.dask.org/en/stable/deploying-docker.html#extensibility). For example, if you would like pyarrow
installed you could set EXTRA_CONDA_PACKAGES="pyarrow"
OK, thanks. I'm new to this I'm afraid.
The environment we set up is an AWS EC2Cluster with automatic sizing per the adapt() mechanism.
Is it still feasible to pass the environment variables through like that to add extra packages?
I can see there's an env_vars param, would that be sufficient?
The environment we set up is an AWS EC2Cluster
Just to confirm, does this mean you're using dask_cloudprovider.aws.EC2Cluster
to create your Dask cluster?
I can see there's an env_vars param, would that be sufficient?
Looking at the dask_cloudprovider.aws.EC2Cluster
docstring, it appears that env_vars
are environment variables passed to the workers. I don't know if that's quite what you're after as it will depend on when those environment variables are set. For this use case, I think you want them to be set when Docker is pulling the image. Maybe docker_args
is what you're after (I think @jacobtomlinson will have more insight here)
Yup that is one of the intended uses of env_vars
, those variables are passed to the docker run
command that happens under the hood on the EC2 instances.
cluster = EC2Cluster(..., env_vars={"EXTRA_CONDA_PACKAGES": "pyarrow"})
I'm going to close this out but please feel free to follow up here if you have more questions about how to do this.