dask / dask-labextension

JupyterLab extension for Dask

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Specify default address to look for scheduler

mrocklin opened this issue · comments

So, I'm in an interesting situation where I'm running a Jupyter server and I know that it will have exactly one Dask cluster attached to it. I would like to populate the Dask labextension with that scheduler address on startup. Is this easy to do?

This is already possible today using the defaultURL setting value. Doing this as part of a deployment would look like:

  1. Identify the relevant server address
  2. Prior to users loading the page (not necessarily prior to the jupyter server startup, but might as well be), put the setting value in an overrides.json file for JupyterLab to pick up. This could be baked in to the environment if it's a stable URL, or done as part of some setup script.

Clearly, I should put a bit of effort into docs here...

Not with the current design -- the default URL to populate the search bar with is decided on the frontend, and feeding information to that goes through the config system (i.e., env variables aren't directly visible to the frontend).

Is there an issue with writing a small config file in that case, or is it just more convenient to set an env variable?

So I would do something like the following before starting up the Jupyter server?

with open("overrides.json", mode="w") as f:
    f.write(json.dumps(...))

Yes, something like that, at least for a proof-of-concept. A more complete solution might be to use json5 and merge with other possible config options.

To be clear, we could have some kind of translation layer between the dask config system and the JupyterLab one, but we'd have to build it. I'm a little reticent to build out a new set of special-case environment variable rather than go through the existing path. I know that some JupyterHubs/QHubs/2i2c-deployments also have needs to distribute custom settings.

The frontend chooses in order:

  1. Any user-populated URL (which is persisted between page refreshes)
  2. The default URL from the settings

I also noticed when kicking the tires on this that the user-populated URL can be a bit too sticky at the moment (you can reset it with a ?reset query parameter). A fix for that is pretty straightforward here.

@ian-r-rose and I spoke. There is some possibility of using the system that currently sends the default at-start-time clusters up to the frontend. This is low enough priority though that we're going to wait until jupyter-on-dask becomes more of a major thing (maybe never).

If we switched out the internals for dask-ctl this would be handled automatically by the cluster discovery. Discovered clusters would be listed automatically in the sidebar. xref #189

we don't have any Cluster objects, just a scheduler address

I am not sure that this would be insurmountable in a refactor to use dask-ctl. Today, the sidebar in some sense owns the clusters listed there, and they are backed by real Cluster instances. But if we can, I'd love to get out of the business of having a Cluster backed object all-together, and just have something like "here is a list of clusters we know how to connect to". In that case maybe an address (+ some related metadata?) would be enough.

@mrocklin that should be fine. dask_ctl.ProxyCluster fulfils the Cluster API and is useful for representing clusters that can't be rehydrated into other cluster manager objects. Currently, the discovery method for ProxyCluster has a look through open ports on localhost and if it finds schedulers it returns them. So classes like LocalCluster and SSHCluster can be included in the list. It would be very quick to expand this to include other addresses configured in the environment like the DASK_SCHEDULER_ADDRESS.

But if we can, I'd love to get out of the business of having a Cluster backed object all-together

I've been down the same thought process too. The trouble is the cluster objects are generally the only place that we can actually represent the abstract concept of a cluster, Dask Gateway and the Dask Kubernetes Operator both have other ways to store and represent this internally, but most other deployment mechanism's don't. My goal with ProxyCluster is to hold this representation in a catch-all way for clusters that aren't easily put back into their original classes.

My goal with ProxyCluster is to hold this representation in a catch-all way for clusters that aren't easily put back into their original classes.

This seems it could bee a good solution -- thanks for the explanation @jacobtomlinson. I'll see if I can put together an example using dask/distributed#6737 and ProxyCluster.

I'm getting more excited about the possibility of integrating dask-ctl here

I'm also interested in providing a default address.

I tried the following in overrides.json but it doesn't seem to work. Maybe I'm using the wrong plugin name?

{
        "dask-labextension:plugin": {
                "hideClusterManager": true,
                "defaultURL": "<hidden>"
        }
}

Thanks for your help.