dask / dask-blog

Dask development blog

Home Page:https://blog.dask.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Blogpost idea: how to choose good settings for Dask on HPC

GenevieveBuckley opened this issue · comments

It'd be good to have a blogpost about how to choose good settings for Dask on HPC. Users are often confused about this.

I think one reason this is particularly confusing is that settings often need to be defined in multiple locations, and people are confused about how they interact. For example, someone might submit a job to SLURM with sbatch, which then runs a python program involving Dask, and want to know how that fits together.

#116 (comment)

...you know what would ALSO be a good blogpost? How to choose good cluster settings. Eg: how your SLURM/PBS/whatever batch submission settings relate to the settings you need to put in your dask-jobqueue cluster object.

To be honest I'm still a bit confused by this, and it is something other people ask me too.

If either @jacobtomlinson or @ian-r-rose would like to help make this, that would be very useful to refer people to (hint, hint) 😄

@guillaumeeb has kindly agreed to help put this together #116 (comment)

Hi all, I saw this issue, and I agree that both ideas would make great articles. Those are questions we see a lot as HPC admin/experts.

I can try to help with the second one one batch submission settings! Everyone is confused about it.

These resources don't necessarily answer the question about how to choose good settings, but might be good to link to:

It'd be good to collect other, non-SLURM links too

Thanks @GenevieveBuckley for starting the discussion.

In my experience, the thing users have the most difficulties to understand is how to configure the JobQueueCluster (be it PBS, Slurm or whatever) correctly, and what do the kwargs mean. More specifically:

  • What is process, cores, memory, how do they change the job configuration and the dask-worker configuration?
  • "Wait, when I use scale, does I scale jobs, dask-workers, dask-worker process? What is a dask-worker?"
  • Why this is not coherent with other dask Cluster objects like LocalCluster?
  • Why is there different options between job queue Cluster objects.
  • How do I pass arguments to my dask workers?

I think one reason this is particularly confusing is that settings often need to be defined in multiple locations, and people are confused about how they interact

With this, there is also the dask-config Yaml file vs the kwargs. Which to use and when?

For example, someone might submit a job to SLURM with sbatch, which then runs a python program involving Dask, and want to know how that fits together.

I agree, we need also to describe different possibilities and "big picture":

  • Just run you Python script using JobQueueCluster on the login/front node.
  • Run a batch job that start the Python script using JobQueueCluster.
  • Why I can't run a client script that submit a Scheduler as a batch job yet (or maybe this should be in a dedicated section "improvements").
  • Hey, there are alternatives to dask-jobqueue, better suited for some cases:

And we could also add improvements to be made, or point to https://blog.dask.org/2019/06/12/dask-on-hpc which presents a lot of things that are still true. And maybe try to develop point 7, at the end of the post.

That is an excellent and thorough summary @guillaumeeb!

We also might add:

  • which of these kwargs/settings are basically required, and which are optional (helps people sort out a structure for what to worry about first/next/last)

how to configure the JobQueueCluster (be it PBS, Slurm or whatever) correctly, and what do the kwargs mean.

Building on "what do the kwargs mean", it would be good if we could not only explain each concept, but also map it to the words used for the same concept in other places. Suggesting this because it's the type of question I get - someone has read all the beginner documentation and asks "Is $foo the same as $bar? Does that mean I should set these values to the same thing?"