catWeight documentation/explanation

Question

catWeight documentation/explanation

ludoro opened this issue 4 years ago · comments

Super cool library. I am working on MLJTuning where we want to have a LatinHypercube hyper-parameter optimization method, so I am using your library there.
One small issue I have is the use of "catWeight". There are many cases where we have categorical values, but it's not very clear how that parameter works.
At the moment I just always set it to 0. I have not found any reference of it in the two papers you list as reference, would you care to share some light on it?

Thanks a lot!

Magnus Urquhart · Answer 1 · Thu Oct 15 2020 22:48:39 GMT+0800 (China Standard Time)

Hi @ludoro,

Glad you are finding it useful! You can read more about it in our paper https://doi.org/10.1016/j.asoc.2019.106050 where you can see the effect of it in Figure 3. But in general you can think of it as a distance between the categorical dimensions when the sampling plan is optimised.

For the example in the documentation (https://mrurq.github.io/LatinHypercubeSampling.jl/stable/man/categorical/) you can think of catWeight=1000 as a large separation between the categorical dimensions which is similar to making separate LHC plans for each category. catWeight=0 can be interpreted as no separation between the categorical dimensions where the categorical dimensions for each point is selected randomly. The risk of having it set to 0 is that all points in one dimension could become clustered to one side of the design space without any penalty. In general I would suggest to use some separation like catWeight=1 to prevent this from happening.

A small note, in the paper the weight values refer to a LHC which is scaled from 0 to 1. In this package the LHC is unscaled integers starting from 1 to N where N is the number of samples. So a catWeight=1 is the same as the step distance in each dimension.

LudovicoBessi · Answer 2 · Fri Oct 16 2020 18:35:40 GMT+0800 (China Standard Time)

I see, thanks a lot!