MrUrq / LatinHypercubeSampling.jl

Julia package for the creation of optimised Latin Hypercube Sampling Plans

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

catWeight documentation/explanation

ludoro opened this issue · comments

Hey @MrUrq,

Super cool library. I am working on MLJTuning where we want to have a LatinHypercube hyper-parameter optimization method, so I am using your library there.
One small issue I have is the use of "catWeight". There are many cases where we have categorical values, but it's not very clear how that parameter works.
At the moment I just always set it to 0. I have not found any reference of it in the two papers you list as reference, would you care to share some light on it?

Thanks a lot!

Hi @ludoro,

Glad you are finding it useful! You can read more about it in our paper https://doi.org/10.1016/j.asoc.2019.106050 where you can see the effect of it in Figure 3. But in general you can think of it as a distance between the categorical dimensions when the sampling plan is optimised.

For the example in the documentation (https://mrurq.github.io/LatinHypercubeSampling.jl/stable/man/categorical/) you can think of catWeight=1000 as a large separation between the categorical dimensions which is similar to making separate LHC plans for each category. catWeight=0 can be interpreted as no separation between the categorical dimensions where the categorical dimensions for each point is selected randomly. The risk of having it set to 0 is that all points in one dimension could become clustered to one side of the design space without any penalty. In general I would suggest to use some separation like catWeight=1 to prevent this from happening.

A small note, in the paper the weight values refer to a LHC which is scaled from 0 to 1. In this package the LHC is unscaled integers starting from 1 to N where N is the number of samples. So a catWeight=1 is the same as the step distance in each dimension.

I see, thanks a lot!