one_hot_encoder in Test

Question

one_hot_encoder in Test

dotRData opened this issue 7 years ago · comments

how do we use one_hot_encoder in test data ?
lets say some new value got added in some column
it will add extra column in test-dataset, which is not a problem,

but let's say some values are missing in the test-data
and it will drop that column in one_hot_encoder
and that might create a problem while scoring

ELToulemonde · Answer 1 · Fri Jan 12 2018 17:49:14 GMT+0800 (China Standard Time)

Hi,

That's a good one.

A quick fix: I would recommand using sameShape which allows you to control the oclumns of your test set.

After, I don't know what is the best approach, do you have an example of another package that allows you to have the same columns in train and test.

Rahul Anand · Answer 2 · Fri Jan 12 2018 21:08:08 GMT+0800 (China Standard Time)

currently I am using this
testData[, setdiff(names(trainData), names(testData)):=0]

I thought you might have some better way.

ELToulemonde · Answer 3 · Tue Jan 16 2018 16:17:09 GMT+0800 (China Standard Time)

I guess a future modification would be to perrform one_hot_encoder such as fastScale works for example...

With first a buildEncoding funtion to build encoding parameters that would be applicable using one_hot_encoding either on train and test.

Feature should be developped in next version.

Rahul Anand · Answer 4 · Tue Jan 16 2018 22:37:41 GMT+0800 (China Standard Time)

Yes, buildEncoding might also take input as min-frequency of the levels present in the features. That way we might have control over the final dimension of the dataset.

ELToulemonde · Answer 5 · Thu Jan 18 2018 01:30:25 GMT+0800 (China Standard Time)

Good idea. I added it. It is implemented in branch v0.3.5 will be merged soon.