Different classification variables in the test set and train set

Question

Different classification variables in the test set and train set

labxpub opened this issue 10 months ago · comments

Describe the bug

File ~/anaconda3/envs/deepl/lib/python3.11/site-packages/pytorch_tabnet/abstract_model.py:258, in TabModel.fit(self, X_train, y_train, eval_set, eval_name, eval_metric, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last, callbacks, pin_memory, from_unsupervised, warm_start, augmentations, compute_importance)
253 for epoch_idx in range(self.max_epochs):
254
255 # Call method on_epoch_begin for all callbacks
256 self._callback_container.on_epoch_begin(epoch_idx)
--> 258 self._train_epoch(train_dataloader)
260 # Apply predict epoch to all eval sets
261 for eval_name, valid_dataloader in zip(eval_names, valid_dataloaders):

File ~/anaconda3/envs/deepl/lib/python3.11/site-packages/pytorch_tabnet/abstract_model.py:489, in TabModel._train_epoch(self, train_loader)
486 for batch_idx, (X, y) in enumerate(train_loader):
487 self._callback_container.on_batch_begin(batch_idx)
--> 489 batch_logs = self._train_batch(X, y)
491 self._callback_container.on_batch_end(batch_idx, batch_logs)
493 epoch_logs = {"lr": self._optimizer.param_groups[-1]["lr"]}
...
2231 # remove once script supports set_grad_enabled
2232 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 2233 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

What is the current behavior?
Hi, I am using Tabnetclassifier,is a good tool.
But I'm having some problems with IndexError: index out of range in self.
I browsed through the previous questions and learnt that it might be because the categories of the categorical variables in the test set are more than the ones in the train set.
Then the solution I've come up with so far is to specify cat_dims as the number of dimensions in the whole dataset, which obviously includes both test and train, but it doesn't seem to work yet.
Because my dataset is relatively small, it's inevitable that some variables appear differently in the training and test sets.
I wonder if you guys have any suggestions for fixing this?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Optimox · Answer 1 · Tue Dec 12 2023 20:54:39 GMT+0800 (China Standard Time)

Hello,

This is not a bug, and even if you could get rid of an error during training and inference by setting your embedding sizes to a large value that would not solve your problem, only silence it.

You can't hope for any model to predict something meaningful to an integer encoded new category. You will simply generate a random representation and make predictions out of noise: garbage in, garbage out.

You need to decide on your pipeline what happens with "new" or "unknown" category. There is a vast amount of options you can pick: replace any new category by the most frequent one, create a "rare values" category during your training and set new categories to this value and many more. That's something you need to deal on your own pipeline and is not taken care of by the library as it's important to understand and have control on this.