dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

Home Page:https://dreamquark-ai.github.io/tabnet/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large dataset is not training

amos-coder opened this issue · comments

Describe the bug
When I am training with large dataset the model is trainied completely for given epochs but after that it was processing for some time and the process is killed

What is the current behavior?
i tried with small dataset i works completely fine but when I am doing training with large dataset this problem occurs

Expected behavior
I need the fit function to finish and move to next step

Screenshots
]
Screenshot from 2023-06-12 10-57-29
Screenshot from 2023-06-12 10-48-05

Other relevant information:
python version: python 3.8.8
Operating System: ubuntu

Additional context

Can you share the the rest of the error message?

Killed process is due to out of memory error.

So I would suggest to:

  • try to reduce your chunk_size and see if it works
  • read your training data by chunks and free your memory -> here X_train[start:end] will simply be a copy of a chunk of a large dataset, you are not freeing memory from your computer but adding an extra consumption.
  • do not evaluate your model on the training set by setting eval_set=[(x_valid, y_valid)] as evaluation requires to save all predictions and targets to memory for auc computation.
  • your
    Can you share the the rest of the error message?
    Thats the complete error message.

read your training data by chunks and free your memory -> here X_train[start:end] will simply be a copy of a chunk of a large dataset, you are not freeing memory from your computer but adding an extra consumption
=> how to acheive that did I need to write my own dataloader and try it?

no need for a custom loader, just never load your entire dataset :

  • if you can do this X_train[start:end] it means that you have your entire X_train in memory
  • if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

no need for a custom loader, just never load your entire dataset :

  • if you can do this X_train[start:end] it means that you have your entire X_train in memory
  • if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

Thanks for your reply! But the problem is I am training a binary classification data if I load as chunk and train means some chunk will have full of 0 labeled data and another chunk will have full of 1 labeled data and the training will not be efficient.Please correct me if I was wrong

no need for a custom loader, just never load your entire dataset :

  • if you can do this X_train[start:end] it means that you have your entire X_train in memory
  • if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

Also loading the data and training the data is not a problem here.After training num of desired epochs some process is running i don't know what it was it only takes so much time and memory.Will be happy if you explain what it was! Thanks

If your code runs fine with smaller dataset, this means that the issue comes from the memory, so smaller chunks should help and not loading your entire dataset should help as well.

I have no mean to reproduce your error so I can't help you more than that.

You could preprocess your data by chunk beforehand and make sure that all your chunks have the same positive/negative ratio.

Screenshot from 2023-06-13 10-59-02
Screenshot from 2023-06-13 10-57-34

I tried using a lesser parameters and 60% of the total dataset still I am seeing the processing for hours after training desired epochs( 1 epoch)
Thanks for your help in advance

![Screenshot from 2023-06-13 12-29-14](https://github.com/dreamquark-ai/tabnet/assets/78432329/fcaea85a-b0ed-432e-a8ea-d5fbd3d40a79
Found that line 271 is reason for the large process time I commented that after it works prefectly fine.

The feature importance is computed by default during training, how many columns do you have ?

The feature importance is computed by default during training, how many columns do you have ?

997 columns

This #493 should make things easier in the future