Large dataset is not training
amos-coder opened this issue · comments
Describe the bug
When I am training with large dataset the model is trainied completely for given epochs but after that it was processing for some time and the process is killed
What is the current behavior?
i tried with small dataset i works completely fine but when I am doing training with large dataset this problem occurs
Expected behavior
I need the fit function to finish and move to next step
Other relevant information:
python version: python 3.8.8
Operating System: ubuntu
Additional context
Can you share the the rest of the error message?
Killed process is due to out of memory error.
So I would suggest to:
- try to reduce your chunk_size and see if it works
- read your training data by chunks and free your memory -> here X_train[start:end] will simply be a copy of a chunk of a large dataset, you are not freeing memory from your computer but adding an extra consumption.
- do not evaluate your model on the training set by setting eval_set=[(x_valid, y_valid)] as evaluation requires to save all predictions and targets to memory for auc computation.
- your
Can you share the the rest of the error message?
Thats the complete error message.
read your training data by chunks and free your memory -> here X_train[start:end] will simply be a copy of a chunk of a large dataset, you are not freeing memory from your computer but adding an extra consumption
=> how to acheive that did I need to write my own dataloader and try it?
no need for a custom loader, just never load your entire dataset :
- if you can do this X_train[start:end] it means that you have your entire X_train in memory
- if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset
no need for a custom loader, just never load your entire dataset :
- if you can do this X_train[start:end] it means that you have your entire X_train in memory
- if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset
Thanks for your reply! But the problem is I am training a binary classification data if I load as chunk and train means some chunk will have full of 0 labeled data and another chunk will have full of 1 labeled data and the training will not be efficient.Please correct me if I was wrong
no need for a custom loader, just never load your entire dataset :
- if you can do this X_train[start:end] it means that you have your entire X_train in memory
- if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset
Also loading the data and training the data is not a problem here.After training num of desired epochs some process is running i don't know what it was it only takes so much time and memory.Will be happy if you explain what it was! Thanks
If your code runs fine with smaller dataset, this means that the issue comes from the memory, so smaller chunks should help and not loading your entire dataset should help as well.
I have no mean to reproduce your error so I can't help you more than that.
You could preprocess your data by chunk beforehand and make sure that all your chunks have the same positive/negative ratio.
![Screenshot from 2023-06-13 12-29-14](https://github.com/dreamquark-ai/tabnet/assets/78432329/fcaea85a-b0ed-432e-a8ea-d5fbd3d40a79
Found that line 271 is reason for the large process time I commented that after it works prefectly fine.
The feature importance is computed by default during training, how many columns do you have ?
The feature importance is computed by default during training, how many columns do you have ?
997 columns