Large dataset is not training

Question

Large dataset is not training

amos-coder opened this issue a year ago · comments

Describe the bug
When I am training with large dataset the model is trainied completely for given epochs but after that it was processing for some time and the process is killed

What is the current behavior?
i tried with small dataset i works completely fine but when I am doing training with large dataset this problem occurs

Expected behavior
I need the fit function to finish and move to next step

Screenshots
]

Other relevant information:
python version: python 3.8.8
Operating System: ubuntu

Additional context

Optimox · Answer 1 · Mon Jun 12 2023 14:36:43 GMT+0800 (China Standard Time)

Can you share the the rest of the error message?

Killed process is due to out of memory error.

So I would suggest to:

try to reduce your chunk_size and see if it works
read your training data by chunks and free your memory -> here X_train[start:end] will simply be a copy of a chunk of a large dataset, you are not freeing memory from your computer but adding an extra consumption.
do not evaluate your model on the training set by setting eval_set=[(x_valid, y_valid)] as evaluation requires to save all predictions and targets to memory for auc computation.

amos-07 · Answer 2 · Mon Jun 12 2023 14:57:14 GMT+0800 (China Standard Time)

your
Can you share the the rest of the error message?
Thats the complete error message.

read your training data by chunks and free your memory -> here X_train[start:end] will simply be a copy of a chunk of a large dataset, you are not freeing memory from your computer but adding an extra consumption
=> how to acheive that did I need to write my own dataloader and try it?

Optimox · Answer 3 · Mon Jun 12 2023 17:20:13 GMT+0800 (China Standard Time)

no need for a custom loader, just never load your entire dataset :

if you can do this X_train[start:end] it means that you have your entire X_train in memory
if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

amos-07 · Answer 4 · Mon Jun 12 2023 18:25:49 GMT+0800 (China Standard Time)

no need for a custom loader, just never load your entire dataset :

if you can do this X_train[start:end] it means that you have your entire X_train in memory

if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

Thanks for your reply! But the problem is I am training a binary classification data if I load as chunk and train means some chunk will have full of 0 labeled data and another chunk will have full of 1 labeled data and the training will not be efficient.Please correct me if I was wrong

amos-07 · Answer 5 · Mon Jun 12 2023 19:04:04 GMT+0800 (China Standard Time)

no need for a custom loader, just never load your entire dataset :

if you can do this X_train[start:end] it means that you have your entire X_train in memory

if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

Also loading the data and training the data is not a problem here.After training num of desired epochs some process is running i don't know what it was it only takes so much time and memory.Will be happy if you explain what it was! Thanks

Optimox · Answer 6 · Tue Jun 13 2023 00:59:25 GMT+0800 (China Standard Time)

If your code runs fine with smaller dataset, this means that the issue comes from the memory, so smaller chunks should help and not loading your entire dataset should help as well.

I have no mean to reproduce your error so I can't help you more than that.

You could preprocess your data by chunk beforehand and make sure that all your chunks have the same positive/negative ratio.

amos-07 · Answer 7 · Tue Jun 13 2023 13:32:49 GMT+0800 (China Standard Time)

I tried using a lesser parameters and 60% of the total dataset still I am seeing the processing for hours after training desired epochs( 1 epoch)
Thanks for your help in advance

amos-07 · Answer 8 · Tue Jun 13 2023 15:02:30 GMT+0800 (China Standard Time)

![Screenshot from 2023-06-13 12-29-14](https://github.com/dreamquark-ai/tabnet/assets/78432329/fcaea85a-b0ed-432e-a8ea-d5fbd3d40a79
Found that line 271 is reason for the large process time I commented that after it works prefectly fine.

Optimox · Answer 9 · Tue Jun 13 2023 15:24:41 GMT+0800 (China Standard Time)

The feature importance is computed by default during training, how many columns do you have ?

amos-07 · Answer 10 · Tue Jun 13 2023 17:20:51 GMT+0800 (China Standard Time)

The feature importance is computed by default during training, how many columns do you have ?

997 columns

Optimox · Answer 11 · Wed Jul 19 2023 23:24:11 GMT+0800 (China Standard Time)

This #493 should make things easier in the future