Training got error

Question

Training got error

DyncEric opened this issue 4 years ago · comments

Epoch 00095: val_loss did not improve from -16.76465
2021-01-21 00:24:52.466681: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

leooo · Answer 1 · Sat Jan 23 2021 17:21:52 GMT+0800 (China Standard Time)

Epoch 00095: val_loss did not improve from -16.76465
2021-01-21 00:24:52.466681: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

I found this problem, too on tf2.3 as it goes to 80 epoch. Now I choose to train on tf2.2 to test whether there is something wrong with my project.
It's also possible TensorFlow leads to this problem.

DyncEric · Answer 2 · Sat Jan 23 2021 19:39:28 GMT+0800 (China Standard Time)

Thanks, i not used tf2.2, will try

Nils L. Westhausen · Answer 3 · Fri Feb 12 2021 18:13:03 GMT+0800 (China Standard Time)

Did it solve the problem?

DyncEric · Answer 4 · Sun Feb 21 2021 12:03:06 GMT+0800 (China Standard Time)

Did it solve the problem?

No, my Graphics card is TX3090, it's not support tf2.2 and below.

yudashuixiao1 · Answer 5 · Wed Oct 27 2021 16:14:27 GMT+0800 (China Standard Time)

Did it solve this problem for now? I also meet this porblem..

Tim · Answer 6 · Sat Dec 18 2021 09:43:59 GMT+0800 (China Standard Time)

@yudashuixiao1
Mybe the reason is the code :
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0,
patience=10, verbose=0, mode='auto', baseline=None)
when you val_loss did not improve ,large than patience=10 epoch, training will be stopped.

StuartIanNaylor · Answer 7 · Wed Apr 13 2022 18:11:07 GMT+0800 (China Standard Time)

@DyncEric
I only have a RTX3050 but I am pretty sure that error is a memory error Error occurred when finalizing GeneratorDataset iterator.

I did change my code as the repo is really set up for a headless GPU server and maybe with a desktop competing for GPU ram the following is better.

If you are running a desktop them prob this is better

        # some line to correctly find some libraries in TF 2.x
        #physical_devices = tf.config.experimental.list_physical_devices('GPU')
        #if len(physical_devices) > 0:
            #for device in physical_devices:
                #tf.config.experimental.set_memory_growth(device, enable=True)
        gpus = tf.config.list_physical_devices('GPU')
        if gpus:
          # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
          try:
            tf.config.set_logical_device_configuration(gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=6144)])
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
          except RuntimeError as e:
          # Virtual devices must be set before GPUs have been initialized
            print(e)

Didn't make much difference for me as 6144 is near what is allocated anyway and going to see if a headless server install will give me more ram and fingers crossed a longer run before failure.
I wanted to try how it would do with the same keyword with noise added and if you just have a single kw word by multiple speakers if the noise supression would work at higher levels but my luck is I got a rtx3050 as I only have a 400watt psu

So what I did was change the ReduceLROnPlateau as reduce factor of 0.5 on a patience of just 3 seems a pretty steep LR drop off to me. I dunno what happens when you get to LR of min_lr=10**(-10) which you hit pretty quick but tensorflow doesn't seem to like it. I just set it to min_lr=0.00001 but going to have a play and see what works best but something goes really weird when you are processing a LR as low as 10**(-10) and really is it beneficial to go that low.
Dunno what happens with the memory but even my screen crashed and did that pixelation thing one time and would say it is, but you get the last saved best so guess it doesn't matter.
I went with factor=0.8, patience=1 so the LR is a bit more linear and upped the patience so it would keep ticking away for longer.

        # create callback for the adaptive learning rate
        reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.8,
                              patience=1, verbose=1, min_lr=0.0000001, cooldown=1)
        # create callback for early stopping
        early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, 
            patience=50, verbose=1, mode='auto', baseline=None)

Such a pain to have an error that is on epoch 70+ :)
But yeah lower the factor with a bit more patience that it currently is and maybe set your min_lr a bit higher, still bangs out but at least your training runs for a bit longer and you still have the last best.
Why it comes up with the python error on early stop elludes me as tried most things now, but not the worst thing.

nakamura · Answer 8 · Wed Nov 09 2022 16:48:54 GMT+0800 (China Standard Time)

@StuartIanNaylor
Hi, I have some issues on training this nice model.

I followed step by step reading through documentation.
training data created by breizhn/DNS-challenge repository
I get validation loss around 49.29
training loss around 0.0014
epoch_lr : decreases from 0.0010 to 1.5259e-08

The validation loss is over too high.
I don't think the data is corrupt, since I am using the same training_data repository you provided.
Can I get some hint or help? How can I get correct validation loss .
only thing I've changed is total_hours 500 -> 40.
I tried to change dropout/ numLayer in DTLN_model.py but it did not improve.

Thanks in advance.

noisyspeech_synthesizer.cfg I used.

sampling_rate: 16000
audioformat: *.wav
audio_length: 30
silence_length: 0.2
total_hours: 40
snr_lower: -5
snr_upper: 25
randomize_snr: True
target_level_lower: -35
target_level_upper: -15
total_snrlevels: 31
clean_activity_threshold: 0.6
noise_activity_threshold: 0.0
fileindex_start: None
fileindex_end: None
is_test_set: False
noise_dir: ./datasets/noise
speech_dir: ./datasets/clean
noisy_destination: ./training_set/noisy
clean_destination: ./training_set/clean
noise_destination: ./training_set/noise
log_dir: ./logs

StuartIanNaylor · Answer 9 · Thu Nov 10 2022 10:29:14 GMT+0800 (China Standard Time)

Totally forgot what I got now but probably because you have changed from 500 -> 40.

As when valaidation is much higher than train loss usually its a sign of overfitting and you don't have enough data variance in training.

Your model id working fine on what it has been trained on but external data such as validation is more so likely not enough or varied enough training data

nakamura · Answer 10 · Thu Nov 10 2022 10:39:20 GMT+0800 (China Standard Time)

@StuartIanNaylor
Thanks for fast reply,
I've tried both 40h and 500h ( both created automatically by python script provided (noisyspeech_synthesizer_multiprocessing.py)
validation test data was also splitted by script ( split_dns_corpus.py).
I also think this is overfitting too, but I cannot find any solution beside managing data set

StuartIanNaylor · Answer 11 · Fri Nov 11 2022 14:25:10 GMT+0800 (China Standard Time)

I tell it what it might be as git clone breizhn not the microsoft one
https://github.com/breizhn/DNS-Challenge and create the dataset as I remember at 1st the dataset I was creating was garbage but as per usual have forgot.
I have MS and apols for the amnesia

nakamura · Answer 12 · Tue Nov 22 2022 13:34:07 GMT+0800 (China Standard Time)

@StuartIanNaylor
I am still struggling with val loss in training DTLN
I've checked data set randomly and had no problem.

Can I figure out which one of 1st dataset is garbage?
Or do you get correct val loss, if you run run_training.py? have you changed any configurations?

Sorry for many questions, have been struggling a while

StuartIanNaylor · Answer 13 · Tue Nov 22 2022 20:33:19 GMT+0800 (China Standard Time)

All I remember is to use the breizhn fork and the exact same instructtions 1st.

Training data preparation:
Clone the forked DNS-Challenge repository. Before cloning the repository make sure git-lfs is installed. Also make sure your disk has enough space. I recommend downloading the data to an SSD for faster dataset creation.

Run noisyspeech_synthesizer_multiprocessing.py to create the dataset. noisyspeech_synthesizer.cfgwas changed according to my training setup used for the DNS-Challenge.

Run split_dns_corpus.pyto divide the dataset in training and validation data. The classic 80:20 split is applied. This file was added to the forked repository by me.

Do exactly from the breizhn fork as above and it will work and that will give you a datum to do a custom one.

steven · Answer 14 · Fri Apr 14 2023 06:22:19 GMT+0800 (China Standard Time)

@yudashuixiao1 Mybe the reason is the code : early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=0, mode='auto', baseline=None) when you val_loss did not improve ,large than patience=10 epoch, training will be stopped.

Maybe you're right.I met the same error just after 10 not improving epochs.