Training with continue and fork mode terminated due to unhandled system error
drremo1 opened this issue · comments
Bug Description
Hello, I've been trying to train the sota/2019 pre-trained dev-clean transformer model for 1 more epoch using flashlight's train continue mode. However, it fails to start training as the pre-trained models from wav2letter are not compatible with flashlight. I then installed wav2letter v0.2 to try retraining the pre-trained models using train continue but it fails and shows this error:
I0401 12:50:50.303958 29472 Train.cpp:80] Parsing command line flags
I0401 12:50:50.303997 29472 Train.cpp:81] Overriding flags should be mutable when using `continue`
terminate called after throwing an instance of 'std::runtime_error'
what(): unhandled system error
*** Aborted at 1680324650 (unix time) try "date -d @1680324650" if you are using GNU date ***
PC: @ 0x7f4669af9e87 gsignal
*** SIGABRT (@0x3e800007320) received by PID 29472 (TID 0x7f4697288380) from PID 29472; stack trace: ***
@ 0x7f468f59f980 (unknown)
@ 0x7f4669af9e87 gsignal
@ 0x7f4669afb7f1 abort
@ 0x7f466a4ee957 (unknown)
@ 0x7f466a4f4ae6 (unknown)
@ 0x7f466a4f4b21 std::terminate()
@ 0x7f466a4f4d54 __cxa_throw
@ 0x55673215c6f8 fl::detail::ncclCheck()
@ 0x55673215ddd7 fl::distributedInit()
@ 0x5567320cb387 w2l::initDistributed()
@ 0x556731e3eab2 main
@ 0x7f4669adcc87 __libc_start_main
@ 0x556731ea7e4a _start
Aborted
I tried using train fork and still the error persists. This error does not occur using train alone.
Reproduction Steps
This is what I ran:
wav2letter/build/Train continue /mnt/d/198 --minloglevel=0 --logtostderr=1 --rndv_filepath=
Is there other way to try to train the pretrained models for just 1 epoch?