flashlight / flashlight

A C++ standalone library for machine learning

Home Page:https://fl.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training with continue and fork mode terminated due to unhandled system error

drremo1 opened this issue · comments

Bug Description

Hello, I've been trying to train the sota/2019 pre-trained dev-clean transformer model for 1 more epoch using flashlight's train continue mode. However, it fails to start training as the pre-trained models from wav2letter are not compatible with flashlight. I then installed wav2letter v0.2 to try retraining the pre-trained models using train continue but it fails and shows this error:

I0401 12:50:50.303958 29472 Train.cpp:80] Parsing command line flags
I0401 12:50:50.303997 29472 Train.cpp:81] Overriding flags should be mutable when using `continue`
terminate called after throwing an instance of 'std::runtime_error'
  what():  unhandled system error
*** Aborted at 1680324650 (unix time) try "date -d @1680324650" if you are using GNU date ***
PC: @     0x7f4669af9e87 gsignal
*** SIGABRT (@0x3e800007320) received by PID 29472 (TID 0x7f4697288380) from PID 29472; stack trace: ***
    @     0x7f468f59f980 (unknown)
    @     0x7f4669af9e87 gsignal
    @     0x7f4669afb7f1 abort
    @     0x7f466a4ee957 (unknown)
    @     0x7f466a4f4ae6 (unknown)
    @     0x7f466a4f4b21 std::terminate()
    @     0x7f466a4f4d54 __cxa_throw
    @     0x55673215c6f8 fl::detail::ncclCheck()
    @     0x55673215ddd7 fl::distributedInit()
    @     0x5567320cb387 w2l::initDistributed()
    @     0x556731e3eab2 main
    @     0x7f4669adcc87 __libc_start_main
    @     0x556731ea7e4a _start
Aborted

I tried using train fork and still the error persists. This error does not occur using train alone.

Reproduction Steps

This is what I ran:

wav2letter/build/Train continue /mnt/d/198 --minloglevel=0 --logtostderr=1 --rndv_filepath=

Is there other way to try to train the pretrained models for just 1 epoch?