Fine-tune guidance

Question

Fine-tune guidance

be-redAsmara opened this issue 3 years ago · comments

Hi really loved the work, I'm trying to fine-tune the downloaded models(using the downlaod_model.py) to another domain. I was wondering if you could help me where to put the data and which command to run the training.

Thank you

be-redAsmara commented 3 years ago

Yes

be-redAsmara · Answer 1 · Fri Oct 08 2021 16:55:18 GMT+0800 (China Standard Time)

for both ivos and svos

Rex Cheng · Answer 2 · Sat Oct 09 2021 00:43:47 GMT+0800 (China Standard Time)

You can prepare your data following the DAVIS/YouTubeVOS format, change the data paths (see util/hyper_para.py), and train using the "main training" command in the readme. You can change load_network to load the network file that you want to fine-tune on.

be-redAsmara · Answer 3 · Tue Oct 26 2021 21:38:25 GMT+0800 (China Standard Time)

Hello again,
I have prepared my data following the DAVIS format, made the changes to ( util/hyper_para.py) , I used the following main training code
CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 1 --id retrain_s012 --load_network saves/propagation_model.pth

but I am getting this error

           CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 32045 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train

warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in
main()
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

        train.py FAILED

=======================================
Root Cause:
[0]:
time: 2021-10-26_15:33:37
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 32045)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
[1]:
time: 2021-10-26_15:33:37
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 32046)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Rex Cheng · Answer 4 · Wed Oct 27 2021 00:02:11 GMT+0800 (China Standard Time)

Do you encounter the same error with standard training datasets?

Rex Cheng · Answer 5 · Wed Oct 27 2021 00:22:36 GMT+0800 (China Standard Time)

It seems to be a general PyTorch/DDP problem with your environment... I don't really have a suggestion/solution.

be-redAsmara · Answer 6 · Wed Oct 27 2021 20:20:15 GMT+0800 (China Standard Time)

ok but just to make sure I'm using the right command can you confirm this is the command to use on the standard training datasets with the pre-trained models.....

CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 1 --id retrain_s012 --load_network saves/propagation_model.pth

Rex Cheng · Answer 7 · Thu Oct 28 2021 00:53:48 GMT+0800 (China Standard Time)

Are you trying to finetune the fusion module or the propagation module? If you are finetuning the former, you should load the network file for the fusion module in load_network, not the propagation module.

Fine-tune guidance

======================================= Root Cause: [0]: time: 2021-10-26_15:33:37 rank: 0 (local_rank: 0) exitcode: 1 (pid: 32045) error_file: <N/A> msg: "Process failed with exitcode 1"

=======================================
Root Cause:
[0]:
time: 2021-10-26_15:33:37
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 32045)
error_file: <N/A>
msg: "Process failed with exitcode 1"