hkchengrex / MiVOS

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion. Semi-supervised VOS as well!

Home Page:https://hkchengrex.com/MiVOS/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fine-tune guidance

be-redAsmara opened this issue · comments

Hi really loved the work, I'm trying to fine-tune the downloaded models(using the downlaod_model.py) to another domain. I was wondering if you could help me where to put the data and which command to run the training.

Thank you

for both ivos and svos

You can prepare your data following the DAVIS/YouTubeVOS format, change the data paths (see util/hyper_para.py), and train using the "main training" command in the readme. You can change load_network to load the network file that you want to fine-tune on.

Hello again,
I have prepared my data following the DAVIS format, made the changes to ( util/hyper_para.py) , I used the following main training code
CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 1 --id retrain_s012 --load_network saves/propagation_model.pth

but I am getting this error


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 32045 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in
main()
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/bereket/anaconda3/envs/mivoss/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


        train.py FAILED            

=======================================
Root Cause:
[0]:
time: 2021-10-26_15:33:37
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 32045)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
[1]:
time: 2021-10-26_15:33:37
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 32046)
error_file: <N/A>
msg: "Process failed with exitcode 1"


Do you encounter the same error with standard training datasets?

It seems to be a general PyTorch/DDP problem with your environment... I don't really have a suggestion/solution.

ok but just to make sure I'm using the right command can you confirm this is the command to use on the standard training datasets with the pre-trained models.....

CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 1 --id retrain_s012 --load_network saves/propagation_model.pth

Are you trying to finetune the fusion module or the propagation module? If you are finetuning the former, you should load the network file for the fusion module in load_network, not the propagation module.