hkchengrex / STCN

[NeurIPS 2021] Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

Home Page:https://hkchengrex.com/STCN/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to train this model on my own dataset?

Corawill opened this issue · comments

It's an excellent work, thanks for your code!
However, I want to train this model on my own dataset which is a long video labeled with 200 frames. Because my dataset is a unique scenario, I don't want to train on the static dataset. Or maybe I can train static on my own dataset. And I chance the yv_root to my dataset, and comment the davis_root.
So when I use the train command CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python3 -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s012 --stage 2 and delete the part --load_network [path_to_trained_s0.pth], but when I run this I got this error
Concat dataset size: 0 Renewed with skip: 5 Concat dataset size: 0 Renewed with skip: 5 Traceback (most recent call last): File "train.py", line 163, in <module> total_epoch = math.ceil(para['iterations']/len(train_loader)) ZeroDivisionError: division by zero Traceback (most recent call last): File "train.py", line 163, in <module> total_epoch = math.ceil(para['iterations']/len(train_loader)) ZeroDivisionError: division by zero Killing subprocess 3645 Killing subprocess 3646 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 340, in <module> main() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--id', 'retrain_s012', '--stage', '2']' returned non-zero exit status 1.

How could I fix it? Or could you give me some guidance of training this model on my dataset please?
Thanks for your response.

Hello,

this line

File "train.py", line 163, in <module> total_epoch = math.ceil(para['iterations']/len(train_loader)) ZeroDivisionError: division by zero

tells me that the model cannot list your data. In general, we cannot provide debugging support for custom datasets because everyone's data is different. In this case, it might have something to do with subset loading

STCN/train.py

Lines 92 to 93 in 23a2141

yv_dataset = VOSDataset(path.join(yv_root, 'JPEGImages'),
path.join(yv_root, 'Annotations'), max_skip//5, is_bl=False, subset=load_sub_yv())

If not, I recommend starting debugging from https://github.com/hkchengrex/STCN/blob/main/dataset/vos_dataset.py