SamsungLabs / ritm_interactive_segmentation

Reviving Iterative Training with Mask Guidance for Interactive Segmentation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

It takes long time to train

juntawu opened this issue · comments

Hello. Thanks for you work.
I trained HRNetV2-W18-C+OCR ITER-M model with command
python3 train.py models/iter_mask/hrnet18_cocolvis_itermask_3p.py --gpus=0,1 --workers=6 --exp-name=first-try
on COCO_LVIS dataset, with 2 GPUs (Tesla-V100-SXM2-32GB).
However, it took me nearly 70+ hours to train 200 epochs. Is this normal ?

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

I trained hrnet18s on one 1080ti for 200 epochs. It took approximate 20mins per epoch. The result is lower than reported. I wonder if this is normal.
企业微信截图_20210623172604

Hello. Thanks for you work. I trained HRNetV2-W18-C+OCR ITER-M model with command python3 train.py models/iter_mask/hrnet18_cocolvis_itermask_3p.py --gpus=0,1 --workers=6 --exp-name=first-try on COCO_LVIS dataset, with 2 GPUs (Tesla-V100-SXM2-32GB). However, it took me nearly 70+ hours to train 200 epochs. Is this normal ?

It's normal. #3 (comment)
I need 3 days to train 220 epochs. This is why the authors only trained 55 epochs for their experiments.

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth
Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth
请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

训练结束后会做validation,你可以去过一遍代码,这个代码写得很好。validation的时候会停顿下,但不会很久,而且会有进度条显示。

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

训练结束后会做validation,你可以去过一遍代码,这个代码写得很好。validation的时候会停顿下,但不会很久,而且会有进度条显示。

我看了代码了,然后也挨个代码打断点找问题,发现他有的时候连for循环都进不去,如果你们都没问题的话,那可能是我的电脑的原因? 或者我的数据集有问题?

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

训练结束后会做validation,你可以去过一遍代码,这个代码写得很好。validation的时候会停顿下,但不会很久,而且会有进度条显示。

我看了代码了,然后也挨个代码打断点找问题,发现他有的时候连for循环都进不去,如果你们都没问题的话,那可能是我的电脑的原因? 或者我的数据集有问题?

请问你们有每个epoch训练loss都重新开始的问题么,感觉每个epoch都是独立的

I trained hrnet18s on one 1080ti for 200 epochs. It took approximate 20mins per epoch. The result is lower than reported. I wonder if this is normal. 企业微信截图_20210623172604

hello, may i ask how you get this results? My validation process only gives me the validation loss result.

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

请问后来这个问题是怎么解决的呢,我训练自己的数据集也遇到了同样的问题,训练完第一个epoch到validation就卡死了

久是吗?我也不敢去乱点。

训练结束后会做validation,你可以去过一遍代码,这个代码写得很好。validation的时候会停顿下,但不会很久,而且会有进度条显示。

我看了代码了,然后也挨个代码打断点找问题,发现他有的时候连for循环都进不去,如果你们都没问题的话,那可能是我的电脑的原因? 或者我的数据集有问题?

请问你们有每个epoch训练loss都重新开始的问题么,感觉每个epoch都是独立的

请问这个问题后来你怎么解决的呢?

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

请问这个问题后来你怎么解决的呢?

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

请问后来这个问题是怎么解决的呢,我训练自己的数据集也遇到了同样的问题,训练完第一个epoch到validation就卡死了

请问这个问题后来你怎么解决的呢?

好像是因为影像的原因,把没有标签的剔除掉就可以了

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2022年5月7日(星期六) 下午5:42 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [saic-vul/ritm_interactive_segmentation] It takes long time to train (#5) @juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。 The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments. Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。 请问后来这个问题是怎么解决的呢,我训练自己的数据集也遇到了同样的问题,训练完第一个epoch到validation就卡死了 请问这个问题后来你怎么解决的呢? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

没有标签是指有原始图像 images/sth.jpg 但是没有对应的掩膜masks/sth.png么?

我的情况是使用3D的医学图像切片做的训练数据集,每个原始图像images/sth.jpg都有对应的masks/sth.png图像,但是mask图像有一定比例是纯黑的(mask图像内没有目标)

@yangshunDragon 你好,我也是想用这个模型做一下医学图像分割,想请问一下您这个问题解决了吗