How to train with multiple videos in my custom dataset?

Question

How to train with multiple videos in my custom dataset?

iPrayerr opened this issue 4 years ago · comments

Hello, I've tried to train with my own dataset whose folder is as below:
/dataset_folder/HR/video_num/*.png
/dataset_folder/LR/X4/video_num/*.png

And I've organized them following the instructions in Data/datasets.yaml:
Root: /home/user

Path:
CUSTOM-TRAINHR[video]: dataset_folder/HR
CUSTOM-TRAINLR[video]: dataset_folder/LR/X4

Dataset:
CUSTOM[video]:
train:
hr: CUSTOM_TRAINHR
lr: CUSTOM_TRAINLR
my valset is the same format as organized above.

However, Traceback was thown when I tried the command:
python train.py sofvsr --dataset custom --epochs 100 --cuda
which is:

Traceback (most recent call last):
File "train.py", line 99, in <module>
main()
File "train.py", line 93, in main
t.fit([lt, lv], config)
File "/home/zp/VideoSuperResolution-master/VSR/Backend/Torch/Framework/Trainer.py", line 110, in fit
memory_limit=mem)
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/Loader.py", line 322, in make_one_shot_iterator
raise fs.exception()
File "/home/user/.conda/envs/zp/lib/python3.6/concurrent/futures/thread.py", line 56, in run result = self.fn(*self.args, **self.kwargs)
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/Loader.py", line 393, in _prefecth_chunk
self.cache['hr'].append(img.read_frame(img.frames))
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 362, in read_frame
image_bytes = [BytesIO(self.read()) for _ in range(frames)]
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 362, in <listcomp>
image_bytes = [BytesIO(self.read()) for _ in range(frames)]
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 129, in read raise EOFError(f'End of File! {self.name}')
EOFError: End of File! 068

It usually occurs after one epoch, but sometimes the epoch_num can also be 2 or 3.
Besides, totally my trainset has 240 videos or 240*100=24000 frames. However, it can just read 200 batchs when my batch_size is set to 4.

I just wonder how to train with multiple videos as mentioned above?
Cause when I just use one video folder containing 100 frames, everything is ok.

Thx.

iPrayerr · Answer 1 · Tue Apr 14 2020 10:47:22 GMT+0800 (China Standard Time)

@LoSealL Are you still maintaining the project? If so, I really need your help.

Tang, Wenyi · Answer 2 · Tue Apr 14 2020 10:51:35 GMT+0800 (China Standard Time)

@iPrayerr Sorry, didn't get noticed. Too busy recently. I will check soon.

Tang, Wenyi · Answer 3 · Tue Apr 14 2020 11:01:08 GMT+0800 (China Standard Time)

@iPrayerr Hi, The root in datasets.yaml means the root folder to your image/video data. In your case, the root must be /. Or you can set root=/dataset_folder and CUSTOM-TRAINHR[video]: HR

Besides, there is a typo error in your names: CUSTOM_TRAINHR and CUSTOM-TRAINHR

Tang, Wenyi · Answer 4 · Tue Apr 14 2020 11:25:50 GMT+0800 (China Standard Time)

It usually occurs after one epoch, but sometimes the epoch_num can also be 2 or 3.
Besides, totally my trainset has 240 videos or 240*100=24000 frames. However, it can just read 200 batchs when my batch_size is set to 4.
I just wonder how to train with multiple videos as mentioned above?
Cause when I just use one video folder containing 100 frames, everything is ok.

If I correct the data path, the training procedure is OK in my testing environment. If you meet any errors, please feel free to paste full logs here to help me debug.

In VSR training, for each epoch, the total number of batches if fixed by --steps which by default is 200, no matter how many pictures in your dataset, because it randomly crops patches from them.

iPrayerr · Answer 5 · Tue Apr 14 2020 11:48:18 GMT+0800 (China Standard Time)

It usually occurs after one epoch, but sometimes the epoch_num can also be 2 or 3.
Besides, totally my trainset has 240 videos or 240*100=24000 frames. However, it can just read 200 batchs when my batch_size is set to 4.
I just wonder how to train with multiple videos as mentioned above?
Cause when I just use one video folder containing 100 frames, everything is ok.

If I correct the data path, the training procedure is OK in my testing environment. If you meet any errors, please feel free to paste full logs here to help me debug.

In VSR training, for each epoch, the total number of batches if fixed by --steps which by default is 200, no matter how many pictures in your dataset, because it randomly crops patches from them.

OK, thanks a lot, I'll try. I'll paste more logs if new problems are raised.

iPrayerr · Answer 6 · Wed Apr 15 2020 17:25:30 GMT+0800 (China Standard Time)

@LoSealL Hi, I've followed your instructions previously which still didn't work.

Here're more of the details:

My environment:
Hardware: 32G RAM, 12G GTX 1080Ti
Software: Ubuntu 16.04, Python 3.6.5, tensorflow-gpu 1.10.0, tensorboardX 2.0, PyTorch 1.1.0, Protobuf 3.6.0
My dataset folders(actually it's REDS):
train_hr: /data/ruan/REDS/train/train_sharp//.png
train_lr: /data/ruan/REDS/train/train_sharp_bicubic//.png
val_hr: /data/ruan/REDS/val/val_sharp//.png
val_lr: /data/ruan/REDS/val/val_sharp_bicubic//.png
Here the first "*" stands for various num of videos(from 000 to a max num), where the second stands for a specific single frame(from 00000000 to a max num).
Settings related to my dataset in Data/datasets.yaml:
Root: /data/ruan/REDS

Path:
REDSTRAIN-HR[video]: train/train_sharp
REDSTRAIN-LR[video]: train/train_sharp_bicubic
REDSVAL-HR[video]: val/val_sharp
REDSVAL-LR[video]: val/val_sharp_bicubic

Dataset:
REDS[video]:
train:
hr: REDSTRAIN-HR
lr: REDSTRAIN-LR
val:
hr: REDSVAL-HR
lr: REDSVAL-LR

Parameters of SOF-VSR in Train/par/pytorch/sofvsr.yaml:
sofvsr:
channel: 1
scale: 4
depth: 3

batch_shape: [2, 3, 1, 32, 32]
lr: 1.0e-4
lr_decay:
method: multistep
decay_step: [250, 500, 750, 1000, 1250]
decay_rate: 0.5

Training Logs:

(1) Set no memory_limit:
zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py sofvsr --dataset reds --cuda --epochs 50
2020-04-15 04:07:04,621 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0
2020-04-15 04:07:05,914 INFO: LICENSE: SOF-VSR is implemented by Longguan Wang. @LongguanWang https://github.com/LongguangWang/SOF-VSR.
2020-04-15 04:07:14,027 INFO: Total params: 1639676
2020-04-15 04:07:14,028 WARNING: trying to restore state for optimizer opt, but failed.
2020-04-15 04:07:14,028 INFO: Fitting: [SOF]
| 2020-04-15 04:09:25 | Epoch: 1/50 | LR: 0.0001 |
30%|####################6 | 59/200 [04:33<04:56, 2.10s/batch, image=00.09236, flow/lvl1=00.02955, flow/lvl2=00.02333, flow/lvl3=00.08839]
Killed

(2) Set memory_limit:
zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py sofvsr --dataset reds --cuda --epochs 50 --memory_limit 2GB
2020-04-15 04:29:45,743 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0
2020-04-15 04:30:29,301 INFO: LICENSE: SOF-VSR is implemented by Longguan Wang. @LongguanWang https://github.com/LongguangWang/SOF-VSR.
2020-04-15 04:31:20,773 INFO: Total params: 1639676
2020-04-15 04:31:20,775 WARNING: trying to restore state for optimizer opt, but failed.
2020-04-15 04:31:20,775 INFO: Fitting: [SOF]
| 2020-04-15 04:31:51 | Epoch: 1/50 | LR: 0.0001 |
100%|#####################################################################| 200/200 [05:31<00:00, 1.45s/batch, image=00.07591, flow/lvl1=00.06990, flow/lvl2=00.08228, flow/lvl3=00.10484]
| Epoch average image = 0.106827 |
| Epoch average flow/lvl1 = 0.048361 |
| Epoch average flow/lvl2 = 0.063642 |
| Epoch average flow/lvl3 = 0.105035 |
Test: 100%|################################################################################################################################################| 10/10 [00:39<00:00, 1.91s/it]
psnr: 12.935657,
| 2020-04-15 04:38:32 | Epoch: 2/50 | LR: 0.0001 |
100%|#####################################################################| 200/200 [05:10<00:00, 1.43s/batch, image=00.04348, flow/lvl1=00.05901, flow/lvl2=00.06701, flow/lvl3=00.07531]
| Epoch average image = 0.034973 |
| Epoch average flow/lvl1 = 0.040674 |
| Epoch average flow/lvl2 = 0.045621 |
| Epoch average flow/lvl3 = 0.059479 |
Traceback (most recent call last):
File "train.py", line 99, in
main()
File "train.py", line 93, in main
t.fit([lt, lv], config)
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/Backend/Torch/Framework/Trainer.py", line 130, in fit
self.benchmark(v.val_loader, v, memory_limit='1GB')
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/Backend/Torch/Framework/Trainer.py", line 159, in benchmark
memory_limit=v.memory_limit)
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/Loader.py", line 322, in make_one_shot_iterator
raise fs.exception()
File "/home/zp/anaconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/Loader.py", line 393, in _prefecth_chunk
self.cache['hr'].append(img.read_frame(img.frames))
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 362, in read_frame
image_bytes = [BytesIO(self.read()) for _ in range(frames)]
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 362, in
image_bytes = [BytesIO(self.read()) for _ in range(frames)]
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 129, in read
raise EOFError(f'End of File! {self.name}')
EOFError: End of File! 023

In (2), whatever the --memory_limit sets, the traceback is still the same.

Besides, when I again reload parameters to train further, at most it'll step for one epoch:
zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py sofvsr --dataset reds --cuda --epochs 50 --memory_limit 6GB
2020-04-15 04:45:22,443 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0
2020-04-15 04:45:23,823 INFO: LICENSE: SOF-VSR is implemented by Longguan Wang. @LongguanWang https://github.com/LongguangWang/SOF-VSR.
2020-04-15 04:45:32,135 INFO: Total params: 1639676
2020-04-15 04:45:32,136 INFO: Restoring params for sof from /data/zp/Graduation/VideoSuperResolution-master/Results/sofvsr/save/sof_ep0001.pth.
2020-04-15 04:45:32,375 INFO: Fitting: [SOF]
| 2020-04-15 04:46:51 | Epoch: 2/50 | LR: 0.0001 |
100%|#####################################################################| 200/200 [06:21<00:00, 1.48s/batch, image=00.05740, flow/lvl1=00.06096, flow/lvl2=00.07924, flow/lvl3=00.09723]
| Epoch average image = 0.081686 |
| Epoch average flow/lvl1 = 0.055793 |
| Epoch average flow/lvl2 = 0.086770 |
| Epoch average flow/lvl3 = 0.147451 |
_Test: 100%|################################################################################################################################################| 10/10 [00:39<00:00, 2.19s/it]
psnr: 12.557086, _
Traceback (most recent call last):
File "train.py", line 99, in
main()
File "train.py", line 93, in main
t.fit([lt, lv], config)
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/Backend/Torch/Framework/Trainer.py", line 110, in fit
memory_limit=mem)
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/Loader.py", line 322, in make_one_shot_iterator
raise fs.exception()
File "/home/zp/anaconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/Loader.py", line 393, in _prefecth_chunk
self.cache['hr'].append(img.read_frame(img.frames))
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 362, in read_frame
image_bytes = [BytesIO(self.read()) for _ in range(frames)]
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 362, in
image_bytes = [BytesIO(self.read()) for _ in range(frames)]
File "/data/zp/Graduation/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 129, in read
raise EOFError(f'End of File! {self.name}')
EOFError: End of File! 020

When I loaded the corresponding tensorboard log, only one point was recorded in each subfig.

Hope that the above will be helpful for your debugging.

Tang, Wenyi · Answer 7 · Fri Apr 17 2020 00:29:24 GMT+0800 (China Standard Time)

@iPrayerr I don't get how you arrange your training data

My dataset folders(actually it's REDS):
train_hr: /data/ruan/REDS/train/train_sharp//.png
train_lr: /data/ruan/REDS/train/train_sharp_bicubic//.png
val_hr: /data/ruan/REDS/val/val_sharp//.png
val_lr: /data/ruan/REDS/val/val_sharp_bicubic//.png
Here the first "*" stands for various num of videos(from 000 to a max num), where the second stands for a specific single frame(from 00000000 to a max num).

Are they look like:

train_sharp/001.png   train_sharp/002.png
train_sharp_bicubic/001.png   train_sharp_bicubic/002.png

Then all images in the same folder in video mode will be treated as one video clip.
In order to tell the dataloader the right video data, I'd like to arrange data like:

train_sharp/v01/001.png   train_sharp/v01/002.png
train_sharp/v02/001.png   train_sharp/v02/002.png
...
train_sharp_bicubic/v01/001.png   train_sharp_bicubic/v01/002.png 
train_sharp_bicubic/v02/001.png   train_sharp_bicubic/v02/002.png 
...

Tang, Wenyi · Answer 8 · Fri Apr 17 2020 00:34:53 GMT+0800 (China Standard Time)

I don't even remember I added ssim in the validation!
SSIM is too heavy, and it will significantly slow down the validation speed. So I usually calculate it offline.

BTW, using skimage.measure.compare_ssim is very useful to get ssim and other metrics.

iPrayerr · Answer 9 · Fri Apr 17 2020 10:55:54 GMT+0800 (China Standard Time)

train_sharp/v01/001.png train_sharp/v01/002.png
train_sharp/v02/001.png train_sharp/v02/002.png
...
train_sharp_bicubic/v01/001.png train_sharp_bicubic/v01/002.png
train_sharp_bicubic/v02/001.png train_sharp_bicubic/v02/002.png
...

@LoSealL Sorry, I've forgotten a "*". Previously I meant /data/ruan/REDS/train/train_sharp/*/*.png, not /data/ruan/REDS/train/train_sharp/*.png

Here the "*" stands for a specific number(video_num or frame_num), for example,
/data/ruan/REDS/train/train_sharp/007/00000000.png
or
/data/ruan/REDS/val/val_sharp_bicubic/013/00000099.png

The data is just like what you've been arranged.
And still raised the problem I've mentioned before.

Tang, Wenyi · Answer 10 · Fri Apr 17 2020 12:30:10 GMT+0800 (China Standard Time)

@iPrayerr You're right, this is a bug when enabling memory_limit. Sorry for this, and I made a patch to fix it.

iPrayerr · Answer 11 · Fri Apr 17 2020 12:46:49 GMT+0800 (China Standard Time)

Thanks a lot.
I've just tested, seems that everything's ok now.

iPrayerr · Answer 12 · Fri Apr 17 2020 23:24:57 GMT+0800 (China Standard Time)

@LoSealL Hi, sorry to bother you again. Now when the above problem was solved, I found that both psnr and ssim are quite low in many algorithm's testing process after training(ssim is added by myself and doesn't effect the training process). I've tried 6 algorithms and all of them are of the same problem.

Here're two examples:

zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py vespcn --dataset reds --epochs 35 --steps 8000 --val_steps 300 --cuda --memory_limit 1.5GB
2020-04-17 06:03:20,380 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0
2020-04-17 06:03:32,087 INFO: LICENSE: VESPCN is proposed at CVPR2017 by Twitter. Implemented by myself @LoSealL.
2020-04-17 06:03:55,234 INFO: Total params: 878787
2020-04-17 06:03:55,235 WARNING: trying to restore state for optimizer opt, but failed.
2020-04-17 06:03:55,235 INFO: Fitting: [VESPCN]
| 2020-04-17 06:04:20 | Epoch: 1/35 | LR: 0.0001 |
100%|#################################################################################################| 8000/8000 [1:56:12<00:00, 1.19batch/s, image=00.02813, flow=00.05529, tv=00.09024]
| Epoch average image = 0.056315 |
| Epoch average flow = 0.040478 |
| Epoch average tv = 0.330291 |
Test: 100%|##############################################################################################################################################| 300/300 [05:33<00:00, 1.08it/s]
psnr: 12.205831, ssim: 0.483721,
| 2020-04-17 08:06:16 | Epoch: 2/35 | LR: 0.0001 |
100%|#################################################################################################| 8000/8000 [1:49:09<00:00, 1.25batch/s, image=00.09327, flow=00.03729, tv=00.04959]
| Epoch average image = 0.052351 |
| Epoch average flow = 0.035497 |
| Epoch average tv = 0.070800 |
Test: 100%|##############################################################################################################################################| 300/300 [05:32<00:00, 1.00s/it]
psnr: 11.771148, ssim: 0.445891,

zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py rbpn --dataset reds --epochs 35 --steps 8000 --val_steps 300 --cuda --memory_limit 2GB
2020-04-17 07:44:38,519 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0
2020-04-17 07:44:40,218 INFO: LICENSE: RBPN is implemented by M. Haris, et. al. @alterzero
2020-04-17 07:44:40,218 WARNING: I use unsupervised flownet to estimate optical flow, rather than pyflow module.
2020-04-17 07:44:47,539 INFO: Total params: 14510537
2020-04-17 07:44:47,540 WARNING: trying to restore state for optimizer adam, but failed.
2020-04-17 07:44:47,540 INFO: Fitting: [RBPN]
| 2020-04-17 07:45:05 | Epoch: 1/35 | LR: 0.0001 |
100%|##############################################################################################| 8000/8000 [1:16:33<00:00, 1.89batch/s, flow=00.02468, image=00.19080, total=00.21548]
| Epoch average flow = 0.131890 |
| Epoch average image = 0.258228 |
| Epoch average total = 0.390117 |
Test: 100%|##############################################################################################################################################| 300/300 [02:50<00:00, 2.13it/s]
psnr: 12.385075, ssim: 0.432265,
| 2020-04-17 09:04:39 | Epoch: 2/35 | LR: 0.0001 |
100%|##############################################################################################| 8000/8000 [1:19:01<00:00, 1.94batch/s, flow=00.17723, image=00.29885, total=00.47607]
| Epoch average flow = 0.159079 |
| Epoch average image = 0.214459 |
| Epoch average total = 0.373538 |
Test: 100%|##############################################################################################################################################| 300/300 [02:49<00:00, 2.09it/s]
psnr: 12.186343, ssim: 0.446204,

I just crop 2 epochs as example, as more than 2 can lead to the same result as well(srcnn has been trained over 37 epochs or 296000 iterations and still found the same problem).
However, for some algorithms like SOF-VSR, when I used a single video as training data and set a value of --steps, it could reach to a normal level as around 24 or 26 psnr value, while for others like SRCNN the problem remains.
I guess there may be something wrong with several hyper-parameters. Could you offer me some suggestions?

Thx. :)

Tang, Wenyi · Answer 13 · Fri Apr 17 2020 23:28:53 GMT+0800 (China Standard Time)

Usually 2 ways to debug:

Record the training image patches (through SummaryWriter/tensorboard, or just save on the disk). Check the training pair to see if they matched as desired.
Check the hyper-parameters, especially the learning rate and batch size. The convergence may be very sensitive to them.
Training on the pre-trained weights, it will be easier to converge.

Tang, Wenyi · Answer 14 · Fri Apr 17 2020 23:30:22 GMT+0800 (China Standard Time)

I didn't train sof-vsr from scratch, you can check the paper and my implementation carefully.
I just fine-tuned sof-vsr above the official pre-trained weight and the result is expected to me.

iPrayerr · Answer 15 · Sat Apr 18 2020 11:10:36 GMT+0800 (China Standard Time)

OK, I'll try.

iPrayerr · Answer 16 · Thu Apr 23 2020 12:59:05 GMT+0800 (China Standard Time)

@LoSealL Hello, I've seen your latest comment named "Fix dataloader mess up the file order".

I've tested it in the previous dataset and treated it as a image dataset to test CARN. However, nothing has been fixed at all.

Specifically, in test process, I separatelly saved LR, GT and CARN's SR results. Still, some of them are matched, while most aren't, which is the same as the previous version.

However, when I run check_dataset.py, it didn't find any unmatched pair:

zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ python check_dataset.py redimg
2020-04-23 00:51:37,680 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0
Dataset: REDIMG

========= CHECKING train =========

Found train set in "REDIMG":
Found 24000 ground-truth train data
Found 24000 custom degraded train data

========= CHECKING val =========

Found val set in "REDIMG":
Found 3000 ground-truth val data
Found 3000 custom degraded val data

========= CHECKING test =========

REDIMG doesn't contain any test data.

Do you know what the problem is?

Tang, Wenyi · Answer 17 · Fri Apr 24 2020 16:26:05 GMT+0800 (China Standard Time)

@iPrayerr Affirmative. It's a bug. To work around it, use --threads=1. It takes me some time to fix this :(