aimagelab / RefiNet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Attempt at training depth patches module

dagle593 opened this issue · comments

Hi, I'm Daryl, Im attempting to train from scratch using the datasets from ITOP, module A which is the depth patch model. I think the model can be initialised properly as the error occurs when it runs model.train in the main.py file.

A. As I currently understand it,

in path folder is to contain the datasets and the baseline key points (the kpts_test.pkl & kpts_test.pkl files )

in the separate weights folder is to continue training from the 27th epoch which is referenced in the cmd line via --resume

B. In the train.json file under itop/depth below is the altered json file :

  1. changed some of the file directories but kept the name as in, my main file is moved in a folder "REFINET_STUFF" while the other downloads from "src" file remains inside

  2. "train_dir": "path/to/itop" for this I have a folder "path/to/itopITOP" that i created in my directory to store all ITOP .h5 files

  3. I've changed "kpts_path": "/path/to/kpts_to_load_train.pkl" --> "kpts_path": "path/to/kpts_train.pkl" , "kpts_path_test": "/path/to/kpts_to_load_test.pkl" --> "kpts_path_test": "path/to/kpts_test.pkl", as from the downloads there doesn't exists kpts_to_load_train.pkl orkpts_to_load_test.pkl

  4. I did create this path "path/to/BodyPoseRefine" but it remains empty a I have no idea what files are required from it

C. Current Issue

  1. After it loads the datasets, within model.init_model(), it initiates model.train(), the error brings up the __train() function in train.py and mentions that it is unable to process empty datasets **I'll attach the error code in a bit as it crashed my conda environment and is being restored atm

D. Could anyone help me to affirm the things I've done in parts A and B are right or wrong. Additionally, advise on the current issue. Thank you

{
"name": "BodyPose Refinment with Depth patches",
"dataset": "ITOP",
"side": "side",
"task": "Pose Estimation",
"project_dir": "path/to/BodyPoseRefine",
"train_dir": "path/to/itop",
"epochs": 30,
"data": {
"type": "depth",
"from_gt": true,
"patch_dim": 40,
"batch_size": 64,
"input_size": [40, 40],
"output_size": [15, 2],
"image_size": [240, 320],
"num_keypoints": 15,
"kpts_type": "2D",
"kpts_path": "path/to/kpts_train.pkl",
"kpts_path_test": "path/to/kpts_test.pkl",
"result_dir": "result"
},
"metrics": {
"sigmas": 0.107,
"gt_type": "plain",
"kpts_type": "2D",
"dist_thresh": 100
},
"data_aug": {
"mu": 0,
"sigma": 5
},
"checkpoints": {
"best": true,
"save_name": "train_depth",
"save_dir": "checkpoints/depth/itop",
"save_iters": 30,
"tb_path": "train_log/itop"
},
"solver": {
"type": "Adam",
"workers": 4,
"weight_decay": 0.0001,
"decay_steps": [10, 20],
"base_lr": 0.001
},
"network":{
"model_name": "V1",
"residual": true,
"dropout" : true,
"batch_norm": true,
"activation": "relu"
}
}

Here is the error that came out:

Starting epoch: 1
15
0.107
15
100
0%| | 0/221 [00:00<?, ?it/s]Traceback (most recent call last):
File "main.py", line 60, in
model.train()
File "d:\FYP_walao\REFINET_STUFF\src\train.py", line 492, in train
self.__train()
File "d:\FYP_walao\REFINET_STUFF\src\train.py", line 237, in __train
for data_tuple in tqdm(self.train_loader):
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\site-packages\tqdm_tqdm.py", line 1022, in iter
for obj in iterable:
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\site-packages\torch\utils\data\dataloader.py", line 352, in iter
return self._get_iterator()
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\site-packages\torch\utils\data\dataloader.py", line 801, in init
w.start()
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: 'NoneType' object is not callable

(conda37v2) d:\FYP_walao\REFINET_STUFF>Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\Owner\anaconda3\envs\conda37v2\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

Hi @dagle593 , could you please share how did you solve this issue? Thank you!

I altered the "worker" parameters in the .json file to 0