Not able to train

Question

Not able to train

indhu26 opened this issue 4 years ago · comments

indhu26 commented 4 years ago

Hi all,

Conda env details -

cudatoolkit 9.0
cudnn 7.1.2
tensorflow-gpu 1.6.0

Data Preprocessing -

Followed the same command what was mentioned in the repo

While training facing an error - not able to find why is it caused
Or Am i missing something else @FangGet ?

Attaching the error below
Traceback (most recent call last):
File "monodepth2.py", line 63, in
args.func(config, output_dir, args)
File "monodepth2.py", line 17, in _cli_train
monodepth2_learner.train(output_dir)
File "/home/DEPTH_MODEL/tf-monodepth2/model/monodepth2_learner.py", line 405, in train
self.save(sess, ckpt_dir, 'latest')
File "/home/anaconda3/envs/tf_mono/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/home/anaconda3/envs/tf_mono/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/anaconda3/envs/tf_mono/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 828, in stop
ignore_live_threads=ignore_live_threads)
File "/home/anaconda3/envs/tf_mono/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/anaconda3/envs/tf_mono/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/anaconda3/envs/tf_mono/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
enqueue_callable()
File "/home/anaconda3/envs/tf_mono/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1259, in _single_operation_run
None)
File "/home/anaconda3/envs/tf_mono/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected size[1] in [0, 608], but got 640
[[Node: data_loading/Slice = Slice[Index=DT_INT32, T=DT_UINT8, _device="/job:localhost/replica:0/task:0/device:CPU:0"](data_loading/DecodeJpeg, data_loading/Slice_4/begin, data_loading/Slice_3/size)]]

Bayram Bayramli · Answer 1 · Fri Dec 11 2020 11:36:17 GMT+0800 (China Standard Time)

I believe it is an issue of resolution mismatch. You can first try by checking what resolution you used to process the data and what size of resolution you use for training.

xuchen-dev · Answer 2 · Thu Mar 03 2022 16:47:22 GMT+0800 (China Standard Time)

i got the same problem could you give me some advice
my preprocess img is 1248 *128 is correct then i dont change the size image_height: 192
image_width: 640 in the yaml