Project Not Running as Expected

Question

Project Not Running as Expected

singhangadin opened this issue 7 years ago · comments

Hi, That's a very interesting project you have developed. However, I have been facing many issues running it. It seems like readme.md has been update recently, however it seems outdated. It says regarding model, train and image directories but I couldn't find them in the project. #1 was really helpful in preparing Images. I would strongly suggest you to mention those steps in Readme.

I extracted 10,000 frames from the video link you shared in #1 issue. This is exactly how processed images look like after using scripts in utils:

Next I used train.py script and used above shown images to train model. This is how it ran:

After going through the script, I came to the conclusion that Model saved is the result I expect to see if everything goes well. Which I couldn't find in my execution and training time seems to be unbelievable.

train.py creates a pickle file that looks like this:

and a checkpoint directory as below:

checkpoint/images directory seems to be empty and rest of the files doesn't seem good (Maybe because of their low size).
I might be doing something wrong with the procedure, It would be really helpful if you could guide me through this project.

Cameron Fabbri · Answer 1 · Thu Mar 30 2017 09:23:13 GMT+0800 (China Standard Time)

Hi Angad, Thank you for the very detailed question, the screenshots and explanations were very helpful. The repo is in a pretty messy state now, as you can obviously tell. I originally used my scripts to resize and convert the images to gray, as you saw, and read them in every step using a feed dictionary. Recently I decided to change this to the "Tensorflow preferred" way, which involves using Tensorflow queue runners. The reason for this switch is I believe using this with their prefetching method speeds up the training and fully utilizes your GPU, and also it allows you to convert images to gray and resize them on the fly (without sacrificing speed because of the prefetches). This way you won't need all of those resized gray images, just your original ones. The reason your training took essentially no time is because of the epoch limit. If you look in train.py, the training goes while epoch_num < EPOCHS where EPOCHS is the maximum number of epochs to train for. I had a bug (thank you for pointing this out) in my parser arguments up top that made this default to 0. I was setting it manually when testing, so did not come across this issue. I've just pushed that fix. As for cleaning up the rest of it, unfortunately I'm busy until next week, so that will have to wait. I do plan on completely fixing this up though, so hang tight! Although with the recent epoch fix I think you will be able to at least train for now (no guarantees on the results). Any other questions in the meantime feel free to ask. Cam

…

On Wed, Mar 29, 2017 at 11:23 AM, Angad Singh ***@***.***> wrote: Hi, That's a very interesting project you have developed. However, I have been facing many issues running it. It seems like readme.md has been update recently, however it seems outdated. It says regarding model, train and image directories but I couldn't find them in the project. #1 <#1> was really helpful in preparing Images. I would strongly suggest you to mention those steps in Readme. I extracted 10,000 frames from the video link you shared in #1 <#1> issue. This is exactly how processed images look like after using scripts in utils: [image: Images Directory] <https://cloud.githubusercontent.com/assets/7099405/24464351/26401178-14c7-11e7-8bfb-0e33818ca01f.png> Next I used train.py <https://github.com/cameronfabbri/Colorful-Image-Colorization/blob/master/train.py> script and used above shown images to train model. This is how it ran: [image: train.py] <https://cloud.githubusercontent.com/assets/7099405/24464504/b8cdaf28-14c7-11e7-9d0a-ae0f45dff96d.png> After going through the script, I came to the conclusion that Model saved <https://github.com/cameronfabbri/Colorful-Image-Colorization/blob/master/train.py#L122> is the result I expect to see if everything goes well. Which I couldn't find in my execution and training time seems to be unbelievable. train.py <https://github.com/cameronfabbri/Colorful-Image-Colorization/blob/master/train.py> creates a pickle file that looks like this: [image: Pickle File] <https://cloud.githubusercontent.com/assets/7099405/24464819/c50ee4b8-14c8-11e7-962c-4cbcc8ad7a35.png> and a checkpoint directory as below: [image: screenshot from 2017-03-29 21-45-26] <https://cloud.githubusercontent.com/assets/7099405/24464958/21ca3694-14c9-11e7-9fb5-101a39e21904.png> checkpoint/images directory seems to be empty and rest of the files doesn't seem good (Maybe because of their low size). I might be doing something wrong with the procedure, It would be really helpful if you could guide me through this project. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABkuFJ_8R8S63D6FSKvIGMmQ7zwYPkYAks5rqoWfgaJpZM4MtOWh> .

Angad Singh · Answer 2 · Fri Mar 31 2017 02:11:59 GMT+0800 (China Standard Time)

Now getting this error.

EPOCHS: 10
DATA_DIR: ./Images
BATCH_SIZE: 10

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
Traceback (most recent call last):
File "train.py", line 111, in
sess.run(train_op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0)
[[Node: batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]

Caused by op u'batch', defined at:
File "train.py", line 57, in
Data = data_ops.loadData(DATA_DIR, BATCH_SIZE)
File "/home/localhost/Downloads/Colorize/New/Colorful-Image-Colorization/data_ops.py", line 253, in loadData
paths_batch, inputs_batch, targets_batch = tf.train.batch([paths, input_images, target_images], batch_size=batch_size)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 872, in batch
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 667, in _batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 458, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1310, in _queue_dequeue_many_v2
timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in init
self._traceback = _extract_stack()

OutOfRangeError (see above for traceback): FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0)
[[Node: batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]

Cameron Fabbri · Answer 3 · Fri Mar 31 2017 02:15:05 GMT+0800 (China Standard Time)

That means that the list of training images you have doesn't contain a correct path to an image. Try running it with a / at the end of your directory, i.e DATA_DIR=./Images/

…

On Thu, Mar 30, 2017 at 1:11 PM, Angad Singh ***@***.***> wrote: Now getting this error. EPOCHS: 10 DATA_DIR: ./Images BATCH_SIZE: 10 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. Traceback (most recent call last): File "train.py", line 111, in sess.run(train_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0) [[Node: batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/ replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]] Caused by op u'batch', defined at: File "train.py", line 57, in Data = data_ops.loadData(DATA_DIR, BATCH_SIZE) File "/home/localhost/Downloads/Colorize/New/Colorful-Image-Colorization/data_ops.py", line 253, in loadData paths_batch, inputs_batch, targets_batch = tf.train.batch([paths, input_images, target_images], batch_size=batch_size) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 872, in batch name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 667, in _batch dequeued = queue.dequeue_many(batch_size, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 458, in dequeue_many self._queue_ref, n=n, component_types=self._dtypes, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/ python/ops/gen_data_flow_ops.py", line 1310, in _queue_dequeue_many_v2 timeout_ms=timeout_ms, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/ python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in *init* self._traceback = _extract_stack() OutOfRangeError (see above for traceback): FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0) [[Node: batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/ replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABkuFGM6YKvGQXKzMIR9bUDlDmm5AN20ks5rq_BvgaJpZM4MtOWh> .

Angad Singh · Answer 4 · Fri Mar 31 2017 02:55:39 GMT+0800 (China Standard Time)

Still getting same error, Have also tried with full directory path. I am getting correct file names in pickle file. I am trying to train upon 500 images, I don't feel like that could be a cause for this.

Cameron Fabbri · Answer 5 · Fri Mar 31 2017 03:02:51 GMT+0800 (China Standard Time)

It shouldn't be. That's very odd that you're getting that error if you confirmed your pickle file contains correct paths. Did you make sure to delete the pickle file before trying to run it with the full directory path?

…

On Thu, Mar 30, 2017 at 1:55 PM, Angad Singh ***@***.***> wrote: Still getting same error, Have also tried with full directory path. I am getting correct file names in pickle file. I am trying to train upon 500 images, I don't feel like that could be a cause for this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABkuFFKiXMjGAgonpMohYaq6km2ZhsM2ks5rq_qrgaJpZM4MtOWh> .

Angad Singh · Answer 6 · Fri Mar 31 2017 04:02:55 GMT+0800 (China Standard Time)

Hey, I am uploading a video this time, hopefully that might help you in findings.
Screencast_Friday 31 March 2017_01:23:42 IST.webm.zip

Angad Singh · Answer 7 · Fri Mar 31 2017 04:07:36 GMT+0800 (China Standard Time)

I forgot to show pickle file in the end, here is its picture:

Cameron Fabbri · Answer 8 · Fri Mar 31 2017 04:09:26 GMT+0800 (China Standard Time)

Interesting. One thing I noticed is you still have the resized and gray images, which you don't need, but that shouldn't be the problem. You can remove them by doing: find Images/ -iname '*.resized*.png' -exec rm {} \; And similarly for gray: find Images/ -iname '*.gray*.png' -exec rm {} \; That shouldn't be the problem though. I will try to run my version and see if it runs for me.

…

On Thu, Mar 30, 2017 at 3:02 PM, Angad Singh ***@***.***> wrote: Hey, I am uploading a video this time, hopefully that might help you in findings. Screencast_Friday 31 March 2017_01:23:42 IST.webm.zip <https://github.com/cameronfabbri/Colorful-Image-Colorization/files/883674/Screencast_Friday.31.March.2017_01.23.42.IST.webm.zip> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABkuFPu2tPx3NLcKQg-sXTMJQTzMjAU8ks5rrApvgaJpZM4MtOWh> .

Angad Singh · Answer 9 · Fri Mar 31 2017 04:20:32 GMT+0800 (China Standard Time)

One more thing I want to mention, In error text I wrote earlier:

OutOfRangeError (see above for traceback): FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0)

Current size was 0, however the error I got in the video has 3.

I tried changing batch size to 3 however still got same error.

Cameron Fabbri · Answer 10 · Fri Mar 31 2017 04:27:07 GMT+0800 (China Standard Time)

I noticed that. I've been getting this same error with another project of mine, and unfortunately haven't solved it yet. It's quite frustrating. One thing that "almost" worked was removing the dashes in the path to my images. You can give that a shot, change them to underscores or something.

…

On Thu, Mar 30, 2017 at 3:20 PM, Angad Singh ***@***.***> wrote: One more thing I want to mention, In error text I wrote earlier: OutOfRangeError (see above for traceback): FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0) Current size was 0, however the error I got in the video has 3. I tried changing batch size to 3 however still got same error. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABkuFOsGEqnvKb3S3-eT_4HbacJ2SHVZks5rrA6QgaJpZM4MtOWh> .

Angad Singh · Answer 11 · Fri Mar 31 2017 13:40:43 GMT+0800 (China Standard Time)

Finally script is running. I tried to run script after removing '-' in directory but it didn't made any difference. Removing resized and grayscale images worked for me. I am saving model after every 5th iteration:
Here is a snapshot for that:

Will continue further after few iterations.

Cameron Fabbri · Answer 12 · Fri Mar 31 2017 21:16:46 GMT+0800 (China Standard Time)

Awesome, glad you got it figured out. I will say though, this takes a while to train. I trained for over a day on a GTX 1080 to get the results I showed in the repo. I usually save my models after 500 steps.....looks like you're running on a CPU. I will try and get a pretrained model up at some point.

…

On Fri, Mar 31, 2017 at 12:40 AM, Angad Singh ***@***.***> wrote: Finally script is running. I tried to run script after removing '-' in directory but it didn't made any difference. Removing resized and grayscale images worked for me. I am saving model after every 5th iteration: Here is a snapshot for that: [image: screenshot from 2017-03-31 10-54-13] <https://cloud.githubusercontent.com/assets/7099405/24537536/6baf9c26-1600-11e7-8317-8eb0379523e0.png> Will continue further after few iterations. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABkuFPtQjsJG4iCkteUknUVd75SX-Opcks5rrJHbgaJpZM4MtOWh> .

Angad Singh · Answer 13 · Sat Apr 01 2017 02:15:07 GMT+0800 (China Standard Time)

Apparently, I am not having a gpu. So, I will try it out in very less images. It would be a huge help if you could provide with a trained model. Thank you for helping.

Angad Singh · Answer 14 · Sat Apr 01 2017 15:30:00 GMT+0800 (China Standard Time)

I trained model with 10 images to make things faster, Now I am getting similar error with eval.py, I changed batch size to 1. Earlier, Deleting gray scale and resized images from training image set worked for me. I still can't figure out why are we getting this error as I cant see any hard coded expression for images.

Below is the snapshot of error:

Below are test images I am trying to colorize:

Angad Singh · Answer 15 · Sun Apr 02 2017 18:52:14 GMT+0800 (China Standard Time)

I think grayscale images are cause of this error. As an experiment I trained model on frames/resized images and used same images for testing/colorizing and there was no problem with running scripts. However, if grayscale images were a part of either train/test images I am getting errors mentioned above.