tensorflow / models

Models and examples built with TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[deeplab] Training deeplab model with ADE20K dataset

walkerlala opened this issue · comments

commented

System information

  • What is the top-level directory of the model you are using: deeplab
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.6.0
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: 9.0/7.0.4
  • GPU model and memory: 1080Ti * 2 , 10Gb * 2
  • Exact command to reproduce:

Describe the problem

This is a feature request. I am trying to train the deeplab model with the ADE20K dataset (see this presentation). I have finished the data format conversion and "successfully" train the model on a small subset of ADE20K. Below is the modification to file research/deeplab/datasets/segmentation_dataset.py which is used to extract segmentation data.

diff --git a/research/deeplab/datasets/segmentation_dataset.py b/research/deeplab/datasets/segmentation_dataset.py
index a777252..8648fb2 100644
--- a/research/deeplab/datasets/segmentation_dataset.py
+++ b/research/deeplab/datasets/segmentation_dataset.py
@@ -85,10 +85,22 @@ _PASCAL_VOC_SEG_INFORMATION = DatasetDescriptor(
     ignore_label=255,
 )
 
+_ADE20K_INFORMATION = DatasetDescriptor(
+    splits_to_sizes = {
+        'train': 40,
+        'val': 5,
+    },
+    # TODO temporarily change it to 21 otherwise dimension mismatch
+    num_classes=21,
+    ignore_label=255,
+)
+
 
 _DATASETS_INFORMATION = {
     'cityscapes': _CITYSCAPES_INFORMATION,
     'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
+    'ade20k': _ADE20K_INFORMATION,
 }
 
 # Default file pattern of TFRecord of TensorFlow Example.

The problem is, in the ADE20K dataset there are 150 classes, which is different from that in the VOC or cityspace dataset. That brings problem w.r.t the checkpoint file. Currently there are only pretrained model on the VOC and cityspace dataset. So we have two choices here:

  1. Do not use the checkpoint file. In this case, there is an error:
absl.flags._exceptions.IllegalFlagValueError: flag --tf_initial_checkpoint=None: Flag --tf_initial_checkpoint must be specified.
  1. set num_classes=21 to use those two provided checkpoint files

Are there any alternatives to these?

If anyone have any workable solution for the ADE20K dataset it would be really appreciated.

  1. You could modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False. (Note you still want to restore the variables in ASPP, decoder and so on). By doing so, only the weights in the last classification layer is not initialized (then you could use a classification layer with 150 classes).

  2. You need to explore the min_resize_value and max_resize_value (set resize_factor = output_stride) for ADE20K which contains images of huge various scales (e.g., dimension ranges from 50 to 2000). In that case, by setting min_resize_value and max_resize_value, you are able to resize the images on-the-fly to the similar range (or you could do that manually by yourself while pre-processing the dataset). Note however these hyper-parameters may affect the performance, and we have not yet explored that carefully.

commented

@aquariusjay Thanks for the hints. Now I have started the training, using the provided VOC model checkpoint, setting fine_tune_batch_norm to False, using the mobilenet_v2 variant and a batch size of 8. Hopefully that the loss will drop after several hours...

There are still two things confusing me:

  1. the segmentation annotation images within the ADE20K dataset have trhee channels, but I am reading it with label_reader = build_data.ImageReader('png', channels=1) , as for what we have done for the VOC dataset (in datasets/build_voc2012_data.py). Will that be a problem?

  2. why do we have the resize_factor parameters?

commented

Oh, will it be OK to prepare a pull request for the ADE20K dataset?

Regarding your previous questions:

  1. The groundtruth images should contain only 1 channel with values = semantic labels.
  2. You could check the code for details.

We currently do not have any plan to prepare that.
However, note that one should be able to do that by using the provided code/model/script.
Also, any contributions for extra dataset to the codebase is welcome.

Cheers,

@aquariusjay,

I'm currently having similar issues attempting to train with a custom dataset and was hoping you could offer some insight.

You could modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False.

The link you included "here" appears to need a Google SSO to login. I am assuming that was a link to the train_util.py script. Here are the changes I have currently made to implement your architecture on my custom dataset:

  1. segmentation_dataset.py
  • I added the information for my "toy_dataset"
_TOY_DATASET_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train': 800,
        'trainval': 1000,
        'val': 200,
    },
    num_classes=10,
    ignore_label=255,
)

_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'toy_dataset': _TOY_DATASET_INFORMATION,
}
  1. train.py
  • I do not initialize the final layer of the network.
  • I point training to the directory containing my custom "toy_dataset"
flags.DEFINE_boolean('initialize_last_layer', False,
                     'Initialize the last layer.')

flags.DEFINE_string('dataset', 'toy_dataset',
                    'Name of the segmentation dataset.')
  1. train_utils.py
  • I modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME', as you stated above.
  exclude_list = ['_LOGITS_SCOPE_NAME']
  if not initialize_last_layer:
    exclude_list.extend(last_layers)
  1. eval.py
  • I point evaluation to my custom "toy_dataset".
flags.DEFINE_string('dataset', 'toy_dataset',
                    'Name of the segmentation dataset.')

However, when I run this my code appears to successfully train, but then running into an issues with the the confusion matrix during evaluation (I include the traceback below for reference). Any tips/suggestions on how to fix this?

Thanks for your help!
Brett

Error Traceback:

~/brett/wss-python/models/research/deeplab$ sh local_test_custom.sh 
Converting toy dataset...
>> Converting image 50/200 shard 0
>> Converting image 100/200 shard 1
>> Converting image 150/200 shard 2
>> Converting image 200/200 shard 3
>> Converting image 250/1000 shard 0
>> Converting image 500/1000 shard 1
>> Converting image 750/1000 shard 2
>> Converting image 1000/1000 shard 3
>> Converting image 200/800 shard 0
>> Converting image 400/800 shard 1
>> Converting image 600/800 shard 2
>> Converting image 800/800 shard 3
--2018-03-30 12:33:03--  http://download.tensorflow.org/models/deeplabv3_pascal_train_aug_2018_01_04.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.8.176, 2607:f8b0:4009:80d::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.8.176|:80... connected.
HTTP request sent, awaiting response... 416 Requested range not satisfiable

    The file is already fully retrieved; nothing to do.

toy_dataset
INFO:tensorflow:Training on trainval set
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
INFO:tensorflow:Ignoring initialization; other checkpoint exists
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Restoring parameters from /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-11
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 11.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
toy_dataset
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 200
INFO:tensorflow:Eval batch size 1 and num batch 200
INFO:tensorflow:Waiting for new checkpoint at /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train
INFO:tensorflow:Found new checkpoint at /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-12
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-12
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting evaluation at 2018-03-30-16:35:58
Traceback (most recent call last):
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 175, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 168, in main
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py", line 452, in evaluate_repeatedly
    session.run(eval_ops, feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [255 255 255...] [y (mean_iou/ToInt64_2:0) = ] [10]
	 [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2)]]

Caused by op u'mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert', defined at:
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 175, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 142, in main
    predictions, labels, dataset.num_classes, weights=weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py", line 1009, in mean_iou
    num_classes, weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py", line 263, in _streaming_confusion_matrix
    labels, predictions, num_classes, weights=weights, dtype=dtypes.float64)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/confusion_matrix.py", line 183, in confusion_matrix
    message='`predictions` out of bound')],
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 579, in assert_less
    return control_flow_ops.Assert(condition, data, summarize=summarize)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 177, in Assert
    guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2027, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1868, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 175, in true_assert
    condition, data, summarize, name="Assert")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 48, in _assert
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [255 255 255...] [y (mean_iou/ToInt64_2:0) = ] [10]
	 [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2)]]
commented

@walkerlala

I am trying to train the deeplab model with the ADE20k datasets.
I'm having some problem with data format conversion.
Would you mind sharing the code for ADE20k datasets? It would be really appreciated.

@brett-whitford When I use my data .I have the same error with you . Can you share your solution?
Thank you very much .I 'm looking forword to your reply

commented

@wonderit Of course. Please wait for a while until I have access to my GPU server.

commented

@wonderit Here is the patch for converting training data and training deeplabv3 on ADE20K.

https://gist.github.com/walkerlala/82d978e68407e65158e8825cd470d7e1

(it can also be found at http://fastdrivers.org/misc/patch-for-ade20k.patch )

You can apply this patch on top of commit 1d38a22 or 5281c9a without conflict.

Note:

  1. you can to manually adjust the path in train_ade20k.py for training and supply correct path of the training data for converting the data, as documented in the doc

  2. training data can be found at: http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip

I am also going to submit a PR to get these into the repo. However, I don't have enough GPU to get a good pretrained model (only get two Nvidia 1080...) If you can obtain a decent pretrained model, please share!

commented

Also, anyone interested in add ADE20K to deeplabv3 can take a look at this PR I just created: #3853

@walkerlala When use val.py, did you have the error 'predictions' out of bound?just same with the @brett-whitford ' question.
Thank you

@walkerlala Can you share your eval script?

commented

@walkerlala @aquariusjay
Hi, I am confused about the exclude_list and initialize_last_layer.

I am not sure whether I understand it correctly:
If one want to fine-tune deeplab-v3+ on another dataset, only _LOGITS_SCOPE_NAME need to be excluded?

If so, following @aquariusjay 's suggestion, in "train_utils.py":

exclude_list = [_LOGITS_SCOPE_NAME]
if not initialize_last_layer:
    exclude_list.extend(last_layers)

if set initialize_last_layer=false, then exclude_list will include the last_layers. In "train.py" last_layers is the list [_LOGITS_SCOPE_NAME, _IMAGE_POOLING_SCOPE, _ASPP_SCOPE, _CONCAT_PROJECTION_SCOPE, _DECODER_SCOPE, ].
So all variables in the list will be excluded. This seems inconsistent.

Shouldn't it be the following?
initialize_last_layer=true and exclude_list = [_LOGITS_SCOPE_NAME]

Hi, I'm training on my own dataset as well (only two classes).

When I set initialize_last_layer=false and

exclude_list = ['logits']
if not initialize_last_layer:
    exclude_list.extend(last_layers)

Then when I run vis.py, it gives me all black images (not binary).

When I only set initialize_last_layer=false, I got binary images (result is not good, but at least show some learning). But it gives me this when run train.py:

INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 6390723.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

when training_number_of_steps=100000

Anyone knows why this happens? Thanks!

commented

@lydialixia
Hello.
You should add 'global_step' in exclude_list:

exclude_list = ['global_step']

But I am still confused about whether one should set initialize_last_layer=false when to fine-tune deeplab-v3+ on another task.

When you want to fine-tune DeepLab on other datasets, there are a few cases:

  1. You want to re-use ALL the trained weigths: set initialize_last_layer = True (last_layers_contain_logits_only does not matter in this case).

  2. You want to re-use ONLY the network backbone (i.e., exclude ASPP, decoder and so on): set initialize_last_layer = False and last_layers_contain_logits_only = False.

  3. You want to re-use ALL the trained weights EXCEPT the logits (since the num_classes may be different): set initialize_last_layer = False and last_layers_contain_logits_only = True.

Hi @walkerlala: did you manage to finetune the ADE20K dataset?
I'm trying to finetune on a dataset of the same size, but without success: after the first ~2K iterations the loss stops to decrease and starts to oscillate (~20K iterations).
I tried different learning rates, removed the regularization, but for the moment no improvement.

commented

@georgosgeorgos No I can't eventually fine tune the model on ADE20K dataset. I don't have enough GPU. Every time I try to fine tune the batch normalization parameters the model blow up throwing out out-of-memory error. So I freeze the batch normalization layers when training. Finally I only got a model with only "modest" performance:

Here is the original image (too large to display here): http://www.fastdrivers.org/misc/stuffseg-origin.jpg

Here is the segmentation result:
result

However I can get a satisfying result with PSPNet:

mmexport_1_473_seg

According to the slides from the 2017 Coco + Places Workshop, deeplabv3 should also be able to do that, but I haven't got any luck to fine-tune that. Hopefully Google can provide a fine-tuned pre-trained model in the future @aquariusjay .

@brett-whitford - Hi Brett, I am having the exact same problem as you. How did you end up solving it?

@shipeng-uestc - Hi shipeng, did you manage to solve the issue? I am currently using exclude_list=[_LOGITS_SCOPE_NAME] with _LOGITS_SCOPE_NAME imported from deeplab.model as @walkerlala suggested but I am still having the same error as Brett.

when I run
python deeplab/eval.py
--logtostderr
--eval_split="val"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=16
--decoder_output_stride=4
--eval_crop_size=513
--eval_crop_size=513
--dataset="ade20k"
--checkpoint_dir="./deeplab/datasets/ADE20K/exp/train_on_train_set/train"
--eval_logdir="./deeplab/datasets/ADE20K/exp/train_on_train_set/eval"
--dataset_dir="./deeplab/datasets/ADE20K/tfrecord"

NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_299 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
please help me !!!thanks

@hhwxxx Hello, in your answer to lydialixia, do you mean in train_util.py, exclude_list should be like this:
exclude_list = ['global_step']
exclude_list = ['logits']

but I still can't start training, the information is:
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 30000.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

I have also tried exclude_list = ['_LOGITS_SCOPE_NAME'], this doesn't work.
When just set exclude_list = ['global_step'], the model will achieve mean iu = 0.93 after 10000 iteractions, I don't know whether this is wrong.
Waitting online, thank you!

commented

@qmy612

Hello. Maybe you can try this:
exclude_list = ['global_step', 'logits']

As to the _LOGITS_SCOPE_NAME, it is defined in "model.py", so you should use like this: model._LOGITS_SCOPE_NAME.

And I have no idea about miou=0.93.

Just set set initialize_last_layer = False and last_layers_contain_logits_only = True works for me, if you wanna train on your own dataset with different num classes.

@BeSlower , yes, the solution is work for me but there is another problem that the result is all black and no other label , but during the training process , the loss is decrease. Can anyone help me ?

@qmy612 Did you get the problem solved? I am having the exacting problem as you

@xiangjinwu Yes, the answer of hhwxxx is work.
exclude_list = ['global_step', 'logits']

@aquariusjay
Hello,I train my own dataset which has only one class(exclude unlabeled)and has the same style with the cityscapes on deeplab,but some problems usually happen.
One is the server always restart when training.
Another is the result is only one color of the class I labeled.
Can you give me some advice?Thanks.

@qmy612 Thx a lot, It works

@Soulempty,
Regarding your questions:

  1. I have no idea about what do you mean by the server always restart. Could you please provide more details such as logs?
  2. In your case, the data samples may be strongly biased to one of the classes. That is why the model only predicts one class in the end. To handle that, I would suggest using larger loss_weight for the under-sampled class (i.e., that class that has fewer data samples). You could modify the weights in line 72 by doing something like
    weights = tf.to_float(tf.equal(scaled_labels, 0)) * label0_weight + tf.to_float(tf.equal(scaled_labels, 1)) * label1_weight + tf.to_float(tf.equal(scaled_labels, ignore_label)) * 0.0
    where you need to tune the label0_weight and label1_weight (e.g., set label0_weight=1 and increase label1_weight).

@aquariusjay
Thank you for your detailed solution,I want give you more details about my problems.
1、My dataset is modified as the style of Cityscapes,but have only one class("road"),so the ground truth label only have road pixel and ground pixel(not be labelled).
2、The follow is my ground truth label.
17_nov00100472_gtfine_labeltrainids
3、The follow is my json label.
{"imgWidth": 1280, "imgHeight": 1080, "objects": [{"label": "road", "polygon": [[1.0, 612.0], [0.0, 953.0], [407.1, 965.1], [711.0, 963.4], [1094.2, 970.3], [1147.7, 963.4], [1185.9, 961.7], [1279.1, 969.9], [1279.0, 696.0], [918.7, 584.6], [881.0, 573.1], [837.4, 561.6], [821.4, 564.1], [795.0, 565.4], [769.2, 565.2], [769.8, 589.9], [763.2, 600.3], [716.7, 603.5], [706.3, 601.4], [703.5, 578.0], [709.2, 566.3], [702.5, 565.2], [697.8, 573.7], [682.6, 571.6], [671.2, 574.8], [666.5, 579.1], [660.8, 582.2], [632.4, 582.2], [624.8, 580.1], [619.5, 569.3], [422.2, 582.2], [427.8, 613.5], [426.5, 646.0], [418.9, 654.5], [367.2, 664.5], [355.9, 667.3], [258.7, 665.9], [247.3, 664.5], [233.4, 640.4], [227.0, 598.3]]}]}
4、the follow is part of Cityscapes' label script.
labels = [
# name id trainId category catId hasInstances ignoreInEval color
Label( 'unlabeled' , 0 , 255 , 'void' , 0 , False , True , ( 0, 0,0) ),
Label( 'road' , 1 , 1 , 'flat' , 1 , False , False , (128, 64,128) ),
]

the picture is the result of prediction,the colour is the colour of road,but no ground color.

000002_prediction

@aquariusjay I got black images when using the default loss_weight. By setting the loss_weight my problem is solved since my data are composed of imbalance datas.

@aquariusjay
Hello,When I train my dataset which has only one class(the label is "road") and set the background to unlabeled,but get the same loss 0.2622.
Can you give some advice on how to train the dataset with one class? I think this is important for some other persons.Thank you.
the following is some details:
image
image
screenshot from 2018-05-10 02-58-20

screenshot from 2018-05-10 02-56-27

@Soulempty
You question is not related to this issue (ADE20K).
Could you please open a new one so that people who have similar experience could share (e.g., @shanyucha)?
As I do not have access to your dataset, and it usually takes experimental experience to tune the hyper-parameters.

Thank you,I think I solve the problem how to train dataset with one class with your first advice's inspiration.

@brett-whitford To solve this problem you could inspect the maximum pixel value in the pre-processed gray scale images (after being processed by remove_gt_colormap.py). Your num_classes should be greater than the max pixel value in the images.

I retrained deeplab with Ade20K dataset in my Google Colab notebook, below results with MobileNet-v2 and Xception_65 as initial checkpoint, anyway I couldn't fine tune because of OOM error. May be others can share parameters for training to get better results?

MobileNet-v2
ade20k-mobile-2000iter-2batch

Xception_65
ade20k-xception-2000iter-2batch

commented

@Soulempty Could you please share more your details about how to train custom dataset with only one class ? I really appreciate it. Thanks!

just as the details I show above,but set the trainId of unlabelled to 1.

commented

@Soulempty
Thanks. I still feel confused since I have no idea what the label variable is and where can I find it.
39856204-04da98e8-53fd-11e8-9876-0165c575b0e7

the ground truth label

commented
commented

@lydialixia could you please share more detailed tutorial about how to train custom dataset with two classes?

commented

@Soulempty I am sorry that I still cannot figure out how to train custom dataset with two classes. Could you give a tutorial about how to do it ? Thanks very much!

my dataset have the same style with cityscapes.what is your data like?

commented

@Soulempty
Thanks for your reply. My dataset is from Kaggle, https://www.kaggle.com/c/ultrasound-nerve-segmentation/data.

This dataset totally contains 5635 image. (I split this dataset to trianing set with 4000 images and validation set 1635 images)

Origin Image and its corresponding mask are shown below:

image
image

I have changed images in training set to with extension *.jpg and images in validation set to *.png. Then I save them as the style of VOC2012 which is show

image

Then, I follow the tutorial of @brettkoonce, but it seems there are something wrong with the training procedure.

@RomRoc I am retraining on ADE20K too.
May be the link to download dataset has changed (http://groups.csail.mit.edu/vision/datasets/ADE20K/), right?
Could you share for me some thing you change in code to retrain ADE20K
Thanks

@urgonguyen check here my jupyter notebook that runs in Google Colab.
To download ADE20k and convert it you should use download_and_convert_ade20k.sh script.

@walkerlala why in training ADE20K, you set min_resize_value= 350, max_resize_value=500

commented

@urgonguyen That is a hyper-parameter. You can try to tune that to see whether the result would be better.

@Soulempty
Hi,Soulempty,I train to train deeplabv3 on my dataset,but something wrong,I don't know why,please help me.
My environment;
cuda V9.0.176,cudnn 7.0,tensorflow-gpu 1.8,Titan X x8
My data:20000 num picture,2labels,object is lane(4 pixle width ),other is background.Every picture size is[512,512],segmentation annotation picture is grayscale picture,lane pix is 1,other's(background) is 0.
my alter include:

    splits_to_sizes={
        'train': 18000,
        'trainval': 20000,
        'val': 2000,
    },
    num_classes=2,
    ignore_label=255,
)```
and add:
```_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'lane_seg': _LANE_SEG_INFORMATION,
    'ade20k': _ADE20K_INFORMATION,
}

train:

NUM_ITERATIONS=10000
python "${WORK_DIR}"/train.py \
  --logtostderr \
  --initialize_last_layer=False \
  --num_clones=1 \
  --last_layers_contain_logits_only=False \
  --dataset='lane_seg' \
  --train_split="trainval" \
  --model_variant="xception_65" \
  --atrous_rates=6 \
  --atrous_rates=12 \
  --atrous_rates=18 \
  --output_stride=16 \
  --decoder_output_stride=4 \
  --train_crop_size=513 \
  --train_crop_size=513 \
  --train_batch_size=4 \
  --training_number_of_steps="${NUM_ITERATIONS}" \
  --fine_tune_batch_norm=true \
  --tf_initial_checkpoint="${INIT_FOLDER}/deeplabv3_pascal_train_aug/model.ckpt" \
  --train_logdir="${TRAIN_LOGDIR}" \
  --dataset_dir="${LANE_DATASET}"

python "${WORK_DIR}"/eval.py \
  --logtostderr \
  --eval_split="val" \
  --dataset="lane_seg" \
  --model_variant="xception_65" \
  --atrous_rates=6 \
  --atrous_rates=12 \
  --atrous_rates=18 \
  --output_stride=16 \
  --decoder_output_stride=4 \
  --eval_crop_size=513 \
  --eval_crop_size=513 \
  --checkpoint_dir="${TRAIN_LOGDIR}" \
  --eval_logdir="${EVAL_LOGDIR}" \
  --dataset_dir="${LANE_DATASET}" \
  --max_number_of_evaluations=1

python "${WORK_DIR}"/vis_lane.py \
  --logtostderr \
  --vis_split="val" \
  --dataset="lane_seg" \
  --model_variant="xception_65" \
  --atrous_rates=6 \
  --atrous_rates=12 \
  --atrous_rates=18 \
  --output_stride=16 \
  --decoder_output_stride=4 \
  --vis_crop_size=513 \
  --vis_crop_size=513 \
  --checkpoint_dir="${TRAIN_LOGDIR}" \
  --vis_logdir="${VIS_LOGDIR}" \
  --dataset_dir="${LANE_DATASET}" \
  --max_number_of_iterations=1
python "${WORK_DIR}"/export_model.py \
  --logtostderr \
  --checkpoint_path="${CKPT_PATH}" \
  --export_path="${EXPORT_PATH}" \
  --model_variant="xception_65" \
  --atrous_rates=6 \
  --atrous_rates=12 \
  --atrous_rates=18 \
  --output_stride=16 \
  --decoder_output_stride=4 \
  --num_classes=2 \
  --crop_size=513 \
  --crop_size=513 \
  --inference_scales=1.0

I alter the train_utils.py's exclude_list=['global_step',logits]
when i train the dataset,i error list:

Input checkpoint '/home/mc12/models/research/deeplab/datasets/lane_seg/exp/train_on_trainval_set/train/model.ckpt-10000' doesn't exist!

my questions is:

  • why i set NUM_ITERATIONS=10000,but when i run train script,it with not train,and use the weight of script download fron internet,model.3000.
INFO:tensorflow:Restoring parameters from /home/mc12/models/research/deeplab/datasets/lane_seg/exp/train_on_trainval_set/train/model.ckpt-30005

  • when I vis the picture,The result is like:
    i think it only use pretrain of voc to inference,so the result is so bad,Is my opinion right?
  • my dataset mybe the data samples may be strongly biased to background,How can i finetune the weight of loss ?
  • @parachutel

Your num_classes should be greater than the max pixel value in the images
I don't understand it,my label(0:background,1:lane),This means i should set num_class is 1,i shoud set 2,3,4,...,(In my case,i set it to 2 for background and lane)

Thanks for your help!

@hhwxxx,

Would you like to explain more on your previous answer in this thread. What does logits here stand for? Thanks.

**@qmy612

Hello. Maybe you can try this:
exclude_list = ['global_step', 'logits']**

The line "tf_initial_checkpoint="${INIT_FOLDER}/deeplabv3_pascal_train_aug/model.ckpt" decide what weight it use.
You can set class to 2,and label of lane is 0,background is1

commented

@wenouyang hello.
logits is the last feature maps before softmax.

Maybe this can help you.

The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

@Soulempty Hi, have you solved the issue of assigning different weights for different classes? I have tried the advice from the contributor, but there seems to be something wrong as my loss keeps oscillating around 0.11.

Hi

Currently I’m struggling with improving the results using deeplab trained on my own dataset.
I’ve trained deeplab successfully a few times using different pretrained models from the model zoo, all based on xception_65, but my results keep staying in the same miou range, somewhere around this interval [10, 11].
I have only one GPU at my disposal with 11GB GPU memory.
My dataset has 8 classes with various object sizes, from little to big, and is quite unbalanced.
Here are the label weights: [1, 4, 4, 17, 42, 36, 19, 20].
In my dataset I have 757 instances for training and 100 validation.
I’ve tried to adjust parameters like learning rate, last_layer_gradient_multiplier, weight decay.
I’ve also tried some kind of weighting using the above weights in this formula

weights = tf.to_float(tf.equal(scaled_labels, 0)) * 1 +
tf.to_float(tf.equal(scaled_labels, 1)) * 4 +
tf.to_float(tf.equal(scaled_labels, 2)) * 4 +
tf.to_float(tf.equal(scaled_labels, 3)) * 17 +
tf.to_float(tf.equal(scaled_labels, 4)) * 42 +
tf.to_float(tf.equal(scaled_labels, 5)) * 36 +
tf.to_float(tf.equal(scaled_labels, 6)) * 19 +
tf.to_float(tf.equal(scaled_labels, 7)) * 20 +
tf.to_float(tf.equal(scaled_labels, ignore_label)) * 0.0

but it turns out the algorithm won’t converge.
I’ve trained without fine tuning the batch normalization parameters. Although I tried training those parameters with a 321 crop size in order to be able to fit a batch size of 12 in my GPU.
I’ve tried training on various sizes 321, 513, 769.
The point being I need some tips to figure out what I can do to improve those results.
What do you guys think? Do I need more data in order to increase my miou or hardware?

@weehe91325 i'm afraid you are doing it wrong. if the ratio between classes are 1:4 for example, then the weight should be 4:1 instead of 1:4.

@shanyucha my bad, I wanted to say label weights not ratios. I updated the comment.

Hello, I input the picture of 513*513, there is the following error. How can I solve it?
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [320] rhs shape= [2048] [[Node: save/Assign_8 = Assign[T=DT_FLOAT, _class=["loc:@aspp1_depthwise/BatchNorm/beta"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](aspp1_depthwise/BatchNorm/beta, save/RestoreV2:8)]]

Hi guys,
It's now been a couple of weeks trying to adapt DeepLabV3+ to a TestSet with two classes: circles and squares, below you can see one example of my images and the respective annotation.

I've tried all instructions and suggestions from this thread but all I get are black images as output... I really don't understand where and what I'm doing wrong so I hope you guys can give me a hand. Below I'll summarize all my info:

1. TestSet

I created 1000 images and split 800 for training and 200 for validation. Images are 512x512 RGB (3 channels) JPGs and annotations are 512x512 gray (single channel) PNGs where circles have an intensity of 250, squares 150 and background 0. The dataset folder is organized in the same fashion as ADE20K:

2. datasets > segmentation_dataset.py

When setting ignore_label = 0 predictions come back as a red or green image; and when setting ignore_label = 255 predictions come back as a black image

_TESTSET_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train': 800,  # num of samples in images/training
        'val': 200,  # num of samples in images/validation
    },
    num_classes=3,
    ignore_label=255,
)
_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'ade20k': _ADE20K_INFORMATION,
    'testset': _TESTSET_INFORMATION,
}

3. utils > train_utils.py

  # Variables that will not be restored.
  exclude_list = ['logits']

Although I've also tried exclude_list = ['global_step'] and exclude_list = ['global_step', 'logits'] but the results are the same.
I've also played a little bit with line 72 and the loss_weight as suggested by @aquariusjay but that doesn't seem to help either.

weights = tf.to_float(tf.equal(scaled_labels, 0)) * 1.0 + 
          tf.to_float(tf.equal(scaled_labels, 1)) * 2.0 + 
          tf.to_float(tf.equal(scaled_labels, ignore_label)) * 0.0
not_ignore_mask = tf.to_float(tf.not_equal(scaled_labels, ignore_label)) * weights
# === original lines ===
# not_ignore_mask = tf.to_float(tf.not_equal(scaled_labels, ignore_label)) * loss_weight

4. eval.py

I added the lines suggested HERE to solve the ['predictions' out of bound] error in the evaluation stage:

# Define the evaluation metric.
metric_map = {}
# ============ Added by B.A.D. =====================
indices = tf.squeeze( tf.where( tf.less_equal(labels, dataset.num_classes-1) ), 1 )
labels = tf.cast( tf.gather( labels, indices ), tf.int32 )
predictions = tf.gather( predictions, indices )
# ==============================================
metric_map[predictions_tag] = tf.metrics.mean_iou(
        predictions, labels, dataset.num_classes, weights=weights)

5. main_TestSet.sh

The relevant sections of this shell script are shown below. Training start with a loss around 1.32 and at the end of 5000 iterations gets to 0.31. It performs the evaluation stage without errors or warnings and at the end it prints: miou_1.0[1]. Finally, it runs the visualization script with the 200 images but all predictions are just black images. Please give me a hand!

NUM_ITERATIONS=5000
CKPT_NAME="deeplabv3_mnv2_pascal_train_aug"
CKPT_PATH="${TRAIN_LOGDIR}/model.ckpt-${NUM_ITERATIONS}"
EXPORT_PATH="${EXPORT_DIR}/frozen_inference_graph.pb"

# === Train the network ===
python "${WORK_DIR}"/train.py \
  --logtostderr \
  --train_split="train" \
  --model_variant="mobilenet_v2" \
  --output_stride=16 \
  --train_crop_size=513 \
  --train_crop_size=513 \
  --train_batch_size=4 \
  --training_number_of_steps="${NUM_ITERATIONS}" \
  --dataset="testset" \
  --tf_initial_checkpoint="${INIT_FOLDER}/${CKPT_NAME}/model.ckpt-30000.index"\
  --train_logdir="${TRAIN_LOGDIR}" \
  --dataset_dir="${FINAL_DATASET}" \
  --initialize_last_layer=False \
  --last_layers_contain_logits_only=False \
  --fine_tune_batch_norm=False

# === Run evaluation ===
python "${WORK_DIR}"/eval.py \
  --logtostderr \
  --eval_split="val" \
  --model_variant="mobilenet_v2" \
  --eval_crop_size=513 \
  --eval_crop_size=513 \
  --dataset="testset" \
  --checkpoint_dir="${TRAIN_LOGDIR}" \
  --eval_logdir="${EVAL_LOGDIR}" \
  --dataset_dir="${FINAL_DATASET}" \
  --max_number_of_evaluations=1

# === Visualize the results ===
python "${WORK_DIR}"/vis.py \
  --logtostderr \
  --vis_split="val" \
  --model_variant="mobilenet_v2" \
  --vis_crop_size=513 \
  --vis_crop_size=513 \
  --dataset="testset" \
  --checkpoint_dir="${TRAIN_LOGDIR}" \
  --vis_logdir="${VIS_LOGDIR}" \
  --dataset_dir="${FINAL_DATASET}" \
  --max_number_of_iterations=1

# === Export the trained checkpoint ===
python "${WORK_DIR}"/export_model.py \
  --logtostderr \
  --checkpoint_path="${CKPT_PATH}" \
  --export_path="${EXPORT_PATH}" \
  --model_variant="mobilenet_v2" \
  --num_classes=3 \
  --crop_size=513 \
  --crop_size=513 \
  --inference_scales=1.0

hello,I have a question.
1.I change the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False , last_layers_contain_logits_only=True ,and train with xception_65,
and min_resize_value=50 ,max_resize_value=2000 . the loss is high and never down,like this:
INFO:tensorflow:global step 570: loss = 8.3546 (0.559 sec/step)
INFO:tensorflow:global step 580: loss = 8.5703 (0.548 sec/step)
INFO:tensorflow:global step 590: loss = 8.9560 (0.543 sec/step)
INFO:tensorflow:global step 600: loss = 8.2486 (0.512 sec/step)
INFO:tensorflow:global step 610: loss = 8.1094 (0.508 sec/step)
INFO:tensorflow:global step 620: loss = 8.2317 (0.520 sec/step)
INFO:tensorflow:global step 630: loss = 7.9649 (0.511 sec/step)
INFO:tensorflow:global step 640: loss = 8.2240 (0.517 sec/step)
INFO:tensorflow:global step 650: loss = 7.9889 (0.517 sec/step)
INFO:tensorflow:global step 660: loss = 8.0038 (0.507 sec/step)
INFO:tensorflow:global step 670: loss = 8.0465 (0.529 sec/step)
could anyone help me with it? the result is not good.

hello,I have a question.
1.I change the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False , last_layers_contain_logits_only=True ,and train with xception_65,
and min_resize_value=50 ,max_resize_value=2000 . the loss is high and never down,like this:
INFO:tensorflow:global step 570: loss = 8.3546 (0.559 sec/step)
INFO:tensorflow:global step 580: loss = 8.5703 (0.548 sec/step)
INFO:tensorflow:global step 590: loss = 8.9560 (0.543 sec/step)
INFO:tensorflow:global step 600: loss = 8.2486 (0.512 sec/step)
INFO:tensorflow:global step 610: loss = 8.1094 (0.508 sec/step)
INFO:tensorflow:global step 620: loss = 8.2317 (0.520 sec/step)
INFO:tensorflow:global step 630: loss = 7.9649 (0.511 sec/step)
INFO:tensorflow:global step 640: loss = 8.2240 (0.517 sec/step)
INFO:tensorflow:global step 650: loss = 7.9889 (0.517 sec/step)
INFO:tensorflow:global step 660: loss = 8.0038 (0.507 sec/step)
INFO:tensorflow:global step 670: loss = 8.0465 (0.529 sec/step)
could anyone help me with it? the result is not good.

I want to train ade20K

I retrained deeplab with Ade20K dataset in my Google Colab notebook, below results with MobileNet-v2 and Xception_65 as initial checkpoint, anyway I couldn't fine tune because of OOM error. May be others can share parameters for training to get better results?

MobileNet-v2
ade20k-mobile-2000iter-2batch

Xception_65
ade20k-xception-2000iter-2batch

could you tell me about your training_number_of_steps?and your parameters, and whether you use the xception65_ade20k_train to train your image ?
my result is too bad ,thanks you !have good day!

@apolo74 I wonder Do you have solve the Problem ? I meet the same question you menthoned above

@BeSlower , yes, the solution is work for me but there is another problem that the result is all black and no other label , but during the training process , the loss is decrease. Can anyone help me ?

I have the same problem, have you solve it?

I have the same problem, have you solve it?

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:

  1. Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
  2. The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
    Hope this helps
    /B

@apolo74 how do I set The pixel values in "background" class = 0. I don't see this option in train.py, or segmentation_dataset.py file

@apolo74 how do I set The pixel values in "background" class = 0. I don't see this option in train.py, or segmentation_dataset.py file

When you are creating your training dataset you have to create the "annotations" in grayscale, like the example I shared before:

The image on the left is a normal input to the system, but the image in the right is manually created (Photoshop, Paint, any image processing or in my case I generated this test images with a python script). The main idea is that you put everything that you are not instered in detecting as background with value 0 (all the black area inside the right image), and the different classes with values starting from 1.
The image on the right was created before I realized about this very important point... that is why the rectangle has a light gray color and the circle an even ligther gray. When setting all pixels of the rectangle equal to 1 and all pixels of the circle equal to 2 they wouldn't show clearly in this example but that's how they are supposed to be: background=0, rectangle=1 and circle=2... and ignore_label=255
I hope it's clear now

@apolo74 Hi, hope you could help me, as you said the class number should be 0, 1 , 2 and so on... where do I specify that? Thank you.

@apolo74 thank you for getting me one step closer. Please take a look at my image:

Screenshot 2019-04-11 at 11 05 47 AM

The image on right is segmentation mask, it has background=zero and object to be detected is a very dark shade of grey (I am not sure what is the value for this shade of gray)
The image on left is the original image.

Would you recommend to Convert the image on right such that the dark shade of grey is converted to white color ?

Or is there any other way I can accept the existing right hand side image by modifying my code ?

@apolo74 Hi, hope you could help me, as you said the class number should be 0, 1 , 2 and so on... where do I specify that? Thank you.

Hola Carlos, the idea is that you CREATE your segmentation images with these values as pixels... for example let's say that I have a color image of 10x8 pixels, this means your input is going to be a 10x8x3 (where 3 represents color channels R, G, B). Let's say you want to detect squares and in this example there is a small square in the bottom right, then your segmentation mask will be a single 10x8 matrix with VALUES:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 0 0 0 0

@apolo74 thank you for getting me one step closer. Please take a look at my image:

The image on right is segmentation mask, it has background=zero and object to be detected is a very dark shade of grey (I am not sure what is the value for this shade of gray)
The image on left is the original image.

Would you recommend to Convert the image on right such that the dark shade of grey is converted to white color ?

Or is there any other way I can accept the existing right hand side image by modifying my code ?

Hi again @ajinkya933, I'm glad to hear you are doing some progress... about your questions:

  1. no, don't convert your object mask to white because that means your class pixels will have a value of 255. From what I see they have a very low value but you need to be sure that it's 1 assuming that's the first class you want to detect.
  2. I used Photoshop to create my segmentation masks, but I'm sure you can do this in any other program: Matlab, Paint, even Excel.
    You could even create a very simple script in Python to open the image and print out the value of your pixels in specific positions so you'll be sure what values they have. Don't believe what your eyes see :)

@apolo74 Hi, hope you could help me, as you said the class number should be 0, 1 , 2 and so on... where do I specify that? Thank you.

Hola Carlos, the idea is that you CREATE your segmentation images with these values as pixels... for example let's say that I have a color image of 10x8 pixels, this means your input is going to be a 10x8x3 (where 3 represents color channels R, G, B). Let's say you want to detect squares and in this example there is a small square in the bottom right, then your segmentation mask will be a single 10x8 matrix with VALUES:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 0 0 0 0

Thank you so much, I just trained DeepLab with satisfactory results so far :D!

@apolo74 Thanks I got the output now

@apolo74 Thanks I got the output now

Happy to hear that!

commented

@BeSlower

Just set set initialize_last_layer = False and last_layers_contain_logits_only = True works for me, if you wanna train on your own dataset with different num classes.

hi,i tried the training on my own data(classe=2=1+background)
initialize_last_layer = False
last_layers_contain_logits_only = True
label=gray-scale image (0 1)
but what i got as predicted mask en the test is black mask
can help me with this

commented

@holyprince

@BeSlower , yes, the solution is work for me but there is another problem that the result is all black and no other label , but during the training process , the loss is decrease. Can anyone help me ?

hi ,i have the same problem as you, the predicted mask is a black image
did you fix it ??

commented

@apolo74

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:

Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
Hope this helps
/B

Hi , i tried the trainig on my custom dataset
as you said:
classe=3=1(obj)+background+ignore_label
label=gray_scale image(0,1)
in my label, there are two pixels:0 for background and 1 for object
so should i put the ignore_label in the class number calculation??
but what i got as output is a black mask
can help to fix it?

Hey guys! Have you ever evaluate the provided ade20k pretrained model on val set? I have test them, but both mobilenetv2_ade20k_train and xception65_ade20k_train are lower than the reported performance for about 3%-4%.
here is my evaluation script:
python eval.py
--logtostderr
--eval_split="val"
--model_variant="xception_65"
--atrous_rates= 12
--atrous_rates=24
--atrous_rates=36
--output_stride=8
--decoder_output_stride=4
--eval_crop_size=513
--eval_crop_size=513
--min_resize_value=513
--max_resize_value=513
--resize_factor=8
--aspp_with_batch_norm=true
--aspp_with_separable_conv=true
--decoder_use_separable_conv=true
--dataset="ade20k"
--checkpoint_dir="datasets/ADE20K/deeplabv3_xception_ade20k_train"
--eval_logdir="datasets/ADE20K/exp/v3plus/eval_ori"
--dataset_dir="datasets/ADE20K/tfrecord"
--max_number_of_evaluations=1
--eval_scales=0.5
--eval_scales=0.75
--eval_scales=1.0
--eval_scales=1.25
--eval_scales=1.5
--eval_scales=1.75
--add_flipped_images=true
By the way, the pretrained models for pascal and cityscapes work well. Could someone help me verify the performance or give me some advice?

commented

@apolo74

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:
Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
Hope this helps
/B

thx to your descriptive comment i was able to train successfully deeplab on my custom dataset(14000 images)
after 20000 iteration i tested the trained model with python code it detects fine but when i put the model on an ios application(after convert to tflite model) it gives bad and wrong segmentation
do you have any idea about using deeplab mobilenete trained model on mobile??

Hey guys,
does anyone know how one can freeze layers for training? Say I want to freeze the weights of the backbone and only train the rest. Is that possible?

I would really appreciate some help on this matter. Thanks in advance

@ma8tsch did you manage to freeze some layers eventually ? If yes, can you pls provide some details ?

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:

  1. Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
  2. The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
    Hope this helps
    /B

Thank you for your help :) setting

--initialize_last_layer=False\
--last_layers_contain_logits_only=True

allowed me to no longer have all black masks. I am not getting color spotted, but not acceptable, masks after only 100 steps.

To phrase what you said more clearly (for me at least), you are saying that images should be labeled with only values from 1...N where N is the number of classes, and 0 is reserved for background, and possibly even N+1 because of the ignore label (I am not utilizing this).

In other words, if you have 2 classes (circle and triangle), you will have 4 labels/indexes in your image.

  • index 0 = bg
  • index 1 = class1, say circle
  • index 2 = class2, say triangle
  • index 3 (which by default in the other datasets is 255 instead of 3) = IGNORE_LABEL

How can I confirm that this is the case for my dataset?

I'll report back tomorrow after 10,000 steps to confirm.

How did y'all color index your images? It seems that my images ARE color indexed as @apolo74 specified.

Here is what my model got after 10000 steps:
non-seg
seg

This is what a color indexed image looks like in my dataset (not from same picture as above):
000017

Any possible help?

Hi i am trying to run deeplab in my own dataset but i get an error when i am running the train.py it is related to the number of clases because i have 5 but apparently the program is expecting 21 like the number of classes in the VOC dataset,
Assign requires shapes of both tensors to match. lhs shape= [5] rhs shape= [21]

@aquariusjay
Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

When you want to fine-tune DeepLab on other datasets, there are a few cases:

  1. You want to re-use ALL the trained weigths: set initialize_last_layer = True (last_layers_contain_logits_only does not matter in this case).
  2. You want to re-use ONLY the network backbone (i.e., exclude ASPP, decoder and so on): set initialize_last_layer = False and last_layers_contain_logits_only = False.
  3. You want to re-use ALL the trained weights EXCEPT the logits (since the num_classes may be different): set initialize_last_layer = False and last_layers_contain_logits_only = True.

Hi, My loss does not change. It has become stagnant. I have tried everything mentioned related to deeplabv3+ on every blog.
I am training to detect roads. My images are of 2000x2000.
My training data has 45k images.
I have created my image in the form of PASCAL VOC. I have three kinds of pixels.
background = [0,0,0]
Void class = [255,255,255]
road = [1,1,1]
so the number of classes = 3
I am using PASCAL VOC pre trained weights.

changes in train_util.py are :
1.
ignore_weight = 0
label0_weight =10
label1_weight = 15
not_ignore_mask =
tf.to_float(tf.equal(scaled_labels, 1)) * label0_weight

  • tf.to_float(tf.equal(scaled_labels, 2)) * label1_weight
  • tf.to_float(tf.equal(scaled_labels, ignore_label)) * ignore_weight

Variables that will not be restored.

exclude_list = ['global_step','logits']
if not initialize_last_layer:
exclude_list.extend(last_layers)

my train.py

nohup python deeplab/train.py
--logtostderr
--training_number_of_steps=65000
--train_split="train"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=16
--decoder_output_stride=4
--train_batch_size=2
--initialize_last_layer=False
--last_layers_contain_logits_only=True
--dataset="pascal_voc_seg"
--tf_initial_checkpoint="/data/old_model/models/research/deeplabv3_pascal_trainval/model.ckpt"
--train_logdir="/data/old_model/models/research/deeplab/mycheckpoints"
--dataset_dir="/data/models/research/deeplab/datasets/tfrecord" > my_output.log &

Please help 👍
INFO:tensorflow:global step 700: loss = 0.1759 (0.449 sec/step)
INFO:tensorflow:global step 710: loss = 0.1695 (0.655 sec/step)
INFO:tensorflow:global step 720: loss = 0.1742 (0.689 sec/step)
INFO:tensorflow:global step 730: loss = 0.1710 (0.505 sec/step)
INFO:tensorflow:global step 740: loss = 0.1708 (0.868 sec/step)
INFO:tensorflow:global step 750: loss = 0.1683 (0.632 sec/step)
INFO:tensorflow:global step 760: loss = 0.1692 (0.442 sec/step)
INFO:tensorflow:global step 770: loss = 0.1693 (0.597 sec/step)
INFO:tensorflow:global step 780: loss = 0.1665 (0.441 sec/step)
INFO:tensorflow:global step 790: loss = 0.1680 (0.548 sec/step)
INFO:tensorflow:global step 800: loss = 0.1708 (0.372 sec/step)
INFO:tensorflow:global step 810: loss = 0.1674 (0.327 sec/step)
INFO:tensorflow:global step 820: loss = 0.1666 (0.951 sec/step)
INFO:tensorflow:global step 830: loss = 0.1651 (0.557 sec/step)
INFO:tensorflow:global step 840: loss = 0.1663 (0.506 sec/step)
INFO:tensorflow:global step 850: loss = 0.1646 (0.446 sec/step)
INFO:tensorflow:global step 860: loss = 0.1666 (0.424 sec/step)
INFO:tensorflow:global step 870: loss = 0.1654 (0.520 sec/step)
INFO:tensorflow:global step 880: loss = 0.1662 (0.675 sec/step)
INFO:tensorflow:global step 890: loss = 0.1673 (0.325 sec/step)
INFO:tensorflow:global step 900: loss = 0.1633 (0.548 sec/step)
INFO:tensorflow:global step 910: loss = 0.1659 (0.374 sec/step)
INFO:tensorflow:global step 920: loss = 0.1639 (0.663 sec/step)
INFO:tensorflow:global step 930: loss = 0.1658 (0.442 sec/step)
INFO:tensorflow:global step 940: loss = 0.1654 (0.568 sec/step)
.
.
.
INFO:tensorflow:global step 17850: loss = 0.1416 (0.555 sec/step)
INFO:tensorflow:global step 17860: loss = 0.1417 (0.684 sec/step)
INFO:tensorflow:global step 17870: loss = 0.1415 (0.572 sec/step)
INFO:tensorflow:global step 17880: loss = 0.1417 (0.569 sec/step)
INFO:tensorflow:global step 17890: loss = 0.1415 (0.535 sec/step)
INFO:tensorflow:global step 17900: loss = 0.1415 (0.541 sec/step)
INFO:tensorflow:global step 17910: loss = 0.1419 (0.459 sec/step)
INFO:tensorflow:global step 17920: loss = 0.1415 (0.800 sec/step)
INFO:tensorflow:global step 17930: loss = 0.1417 (0.647 sec/step)
INFO:tensorflow:global step 17940: loss = 0.1416 (0.509 sec/step)
INFO:tensorflow:global step 17950: loss = 0.1416 (0.755 sec/step)
INFO:tensorflow:global step 17960: loss = 0.1417 (0.495 sec/step)
INFO:tensorflow:global step 17970: loss = 0.1419 (0.556 sec/step)
INFO:tensorflow:global step 17980: loss = 0.1417 (0.492 sec/step)
INFO:tensorflow:global step 17990: loss = 0.1416 (0.878 sec/step)
INFO:tensorflow:global step 18000: loss = 0.1415 (0.803 sec/step)
INFO:tensorflow:global step 18010: loss = 0.1418 (0.695 sec/step)
INFO:tensorflow:global step 18020: loss = 0.1418 (0.449 sec/step)
INFO:tensorflow:global step 18030: loss = 0.1415 (0.678 sec/step)
INFO:tensorflow:global step 18040: loss = 0.1418 (0.449 sec/step)
INFO:tensorflow:global step 18050: loss = 0.1415 (0.681 sec/step)
INFO:tensorflow:global step 18060: loss = 0.1415 (0.866 sec/step)
INFO:tensorflow:global step 18070: loss = 0.1417 (0.534 sec/step)
INFO:tensorflow:global step 18080: loss = 0.1415 (0.939 sec/step)
INFO:tensorflow:global step 18090: loss = 0.1416 (0.349 sec/step)
INFO:tensorflow:global step 18100: loss = 0.1416 (0.576 sec/step)
INFO:tensorflow:global step 18110: loss = 0.1416 (0.626 sec/step)
INFO:tensorflow:global step 18120: loss = 0.1418 (0.951 sec/step)
INFO:tensorflow:global step 18130: loss = 0.1417 (0.386 sec/step)
INFO:tensorflow:global step 18140: loss = 0.1417 (0.375 sec/step)
@aquariusjay

As I do not have access to your dataset, and it usually takes experimental experience to tune the hyper-parameters.

@aquariusjay Hi, May I know how we can quantify our dataset to find out these values.
ignore_weight
label0_weight
label1_weight

@PallawiSinghal did u solve it?I also want to change the loss_weight

@jinyuan30 did u solve it?I also want to change the loss_weight

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

Hello, it seems that I meet the same problem, have you solved it yet?

@LightingX Hi,friend! Have you figured out how to adjust the loss weight in new version of train_utils.py?
I tried to change the label_weights from None to a Python list in the common.py, but I got a ValueError: Subscripts with ellipses are not yet supported

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

Did you solve it? I have the same problem now :/.

@Alive1024 @claudiourbina Hey guys, in the latest implemented version, it seems we can adjust the weight by params. When training, add label_weights param to the train params list. For example, if I have 2 classes and their weights are 0.01 and 1, I can add this to the train params:

--label_weights={0.01,1.0}

@essalahsouad Hi! Did you solved problem with black images ? Still actual for me