mobilenet V2 train fail
fanweiya opened this issue · comments
fanweiya commented
i use mobilenet V2 backbone, but train fail
[-] Importing tensorflow...
2021-01-14 13:49:10.317068: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[+] Done! Tensorflow version: 2.5.0-dev20201230
[-] Importing Deeplabv3plus Trainer class...
[-] Importing config files...
2021-01-14 13:49:11.537581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-14 13:49:11.591072: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-14 13:49:11.591101: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (alit-PowerEdge-T640): /proc/driver/nvidia/version does not exist
2021-01-14 13:49:11.591383: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0,/job:localhost/replica:0/task:0/device:GPU:1
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0,/job:localhost/replica:0/task:0/device:GPU:1
Train Images are good to go
[+] Data points in train dataset: 6400
Train Dataset: <PrefetchDataset shapes: ((16, 512, 512, 3), (16, 512, 512, 1)), types: (tf.float32, tf.float32)>
Train Images are good to go
Data points in train dataset: 1600
Val Dataset: <PrefetchDataset shapes: ((16, 512, 512, 3), (16, 512, 512, 1)), types: (tf.float32, tf.float32)>
2021-01-14 13:49:12.045387: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-01-14 13:49:12.045414: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-01-14 13:49:12.100790: I tensorflow/core/profiler/lib/profiler_session.cc:158] Profiler session tear down.
2021-01-14 13:49:12.268507: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
key: "Toutput_types"
value {
list {
type: DT_STRING
type: DT_STRING
}
}
}
attr {
key: "output_shapes"
value {
list {
shape {
}
shape {
}
}
}
}
2021-01-14 13:49:12.362496: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
2021-01-14 13:49:12.367114: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 Hz
Epoch 1/100
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
Traceback (most recent call last):
File "trainer.py", line 47, in <module>
HISTORY = TRAINER.train()
File "/data/deeplab/DeepLabV3-Plus/deeplabv3plus/train.py", line 191, in train
epochs=self.config['epochs'], callbacks=callbacks
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/wandb/integration/keras/keras.py", line 119, in new_v2
return old_v2(*args, **kwargs)
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1135, in fit
tmp_logs = self.train_function(iterator)
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 797, in __call__
result = self._call(*args, **kwds)
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 841, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 695, in _initialize
*args, **kwds))
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2998, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3390, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3235, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 998, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 603, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 985, in wrapper
raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:
/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:840 train_function *
return step_function(self, iterator)
/data/deeplab/DeepLabV3-Plus/deeplabv3plus/model/deeplabv3_plus.py:104 call *
tensor = tf.keras.layers.Concatenate(axis=-1)([input_a, input_b])
/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:1015 __call__ **
self._maybe_build(inputs)
/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:2709 _maybe_build
self.build(input_shapes) # pylint:disable=not-callable
/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py:273 wrapper
output_shape = fn(instance, input_shape)
/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/layers/merge.py:519 build
raise ValueError(err_msg)
ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(8, 128, 128, 256), (8, 64, 64, 48)]
jeremy-cv commented
Hi, thanks for this implementation.
I'm having the same issue with Mobilenetv2 backbone model.
@fanweiya did you solve this ?
thanks
jeremy-cv commented
It seems that training runs with factor 8 in ._get_upsample_layer_fn(input_shape, factor=8)
Diogo Silva commented
+1 Same error here
shivaraj commented
Same error
ValueError: A Concatenate
layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(1, 128, 128, 256), (1, 64, 64, 48)]
Mateusz Koza commented
Has anyone managed to fix this?