Training loss does not change in the first 700 iterations

Question

Training loss does not change in the first 700 iterations

DanielNehemiah opened this issue 4 years ago · comments

Hey,

I have started to train the network using the code in this repo.
I see that the training accuracy has not gone above 0.65 and is mostly revolving around 0.45-0.52
in the first 700 iterations. Is this normal? the loss is also changing very minutely revolving around 5.1

Thanks for this code!

Tiago Freitas · Answer 1 · Mon Feb 10 2020 20:08:28 GMT+0800 (China Standard Time)

Hi!

Let it run more iterations.

In the original code, the validation is evaluated every 1000 iterations. Se after 3000 / 5000 iterations if the loss keeps constant or not.

Let me know how it goes.

Asish Chakrapani · Answer 2 · Fri Feb 28 2020 01:54:25 GMT+0800 (China Standard Time)

@DanielNehemiah There's a stopping condition in the training function , so let it run as long as the validation error between 2 sets i.e 1000 iterations does not improve and the training ends on its own

JacksonLaw577 · Answer 3 · Sat Nov 21 2020 11:47:57 GMT+0800 (China Standard Time)

I've tried for several times, and it always stopped after 10000 iterations, which means that the model cannot converge..

Tiago Freitas · Answer 4 · Mon Nov 23 2020 21:22:18 GMT+0800 (China Standard Time)

@JacksonLaw577 Could you give more details?

JacksonLaw577 · Answer 5 · Tue Nov 24 2020 14:33:54 GMT+0800 (China Standard Time)

When I ran your original code, this error occurred. So I modified self.lr into self.lr0. I don't know if this is the reason.

Tiago Freitas · Answer 6 · Thu Nov 26 2020 23:27:03 GMT+0800 (China Standard Time)

@JacksonLaw577 Are you running the exact same dataset? I developed this on an older version of Keras and Tensorflow. Haven't been able to retry on the newer versions.

dawei-hao · Answer 7 · Sat Feb 20 2021 17:58:10 GMT+0800 (China Standard Time)

/home/user/anaconda3/envs/tensorflow-gpu/bin/python3.6 "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/train_siamese_network.py"
2021-02-20 17:56:39.594927: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Using TensorFlow backend.
2021-02-20 17:56:40.525094: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-20 17:56:40.525605: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-20 17:56:40.553530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-20 17:56:40.554140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 14 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-02-20 17:56:40.554175: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-20 17:56:40.558338: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-20 17:56:40.558412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-20 17:56:40.559689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-20 17:56:40.559968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-20 17:56:40.560085: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-02-20 17:56:40.560968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-20 17:56:40.561058: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-02-20 17:56:40.561073: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-02-20 17:56:40.561376: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-20 17:56:40.562933: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-20 17:56:40.562966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-20 17:56:40.562976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]
2021-02-20 17:56:40.606270: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory.
2021-02-20 17:56:40.623839: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory.
2021-02-20 17:56:40.641658: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory.
['Alphabet_of_the_Magi', 'Tifinagh', 'Gujarati', 'Syriac_(Estrangelo)', 'Futurama', 'Early_Aramaic', 'Latin', 'Japanese_(hiragana)', 'Grantha', 'Sanskrit', 'Greek', 'Burmese_(Myanmar)', 'Mkhedruli_(Georgian)', 'Asomtavruli_(Georgian)', 'Anglo-Saxon_Futhorc', 'Arcadian', 'Balinese', 'Japanese_(katakana)', 'Blackfoot_(Canadian_Aboriginal_Syllabics)', 'Tagalog', 'Armenian', 'Inuktitut_(Canadian_Aboriginal_Syllabics)', 'Korean', 'Bengali', 'Ojibwe_(Canadian_Aboriginal_Syllabics)', 'Cyrillic', 'Braille', 'N_Ko', 'Hebrew', 'Malay_(Jawi_-_Arabic)']
30
Traceback (most recent call last):
File "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/train_siamese_network.py", line 58, in
main()
File "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/train_siamese_network.py", line 46, in main
model_name='siamese_net_lr10e-4')
File "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/siamese_network.py", line 235, in train_siamese_network
train_loss, train_accuracy = self.model.train_on_batch(images, labels)
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1727, in train_on_batch
logs = self.train_function(iterator)
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 828, in call
result = self._call(*args, **kwds)
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 871, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
*args, **kwds))
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().wrapped(*args, **kwds)
File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:

/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
    return step_function(self, iterator)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:795 step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
    return fn(*args, **kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:788 run_step  **
    outputs = model.train_step(data)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:757 train_step
    self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:497 minimize
    loss, var_list=var_list, grad_loss=grad_loss, tape=tape)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:547 _compute_gradients
    with ops.name_scope_v2(self._name + "/gradients"):

TypeError: unsupported operand type(s) for +: 'Modified_SGD' and 'str'

Process finished with exit code 1
#################################################
HI, I met this problems, can you help me look?

Tiago Freitas · Answer 8 · Sun Feb 21 2021 05:49:22 GMT+0800 (China Standard Time)

Answered in #14