Race condition during EvalCallback

Question

Race condition during EvalCallback

shinstra opened this issue 10 months ago · comments

Christopher J. Bateman commented 10 months ago

Bit of a weird one that I'm hoping someone may have encountered before and can give some direction. I'm getting variable errors during EvalCallback.on_epoch_end. These errors change between runs (see examples below) and seem to relate to data inconsistencies. If I step through the code in debug mode there is no problem and it seems to work fine in following epochs. If I include time.sleep(1) at the start of the callback execution then no errors are thrown.

My best guess is that some data used by the callback has not been fully initialized when the first call to EvalCallback.on_epoch_end is made. However I'm not sure if this is an issue due to something happening in the underlying tensorflow/keras level, or if the issue is arising from the tfsim level.

Error Examples

Epoch 1/800
62/62 [==============================] - ETA: 0s - loss: 332.3957 - proj_std: 0.0441Traceback (most recent call last):
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 194, in <module>
    main()
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 178, in main
    history = contrastive_model.fit(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 188, in on_epoch_end
    known_results = _compute_metrics(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 291, in _compute_metrics
    classification_results = evaluator.evaluate_classification(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\evaluators\memory_evaluator.py", line 152, in evaluate_classification
    matcher.compute_count(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 177, in compute_count
    match_mask, distance_mask = self._compute_match_indicators(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 130, in _compute_match_indicators
    d_labels, d_dist = self.derive_match(lookup_labels, lookup_distances)
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\match_nearest.py", line 55, in derive_match
    return lookup_labels[:, :1], lookup_distances[:, :1]
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Index out of range using input dim 1; input has only 1 dims [Op:StridedSlice] name: strided_slice/

Epoch 1/800
62/62 [==============================] - ETA: 0s - loss: 331.3144 - proj_std: 0.0441Traceback (most recent call last):
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 194, in <module>
    main()
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 178, in main
    history = contrastive_model.fit(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 186, in on_epoch_end
    self.model.index(self.targets, self.target_labels, verbose=0)
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\models\contrastive_model.py", line 558, in index
    predictions = self.predict(x)
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\models\contrastive_model.py", line 457, in predict
    x = self.backbone.predict(
ValueError: can only convert an array of size 1 to a Python scalar

Epoch 1/800
62/62 [==============================] - ETA: 0s - loss: 329.6670 - proj_std: 0.0441Traceback (most recent call last):
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 194, in <module>
    main()
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 178, in main
    history = contrastive_model.fit(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 188, in on_epoch_end
    known_results = _compute_metrics(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 291, in _compute_metrics
    classification_results = evaluator.evaluate_classification(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\evaluators\memory_evaluator.py", line 152, in evaluate_classification
    matcher.compute_count(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 177, in compute_count
    match_mask, distance_mask = self._compute_match_indicators(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 128, in _compute_match_indicators
    ClassificationMatch._check_shape(query_labels, lookup_labels, lookup_distances)
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 305, in _check_shape
    raise ValueError("Number of query labels must match the number of " "lookup_label sets.")
ValueError: Number of query labels must match the number of lookup_label sets.

I'm working pretty close to the unsupervised-learning example notebook with the following key exceptions:

custom dataset with input size (None, 64, 64, 1)
The backbone is the same from the supervised learning notebook.
The only callback I'm using is the EvalCallback.

I'm using python==3.8.16 and tensorflow==2.10.1

Owen Vallis · Answer 1 · Fri Aug 11 2023 13:28:12 GMT+0800 (China Standard Time)

Thanks @shinstra, I'll try and take a look into this. The lookup error may be caused by something in the result set returned by nmslib, but I'll have to dig into the other errors to find out more.