TFMA unable to find metrics for Keras model when loading eval result

Question

TFMA unable to find metrics for Keras model when loading eval result

thisisandreeeee opened this issue 4 years ago · comments

Andre commented 4 years ago

System information

Have I written custom code (as opposed to using a stock example script
provided in TensorFlow Model Analysis): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Catalina
TensorFlow Model Analysis installed from (source or binary): pypi
TensorFlow Model Analysis version (use command below): 0.22.1
Python version: 3.7.5
Jupyter Notebook version: 1.0.0

Describe the problem

I have trained a Keras model (not estimator) with the following serving signature:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['examples'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: serving_default_examples:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['mu'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: StatefulPartitionedCall_1:0
    outputs['sigma'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: StatefulPartitionedCall_1:1
  Method name is: tensorflow/serving/predict

The weights are updated using a custom training loop with gradient tape, instead of the model.fit method, before the model is exported as a saved_model. As I am unable to get TFMA to work without first compiling the model, I compile the model while specifying a set of custom Keras metrics:

model.compile(metrics=custom_keras_metrics) # each custom metric inherits from keras.Metric
custom_training_loop(model)
model.save("path/to/saved_model", save_format="tf")

I would like to evaluate this model using TFMA, so I first initialise an eval shared model as follows:

eval_config = tfma.EvalConfig(
    model_specs=[tfma.ModelSpec(label_key="my_label_key")],
    slicing_specs=[tfma.SlicingSpec()] # empty slice refers to the entire dataset
)
eval_shared_model = tfma.default_eval_shared_model("path/to/saved_model", eval_config=eval_config)

However, when I try to run model analysis:

eval_results = tfma.run_model_analysis(
    eval_shared_model=eval_shared_model,
    data_location="path/to/test/tfrecords*",
    file_format="tfrecords"
)

I am faced with the following error:

ValueError          Traceback (most recent call last)
<ipython-input-156-f9a9684a6797> in <module>
      2     eval_shared_model=eval_shared_model,
      3     data_location="tfma/test_raw-*",
----> 4     file_format="tfrecords"
      5 )

~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/api/model_eval_lib.py in run_model_analysis(eval_shared_model, eval_config, data_location, file_format, output_path, extractors, evaluators, writers, pipeline_options, slice_spec, write_config, compute_confidence_intervals, min_slice_size, random_seed_for_testing, schema)
   1204 
   1205   if len(eval_config.model_specs) <= 1:
-> 1206     return load_eval_result(output_path)
   1207   else:
   1208     results = []

~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/api/model_eval_lib.py in load_eval_result(output_path, model_name)
    383       metrics_and_plots_serialization.load_and_deserialize_metrics(
    384           path=os.path.join(output_path, constants.METRICS_KEY),
--> 385           model_name=model_name))
    386   plots_proto_list = (
    387       metrics_and_plots_serialization.load_and_deserialize_plots(

~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/writers/metrics_and_plots_serialization.py in load_and_deserialize_metrics(path, model_name)
    180       raise ValueError('Fail to find metrics for model name: %s . '
    181                        'Available model names are [%s]' %
--> 182                        (model_name, ', '.join(keys)))
    183 
    184     result.append((

ValueError: Fail to find metrics for model name: None . Available model names are []

Why is TFMA raising this exception, and where should I begin debugging this error? I tried specifying the model names manually (which should not be required since I'm only using one model), but that did not seem to help either. I tried tracing the source code and it seems this happens when TFMA tries to load the eval result generated by the PTransform.

Mike Dreves · Answer 1 · Thu Sep 03 2020 03:54:37 GMT+0800 (China Standard Time)

Can you try adding a non-custom metric. I suspect that there are no metrics being computed (the poor error message has since been fixed but not sure it is released). Also, can you try load the model with custom metrics and check if the metrics were saved:

model = tf.keras.models.load_model(model_path)
model.metrics

Andre · Answer 2 · Thu Sep 03 2020 11:00:23 GMT+0800 (China Standard Time)

When I do not specify compile=False, the load_model call returns the following error:

ValueError: Unknown metric function: CustomMetric

It seems like this also results in some issues when creating the default eval shared model, which infers that the model type is TF_GENERIC instead of TF_KERAS. I think this might be related to how I am creating the Keras model.

I require a custom training loop using gradient tape with low-level handling of custom metrics. As such, the model does not need to be compiled since the .fit() method is not called. I am able to successfully train, and compute metrics for, the model.

However, when I try passing an uncompiled model, TFMA seems to have difficulty loading it (more details here) with the following exception:

AttributeError: 'NoneType' object has no attribute 'metrics'

Therefore, I tried to compile it while passing the custom metrics:

model.compile(metrics=custom_keras_metrics)

Should I be compiling or saving the model differently in order to ensure compatibility with TFMA?

Andre · Answer 3 · Thu Sep 03 2020 11:13:34 GMT+0800 (China Standard Time)

On a side note, I tried following this example for creating custom metrics. I made no code changes, but even after compiling the model, the metrics attribute seems to be empty.

>>> model.metrics
[]

Is that expected?

Mike Dreves · Answer 4 · Thu Sep 03 2020 12:13:48 GMT+0800 (China Standard Time)

What version of TF are you using. There was a bug in some versions of TF where the metrics were not restored on load.

Andre · Answer 5 · Thu Sep 03 2020 14:34:05 GMT+0800 (China Standard Time)

I am using tensorflow 2.3.0. I'm not sure this is related, because the metrics are not available when I call model.metrics immediately after compilation (without saving and loading).

Andre · Answer 6 · Mon Sep 07 2020 11:56:13 GMT+0800 (China Standard Time)

@mdreves Is there an example that runs model analysis for Keras models trained using gradient tape with custom metrics that I could reference?

Mike Dreves · Answer 7 · Tue Sep 08 2020 07:18:35 GMT+0800 (China Standard Time)

Not that I'm aware of.

I think the problem is that TF is lazily creating these and doesn't recognized the metrics as part of the model until after model.fit is called which means in your case they will not be saved and they are unknown to TFMA. One option is to manually add your metrics via TFMA config (see [1]). You will need to make sure the lib containing the custom code is available on the workers though.

[1] https://github.com/tensorflow/model-analysis/blob/master/g3doc/metrics.md#customization

Andre · Answer 8 · Wed Sep 09 2020 16:52:08 GMT+0800 (China Standard Time)

Ah, I see. What worked for me was:

First doing a no-op compile:

model.compile(optimizer=my_custom_optimizer) # did not specify loss or metrics here

Then passing the custom metric through the MetricsSpec:

eval_config = tfma.EvalConfig(
    model_specs=[tfma.ModelSpec(label_key="my_label")],
    metrics_specs=[
        tfma.MetricsSpec(
            metrics=[tfma.MetricConfig(
                class_name="MyCustomMetric",
                module="module.containing.metric"
            )]
        )
    ],
    slicing_specs=[
        tfma.SlicingSpec(), # empty slice refers to the entire dataset
    ]
)

I'm running into some other error now, but it doesn't seem to be related to this issue so I'll proceed to resolve this.

Thank you, I appreciate the help!