tensorflow / model-analysis

Model analysis tools for TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to use custom Keras metrics with TFX evaluator component

rclough opened this issue · comments

System information

  • Have I written custom code (as opposed to using a stock example script
    provided in TensorFlow Model Analysis)
    : Using TFX Evaluator component
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS, executing on kubeflow (both direct and dataflow runners)
  • TensorFlow Model Analysis installed from (source or binary): TFX 0.22.1
  • TensorFlow Model Analysis version (use command below): 0.22.2
  • Python version: 3.6
  • Exact command to reproduce: N/A

Describe the problem

TLDR: We are attempting to use a custom Keras metric, but the beam job fails with the following (longer stack trace posted later):

ValueError: Unknown metric function: NormalizedBinaryCrossEntropy

We have the metric implemented in our own python code, and reference it in the eval_config/MetricsSpec like so, based on the TFMA documentation for custom keras metrics:

eval_config = tfma.EvalConfig(
    metrics_specs=[
        tfma.MetricsSpec(
            metrics=[
                tfma.config.MetricConfig(
                    class_name="BinaryCrossentropy",
                ),
                ... # more metrics
                tfma.config.MetricConfig(
                    class_name="NormalizedBinaryCrossEntropy", module="my.module_file.path.metrics"
                ),
            ]
        ),
    ],
    slicing_specs=[
        tfma.SlicingSpec(),
        tfma.SlicingSpec(feature_keys=["gender"]),
    ],
    options=tfma.Options(include_default_metrics=BoolValue(value=True)),
)

*For clarification, my.module_file.path.metrics is a sanitized name that represents a real module file path

We first ran into this on dataflow, and suspected we maybe didn't package the code correctly, but we provide a beam extra package with the custom metric module installed, verified the dataflow worker logs installed it, and then I pulled it down to inspect it, and it contains all the expected code/setup.py.

Furthermore, we also ran the code with the direct beam runner - we can verify that we can run from my.module_file.path.metrics import NormalizedBinaryCrossEntropy, but when the code runs with the direct beam runner, it fails with the same error above.

I'm hoping that maybe I missed something simple as far as making sure the module is available at runtime, or maybe our version of TFX/TFMA is too old to make use of custom keras metrics? But given that there isn't any further documentation/examples of custom metrics running in beam, I made an attempt at following the code path in the stack trace (see logs section for the full trace).

I'm not super familiar with the TFMA or tensorflow/keras codebases, so pardon if this analysis is misinformed, because the code is fairly convoluted for someone not intimately acquainted with it-

My current suspicion is that the code doesn't actually make use of the module attribute specified in the tfma.config.MetricConfig....I followed through the stack trace, and unfortunately I don't really have a good way of testing beam code, but I suspect the custom objects scope isn't loading the modules correctly, since that seems to be how TFMA tries to provide the module scope to keras when it goes to deserialize thee metrics (its not otherrwise passing in the module scope from the config, which initially threw me off until I saw this line).

Indeed, I just looked through the release notes and saw 0.23 had some changes in how custom objects are loaded though my brain hurts now and I'm not sure if that what could be causing our problem; the way the commit is worded seems like our module would have to end with keras.metrics to be effected which it does not....

Unfortunately it is difficult to try newer versions due to the corresponding TFX changes, and running on shared infrastructure, but I'm working to try that as well.

Source code / logs

Stack Trace from direct runner:

Traceback (most recent call last):
  File "spotify_kubeflow/component/sdk/execution/run_component.py", line 53, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "spotify_kubeflow/component/sdk/execution/run_component.py", line 49, in main
    run_component(**kwargs)
  File "spotify_kubeflow/component/sdk/execution/run_component.py", line 37, in run_component
    runner.launch()
  File "/ml/spotify_kubeflow/component/sdk/execution/component_launcher.py", line 73, in launch
    self._run_and_publish(execution_decision)
  File "/ml/spotify_kubeflow/component/sdk/execution/component_launcher.py", line 90, in _run_and_publish
    execution_decision.exec_properties,
  File "/ml/spotify_kubeflow/component/sdk/execution/component_launcher.py", line 127, in _run_executor
    executor.Do(input_dict, output_dict, exec_properties)
  File "/ml/spotify_kubeflow/component/common/evaluator/with_module_file/executor.py", line 184, in Do
    eval_config=eval_config,
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/pipeline.py", line 524, in __exit__
    self.run().wait_until_finish()
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/pipeline.py", line 510, in run
    return self.runner.run_pipeline(self, self._options)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/direct/direct_runner.py", line 130, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 179, in run_pipeline
    pipeline.to_runner_api(default_environment=self._default_environment))
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 189, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 335, in run_stages
    bundle_context_manager,
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 545, in _run_stage
    expected_timer_output)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1051, in process_bundle
    for result, split_result in executor.map(execute, part_inputs):
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
    self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1048, in execute
    part_map, expected_outputs, fired_timers, expected_output_timers)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 947, in process_bundle
    result_future = self._worker_handler.control_conn.push(process_bundle_req)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 349, in push
    response = self.worker.do_instruction(request)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 471, in do_instruction
    getattr(request, request_type), request.instruction_id)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 506, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/bundle_processor.py", line 977, in process_bundle
    op.finish()
  File "apache_beam/runners/worker/operations.py", line 982, in apache_beam.runners.worker.operations.PGBKCVOperation.finish
  File "apache_beam/runners/worker/operations.py", line 985, in apache_beam.runners.worker.operations.PGBKCVOperation.finish
  File "apache_beam/runners/worker/operations.py", line 993, in apache_beam.runners.worker.operations.PGBKCVOperation.output_key
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/evaluators/metrics_and_plots_evaluator_v2.py", line 364, in compact
    return super(_ComputationsCombineFn, self).compact(accumulator)
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/transforms/combiners.py", line 706, in compact
    return [c.compact(a) for c, a in zip(self._combiners, accumulator)]
  File "/usr/local/lib/python3.6/dist-packages/apache_beam/transforms/combiners.py", line 706, in <listcomp>
    return [c.compact(a) for c, a in zip(self._combiners, accumulator)]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 560, in compact
    self._process_batch(accumulator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 518, in _process_batch
    self._setup_if_needed()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 513, in _setup_if_needed
    _deserialize_metrics(self._metric_configs[i]))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 258, in _deserialize_metrics
    return [tf.keras.metrics.deserialize(c) for c in metric_configs]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 258, in <listcomp>
    return [tf.keras.metrics.deserialize(c) for c in metric_configs]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/metrics.py", line 3443, in deserialize
    printable_module_name='metric function')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py", line 347, in deserialize_keras_object
    config, module_objects, custom_objects, printable_module_name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py", line 296, in class_and_config_for_serialized_keras_object
    raise ValueError('Unknown ' + printable_module_name + ': ' + class_name)
ValueError: Unknown metric function: NormalizedBinaryCrossEntropy
 

In general, this sounds like an issue where the custom module is not available on the Beam worker, even though it is available in your main program (where the Beam pipeline is constructed). My first thought is to suggest you try setting the module_file parameter in the Evaluator component with an absolute path to your module file. This will package the module_file and distribute it to the Beam workers.

A couple clarifications which might help narrow things down:

  • How are you invoking the Evaluator component? Are you running it in a TFX pipeline? Ideally we could find a portable way to reproduce this issue.
  • Was this custom metric setup previously working, and it broke with TFMA version 0.22.2? Or has this never worked?
  • Does your real metric module name (not what I assume is the placehorlder, my.module_file.path.metrics) include the literal, tf.keras.metrics? If so, then it might have been affected by the change you linked, though this seems pretty unlikely.

Have some updates here -

  • We publish the custom metrics code in an extra_package that is sent to beam workers, and can confirm this package has the code, and that from logs, the workers install that extra_package.
  • The setup was not previously working, it was a first attempt at implementing custom metrics
  • It does not include the literal tf.keras.metrics which is why I thought it wouldnt be affected

That said, we have managed to update the code to use TFX 0.27 (TFMA 0.27) and magically the code seems to work 🤷 (no other changes to the code related to the evaluator component, custom metrics code or packaging thereof) I have no idea what change might have caused the fix but perhaps we can close the issue.

Well, glad it's fixed. It would be nice to understand the root cause, but it's probably not a high priority unless this issue reappears for others. Thanks for taking the time to report the issue and follow up.

Closing this based on above comment trace,Thanks.