Unable to use custom Keras metrics with TFX evaluator component
rclough opened this issue · comments
System information
- Have I written custom code (as opposed to using a stock example script
provided in TensorFlow Model Analysis): Using TFX Evaluator component - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS, executing on kubeflow (both direct and dataflow runners)
- TensorFlow Model Analysis installed from (source or binary): TFX 0.22.1
- TensorFlow Model Analysis version (use command below): 0.22.2
- Python version: 3.6
- Exact command to reproduce: N/A
Describe the problem
TLDR: We are attempting to use a custom Keras metric, but the beam job fails with the following (longer stack trace posted later):
ValueError: Unknown metric function: NormalizedBinaryCrossEntropy
We have the metric implemented in our own python code, and reference it in the eval_config/MetricsSpec like so, based on the TFMA documentation for custom keras metrics:
eval_config = tfma.EvalConfig(
metrics_specs=[
tfma.MetricsSpec(
metrics=[
tfma.config.MetricConfig(
class_name="BinaryCrossentropy",
),
... # more metrics
tfma.config.MetricConfig(
class_name="NormalizedBinaryCrossEntropy", module="my.module_file.path.metrics"
),
]
),
],
slicing_specs=[
tfma.SlicingSpec(),
tfma.SlicingSpec(feature_keys=["gender"]),
],
options=tfma.Options(include_default_metrics=BoolValue(value=True)),
)
*For clarification, my.module_file.path.metrics
is a sanitized name that represents a real module file path
We first ran into this on dataflow, and suspected we maybe didn't package the code correctly, but we provide a beam extra package with the custom metric module installed, verified the dataflow worker logs installed it, and then I pulled it down to inspect it, and it contains all the expected code/setup.py.
Furthermore, we also ran the code with the direct beam runner - we can verify that we can run from my.module_file.path.metrics import NormalizedBinaryCrossEntropy
, but when the code runs with the direct beam runner, it fails with the same error above.
I'm hoping that maybe I missed something simple as far as making sure the module is available at runtime, or maybe our version of TFX/TFMA is too old to make use of custom keras metrics? But given that there isn't any further documentation/examples of custom metrics running in beam, I made an attempt at following the code path in the stack trace (see logs section for the full trace).
I'm not super familiar with the TFMA or tensorflow/keras codebases, so pardon if this analysis is misinformed, because the code is fairly convoluted for someone not intimately acquainted with it-
My current suspicion is that the code doesn't actually make use of the module
attribute specified in the tfma.config.MetricConfig
....I followed through the stack trace, and unfortunately I don't really have a good way of testing beam code, but I suspect the custom objects scope isn't loading the modules correctly, since that seems to be how TFMA tries to provide the module scope to keras when it goes to deserialize thee metrics (its not otherrwise passing in the module scope from the config, which initially threw me off until I saw this line).
Indeed, I just looked through the release notes and saw 0.23 had some changes in how custom objects are loaded though my brain hurts now and I'm not sure if that what could be causing our problem; the way the commit is worded seems like our module would have to end with keras.metrics
to be effected which it does not....
Unfortunately it is difficult to try newer versions due to the corresponding TFX changes, and running on shared infrastructure, but I'm working to try that as well.
Source code / logs
Stack Trace from direct runner:
Traceback (most recent call last):
File "spotify_kubeflow/component/sdk/execution/run_component.py", line 53, in <module>
fire.Fire(main)
File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 471, in _Fire
target=component.__name__)
File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "spotify_kubeflow/component/sdk/execution/run_component.py", line 49, in main
run_component(**kwargs)
File "spotify_kubeflow/component/sdk/execution/run_component.py", line 37, in run_component
runner.launch()
File "/ml/spotify_kubeflow/component/sdk/execution/component_launcher.py", line 73, in launch
self._run_and_publish(execution_decision)
File "/ml/spotify_kubeflow/component/sdk/execution/component_launcher.py", line 90, in _run_and_publish
execution_decision.exec_properties,
File "/ml/spotify_kubeflow/component/sdk/execution/component_launcher.py", line 127, in _run_executor
executor.Do(input_dict, output_dict, exec_properties)
File "/ml/spotify_kubeflow/component/common/evaluator/with_module_file/executor.py", line 184, in Do
eval_config=eval_config,
File "/usr/local/lib/python3.6/dist-packages/apache_beam/pipeline.py", line 524, in __exit__
self.run().wait_until_finish()
File "/usr/local/lib/python3.6/dist-packages/apache_beam/pipeline.py", line 510, in run
return self.runner.run_pipeline(self, self._options)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/direct/direct_runner.py", line 130, in run_pipeline
return runner.run_pipeline(pipeline, options)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 179, in run_pipeline
pipeline.to_runner_api(default_environment=self._default_environment))
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 189, in run_via_runner_api
return self.run_stages(stage_context, stages)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 335, in run_stages
bundle_context_manager,
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 545, in _run_stage
expected_timer_output)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1051, in process_bundle
for result, split_result in executor.map(execute, part_inputs):
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/local/lib/python3.6/dist-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1048, in execute
part_map, expected_outputs, fired_timers, expected_output_timers)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 947, in process_bundle
result_future = self._worker_handler.control_conn.push(process_bundle_req)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 349, in push
response = self.worker.do_instruction(request)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 471, in do_instruction
getattr(request, request_type), request.instruction_id)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 506, in process_bundle
bundle_processor.process_bundle(instruction_id))
File "/usr/local/lib/python3.6/dist-packages/apache_beam/runners/worker/bundle_processor.py", line 977, in process_bundle
op.finish()
File "apache_beam/runners/worker/operations.py", line 982, in apache_beam.runners.worker.operations.PGBKCVOperation.finish
File "apache_beam/runners/worker/operations.py", line 985, in apache_beam.runners.worker.operations.PGBKCVOperation.finish
File "apache_beam/runners/worker/operations.py", line 993, in apache_beam.runners.worker.operations.PGBKCVOperation.output_key
File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/evaluators/metrics_and_plots_evaluator_v2.py", line 364, in compact
return super(_ComputationsCombineFn, self).compact(accumulator)
File "/usr/local/lib/python3.6/dist-packages/apache_beam/transforms/combiners.py", line 706, in compact
return [c.compact(a) for c, a in zip(self._combiners, accumulator)]
File "/usr/local/lib/python3.6/dist-packages/apache_beam/transforms/combiners.py", line 706, in <listcomp>
return [c.compact(a) for c, a in zip(self._combiners, accumulator)]
File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 560, in compact
self._process_batch(accumulator)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 518, in _process_batch
self._setup_if_needed()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 513, in _setup_if_needed
_deserialize_metrics(self._metric_configs[i]))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 258, in _deserialize_metrics
return [tf.keras.metrics.deserialize(c) for c in metric_configs]
File "/usr/local/lib/python3.6/dist-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 258, in <listcomp>
return [tf.keras.metrics.deserialize(c) for c in metric_configs]
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/metrics.py", line 3443, in deserialize
printable_module_name='metric function')
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py", line 347, in deserialize_keras_object
config, module_objects, custom_objects, printable_module_name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py", line 296, in class_and_config_for_serialized_keras_object
raise ValueError('Unknown ' + printable_module_name + ': ' + class_name)
ValueError: Unknown metric function: NormalizedBinaryCrossEntropy
In general, this sounds like an issue where the custom module is not available on the Beam worker, even though it is available in your main program (where the Beam pipeline is constructed). My first thought is to suggest you try setting the module_file
parameter in the Evaluator component with an absolute path to your module file. This will package the module_file
and distribute it to the Beam workers.
A couple clarifications which might help narrow things down:
- How are you invoking the Evaluator component? Are you running it in a TFX pipeline? Ideally we could find a portable way to reproduce this issue.
- Was this custom metric setup previously working, and it broke with TFMA version 0.22.2? Or has this never worked?
- Does your real metric module name (not what I assume is the placehorlder,
my.module_file.path.metrics
) include the literal,tf.keras.metrics
? If so, then it might have been affected by the change you linked, though this seems pretty unlikely.
Have some updates here -
- We publish the custom metrics code in an
extra_package
that is sent to beam workers, and can confirm this package has the code, and that from logs, the workers install that extra_package. - The setup was not previously working, it was a first attempt at implementing custom metrics
- It does not include the literal
tf.keras.metrics
which is why I thought it wouldnt be affected
That said, we have managed to update the code to use TFX 0.27 (TFMA 0.27) and magically the code seems to work 🤷 (no other changes to the code related to the evaluator component, custom metrics code or packaging thereof) I have no idea what change might have caused the fix but perhaps we can close the issue.
Well, glad it's fixed. It would be nice to understand the root cause, but it's probably not a high priority unless this issue reappears for others. Thanks for taking the time to report the issue and follow up.
Closing this based on above comment trace,Thanks.