Reading FeatureStats.pb created by MLMD produces: UnicodeDecodeError: 'utf-8' codec can't decode byte

Question

Reading FeatureStats.pb created by MLMD produces: UnicodeDecodeError: 'utf-8' codec can't decode byte

dwsmith1983 opened this issue 3 years ago · comments

I am running on:

Databricks 8.4
TFX version 1.0.0
MLMD version 1.0.0
Using the Penguin data from the TFX/MLMD examples

I wrote the data to an MLMD server. I can access the data from MySQL server and I can access from MLMD in a databricks notebook (plain Python). When I tried to load the data with tfdv, I get the following the stack trace

import tensorflow_data_validation as tfdv

stats = tfdv.load_statistics("gs://data/path/Penguin-Training-Metadata/StatisticsGen/statistics/7/Split-train/FeatureStats.pb")

I also tried load_stats_text but the results are the same. Similarly, I cannot read the anomalies. It looks like the files that are written with the extension of .pb cannot be loaded correctly. However, Schema is written as .pbtxt and has no issues.

I found a TF issue where if "r" is passed instead of "rb" when parsing binary files, in Pthon 3, we will get this error. I am trying to locate how tfdv is parsing and reading file to see if this is the issue.
tensorflow/tensorflow#11312 (comment)

Stack Trace:

DataLossError                             Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/stats_util.py in load_statistics(input_path)
    355   try:
--> 356     return load_stats_tfrecord(input_path)
    357   except Exception:  # pylint: disable=broad-except

/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/stats_util.py in load_stats_tfrecord(input_path)
    246   """
--> 247   serialized_stats = next(tf.compat.v1.io.tf_record_iterator(input_path))
    248   result = statistics_pb2.DatasetFeatureStatisticsList()

DataLossError: corrupted record at 0

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)
<command-2625483361098304> in <module>
----> 1 stats = tfdv.load_statistics("gs://data/path/Penguin-Training-Metadata/StatisticsGen/statistics/7/Split-train/FeatureStats.pb")

/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/stats_util.py in load_statistics(input_path)
    358     logging.info('File %s did not look like a TFRecord. Try reading as a plain '
    359                  'file.', input_path)
--> 360     return load_stats_text(input_path)

/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/stats_util.py in load_stats_text(input_path)
    213   """
    214   stats_proto = statistics_pb2.DatasetFeatureStatisticsList()
--> 215   stats_text = io_util.read_file_to_string(input_path)
    216   text_format.Parse(stats_text, stats_proto)
    217   return stats_proto

/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/io_util.py in read_file_to_string(filename, binary_mode)
     51   else:
     52     f = tf.io.gfile.GFile(filename, mode="r")
---> 53   return f.read()

/databricks/python/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py in read(self, n)
    120     else:
    121       length = n
--> 122     return self._prepare_value(self._read_buf.read(length))
    123 
    124   @deprecation.deprecated_args(

/databricks/python/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py in _prepare_value(self, val)
     92       return compat.as_bytes(val)
     93     else:
---> 94       return compat.as_str_any(val)
     95 
     96   def size(self):

/databricks/python/lib/python3.8/site-packages/tensorflow/python/util/compat.py in as_str_any(value)
    137   """
    138   if isinstance(value, bytes):
--> 139     return as_str(value)
    140   else:
    141     return str(value)

/databricks/python/lib/python3.8/site-packages/tensorflow/python/util/compat.py in as_str(bytes_or_text, encoding)
    116     return as_bytes(bytes_or_text, encoding)
    117   else:
--> 118     return as_text(bytes_or_text, encoding)
    119 
    120 tf_export('compat.as_text')(as_text)

/databricks/python/lib/python3.8/site-packages/tensorflow/python/util/compat.py in as_text(bytes_or_text, encoding)
    107     return bytes_or_text
    108   elif isinstance(bytes_or_text, bytes):
--> 109     return bytes_or_text.decode(encoding)
    110   else:
    111     raise TypeError('Expected binary or unicode string, got %r' % bytes_or_text)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 1: invalid start byte

dustin · Answer 1 · Mon Aug 16 2021 12:24:45 GMT+0800 (China Standard Time)

After some digging, it appears load_stats_binary will work since this passes the binary boolean to read_file_to_string in io_util.py.