Reading FeatureStats.pb created by MLMD produces: UnicodeDecodeError: 'utf-8' codec can't decode byte
dwsmith1983 opened this issue · comments
I am running on:
- Databricks 8.4
- TFX version 1.0.0
- MLMD version 1.0.0
- Using the Penguin data from the TFX/MLMD examples
I wrote the data to an MLMD server. I can access the data from MySQL server and I can access from MLMD in a databricks notebook (plain Python). When I tried to load the data with tfdv, I get the following the stack trace
import tensorflow_data_validation as tfdv
stats = tfdv.load_statistics("gs://data/path/Penguin-Training-Metadata/StatisticsGen/statistics/7/Split-train/FeatureStats.pb")
I also tried load_stats_text
but the results are the same. Similarly, I cannot read the anomalies. It looks like the files that are written with the extension of .pb
cannot be loaded correctly. However, Schema is written as .pbtxt
and has no issues.
I found a TF issue where if "r"
is passed instead of "rb"
when parsing binary files, in Pthon 3, we will get this error. I am trying to locate how tfdv is parsing and reading file to see if this is the issue.
tensorflow/tensorflow#11312 (comment)
Stack Trace:
DataLossError Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/stats_util.py in load_statistics(input_path)
355 try:
--> 356 return load_stats_tfrecord(input_path)
357 except Exception: # pylint: disable=broad-except
/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/stats_util.py in load_stats_tfrecord(input_path)
246 """
--> 247 serialized_stats = next(tf.compat.v1.io.tf_record_iterator(input_path))
248 result = statistics_pb2.DatasetFeatureStatisticsList()
DataLossError: corrupted record at 0
During handling of the above exception, another exception occurred:
UnicodeDecodeError Traceback (most recent call last)
<command-2625483361098304> in <module>
----> 1 stats = tfdv.load_statistics("gs://data/path/Penguin-Training-Metadata/StatisticsGen/statistics/7/Split-train/FeatureStats.pb")
/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/stats_util.py in load_statistics(input_path)
358 logging.info('File %s did not look like a TFRecord. Try reading as a plain '
359 'file.', input_path)
--> 360 return load_stats_text(input_path)
/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/stats_util.py in load_stats_text(input_path)
213 """
214 stats_proto = statistics_pb2.DatasetFeatureStatisticsList()
--> 215 stats_text = io_util.read_file_to_string(input_path)
216 text_format.Parse(stats_text, stats_proto)
217 return stats_proto
/databricks/python/lib/python3.8/site-packages/tensorflow_data_validation/utils/io_util.py in read_file_to_string(filename, binary_mode)
51 else:
52 f = tf.io.gfile.GFile(filename, mode="r")
---> 53 return f.read()
/databricks/python/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py in read(self, n)
120 else:
121 length = n
--> 122 return self._prepare_value(self._read_buf.read(length))
123
124 @deprecation.deprecated_args(
/databricks/python/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py in _prepare_value(self, val)
92 return compat.as_bytes(val)
93 else:
---> 94 return compat.as_str_any(val)
95
96 def size(self):
/databricks/python/lib/python3.8/site-packages/tensorflow/python/util/compat.py in as_str_any(value)
137 """
138 if isinstance(value, bytes):
--> 139 return as_str(value)
140 else:
141 return str(value)
/databricks/python/lib/python3.8/site-packages/tensorflow/python/util/compat.py in as_str(bytes_or_text, encoding)
116 return as_bytes(bytes_or_text, encoding)
117 else:
--> 118 return as_text(bytes_or_text, encoding)
119
120 tf_export('compat.as_text')(as_text)
/databricks/python/lib/python3.8/site-packages/tensorflow/python/util/compat.py in as_text(bytes_or_text, encoding)
107 return bytes_or_text
108 elif isinstance(bytes_or_text, bytes):
--> 109 return bytes_or_text.decode(encoding)
110 else:
111 raise TypeError('Expected binary or unicode string, got %r' % bytes_or_text)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 1: invalid start byte
After some digging, it appears load_stats_binary
will work since this passes the binary boolean to read_file_to_string
in io_util.py
.