Using tfdv to validate text based data

Question

Using tfdv to validate text based data

Capsar opened this issue 2 years ago · comments

Hi,

After searching online whether tfdv could be used to validate data that contains text. For instance, for a dataset with sentences that have to be mapped to labels. I could not find any real useful tutorials, as the ones that I could find only go into numerical data regarding the dataset. For instance, height, weights, etc.

After looking around in the data-validation package I have found a couple of files that seem to be related to this.
https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_stats_generator.py
And
https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_domain_inferring_stats_generator.py

Furthermore on the Tensorflow website about the StatsOptions class I found the following:
https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions

Arguments	Description
enable_semantic_domain_stats	If True statistics for semantic domains are generated (e.g: image, text domains).
semantic_domain_stats_sample_rate	An optional sampling rate for semantic domain statistics. If specified, semantic domain statistics is computed over a sample.
vocab_paths	An optional dictionary mapping vocab names to paths. Used in the schema when specifying a NaturalLanguageDomain. The paths can either be to GZIP-compressed TF record files that have a tfrecord.gz suffix or to text files.

These arguments and files do indicate that tfdv can be used to analyze and validate data that would be used in NLP / Text classification type problems.

However, it is unclear to me how one would go about and use these features to validate text-based data?
I have enabled the enable_semantic_domain_stats argument and this does give information like sequence length etc.
However, how would one extend on this, and validate vocabularies for known/unknown word ratio's; etc.

Any tips or thoughts are highly appreciated!
Kind Regards,
Caspar