tensorflow / data-validation

Library for exploring and validating machine learning data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TFDV fails everytime on GC Dataflow job

AnoderPersona opened this issue · comments

For whatever reason when trying to start a dataflow job for tfdv.generate_statistics_from_csv using gc storage, doesn't work in this version for me (it fails on the fourth step every time). However it does work for the previous TFX version (had to downgrade).

Version that supposedly has the issue: tfx 1.6.0
Version that works for me: tfx 1.5.0

Code example:

from apache_beam.options.pipeline_options import (
 PipelineOptions, GoogleCloudOptions, StandardOptions)

options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = ****
google_cloud_options.region = 'us-west1'
google_cloud_options.job_name = 'generando-stats'
google_cloud_options.staging_location = 'gs://***/staging'
google_cloud_options.temp_location = 'gs://***/tmp'
options.view_as(StandardOptions).runner = 'DataflowRunner'

from apache_beam.options.pipeline_options import SetupOptions

setup_options = options.view_as(SetupOptions)
setup_options.extra_packages = ['/home/jupyter/tensorflow_data_validation-1.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl']

import tensorflow_data_validation as tfdv

#ruta del archivo de datos
data_set_path = 'gs://***/csv/consumer_complaints_with_narrative.csv'

#ruta del archivo output con estadpisticas
output_path = 'gs://***/stats.pb'
stats = tfdv.generate_statistics_from_csv(data_location= data_set_path,
                                  output_path=output_path,
                                  pipeline_options=options)

Step in which always fails:
image

Error according to dataflow:

TypeError: 'int' object is not iterable 

This was made in google cloud dataflow's jupyter notebook.
Hope this helps someone

Thanks for opening this issue. Were you using a released version of TFDV 1.6.0, or did you build TFDV from source?

The released one. I actually didn't even noticed it had updated until two days later haha

Thanks for the info -- Could you provide an excerpt of your logs containing the error? It would be very useful to know what stage this is happening at.

Closing this issue in light of the lack of further information, but please reopen if this is still an issue and you can provide logs or more detail about where the error you saw is happening. Thanks!