swapnil3597 / dataflow-tfrecord

This repository is a reference to build Custom ETL Pipeline for creating TF-Records using Apache Beam Python SDK on Google Cloud Dataflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


This repository is a reference ETL Pipeline for creating TF-Records using Apache Beam Python SDK on Google Cloud Dataflow. You may find the blog for his repo here

To run this pipeline:

Step 1:

First have a csv_file in format in the GCS Bucket,


and corresponding dummy square images of same size stored in the GCS bucket at correct path.

Step 2:

Before running the pipeline make sure you initialize the following variables in create_tfrecords/create_tfrecords.py:

# TODO: Initialize below variables
IMG_SIZE = 28 # TODO: Enter your own int value for square image

PROJ_NAME = 'Your Project Name'

CSV_PATH = 'gs://<bucket-name>/path-to.csv'
RUNNER = 'DataflowRunner'
STAGING_LOCATION = 'gs://<bucket-name>/staging/'
TEMP_LOCATION = 'gs://<bucket-name>/temp/'
TEMPLATE_LOCATION = 'gs://<bucket-name>/path/to/template_location/template_name'
JOB_NAME = 'random-job-name'
OUTPUT_PATH = 'gs://<bucket-name>/output_path/'

Step 3:

Now, inorder to run the pipeline on Google VM Instance you may run,

bash run.sh


This repository is a reference to build Custom ETL Pipeline for creating TF-Records using Apache Beam Python SDK on Google Cloud Dataflow


Language:Python 99.2%Language:Shell 0.8%