TFRecorder makes it easy to create TFRecords from images and labels in Pandas DataFrames or CSV files. Today, TFRecorder supports data stored in 'image csv format' similar to GCP AutoML Vision. In the future TFRecorder will support converting any Pandas DataFrame or CSV file into TFRecords.
From the top directory of the repo, run the following command:
pip install tfrecorder
import pandas as pd
import tfrecorder
df = pd.read_csv(...)
df.tensorflow.to_tfr(output_dir='gs://my/bucket')
Google Cloud Platform Dataflow workers need to be supplied with the tfrecorder package that you would like to run remotely. To do so first download or build the package (a python wheel file) and then specify the path the the file when tfrecorder is called.
Step 1: Download or create the wheel file.
To download the wheel from pip:
pip download tfrecorder --no-deps
To build from source/git:
python setup.py sdist
Step 2: Specify the project, region, and path to the tfrecorder wheel for remote execution.
import pandas as pd
import tfrecorder
df = pd.read_csv(...)
df.tensorflow.to_tfr(
output_dir='gs://my/bucket',
runner='DataFlowRunner',
project='my-project',
region='us-central1'
tfrecorder_wheel='/path/to/my/tfrecorder.whl')
Using Python interpreter:
import tfrecorder
tfrecorder.create_tfrecords(
input_data='/path/to/data.csv',
output_dir='gs://my/bucket')
Using the command line:
tfrecorder create-tfrecords \
--input_data=/path/to/data.csv \
--output_dir=gs://my/bucket
Using Python interpreter:
import tfrecorder
tfrecorder.check_tfrecords(
file_pattern='/path/to/tfrecords/train*.tfrecord.gz',
num_records=5,
output_dir='/tmp/output')
This will generate a CSV file containing structured data and image files representing the images encoded into TFRecords.
Using the command line:
tfrecorder check-tfrecords \
--file_pattern=/path/to/tfrecords/train*.tfrecord.gz \
--num_records=5 \
--output_dir=/tmp/output
TFRecorder currently expects data to be in the same format as
AutoML Vision.
This format looks like a Pandas DataFrame or CSV formatted as:
split | image_uri | label |
---|---|---|
TRAIN | gs://my/bucket/image1.jpg | cat |
where:
split
can take on the values TRAIN, VALIDATION, and TESTimage_uri
specifies a local or google cloud storage location for the image file.label
can be either a text based label that will be integerized or integer
Pull requests are welcome.