TFRecorder

TFRecorder makes it easy to create TFRecords from images and labels in Pandas DataFrames or CSV files. Today, TFRecorder supports data stored in 'image csv format' similar to GCP AutoML Vision. In the future TFRecorder will support converting any Pandas DataFrame or CSV file into TFRecords.

Installation

From the top directory of the repo, run the following command:

pip install tfrecorder

Example usage

Generating TFRecords

From Pandas DataFrame

Running on a local machine

import pandas as pd
import tfrecorder

df = pd.read_csv(...)
df.tensorflow.to_tfr(output_dir='gs://my/bucket')

Running on Cloud Dataflow

Google Cloud Platform Dataflow workers need to be supplied with the tfrecorder package that you would like to run remotely. To do so first download or build the package (a python wheel file) and then specify the path the the file when tfrecorder is called.

Step 1: Download or create the wheel file.

To download the wheel from pip: pip download tfrecorder --no-deps

To build from source/git: python setup.py sdist

Step 2: Specify the project, region, and path to the tfrecorder wheel for remote execution.

import pandas as pd
import tfrecorder

df = pd.read_csv(...)
df.tensorflow.to_tfr(
    output_dir='gs://my/bucket',
    runner='DataFlowRunner',
    project='my-project',
    region='us-central1'
    tfrecorder_wheel='/path/to/my/tfrecorder.whl')

From CSV

Using Python interpreter:

import tfrecorder

tfrecorder.create_tfrecords(
    input_data='/path/to/data.csv',
    output_dir='gs://my/bucket')

Using the command line:

tfrecorder create-tfrecords \
    --input_data=/path/to/data.csv \
    --output_dir=gs://my/bucket

Verifying data in TFRecords generated by TFRecorder

Using Python interpreter:

import tfrecorder

tfrecorder.check_tfrecords(
    file_pattern='/path/to/tfrecords/train*.tfrecord.gz',
    num_records=5,
    output_dir='/tmp/output')

This will generate a CSV file containing structured data and image files representing the images encoded into TFRecords.

Using the command line:

tfrecorder check-tfrecords \
    --file_pattern=/path/to/tfrecords/train*.tfrecord.gz \
    --num_records=5 \
    --output_dir=/tmp/output

Input format

TFRecorder currently expects data to be in the same format as AutoML Vision.
This format looks like a Pandas DataFrame or CSV formatted as:

split	image_uri	label
TRAIN	gs://my/bucket/image1.jpg	cat

where:

split can take on the values TRAIN, VALIDATION, and TEST
image_uri specifies a local or google cloud storage location for the image file.
label can be either a text based label that will be integerized or integer

Contributing

Pull requests are welcome.

zhengqin / tensorflow-recorder