TFRecords with variable length data

Question

TFRecords with variable length data

verbeemen opened this issue 7 years ago · comments

Hey Hvass,
I don't know if this is the most proper channel to ask you a question,
but since I'm here I can try it anyway.

In the past 2 weeks, I'm working with tf-records and the tf-estimator & i can definitively feel your pain.
But I've a question which has been bothering me for a few days.

How do you use tf-records whereby you don't know which kind of shape they have?
Okey, creating a tf-record might be a mess, and you can do it in several ways. But as soon as you've a tf-record, it doesn't know which kind of shape it has. And that is a bit problematic when you create your own NN. because the first layer requires somehow a shape.

first_hidden_layer = tf.layers.dense(features['x'], 10, activation=tf.nn.relu)

Tensor("IteratorGetNext:0", shape=(?, ?), dtype=float64, device=/device:CPU:0)
-
ValueError: The last dimension of the inputs to `Dense` should be defined. Found `None`.

And that is a bit ... i think.
Because, i can imagine that you sometimes would like to use variable length arrays (like when you work with text) .
And I can also imagine that, if i create a tf-record now, then i would like to use that same record also in the future when I've forgotten the shape of that record.

Basically, I, as a person don't want to remember the shape of a tf-record. or write it down somewhere ... And this for FixedLenght as well as for the Variable Length tf-records.

Thus the question which I've is, do you've managed to use the tf-records and tf-estimator, together whereby you (act like) you don't know the shape of the tf-record? (with shape --> i mean, amount of features in a record (or pixels, or ...))

because in every single example these values are hard coded,
(even you, use img_size)

therefore, can you help me?

I hope that my question is clear -
-- if you want, i can provide you a notebook --

Hvass-Labs · Answer 1 · Wed Nov 29 2017 15:47:39 GMT+0800 (China Standard Time)

I only recently started using these and as you can see in the tutorial I am rather displeased with the API and file-format. So I don't know how to solve your problem. Did you ask on StackOverflow?

An idea is to save the shape in the TFRecord as well. In my Tutorial #18 look for this code and then add the shape:

data = \
    {
        'image': wrap_bytes(img_bytes),
        'label': wrap_int64(label)
     }

And then in the parse() function you would get the shape back again and then use it to reshape the arrays.

Would that work?

If you find a solution then please write it here in case other's have the same problem in the future and find this post via google.

Verbeemen · Answer 2 · Thu Nov 30 2017 20:38:58 GMT+0800 (China Standard Time)

I don't think it will work, and I think that i've tried this before, together with the tutorial of "https://www.tensorflow.org/extend/estimators".
The only thing which works, and that is what we are doing now is.... creating a configuration script, which knows all the sizes and types of each record/feature/tensor... (speaking of records --> e.g. you also have to remember which kind of type something is, float32, float64, etc it doesn't make sense .... )
Nevertheless,
I hope that they change their tf-records in a next version of tensorflow because, the idea is great, it is much easier to create a network via the estimator, (even though the code can be a mess)

Hvass-Labs · Answer 3 · Fri Dec 01 2017 15:32:00 GMT+0800 (China Standard Time)

OK, I'm closing this issue. Perhaps you could ask on StackOverflow or TensorFlow's github forum instead? If you find a solution somewhere else, then please provide a link here for people who may have a similar problem and who find this post in the future.

Mandar Harshe · Answer 4 · Thu Jan 18 2018 19:28:39 GMT+0800 (China Standard Time)

What you could do to write a tfrecord of variable length using this wrapper instead:

def _wrap_int64Array(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

(note the parameter passed to Int64List)
So writing the data becomes:

data = \
    {
        'variableDataFeature'  : _wrap_int64Array(myDataArray), 
        'arrayLengthFeature' : wrap_int64(len(myDataArray))
    }

When parsing the tf record, you need to have the lines:

features = \
    {
        'variableDataFeature': tf.FixedLenSequenceFeature(
            [],
            tf.int64,
            allow_missing=True),
        'arrayLengthFeature': tf.FixedLenFeature([], tf.int64)
    }

As far as I understand, you cannot create a tensorflow graph without knowing the shape of the input data anyway, so I'm not sure how it would be possible to do it even when you don't have tfrecords. If the max length is all that you need, then you can just store that value in a small txt file, which resides in the same folder as the tfrecords. Your code then has to just read this file first, and then start processing the records.