Difficult with use "tfsim.samplers.TFRecordDatasetSampler"
tonylincon1 opened this issue · comments
Hello, how are you?
I really like the tensorflow similarity solution for making recommendations, however I am having a hard time using tfsim.samplers.TFRecordDatasetSampler as I have a lot of data to keep in memory.
I tried the following way to save ".tfrecords" files:
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))): # if value ist tensor
value = value.numpy() # get value of tensor
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
"""Returns a floast_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def serialize_array(array):
array = tf.io.serialize_tensor(array)
return array
def parse_single_image(image, label):
#define the dictionary -- the structure -- of our single example
data = {
'height' : _int64_feature(image.shape[0]),
'width' : _int64_feature(image.shape[1]),
'depth' : _int64_feature(image.shape[2]),
'raw_image' : _bytes_feature(serialize_array(image)),
'label' : _int64_feature(label)
}
#create an Example, wrapping the single features
out = tf.train.Example(features=tf.train.Features(feature=data))
return out
def write_images_to_tfr_short(images, labels, filename:str="images"):
filename= filename+".tfrecords"
writer = tf.io.TFRecordWriter(filename) #create a writer that'll store our data to disk
count = 0
for index in range(len(images)):
#get the data we want to write
current_image = images[index]
current_label = labels[index]
out = parse_single_image(image=current_image, label=current_label)
writer.write(out.SerializeToString())
count += 1
writer.close()
print(f"Wrote {count} elements to TFRecord")
return count
count = write_images_to_tfr_short(x_train, y_train, filename=f"{data_path}small_images")
From this I was able to save two files with my images and then write the unerecording function
#use the same structure as above; it's kinda an outline of the structure we now want to create
data = {
'height': tf.io.FixedLenFeature([], tf.int64),
'width':tf.io.FixedLenFeature([], tf.int64),
'label':tf.io.FixedLenFeature([], tf.int64),
'raw_image' : tf.io.FixedLenFeature([], tf.string),
'depth':tf.io.FixedLenFeature([], tf.int64),
}
content = tf.io.parse_single_example(element, data)
height = content['height']
width = content['width']
depth = content['depth']
label = content['label']
raw_image = content['raw_image']
#get our 'feature'-- our image -- and reshape it appropriately
feature = tf.io.parse_tensor(raw_image, out_type=tf.uint8)
feature = tf.reshape(feature, shape=[height,width,depth])
return (feature, label)
def get_dataset_small(filename):
#create the dataset
dataset = tf.data.TFRecordDataset(filename)
#pass every single feature through our mapping function
dataset = dataset.map(
parse_tfr_element
)
return dataset
When I try to use tfsim.samplers.TFRecordDatasetSampler the following error occurs
sampler = tfsim.samplers.TFRecordDatasetSampler(
shard_path=data_path,
deserialization_fn=get_dataset_small,
)
InvalidArgumentError: buffer_size must be greater than zero. [Op:ShuffleDatasetV3]
Something response?
Hi Tony,
Thanks for using TF Sim. I think the issue might be with how you are writing your tf record files. The TFRecordDatasetSampler uses the interleave function to randomly sample examples from K different tfrecord files, where K is equal to or greater than the number of classes you have in your dataset. Additionally, the length of each tfrecord file must be an integer multiple of the number of examples per class per batch.
Let me know if that unblocks you and see here for more details. #171 and #213
An alternative approach is to use the in memory sampler (or the new tf.data.Dataset sampler I'm working on in this branch). You can pass the URI to the images as the X values and then use the load_fn to read them per batch. See here for a working version using the MultiShotMemorySampler for images.
I've will try the MultiShotMemorySampler, Thank you for reponse xD
Thanks. Closing this for now but let us know if you run into any other issues.