Difficult with use "tfsim.samplers.TFRecordDatasetSampler"

Question

Difficult with use "tfsim.samplers.TFRecordDatasetSampler"

tonylincon1 opened this issue a year ago · comments

Hello, how are you?

I really like the tensorflow similarity solution for making recommendations, however I am having a hard time using tfsim.samplers.TFRecordDatasetSampler as I have a lot of data to keep in memory.

I tried the following way to save ".tfrecords" files:

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))): # if value ist tensor
        value = value.numpy() # get value of tensor
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a floast_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def serialize_array(array):
  array = tf.io.serialize_tensor(array)
  return array

def parse_single_image(image, label):
  
  #define the dictionary -- the structure -- of our single example
  data = {
        'height' : _int64_feature(image.shape[0]),
        'width' : _int64_feature(image.shape[1]),
        'depth' : _int64_feature(image.shape[2]),
        'raw_image' : _bytes_feature(serialize_array(image)),
        'label' : _int64_feature(label)
    }
  #create an Example, wrapping the single features
  out = tf.train.Example(features=tf.train.Features(feature=data))

  return out

def write_images_to_tfr_short(images, labels, filename:str="images"):
  filename= filename+".tfrecords"
  writer = tf.io.TFRecordWriter(filename) #create a writer that'll store our data to disk
  count = 0

  for index in range(len(images)):

    #get the data we want to write
    current_image = images[index] 
    current_label = labels[index]

    out = parse_single_image(image=current_image, label=current_label)
    writer.write(out.SerializeToString())
    count += 1

  writer.close()
  print(f"Wrote {count} elements to TFRecord")
  return count

count = write_images_to_tfr_short(x_train, y_train, filename=f"{data_path}small_images")

From this I was able to save two files with my images and then write the unerecording function

  #use the same structure as above; it's kinda an outline of the structure we now want to create
  data = {
      'height': tf.io.FixedLenFeature([], tf.int64),
      'width':tf.io.FixedLenFeature([], tf.int64),
      'label':tf.io.FixedLenFeature([], tf.int64),
      'raw_image' : tf.io.FixedLenFeature([], tf.string),
      'depth':tf.io.FixedLenFeature([], tf.int64),
    }

    
  content = tf.io.parse_single_example(element, data)
  
  height = content['height']
  width = content['width']
  depth = content['depth']
  label = content['label']
  raw_image = content['raw_image']
  
  
  #get our 'feature'-- our image -- and reshape it appropriately
  feature = tf.io.parse_tensor(raw_image, out_type=tf.uint8)
  feature = tf.reshape(feature, shape=[height,width,depth])
  return (feature, label)

def get_dataset_small(filename):
  #create the dataset
  dataset = tf.data.TFRecordDataset(filename)

  #pass every single feature through our mapping function
  dataset = dataset.map(
      parse_tfr_element
  )
    
  return dataset

When I try to use tfsim.samplers.TFRecordDatasetSampler the following error occurs

sampler = tfsim.samplers.TFRecordDatasetSampler(
    shard_path=data_path,
    deserialization_fn=get_dataset_small,
)

InvalidArgumentError: buffer_size must be greater than zero. [Op:ShuffleDatasetV3]

Antônio Santos · Answer 1 · Mon Apr 24 2023 08:16:45 GMT+0800 (China Standard Time)

Something response?

Owen Vallis · Answer 2 · Mon Apr 24 2023 09:20:46 GMT+0800 (China Standard Time)

Hi Tony,

Thanks for using TF Sim. I think the issue might be with how you are writing your tf record files. The TFRecordDatasetSampler uses the interleave function to randomly sample examples from K different tfrecord files, where K is equal to or greater than the number of classes you have in your dataset. Additionally, the length of each tfrecord file must be an integer multiple of the number of examples per class per batch.

Let me know if that unblocks you and see here for more details. #171 and #213

Owen Vallis · Answer 3 · Mon Apr 24 2023 09:23:31 GMT+0800 (China Standard Time)

An alternative approach is to use the in memory sampler (or the new tf.data.Dataset sampler I'm working on in this branch). You can pass the URI to the images as the X values and then use the load_fn to read them per batch. See here for a working version using the MultiShotMemorySampler for images.

Antônio Santos · Answer 4 · Mon Apr 24 2023 09:35:35 GMT+0800 (China Standard Time)

I've will try the MultiShotMemorySampler, Thank you for reponse xD

Owen Vallis · Answer 5 · Mon May 01 2023 13:15:48 GMT+0800 (China Standard Time)

Thanks. Closing this for now but let us know if you run into any other issues.