Hvass-Labs / TensorFlow-Tutorials

TensorFlow Tutorials with YouTube Videos

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with extendingTensorFlow Tutorial #20

pathakrohit08 opened this issue · comments

I am trying to implement your TensorFlow Tutorial 20 . I have a huge dataset approx 2.5GB . Now in your code x_train_text, y_train = imdb.load_data(train=True) in this line I understand that you are trying to load whole train data in memory which I cant do as I will run out of memory. So I was trying to create TF-Records for the same by combining your tutorial TensorFlow Tutorial 18 Can you please take a look at my code

I have three text files A.txt, B.txt,C.txt the labels are A,B and C

def convert(file_paths, labels, out_path):

    print("Converting: " + out_path)

    # Number of images. Used when printing the progress.
    num_files = len(file_paths)

    # Open a TFRecordWriter for the output-file.
   with tf.python_io.TFRecordWriter(out_path) as writer:

    # Iterate over all the image-paths and class-labels.
    for i, (path, label) in enumerate(zip(file_paths, labels)):
           # Print the percentage-progress.
           print_progress(count=i, total=num_files - 1)

           lines=getModifiedLines(path)


          # Create a dict with the data we want to save in the
          # TFRecords file. You can add more relevant data here.
          data = \
              {
                  'text': wrap_int64(lines),
                  'label': wrap_int64(label)
              }

          # Wrap the data as TensorFlow Features.
          feature = tf.train.Features(feature=data)

          # Wrap again as a TensorFlow Example.
          example = tf.train.Example(features=feature)

          # Serialize the data.
          serialized = example.SerializeToString()
          writer.write(serialized)

def getModifiedLines(filePath):
    data = open(filePath, 'r', encoding='UTF8', errors='ignore').read()
    lines = re.split("\n", data)
    all_lines=[]
    for line in lines:
      _l=count_vect.fit_transform(line)
      all_lines.append(_l)

return all_lines

I am getting error at line 'text': wrap_int64(lines),

Error
`[<12071x21108 sparse matrix of type '<class 'numpy.int64'>'
with 226655 stored elements in Compressed Sparse Row format>] has type
<class 'list'>, but expected one of: (<class 'int'>,)``

I do not give support for people who try and customize the tutorials. I could spend the rest of my life doing that. This was stated clearly in the text you deleted before posting your issue:

Questions about modifications or how to use these tutorials on your own data-set should also be asked on StackOverflow. Thousands of people are using these tutorials. It is impossible for me to give individual support for your project.

If your data-set is only 2.5 GB then it should be able to fit into most modern machines' RAM. If you have less than 8 GB then you have too little RAM anyway and you need to upgrade. You probably should have at least 16 GB of RAM.