Issue with extendingTensorFlow Tutorial #20
pathakrohit08 opened this issue · comments
I am trying to implement your TensorFlow Tutorial 20 . I have a huge dataset approx 2.5GB . Now in your code x_train_text, y_train = imdb.load_data(train=True)
in this line I understand that you are trying to load whole train data in memory which I cant do as I will run out of memory. So I was trying to create TF-Records for the same by combining your tutorial TensorFlow Tutorial 18 Can you please take a look at my code
I have three text files A.txt, B.txt,C.txt the labels are A,B and C
def convert(file_paths, labels, out_path):
print("Converting: " + out_path)
# Number of images. Used when printing the progress.
num_files = len(file_paths)
# Open a TFRecordWriter for the output-file.
with tf.python_io.TFRecordWriter(out_path) as writer:
# Iterate over all the image-paths and class-labels.
for i, (path, label) in enumerate(zip(file_paths, labels)):
# Print the percentage-progress.
print_progress(count=i, total=num_files - 1)
lines=getModifiedLines(path)
# Create a dict with the data we want to save in the
# TFRecords file. You can add more relevant data here.
data = \
{
'text': wrap_int64(lines),
'label': wrap_int64(label)
}
# Wrap the data as TensorFlow Features.
feature = tf.train.Features(feature=data)
# Wrap again as a TensorFlow Example.
example = tf.train.Example(features=feature)
# Serialize the data.
serialized = example.SerializeToString()
writer.write(serialized)
def getModifiedLines(filePath):
data = open(filePath, 'r', encoding='UTF8', errors='ignore').read()
lines = re.split("\n", data)
all_lines=[]
for line in lines:
_l=count_vect.fit_transform(line)
all_lines.append(_l)
return all_lines
I am getting error at line 'text': wrap_int64(lines),
Error
`[<12071x21108 sparse matrix of type '<class 'numpy.int64'>'
with 226655 stored elements in Compressed Sparse Row format>] has type
<class 'list'>, but expected one of: (<class 'int'>,)``
I do not give support for people who try and customize the tutorials. I could spend the rest of my life doing that. This was stated clearly in the text you deleted before posting your issue:
Questions about modifications or how to use these tutorials on your own data-set should also be asked on StackOverflow. Thousands of people are using these tutorials. It is impossible for me to give individual support for your project.
If your data-set is only 2.5 GB then it should be able to fit into most modern machines' RAM. If you have less than 8 GB then you have too little RAM anyway and you need to upgrade. You probably should have at least 16 GB of RAM.