suggestion: Use a single file for labels and text

Question

suggestion: Use a single file for labels and text

shashi-netra opened this issue 6 years ago · comments

In the current version you have .lab and .txt files - one each for a training row. Wouldn't it be easier to save these in a single file or a single one for labels and another for text files? Wouldn't this be more
idiomatic (a la scikit-learn)

Having several million .lab files and .txt files is especially problematic when there are millions of files and the filesystem chokes up.

Jan Stypka · Answer 1 · Sun Aug 05 2018 00:44:32 GMT+0800 (China Standard Time)

@shashi-netra you're right, having an other option of loading files would be a reasonable feature. I think you're actually not the first who suggested that. It shouldn't be difficult to implement, but I can't promise I'll have time to do that in the near future. You're welcome to take a stab at it and open a PR!

dorg-ekrolewicz · Answer 2 · Fri Oct 05 2018 01:21:40 GMT+0800 (China Standard Time)

@jstypka Can you please indicate what the input format looks like? Is it embedding arrays for the inputs and one hot arrays for label?

Jan Stypka · Answer 3 · Fri Oct 05 2018 05:46:57 GMT+0800 (China Standard Time)

@dorg-ekrolewicz the output is one-hot arrays and the input is a 2D array - each row being a word represented as a word2vec vector. A batch of several document would make a 3D tensor. Does that help?

dorg-ekrolewicz · Answer 4 · Fri Oct 05 2018 05:54:13 GMT+0800 (China Standard Time)

Are you using padding?

Ex for classifying cats and dogs: num_classes = 2
max_num_words = number of words in x = 10 (in this example)

Inputs:

x = "the dog is red" y = [0,1] where num_words = 4
x = "the cat and dog are blue" y = [1,1] where num_words = 6

Since we have m=2 examples, the input dimensions would be a (m, embedding_dim, max_num_words)?

Jan Stypka · Answer 5 · Fri Oct 05 2018 05:57:29 GMT+0800 (China Standard Time)

@dorg-ekrolewicz yes, that looks correct. We pad with 0s until max_num_words and throw a 0 vector if we don't have a representation for a word (unfamiliar vocabulary).

Pretty much all the code is in this function.