inspirehep / magpie

Deep neural network framework for multi-label text classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

suggestion: Use a single file for labels and text

shashi-netra opened this issue · comments

In the current version you have .lab and .txt files - one each for a training row. Wouldn't it be easier to save these in a single file or a single one for labels and another for text files? Wouldn't this be more
idiomatic (a la scikit-learn)

Having several million .lab files and .txt files is especially problematic when there are millions of files and the filesystem chokes up.

@shashi-netra you're right, having an other option of loading files would be a reasonable feature. I think you're actually not the first who suggested that. It shouldn't be difficult to implement, but I can't promise I'll have time to do that in the near future. You're welcome to take a stab at it and open a PR!

@jstypka Can you please indicate what the input format looks like? Is it embedding arrays for the inputs and one hot arrays for label?

@dorg-ekrolewicz the output is one-hot arrays and the input is a 2D array - each row being a word represented as a word2vec vector. A batch of several document would make a 3D tensor. Does that help?

Are you using padding?

Ex for classifying cats and dogs: num_classes = 2
max_num_words = number of words in x = 10 (in this example)

Inputs:

  1. x = "the dog is red" y = [0,1] where num_words = 4
  2. x = "the cat and dog are blue" y = [1,1] where num_words = 6

Since we have m=2 examples, the input dimensions would be a (m, embedding_dim, max_num_words)?

@dorg-ekrolewicz yes, that looks correct. We pad with 0s until max_num_words and throw a 0 vector if we don't have a representation for a word (unfamiliar vocabulary).

Pretty much all the code is in this function.