inspirehep / magpie

Deep neural network framework for multi-label text classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Queries on Inputs to feed

RupeshGoud opened this issue · comments

Hey, thanks for the amazing package !! Can you please clear some queries,

  1. What is the recommended size of the words for each input text ?? My use case is to use all text on companies web homepage as input text.
  2. And what is the minimum number of text documents are recommended?
  3. Can I just concatenate the additional string information to text document if I want ?
    Thanks in advance!!

@RupeshGoud good questions!

  1. By default it's going to cut all the words beyond the configured limit (200 by default, but can be changed in the config file). In principle there's no reason not to use a longer value, but bear in mind that the training might get really slow as the model needs to load all this data to memory and encode it.

  2. Depends on the number of labels that you want to classify for. Generally the number of samples should be in thousands/tens of thousands and the number of labels not more than hundreds. The model is designed to work well for big data i.e. we ran it on a corpus with 300k samples and 10k labels and it performed pretty well, but usually you will have smaller usecases.

  3. Usually yes, especially if it's free text. Bear in mind that the line break in the doc is treated as an end of sentence. If the concatenated document makes sense to you as a human, it's probably fine to use it for the model as well.

Good luck!

Thank you son much @jstypka for quick response. 👍 💯