inspirehep / magpie

Deep neural network framework for multi-label text classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Predictions are horribly wrong

davidniki02 opened this issue · comments

I have trained magpie on a news dataset. I have 9 labels for my data.

I training the model and tested the following text using magpie.predict_from_text():

Más de 690 mil casos de inmigrantes esperan ser resueltos por tribunales de Inmigración WASHINGTON— La Administración Trump ha convertido las protecciones de menores en sinónimo de “lagunas legales” que el Congreso debe eliminar pero mientras tanto, sobre el terreno, tampoco ha mejorado el atasco de más de 692,000 casos pendientes en los tribunales de Inmigración, según expertos.

While I don't have ANY Spanish documents in my training samples, magpie returns a 90% chance that this text belongs to one of my labels! It even predicts similar results for 3 other categories, all of them irrelevant. I even tried to see if there are any words that are causing this, but could not find any.

What can be wrong here? I trained the data on 400-500 documents for each category, and set epochs to 30 as well as 50 (no change in results)

Well, if you didn't feed it any Spanish text before, the network will return random result. In order for the network to build representations for words (in any language) they need to appear in the training set at least N times (N=5 by default). Otherwise Magpie just has no idea what is being fed into it and might be triggered by random noise like "Washington" or "Trump" in your case.

The rule is - you should test/predict on the same type of data as you train.

The thing that worries me is the high confidence - 95% in some cases. If it does not recognize the words, should it not at least be careful about its predictions?

I have the same issue, and have these poor results even if I use some part of the training corpus to test.