inspirehep / magpie

Deep neural network framework for multi-label text classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to deal with the problem of label imbalance??

JiaWenqi opened this issue · comments

My training set has 100,000 doc samples and 1,000 tags, but I found that tags satisfy the long tail distribution. Some tags only appear in less than 10 docs, while others are basically included in every doc. So how should I deal with these situations?

Magpie will likely learn to almost never recommend the classes from the long tail and will frequently default to the most common class. If that's not a behaviour you desire, then you might want to repartition your dataset to have more balanced class distribution.