Deal with empty documents
jboynyc opened this issue · comments
John Boy commented
>>> import textnets as tn
>>> import pandas as pd
>>>
>>> s = pd.Series(['text 1', None, 'text 3', 'text 4'], index=list('ABCD'))
>>> tn.Corpus(s)
Corpus(4 docs: A, B, C…)
>>> tn.Corpus(s).tokenized()
# results in error because of document B
Either silently discard empties, discard and warn, or provide an option in Corpus
init method.
John Boy commented
New behavior:
>>> import textnets as tn
>>> import pandas as pd
>>> s = pd.Series(['text 1', None, 'text 3', 'text 4'], index=list('ABCD'))
>>> tn.Corpus(s).tokenized()
.../textnets/textnets/corpus.py:64: UserWarning: Dropping 1 empty document(s).
warnings.warn(f"Dropping {missings} empty document(s).")
term n
label
A text 1
C text 1
D text 1