Deal with empty documents

Question

Deal with empty documents

jboynyc opened this issue 4 years ago · comments

>>> import textnets as tn
>>> import pandas as pd
>>> 
>>> s = pd.Series(['text 1', None, 'text 3', 'text 4'], index=list('ABCD'))
>>> tn.Corpus(s)
Corpus(4 docs: A, B, C…)
>>> tn.Corpus(s).tokenized()
# results in error because of document B

Either silently discard empties, discard and warn, or provide an option in Corpus init method.

John Boy · Answer 1 · Fri Jul 10 2020 00:38:52 GMT+0800 (China Standard Time)

New behavior:

>>> import textnets as tn
>>> import pandas as pd
>>> s = pd.Series(['text 1', None, 'text 3', 'text 4'], index=list('ABCD'))
>>> tn.Corpus(s).tokenized()
.../textnets/textnets/corpus.py:64: UserWarning: Dropping 1 empty document(s).
  warnings.warn(f"Dropping {missings} empty document(s).")
       term  n
label         
A      text  1
C      text  1
D      text  1