jboynyc / textnets

Text analysis with networks.

Home Page:https://textnets.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deal with empty documents

jboynyc opened this issue · comments

>>> import textnets as tn
>>> import pandas as pd
>>> 
>>> s = pd.Series(['text 1', None, 'text 3', 'text 4'], index=list('ABCD'))
>>> tn.Corpus(s)
Corpus(4 docs: A, B, C…)
>>> tn.Corpus(s).tokenized()
# results in error because of document B

Either silently discard empties, discard and warn, or provide an option in Corpus init method.

New behavior:

>>> import textnets as tn
>>> import pandas as pd
>>> s = pd.Series(['text 1', None, 'text 3', 'text 4'], index=list('ABCD'))
>>> tn.Corpus(s).tokenized()
.../textnets/textnets/corpus.py:64: UserWarning: Dropping 1 empty document(s).
  warnings.warn(f"Dropping {missings} empty document(s).")
       term  n
label         
A      text  1
C      text  1
D      text  1