how to remove sentences from ODM
fredzannarbor opened this issue · comments
Hi,
I want to preprocess certain tokenized sentences before submitting them to the summarizer. For example I would like to be ab le to remove any sentence that contains five consecutive periods (these are often 'noisy' ToC lines).
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
summarizer.stop_words = get_stop_words("english")
summary = summarizer(parser.document, sentences_count)
summary_text = '\n'.join([str(sentence) for sentence in summary])
So I want to insert something like this pseudocode before the "summarizer" line:
for s in parser.document.sentences:
if s.str.contains("...."):
s.remove()
Of course, this doesn't work because the ODM is not iterable. So how do I iterate through the components of the document and remove or edit them as I see fit?
Hello, DOM is just an object consisting of paragraphs and sentences. You can filter sentences out and create a new one if you want.
paragraphs = []
for p in parser.document.paragraphs:
paragraphs.append([s for s in p.sentences if not str(s).contains("....")])
dom = ObjectDocumentModel(paragraphs)
You have to cover edge case as if you remove all sentences from paragraph maybe. But maybe even empty paragraphs will work.
Thank you. I did not understand how to reconstitute the dom from the constituents.
OK, one more obstacle.
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
summarizer.stop_words = get_stop_words("english")
print(len(parser.document.sentences))
paragraphs = []
drops = []
#print(len(parser.document.paragraphs))
for paragraph in parser.document.paragraphs:
#print(len(paragraph.sentences))
for sentence in paragraph.sentences:
if "......" in str(sentence):
drops.append(sentence)
else:
paragraphs.append(sentence)
print(len(drops), len(paragraphs))
dom = ObjectDocumentModel(paragraphs)
print(len(dom.paragraphs))
summary = summarizer(dom, sentences_count)
The extra code is just to make sure that the filter is dropping the problem sentences, and the keeps & drops add up correctly. But when I try to summarize the filtered dom, it throws an error.
1746
9 1737
1737
Traceback (most recent call last):
File "app/utilities/text2sumy_summarize.py", line 53, in <module>
result = sumy_summarize(text, sentences_count=args.sentences_count)
File "app/utilities/text2sumy_summarize.py", line 32, in sumy_summarize
summary = summarizer(dom, sentences_count)
File "/Users/fred/.virtualenvs/pycharmed-unity/lib/python3.8/site-packages/sumy/summarizers/lex_rank.py", line 36, in __call__
sentences_words = [self._to_words_set(s) for s in document.sentences]
File "/Users/fred/.virtualenvs/pycharmed-unity/lib/python3.8/site-packages/sumy/utils.py", line 53, in decorator
setattr(self, key, getter(self))
File "/Users/fred/.virtualenvs/pycharmed-unity/lib/python3.8/site-packages/sumy/models/dom/_document.py", line 23, in sentences
return tuple(chain(*sentences))
File "/Users/fred/.virtualenvs/pycharmed-unity/lib/python3.8/site-packages/sumy/models/dom/_document.py", line 22, in <genexpr>
sentences = (p.sentences for p in self._paragraphs)
AttributeError: 'Sentence' object has no attribute 'sentences'
The bug is on this line paragraphs.append(sentence)
😉