Sentences within a dataset
gabrer opened this issue · comments
I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).
Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?
PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?
Oh, thank you for confirming this!
I've already modified the regular expression; but unfortunately, they are not only abbreviations but "mistakes".
Thank you anyway!