askerlee / topicvec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sentences within a dataset

gabrer opened this issue · comments

I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).

Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?

PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?

Oh, thank you for confirming this!
I've already modified the regular expression; but unfortunately, they are not only abbreviations but "mistakes".

Thank you anyway!