Sentences within a dataset

Question

Sentences within a dataset

gabrer opened this issue 7 years ago · comments

I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).

Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?

PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?

askerlee · Answer 1 · Thu May 18 2017 14:15:12 GMT+0800 (China Standard Time)

The sentence information is actually not used. So it should not impact the performance. Do you mean that dots are part of the abbreviations? In this case you could modify the regular expression used to extract tokens from text.

…

On May 18, 2017 1:32 AM, "Gabriele Pergola" ***@***.***> wrote: I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases). Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgKJZSRxeInd8W9fti3r2NYk2JmSCibks5r6y86gaJpZM4NeNLw> .

Gabriele Pergola · Answer 2 · Thu May 18 2017 19:56:41 GMT+0800 (China Standard Time)

Oh, thank you for confirming this!
I've already modified the regular expression; but unfortunately, they are not only abbreviations but "mistakes".

Thank you anyway!