Make a Python script for separating a post into sentences and store that info in the database
ykdojo opened this issue · comments
This is so that users will be able to edit a post sentence-by-sentence.
I'm starting to think, this might be a good structure for the database:
We already have: Post.
Each Post has text.
In addition to that, we should have a Sentence model.
Each Sentence will belong to a Post.
And each Sentence will have a sentence_index, which will be the index that shows where in the Post it appears.
So, the first sentence will have sentence_index = 0, and the second sentence will have sentence_index = 1, and so on.
I'm going to look into the best way to do this right now.
This actually seems like a non-trivial problem.
Some StackOverflow discussions about this:
Looks like nltk.tokenize is a preferred solution, as described here: https://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python
I'm going to try and see if it works with Japanese, too.
Anyway, for now, I'm just going to make a simplified version of this algorithm and move on (maybe like break paragraphs by line breaks for now).
I'm going to work on this now.
I'm planning to break the sentences by line breaks and ignore empty lines.
Note: I'm planning to work on this branch for this: https://github.com/ykdojo/editdojoprivate/tree/split-post-into-paragraphs