ykdojo / editdojo2

This used to be Edit Dojo's private repo - now it's public.

Home Page:https://www.csdojo.io/edit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make a Python script for separating a post into sentences and store that info in the database

ykdojo opened this issue · comments

commented

This is so that users will be able to edit a post sentence-by-sentence.

commented

@Jonathantsho FYI, I'm working on this one now.

commented

I'm starting to think, this might be a good structure for the database:

We already have: Post.

Each Post has text.

In addition to that, we should have a Sentence model.

Each Sentence will belong to a Post.

And each Sentence will have a sentence_index, which will be the index that shows where in the Post it appears.

So, the first sentence will have sentence_index = 0, and the second sentence will have sentence_index = 1, and so on.

I'm going to look into the best way to do this right now.

commented

This actually seems like a non-trivial problem.

Some StackOverflow discussions about this:
https://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python
https://stackoverflow.com/questions/4576077/python-split-text-on-sentences

Looks like nltk.tokenize is a preferred solution, as described here: https://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python

I'm going to try and see if it works with Japanese, too.

commented

I tried using nltk.tokenize, but I got this error:
image

Looks like we'll need to load some data somewhere first?

Anyway, for now, I'm just going to make a simplified version of this algorithm and move on (maybe like break paragraphs by line breaks for now).

commented

Anyway, for now, I'm just going to make a simplified version of this algorithm and move on (maybe like break paragraphs by line breaks for now).

I'm going to work on this now.

I'm planning to break the sentences by line breaks and ignore empty lines.

commented

Note: I'm planning to work on this branch for this: https://github.com/ykdojo/editdojoprivate/tree/split-post-into-paragraphs