tecoholic / ner-annotator

Named Entity Recognition (NER) Annotation tool for SpaCy. Generates Traning Data as a JSON which can be readily used.

Home Page:https://tecoholic.github.io/ner-annotator/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ambiguity in indexing of labels

Faran-Javaid opened this issue · comments

Hi @tecoholic . First of all thanks for this great repository. Secondly I would like to ask you a question.
Can you please explain on which string are you indexing (either original or after tokenization) because when I am testing the exported json file the indexes are not appropriate.

@Faran-Javaid Hi, the indices are calculated as per the TreebankTokenizer algorithm. The relevant code can be see here

https://github.com/tecoholic/ner-annotator/blob/main/annotator/server.py#L20

This index issue is happened with me as well. If I consider original text then index works fine, but if I try indexing on text which is present in JSON, it returns wrong result.

@Faran-Javaid Hi, the indices are calculated as per the TreebankTokenizer algorithm. The relevant code can be see here

https://github.com/tecoholic/ner-annotator/blob/main/annotator/server.py#L20

Thanks for the response @tecoholic . I have noticed that this tool works completely fine when the text is properly formatted. However, if the string contains multiple space or \t or \n characters then the indexing seems to be going wrong. I have figured it out and fixed this issue by taking the original text instead of tokenized text for annotations. Please let me know if you want me to make a pull request for the above mentioned changes. Cheers!

@Faran-Javaid It would be nice to have the improvement included. Please send a PR, I will be happy to merge. Thanks in advance.

PR & Further Discussion in #23