Ambiguity in indexing of labels

Question

Ambiguity in indexing of labels

Faran-Javaid opened this issue 3 years ago · comments

Hi @tecoholic . First of all thanks for this great repository. Secondly I would like to ask you a question.
Can you please explain on which string are you indexing (either original or after tokenization) because when I am testing the exported json file the indexes are not appropriate.

Arunmozhi · Answer 1 · Thu Sep 23 2021 22:42:52 GMT+0800 (China Standard Time)

@Faran-Javaid Hi, the indices are calculated as per the TreebankTokenizer algorithm. The relevant code can be see here

https://github.com/tecoholic/ner-annotator/blob/main/annotator/server.py#L20

rsparth · Answer 2 · Wed Sep 29 2021 15:21:47 GMT+0800 (China Standard Time)

This index issue is happened with me as well. If I consider original text then index works fine, but if I try indexing on text which is present in JSON, it returns wrong result.

Faran-Javaid · Answer 3 · Wed Sep 29 2021 15:27:04 GMT+0800 (China Standard Time)

@Faran-Javaid Hi, the indices are calculated as per the TreebankTokenizer algorithm. The relevant code can be see here

https://github.com/tecoholic/ner-annotator/blob/main/annotator/server.py#L20

Thanks for the response @tecoholic . I have noticed that this tool works completely fine when the text is properly formatted. However, if the string contains multiple space or \t or \n characters then the indexing seems to be going wrong. I have figured it out and fixed this issue by taking the original text instead of tokenized text for annotations. Please let me know if you want me to make a pull request for the above mentioned changes. Cheers!

Arunmozhi · Answer 4 · Thu Sep 30 2021 11:24:24 GMT+0800 (China Standard Time)

@Faran-Javaid It would be nice to have the improvement included. Please send a PR, I will be happy to merge. Thanks in advance.

Arunmozhi · Answer 5 · Sun Oct 10 2021 17:21:31 GMT+0800 (China Standard Time)

PR & Further Discussion in #23