glample / tagger

Named Entity Recognition Tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Which tokenizer did you use?

janwendt opened this issue · comments

Your documentation says:

The input file should contain one sentence by line, and they have to be tokenized. Otherwise, the tagger will perform poorly.

Simple Question: Which tokenizer did you use?

Hi,

I only trained the model on the CoNLL datasets that were already tokenized, so I did not have to tokenized anything. Probably the Moses tokenizer should work well:
https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer

@glample can you give one example line how the input.txt should look like?

commented

@janwendt What did you eventually do?

@mrmotallebi I am using the StanfordCoreNLP API which does a very good job but there are similar Python libs (NLTK is pretty good) as well.
Post that got me to the API: https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en

I would personally recommend the Moses one, it's pretty standard, and very fast.

@janwendt Do you have a domo of input.txt to the tagger.py?

@janwendt It needn't . I have successed.

commented

@bjtu-lucas-nlp Could you please share an example of input.txt?
I have tried all kinds of combination but still get O tags of everything in the output.

@gui-li if you have a column based data structure like:
example1;example2;example3;

the tokenized output and input.txt for your tagger should be:

example1
example2
example3
commented

@janwendt Thanks for replying.