Which tokenizer did you use?

Question

Which tokenizer did you use?

janwendt opened this issue 6 years ago · comments

Your documentation says:

The input file should contain one sentence by line, and they have to be tokenized. Otherwise, the tagger will perform poorly.

Simple Question: Which tokenizer did you use?

Guillaume Lample · Answer 1 · Sat Apr 21 2018 05:11:52 GMT+0800 (China Standard Time)

Hi,

I only trained the model on the CoNLL datasets that were already tokenized, so I did not have to tokenized anything. Probably the Moses tokenizer should work well:
https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer

Jan Wendt · Answer 2 · Sun Apr 22 2018 21:04:42 GMT+0800 (China Standard Time)

@glample can you give one example line how the input.txt should look like?

Guillaume Lample · Answer 3 · Mon Apr 23 2018 01:50:30 GMT+0800 (China Standard Time)

Yes, you can check the data here:
https://github.com/glample/tagger/tree/master/dataset

rezzy · Answer 4 · Tue May 08 2018 15:38:47 GMT+0800 (China Standard Time)

@janwendt What did you eventually do?

Jan Wendt · Answer 5 · Wed May 09 2018 00:13:57 GMT+0800 (China Standard Time)

@mrmotallebi I am using the StanfordCoreNLP API which does a very good job but there are similar Python libs (NLTK is pretty good) as well.
Post that got me to the API: https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en

Guillaume Lample · Answer 6 · Wed May 09 2018 06:08:09 GMT+0800 (China Standard Time)

I would personally recommend the Moses one, it's pretty standard, and very fast.

Kezhi Lu · Answer 7 · Thu May 10 2018 14:30:27 GMT+0800 (China Standard Time)

@janwendt Do you have a domo of input.txt to the tagger.py?

Kezhi Lu · Answer 8 · Thu May 10 2018 14:48:49 GMT+0800 (China Standard Time)

@janwendt It needn't . I have successed.

Z · Answer 9 · Tue Jul 23 2019 07:55:43 GMT+0800 (China Standard Time)

@bjtu-lucas-nlp Could you please share an example of input.txt?
I have tried all kinds of combination but still get O tags of everything in the output.

Jan Wendt · Answer 10 · Wed Sep 18 2019 00:03:00 GMT+0800 (China Standard Time)

@gui-li if you have a column based data structure like:
example1;example2;example3;

the tokenized output and input.txt for your tagger should be:

example1
example2
example3

Z · Answer 11 · Wed Sep 18 2019 00:57:56 GMT+0800 (China Standard Time)

@janwendt Thanks for replying.