ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text

Home Page:http://bark.phon.ioc.ee/punctuator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No punctuation in result - Train more lines or preprocess data.dev.txt?

ErfolgreichCharismatisch opened this issue · comments

I followed

python data.py <data_dir>
python main.py <model_name> 256 0.02
cat data.dev.txt | python punctuator.py <model_path> <model_output_path>

I used the europarl-v7.de-en.de dataset and took

1800 lines for ep.dev.txt
1800 lines for ep.test.txt
7200 lines for ep.train.txt

with data.dev.txt being a long string on one line from kaldi, a speech-to-text engine. It's all lowercase, sometimes wrong words and no punctuation.

<model_output_path> is equal to data.dev.txt

Is the solution to train more lines or do I have to preprocess data.dev.txt? If the latter, how?

I think you need to have sentences on a new line and many more samples. I used https://www.statmt.org/wmt14/training-monolingual-europarl-v7/europarl-v7.fr.gz and modified run.sh to use this file. Run it and see what output .txt files look like.