No punctuation in result - Train more lines or preprocess data.dev.txt?
ErfolgreichCharismatisch opened this issue · comments
I followed
python data.py <data_dir>
python main.py <model_name> 256 0.02
cat data.dev.txt | python punctuator.py <model_path> <model_output_path>
I used the europarl-v7.de-en.de
dataset and took
1800 lines for ep.dev.txt
1800 lines for ep.test.txt
7200 lines for ep.train.txt
with data.dev.txt
being a long string on one line from kaldi, a speech-to-text engine. It's all lowercase, sometimes wrong words and no punctuation.
<model_output_path>
is equal to data.dev.txt
Is the solution to train more lines or do I have to preprocess data.dev.txt
? If the latter, how?
Push
I think you need to have sentences on a new line and many more samples. I used https://www.statmt.org/wmt14/training-monolingual-europarl-v7/europarl-v7.fr.gz and modified run.sh to use this file. Run it and see what output .txt files look like.