No punctuation in result - Train more lines or preprocess data.dev.txt?

Question

No punctuation in result - Train more lines or preprocess data.dev.txt?

ErfolgreichCharismatisch opened this issue 3 years ago · comments

Erfolgreich charismatisch commented 3 years ago

I followed

python data.py <data_dir>
python main.py <model_name> 256 0.02
cat data.dev.txt | python punctuator.py <model_path> <model_output_path>

I used the europarl-v7.de-en.de dataset and took

1800 lines for ep.dev.txt
1800 lines for ep.test.txt
7200 lines for ep.train.txt

with data.dev.txt being a long string on one line from kaldi, a speech-to-text engine. It's all lowercase, sometimes wrong words and no punctuation.

<model_output_path> is equal to data.dev.txt

Is the solution to train more lines or do I have to preprocess data.dev.txt? If the latter, how?

Erfolgreich charismatisch commented 3 years ago

Push

ssabatier · Answer 1 · Tue Apr 05 2022 15:53:11 GMT+0800 (China Standard Time)

I think you need to have sentences on a new line and many more samples. I used https://www.statmt.org/wmt14/training-monolingual-europarl-v7/europarl-v7.fr.gz and modified run.sh to use this file. Run it and see what output .txt files look like.