Rule based language model
kezakool opened this issue · comments
Hi,
Thank you for all your work and shares!
I'm trying to use kenlm to make some rule based language models, on small texts to detect child text missreading.
I need to tweek the probabilities of each node to keep like 70% on the nominal text and 30% on errors, what is the best way to build the correct arpa file with your package?
For now, i ve generated artificial text variations including errors, and added a big percentage of duplicated nominal sentences to get the good probabilities.
with this, i ve succeded in generating some lm with the discount_fallback-- parameter, but i find it hard to verify and get some weird results, making me thinking that the duplicated sentences are not effective on probabilities.
Thank you for your time :)