Issue with using a cyrillic corpus
barrucadu opened this issue · comments
I have a cyrillic corpus here, see relevant part of an email exchange:
I was as surprised as you are.
See attached a novel in Russian.
Here's what I do:
(Cmd) train 3 dom.txt
(Cmd) tokens 10
Warning: using seed 1459321644
и знал, что ты свой поповский туман на меня нагонять
The output can be found in input file verbatim. So it's just 10 consecutive
words, taken from a random place.
Interestingly, if I comment out the call to sys.intern
in https://github.com/barrucadu/markov/blob/master/markov/markov.py#L24 then it works. So something funky seems to be going on with interning strings consisting of cyrillic characters, although I've not been successful yet at replicating this with a small enough corpus to make manually examining the trained data structure fruitful.
I'm a bit wary to just remove that line, as string interning is typically a massive memory saving in most situations, but if you are affected by this you can do so to make it work.
I have just checked the dictionaries produced by markov.Markov.train
, both with and without interning. They are identical. So that's not where the issue creeps in.
Here's the corpus: http://lib.ru/PROZA/ABRAMOW/abramov_dom.txt