barrucadu / markov

Markov chain text generator, as used for KingJamesProgramming

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with using a cyrillic corpus

barrucadu opened this issue · comments

I have a cyrillic corpus here, see relevant part of an email exchange:

I was as surprised as you are.
See attached a novel in Russian.
Here's what I do:
(Cmd) train 3 dom.txt
(Cmd) tokens 10
Warning: using seed 1459321644
и знал, что ты свой поповский туман на меня нагонять
The output can be found in input file verbatim. So it's just 10 consecutive
words, taken from a random place.

Interestingly, if I comment out the call to sys.intern in https://github.com/barrucadu/markov/blob/master/markov/markov.py#L24 then it works. So something funky seems to be going on with interning strings consisting of cyrillic characters, although I've not been successful yet at replicating this with a small enough corpus to make manually examining the trained data structure fruitful.

I'm a bit wary to just remove that line, as string interning is typically a massive memory saving in most situations, but if you are affected by this you can do so to make it work.

I have just checked the dictionaries produced by markov.Markov.train, both with and without interning. They are identical. So that's not where the issue creeps in.