barrucadu / markov

Markov chain text generator, as used for KingJamesProgramming

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with using a cyrillic corpus

barrucadu opened this issue · comments

I have a cyrillic corpus here, see relevant part of an email exchange:

I was as surprised as you are.
See attached a novel in Russian.
Here's what I do:
(Cmd) train 3 dom.txt
(Cmd) tokens 10
Warning: using seed 1459321644
и знал, что ты свой поповский туман на меня нагонять
The output can be found in input file verbatim. So it's just 10 consecutive
words, taken from a random place.

Interestingly, if I comment out the call to sys.intern in then it works. So something funky seems to be going on with interning strings consisting of cyrillic characters, although I've not been successful yet at replicating this with a small enough corpus to make manually examining the trained data structure fruitful.

I'm a bit wary to just remove that line, as string interning is typically a massive memory saving in most situations, but if you are affected by this you can do so to make it work.

I have just checked the dictionaries produced by markov.Markov.train, both with and without interning. They are identical. So that's not where the issue creeps in.