Issue with using a cyrillic corpus

Question

Issue with using a cyrillic corpus

barrucadu opened this issue 9 years ago · comments

I have a cyrillic corpus here, see relevant part of an email exchange:

I was as surprised as you are.
See attached a novel in Russian.
Here's what I do:
(Cmd) train 3 dom.txt
(Cmd) tokens 10
Warning: using seed 1459321644
и знал, что ты свой поповский туман на меня нагонять
The output can be found in input file verbatim. So it's just 10 consecutive
words, taken from a random place.

Interestingly, if I comment out the call to sys.intern in https://github.com/barrucadu/markov/blob/master/markov/markov.py#L24 then it works. So something funky seems to be going on with interning strings consisting of cyrillic characters, although I've not been successful yet at replicating this with a small enough corpus to make manually examining the trained data structure fruitful.

I'm a bit wary to just remove that line, as string interning is typically a massive memory saving in most situations, but if you are affected by this you can do so to make it work.

Michael Walker · Answer 1 · Wed Mar 30 2016 16:36:15 GMT+0800 (China Standard Time)

I have just checked the dictionaries produced by markov.Markov.train, both with and without interning. They are identical. So that's not where the issue creeps in.

Michael Walker · Answer 2 · Wed Mar 30 2016 20:44:47 GMT+0800 (China Standard Time)

Here's the corpus: http://lib.ru/PROZA/ABRAMOW/abramov_dom.txt