Umlauts
Fetzii opened this issue · comments
Fetzii commented
- spikex version: 0.5.2
- Python version: 3.9.7
- Operating System: Windows 10
Description
Getting categories for a page with umlauts from my dewiki_core (Cem Özdemir: https://de.wikipedia.org/wiki/Cem_%C3%96zdemir)
It crashes, what shouldn't happen. There is also an english wiki page for him (https://en.wikipedia.org/wiki/Cem_%C3%96zdemir)
What I Did
from spikex.wikigraph import load as wg_load
wg = wg_load("dewiki_core")
page = "Cem_Özdemir"
categories = wg.get_categories(page, distance=1)
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
Paolo Arduin commented
Thank you, @Fetzii!
I'll investigate this issue, but it could be related to some bad handled encoding.
I keep you posted on what I'll find.
Fetzii commented
It seems to me, that I have managed to fix the problem locally by changing line 234 in dumptools.py from:
line = line.decode("latin1")
to: line = line.decode(encoding="utf-8", errors="backslashreplace")