erre-quadro / spikex

SpikeX - SpaCy Pipes for Knowledge Extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Umlauts

Fetzii opened this issue · comments

  • spikex version: 0.5.2
  • Python version: 3.9.7
  • Operating System: Windows 10

Description

Getting categories for a page with umlauts from my dewiki_core (Cem Özdemir: https://de.wikipedia.org/wiki/Cem_%C3%96zdemir)
It crashes, what shouldn't happen. There is also an english wiki page for him (https://en.wikipedia.org/wiki/Cem_%C3%96zdemir)

What I Did

from spikex.wikigraph import load as wg_load
wg = wg_load("dewiki_core")
page = "Cem_Özdemir"
categories = wg.get_categories(page, distance=1)
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

Thank you, @Fetzii!
I'll investigate this issue, but it could be related to some bad handled encoding.
I keep you posted on what I'll find.

It seems to me, that I have managed to fix the problem locally by changing line 234 in dumptools.py from:
line = line.decode("latin1") to: line = line.decode(encoding="utf-8", errors="backslashreplace")