erre-quadro / spikex

SpikeX - SpaCy Pipes for Knowledge Extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incomplete list of categories

Fetzii opened this issue · comments

  • spikex version: 0.5.2
  • Python version: 3.9.7
  • Operating System: Windows 10

Description

I want to get all categories of a page, but most categories are missing

What I Did

from spikex.wikigraph import load as wg_load
page = "Peking_2022"
categories = wg.get_categories(page, distance=1)

What I get: ['Category:Olympische_Winterspiele_2022']
The output I expect: ['Austragung der Olympischen Winterspiele', 'Olympische Winterspiele 2022', 'Sport (Hebei)', 'Sportveranstaltung 2022', 'Sportveranstaltung in Peking', 'Wikipedia:Veraltet nach Jahr 2022', 'Zukünftige Sportveranstaltung']
Prove: https://de.wikipedia.org/wiki/Olympische_Winterspiele_2022

I created a categorylinks dictionary from the categorylinks.sql.gz, so that the keys are the page_ids and under each key is the list of categories. I used your functions to get the page_id: page_id = self.get_pageid(self.redirect(page)) and my categorylinks dictionary . With this method I get the expected output. If this behaviour is not desired, I would like to think that there is a problem with the processing of categorylinks.sql.gz on your side.

I'm facing the same problem in with a ptwiki_core