kermitt2 / entity-fishing

A machine learning tool for fishing entities

Home Page:http://nerd.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

For KB concepts the given "valueName" of statements should be the english label instead of the wikipedia page title

oterrier opened this issue · comments

I find it counter intuitive to set the valueName of statement properties to the title of the wikipedia page for KB concepts
Let's take some examples
With "human brain" (Q492038) the partial response to concept looup gives

...
Statements:
...
subclass of | Q75865
subclass of | Q25449120
subclass of | Brain
subclass of | Q66589895
...

Because Q75865 (encephalon), Q25449120 (human organ) and Q66589895 (organ component of neuraxis) have no direct wikipedia page

I think we should instead provide:

...
Statements:
...
subclass of | encephalon
subclass of | human organ
subclass of | brain
subclass of | organ component of neuraxis
...

So replace the current valueName by the english label of the wikidata concept if exists

Another example with "Strasbourg" (Q6602) the partial response to concept looup gives:

...
Statements:
...
instance of | Communes of France
instance of | Q1549591
instance of | Capital city
...

Where we could have:

...
Statements:
...
instance of | commune of France
instance of | big city
instance of | capital
...

I can provide a PR if you approve this modification

Regards

Olivier

Thank you Olivier, yes you're right, it's a much better idea to use Wikidata English label as value name. This is more consistent but afaik there is no guarantee to have an English label for all Wikidata entities, so a fallback to the Wikidata Q identitifer will still be necessary.

PR is very welcomed, thanks !

Correct me if I'm wrong but to implement this we have to modify the grisp project part ?

It could be in entity-fishing. You may have seen that the labels are not loaded right now, but the statements are loaded from the Wikidata json dump file directly in entity-fishing, so we could add the labels at the same time, as additional special statements (ok it's hacky).

We could also extend the Concept object to store the labels (at least the English one) and load the labels from the Wikidata json dump too - it will slow a bit more the lmdb creation time, but then everything is done in entity-fishing.

Otherwise, in Grisp we could put the label in the wikidataIds.csv and load the label when we create the concept database.

In #72 I was thinking to use labels as additional terms that can trigger entity candidates, but it requires to estimate some usage information, so I postponed the coverage of the labels to study how to do this. Here it makes more sense to do that in grisp because it would require some pre-processing/smoothing. However, we can already import the label anyway in a simple manner first, before addressing this issue.

Fix with #142 without hacky way, we create an additional lmdb for the wikidata labels of the supported languages.

concept look-up for Q492038 ->

Statements:
---
subclass of | encephalon
subclass of | human organ
subclass of | primate brain
subclass of | organ component of neuraxis