kuhumcst / DanNet

The Danish WordNet as an RDF graph.

Home Page:https://wordnet.dk

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing ontotype

simongray opened this issue · comments

Ok, so the issue relates to data in the 2023 adjective dataset and it seems to be the following:

  1. ontotypes are not being display because multiple composite ontotypes are being attached to a single synset, which ruins the assumptions of the UI.
  2. In one type of case, the existing synset is actually an old synset that already has an ontotype. When this is the case, it shouldn't inherit anything. This has been fixed in 7c737d3.
  3. In the other case, the same sense ID appears multiple times in the adjectives dataset. This is causing several identical synthesized synset IDs made from the sense IDs, e.g. 21038758.

For the other case, I wanted to solve it by adding -N for each dupe at the the end of the synthesized ID, e.g. synset-s21038758-0 and synset-s21038758-1. My first attempt just preprocessed the rows, adding this information as metadata...

However, this other case is pretty hairy, since the synthesized IDs are generated not only for the sek_id of a particular row, but also for its siblings, so how do I know if the siblings are dupes? I need to know this when finding siblings too.

At least the hypernyms seem to be distinct from the new adjectives.

(let [rows (read-triples [identity
                            "bootstrap/other/dannet-new/adjectives.tsv"
                            :encoding "UTF-8"
                            :separator \tab
                            :preprocess rest])]
    (set/intersection (set (map #(nth % 5) rows))
                      (set (map #(nth % 7) rows))))

;; => #{""}

At least the hypernyms seem to be distinct from the new adjectives.

(let [rows (read-triples [identity
                            "bootstrap/other/dannet-new/adjectives.tsv"
                            :encoding "UTF-8"
                            :separator \tab
                            :preprocess rest])]
    (set/intersection (set (map #(nth % 5) rows))
                      (set (map #(nth % 7) rows))))

;; => #{""}

A side issue I have discovered is that some of the dannetsemid in the dataset are not defined in the label dataset from Thomas, yet they do exist in our dataset (e.g. lydig, sense 21049162), so I have to make sure to also check wordsenses.csv when creating these links.

https://wordnet.dk/dannet/data/sense-21049162
https://wordnet.dk/dannet/data/synset-79018

Hmmm... an unsplit duplicate has appeared here: http://localhost:3456/dannet/data/sense-21086269