wetneb / pynif

A small Python library for NLP Interchange Format (NIF) for NER(D) systems

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Order of NIF contexts is random

flackbash opened this issue · comments

I have a function that looks something like this:

def print_nif_contexts(filepath: str):
    with open(filepath, "r", encoding="utf8") as file:
        file_content = file.readlines()
        nif_content = "".join(file_content)
        nif_doc = NIFCollection.loads(nif_content)
        for context in nif_doc.contexts:
            print(context.uri)

The order of the contexts is random and differs from run to run.

I would expect the sort order to be the same as in the file from which the contexts are read.

Having the same order in every run would already help, but sorted(nif_doc.contexts) is not supported (TypeError: '<' not supported between instances of 'NIFContext' and 'NIFContext').

Other than that I find the package really useful so far, thanks for making this opensource :)

that sounds like a very sensible request! If you feel like making a PR for it, I would be happy to merge it and release it.

It looks like I would have to dig a bit deeper for preserving the order of the input file, since the randomness seems to be coming from rdflib.Graph if I'm not mistaken.
But if adding support for sorting is fine with you, I can open a PR. I think the most natural sort order for contexts would be by uri, do you agree?

On second thought: Since it is probably not entirely obvious by what key contexts should be sorted (other than using the input file order), it might be better if the users just implement a custom sort key instead of supporting sorting for contexts.

If preserving the order in which the contexts were read is possible, then I guess it is a natural choice. Otherwise, any deterministic order would probably be sensible too :)

commented

@flackbash To understand the problem a little bit more, could you provide a case or situation where the order of context helps?

I can only think about testing or debugging scenarios where you want to do a full text-based comparison/match. Despite that, the good thing about semantic data is that it is meaningful by itself and the order in which you define the contexts should not matter as they are referenced by IRIs. Otherwise, if we want to preserve the order of the contexts, we should create an rdf:list to represent the order of the different Contexts of a Collection. However, I doubt this will follow the NIF 2.1 specification.

Another way to go would be not using the rdflib Graph to create a graph before serializing it. The rdf Graph is where the order of the context is lost, as it loads the contexts as a graph in memory before serializing them. However, avoiding using the rdflib will make the library more complex as we will need to manually implement each serialization for each type of format. This includes removing duplicated triples if any. The good side of avoiding rdflib would be preserving the order and improving performance.

My suggestion is that if the order is a critical issue, you can use the function triples() to generate each triple from the collection/context/phrase (this will preserve the order) and then manually serialise them in your format of interest. The same for loading the triples from a file: instead of doing it with the NIFCollection.load(), you can create a function that generates the NIFCollection based on the contents of the file. <--- If you want to go in this direction, I recommend using JSON-LD, as the mapping to pyNIF instances would be much easier.

If I remember correctly, my incentive was partly debugging: I was generating NIF output files and wanted to be able to compare them easily (e.g. using diff) and particularly check if two files have the same content.

Another use case was that I was reading a benchmark from a NIF file containing several articles and while transforming the benchmark to a different format (jsonl) I wanted to keep the articles in the order in which the benchmark is commonly displayed to avoid confusion.

I ended up just using the context URI as key for sorting when iterating over all contexts which is sufficient for my use case and I understand also that the contexts are not really intended to be sorted.