AstraZeneca / KAZU

Fast, world class biomedical NER

Home Page:https://AstraZeneca.github.io/KAZU/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

find a solution for mapping_types in SynonymTerm

RichJackson opened this issue · comments

getting rid of mapping types would be a useful thing to do for many reasons:

  1. no one really cares about them it seems
  2. they complicate hashing behaviour of SynonymTerm and thus SynonymDB can have inconsistent behaviour and results

Originally, we had them as a term can have multiple disjointed mapping types in a KB.

original comment

          the `get_all` call 'smells' wrong to me here for a couple of reasons:
  1. We're having to iterate over all the synonym terms for every annotation here.
  2. Within the get_all call, we get the synonym term associated with the term_norm, but don't capture it here, and then have to recalculate it within drop_equivalent_id_set_from_synonym_term when we do synonym_term = self._syns_database_by_syn[name][synonym].

Is there a problem I'm missing with replacing:

        self._associated_id_sets_by_id: Dict[
            ParserName, Dict[str, Set[FrozenSet[EquivalentIdSet]]]
        ] = {}

with

        self._synonym_terms_by_id: Dict[
            ParserName, Dict[str, Set[SynonymTerm]]
        ] = {}

?

If we do that, then this function doesn't need to loop over all the synonym terms for the parser, and can just be:

    def drop_equivalent_id_set_containing_id_from_all_synonym_terms(
        self, name: ParserName, id_to_drop: Idx
    ) -> Tuple[int, int]:
        """
        remove all EquivalentIdSet's that contain this id from all SynonymTerms
        in the database

        :param name:
        :param id_to_drop:
        :return:
        """

        terms_modified = 0
        terms_dropped = 0

        syn_term: SynonymTerm
        for syn_term in self.get_associated_syn_terms_for_id(name, id_to_drop):
            new_id_sets = frozenset(  # or set( if we leave it as-is
                equiv_id_set
                for equiv_id_set in syn_term.associated_id_sets
                if id_to_drop not in equiv_id_set.ids
            )

            result = self._modify_or_drop_synonym_term_after_id_set_change(
                new_id_sets, name, syn_term
            )

            if result == DBModificationResult.ID_SET_MODIFIED:
                terms_modified += 1
            elif result == DBModificationResult.SYNONYM_TERM_DROPPED:
                terms_dropped += 1

        return terms_modified, terms_dropped

Although doing that makes me wonder if we actually need both this new self._synonym_terms_by_id and the existing self._syns_by_aggregation_strategy, since that also goes from an id to some data associated with a set of SynonymTerms (their term_norms). I wonder if we could use a single data structure here - but equally that might trade off a small-ish amount of space for a bit of performance, so not sure how bothered we would be.

Original comment from @EFord36