delph-in / matrix

The Grammar Matrix

Home Page:https://matrix.ling.washington.edu/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lexical Multiple Inheritance Broken

Ubadub opened this issue · comments

I believe this PR introduced a bug in the validate script that appears when specifying multiple supertypes on a noun I also believe I have a fix and would like to run it by @ltxom (the committer of that PR), @emilymbender + anyone else working on this.

Bug description

Attempting to save a questionnaire (or upload a choices file) with a lexical type that has multiple supertypes will result in this error message:

Screenshot 2023-01-30 at 1 52 37 PM

which has this Python stack trace:

Screenshot 2023-01-30 at 1 53 44 PM

Steps to reproduce

A choices file that will reproduce this bug can be found here: choices_supertype_bugreport.txt

Alternatively, it can be reproduced with any choices file that has multiple inheritance for a lexical type, or on the website itself, by selecting multiple supertypes from the supertype dropdown selection box.

Suspected Cause

PR 647 introduced the following function to utils.py (direct link):

def recursive_get_supertypes_features(ch, lt, parent_features_list):
    feats = ch.get(lt).get('feat')
    for feat in feats:
        parent_features_list.append(feat)
    lt_supertypes = ch.get(lt).get('supertypes')
    if len(lt_supertypes) != 0:
        recursive_get_supertypes_features(ch, lt_supertypes, parent_features_list)

The bug occurs because ch.get(lt).get('supertypes') returns a string consisting of a comma-separated list of supertypes, for any lexical type with multiple supertypes. Presently, in all other places where similar functionality is required in the repo, there is always a .split(', ') call after, and then the resulting list is iterated over, whereas the code above assumes that there is only one supertype.

However, it’s unclear to me why this method was implemented this way, or why it was added to utils.py specifically. This function is called only in one other place, in the validate() function of linglib/lexicon.py, here, but that file already has a function called get_all_supertypes which could be used to implement the same functionality without duplication, and also without recursion, which can be brittle in Python.

Since get_all_supertypes() does this iteratively, while keeping track of already-visited supertypes, there is no need to check before hand for the presence of cycles.

We can't just edit the function in place to use get_all_supertypes() because we can't import lexicon.py into utils.py without creating a circular dependency, but there’s no reason to have the function in utils.py in the first place as far as I can tell.

It's also unclear to me why the same check occurs three times in the file. Based on commit history and comments PR #647, it looks like it was known that the check already exists for verbs and nouns. However, in adding the check for all lexical types, the existing checks for verbs and nouns were not deleted, so now the check happens twice each for verbs and nouns, and there are at least four different sections of code all walking the supertype hierarchy to build a list of supertypes. Perhaps I am missing a reason for why this is the case, but if not, perhaps the duplicate code should be deleted.

Proposed Fix

A simple fix, which I have implemented locally and which as far as I can tell fixes the bug, is to move recursive_get_supertypes_features (which I have renamed get_all_supertypes_features) to lexicon.py, and amend it to use get_all_supertypes to get a list of all supertypes, and then to generate and return from that a list of all features of those supertypes. Then the code that calls it puts the names of the features of the parents and the child into seaerate set instances and checks if they overlap or not with set.isdisjoint.

I can submit a pull request. What's the procedure for doing that? As I am not an approved contributor to this repo, would the correct procedure be to fork the repo first?

I should note that my fix seems like a bandaid; a lot of this code looks like it could be refactored in such a way as to fix this bug and eliminate a great deal of redundancy. For example, perhaps there could be a single procedure at the beginning of the function where, for every lexical type, a dictionary is constructed mapping each lexical type to a list of its own features and a list of its inherited features. Similarly, a data structure could be built to store lists of the supertypes of all lexical types. Then those data structures could be used everywhere, rather than having three or four different sections of code that all construct the same or similar data structures only to clear it out immediately after use (a dictionary of the sort is already used in the code, but it gets used only in the noun and verb loops, and is cleared in between them).

Also, a unit or regression test should probably be added to check multiple inheritance works, if it doesn't already exist (if it does, it should be fixed to detect this issue).

Thank you for this. @ltxom if you have time to look into this please run the regression tests on the proposed fix before committing.

commented

Thank you, Abhinav. I think you can fork the Matrix's repo and commit your changes there. Then, you can submit a pull request (PR) from your fork to the trunk of this repo. Once you have the PR, I can run the regression tests on them.

Tests for validation go in the unit tests subdirectory, rather than the regression tests. If you see a test that could be added, that would be great.

Thanks Emily, I saw ltxom's instructions on unit testing on the PR. Regarding the regression tests, however, I see that the regression tests folder has choices files in there that do have multiple inheritance. From a quick skim of the rtest.py file, it seems the issue is that that script, in running the regression tests, skips to customizing the grammars, without running the validation script. I'm not sure if that's intentional but perhaps it might help catch errors like this in the future to add a few lines in the customize workflow to first validate before customizing?

No, I don't think we want to add a validation check to the regression tests --- at least not without the time to make sure the validation system isn't spuriously ruling out those choices files that should work.