nikitakit / self-attentive-parser

High-accuracy NLP parser with models for 11 languages.

Home Page:https://parser.kitaev.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

```._.labels``` doesn't work for spans with length of one

LawlAoux opened this issue · comments

For some reason, when the span has a length of one, ._.labels returns an empty tuple. I would expect it to return the part of speech of the individual word (which can be done by taking the token of the word in the span and then taking tag_.

Reproduction:

import spacy, benepar
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("Tuesday morning")
sent = tuple(doc.sents)[0]
first_child = tuple(sent._.children)[0]
pos = first_child._.labels

From this code pos will be an empty tuple, but I would expect it to be equal to first_child[0].tag_ which is "NNP"

commented

I encountered the same problem. Even couldn't iterate through ._.parse_string as it is a nested complicated structure with parenthesis

@burak0006 @LawlAoux
You can make use of the less complicated version of the parsed string at the leaf to solve this issue.

all_tokens = self.span_obj._.parse_string.split("(")
label = all_tokens[1].split(" ")[0]

Eg:
image

Here, the parsed strings at the leafs are:

  • (NN Stock)
  • (NNS prices)
  • (VBD soared)
  • ........

I also had the same problem. Is there some kind of conversion to CNF along the way that causes the API to go bonkers? The only working solution I could come up with is that which @anmolagarwal999 suggested, but it is unfortunate to have to parse a string constructed of a sentence that is already parsed. :/ A better API is warranted in my opinion.

If you pass the parsed_sentence string into this function it will give you an appropriate tree structure.

# This was adapted from https://stackoverflow.com/questions/54959875/recursive-parentheses-parser-for-expressions-of-strings
def parse_tree(sentence):
    stack = []  # or a `collections.deque()` object, which is a little faster
    top = items = []
    for token in filter(None, re.compile(r'(?:([()])|\s+)').split(sentence)):
        if token == '(':
            stack. Append(items)
            items.append([])
            items = items[-1]
        elif token == ')':
            if not stack:
                raise ValueError("Unbalanced parentheses")
            items = stack.pop()  
        else:
            items. Append(token)
    if stack:
        raise ValueError("Unbalanced parentheses")    
    return top

This is a tree so it's not convenient to get stuff out of it. Here is an XPath-like function which you can use to query the structure.

def find_pos(tree, pos):
    result = []
    if not isinstance(tree[0], str):
        result = [find_pos(subtree, pos) for subtree in tree]
    else:
        pos_parts = pos.split("/")
        if re.match(pos_parts[0], tree[0], flags=re.IGNORECASE):
            if len(pos_parts) == 1:
                return tree[1]
            else:
                result =  [find_pos(subtree, "/".join(pos_parts[1:])) for subtree in tree[1:]]
    if len(result) == 0:
        return None
    result = [f for f in result if f is not None]
    if len(result) == 0:
        return None
    elif len(result) == 1:
        return result[0]
    else:
        return result

You provide the (re-)parsed tree and the desired part of speech (as a string, case insensitive), but you have to specify the path from the root. For example if your sentence is a S > VP kind of sentence then getting the verb(s) should be like this: find_pos(command, 'VP/VB') and if there is a noun associated with that, find_pos(command, 'VP/NP/NN.*') should do. If you want to get prepositional nouns (go to the store) then you can also use find_pos(command, 'VP/PP/NP/NN.*'). Slashes separate tree levels you want to iterate through, but the expressions between the slashes can be complex regex expressions too! This allows some cleverness if you're careful with it.

Since I use regular expressions you have to import re to use this code. Enjoy!

Given any span you can use the function to get a list of labels

def get_span_labels(span: str) -> List[str]:
    labels = span._.labels
    if len(labels) == 0:
        doc = span.doc
        start, end = span.start, span.end
        assert start + 1 == end
        labels = (doc[start].tag_,)
        # constituent_data = doc._._constituent_data
        # labels_index = (
        #     (constituent_data.starts == start) * (constituent_data.ends == end)
        # ).argmax()
        # labels = constituent_data.label_vocab[labels_index]
    return labels

Below is a portion of the parse_string() function.

        label = label_vocab[label_idx]
        if (i + 1) >= j:
            token = doc[i]
            s = (
                "("
                + u"{} {}".format(token.tag_, token.text)
                .replace("(", "-LRB-")
                .replace(")", "-RRB-")
                .replace("{", "-LCB-")
                .replace("}", "-RCB-")
                .replace("[", "-LSB-")
                .replace("]", "-RSB-")
                + ")"
            )

label is an empty tuple but, ._.parse_string shows token.tag_ as a tag.

  • Workaroud 1
    Instead of ._.lables, use the function below.
def get_labels(span):
    return span._.labels or (span[0].tag_,)
  • Workaround 2
    Override the installed extensions.
org_span_labels = spacy.tokens.Span.remove_extension('labels')

def get_labels(span):
    return  org_span_labels[2](span) or (span[0].tag_,)

spacy.tokens.Span.set_extension('labels', getter=get_labels)

spacy.tokens.Token.remove_extension('labels')
spacy.tokens.Token.set_extension(
    'labels',
    getter=lambda token: get_labels(token.doc[token.i: token.i+1])
)