```._.labels``` doesn't work for spans with length of one
LawlAoux opened this issue · comments
For some reason, when the span has a length of one, ._.labels
returns an empty tuple. I would expect it to return the part of speech of the individual word (which can be done by taking the token of the word in the span and then taking tag_
.
Reproduction:
import spacy, benepar
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("Tuesday morning")
sent = tuple(doc.sents)[0]
first_child = tuple(sent._.children)[0]
pos = first_child._.labels
From this code pos will be an empty tuple, but I would expect it to be equal to first_child[0].tag_
which is "NNP"
I encountered the same problem. Even couldn't iterate through ._.parse_string
as it is a nested complicated structure with parenthesis
@burak0006 @LawlAoux
You can make use of the less complicated version of the parsed string at the leaf to solve this issue.
all_tokens = self.span_obj._.parse_string.split("(")
label = all_tokens[1].split(" ")[0]
Here, the parsed strings at the leafs are:
- (NN Stock)
- (NNS prices)
- (VBD soared)
- ........
I also had the same problem. Is there some kind of conversion to CNF along the way that causes the API to go bonkers? The only working solution I could come up with is that which @anmolagarwal999 suggested, but it is unfortunate to have to parse a string constructed of a sentence that is already parsed. :/ A better API is warranted in my opinion.
If you pass the parsed_sentence string into this function it will give you an appropriate tree structure.
# This was adapted from https://stackoverflow.com/questions/54959875/recursive-parentheses-parser-for-expressions-of-strings
def parse_tree(sentence):
stack = [] # or a `collections.deque()` object, which is a little faster
top = items = []
for token in filter(None, re.compile(r'(?:([()])|\s+)').split(sentence)):
if token == '(':
stack. Append(items)
items.append([])
items = items[-1]
elif token == ')':
if not stack:
raise ValueError("Unbalanced parentheses")
items = stack.pop()
else:
items. Append(token)
if stack:
raise ValueError("Unbalanced parentheses")
return top
This is a tree so it's not convenient to get stuff out of it. Here is an XPath-like function which you can use to query the structure.
def find_pos(tree, pos):
result = []
if not isinstance(tree[0], str):
result = [find_pos(subtree, pos) for subtree in tree]
else:
pos_parts = pos.split("/")
if re.match(pos_parts[0], tree[0], flags=re.IGNORECASE):
if len(pos_parts) == 1:
return tree[1]
else:
result = [find_pos(subtree, "/".join(pos_parts[1:])) for subtree in tree[1:]]
if len(result) == 0:
return None
result = [f for f in result if f is not None]
if len(result) == 0:
return None
elif len(result) == 1:
return result[0]
else:
return result
You provide the (re-)parsed tree and the desired part of speech (as a string, case insensitive), but you have to specify the path from the root. For example if your sentence is a S > VP kind of sentence then getting the verb(s) should be like this: find_pos(command, 'VP/VB')
and if there is a noun associated with that, find_pos(command, 'VP/NP/NN.*')
should do. If you want to get prepositional nouns (go to the store) then you can also use find_pos(command, 'VP/PP/NP/NN.*')
. Slashes separate tree levels you want to iterate through, but the expressions between the slashes can be complex regex expressions too! This allows some cleverness if you're careful with it.
Since I use regular expressions you have to import re
to use this code. Enjoy!
Given any span you can use the function to get a list of labels
def get_span_labels(span: str) -> List[str]:
labels = span._.labels
if len(labels) == 0:
doc = span.doc
start, end = span.start, span.end
assert start + 1 == end
labels = (doc[start].tag_,)
# constituent_data = doc._._constituent_data
# labels_index = (
# (constituent_data.starts == start) * (constituent_data.ends == end)
# ).argmax()
# labels = constituent_data.label_vocab[labels_index]
return labels
Below is a portion of the parse_string() function.
label = label_vocab[label_idx]
if (i + 1) >= j:
token = doc[i]
s = (
"("
+ u"{} {}".format(token.tag_, token.text)
.replace("(", "-LRB-")
.replace(")", "-RRB-")
.replace("{", "-LCB-")
.replace("}", "-RCB-")
.replace("[", "-LSB-")
.replace("]", "-RSB-")
+ ")"
)
label
is an empty tuple but, ._.parse_string
shows token.tag_
as a tag.
- Workaroud 1
Instead of._.lables
, use the function below.
def get_labels(span):
return span._.labels or (span[0].tag_,)
- Workaround 2
Override the installed extensions.
org_span_labels = spacy.tokens.Span.remove_extension('labels')
def get_labels(span):
return org_span_labels[2](span) or (span[0].tag_,)
spacy.tokens.Span.set_extension('labels', getter=get_labels)
spacy.tokens.Token.remove_extension('labels')
spacy.tokens.Token.set_extension(
'labels',
getter=lambda token: get_labels(token.doc[token.i: token.i+1])
)