Children in c2d.py when converting Sejong Corpus
j-min opened this issue · comments
In c2d.py, find_gov
determines 'HEAD' of CoNLL-U format to each node with 4 rules including head final rule, whose children are determined in make_edge
.
Could you please explain the usage of lchild
and rchild
?
It seems like one node can only have two children, and child node is attached to parent node as lchild
by default.
But I couldn't understand
- what
lchild
andrchild
mean in parsing tree - why one node can have only two children.
Are they leftmost child and rightmost child surrounding inner children?
def find_gov(node) :
'''
* node = leaf node
1. head final rule
- 현재 node에서 parent를 따라가면서
첫번째로 right child가 있는 node를 만나면
해당 node의 right child를 따라서 leaf node까지 이동
2. VX rule
- 보조용언을 governor로 갖는다면 본용언으로 바꿔준다.
- 보조용언은 아니지만 보조용언처럼 동작하는 용언도 비슷하게 처리한다. ex) '지니게 되다'
3. VNP rule
- 'VNP 것/NNB + 이/VCP + 다/EF' 형태를 governor로 갖는다면 앞쪽 용언으로 바꿔준다.
4. VA rule
- '있/VA, 없/VA, 같/VA'가 governor인 경우, 앞쪽에 'ㄹ NNB' 형태가 오면 앞쪽 용언으로 바꿔준다.
node['pleaf'] 링크를 활용한다.
'''
# 첫번째로 right child가 있는 node를 탐색
# sibling link를 활용한다.
next = node
found = None
while next :
if next['sibling'] :
found = next['sibling']['parent']
break
next = next['parent']
gov_node = None
if found :
# right child를 따라서 leaf node까지
next = found
while next :
if next['leaf'] :
gov_node = next
# -----------------------------------------------------------------
# gov_node가 vx rule을 만족하는 경우 parent->lchild를 따라간다.
if check_vx_rule(gov_node) :
new_gov_node = find_for_vx_rule(node, gov_node)
if new_gov_node : gov_node = new_gov_node
# gov_node가 vnp rule을 만족하는 경우 parent->lchild를 따라간다.
if check_vnp_rule(gov_node) :
new_gov_node = find_for_vnp_rule(node, gov_node)
if new_gov_node :
gov_node = new_gov_node
# 새로운 지배소가 '있다,없다,같다'인 경우
# check_va_rule을 한번 태워본다.
if check_va_rule(gov_node) :
new_gov_node = find_for_va_rule(node, gov_node, search_mode=2)
if new_gov_node : gov_node = new_gov_node
# gov_node가 va rule을 만족하는 경우 parent->lchild를 따라간다.
if check_va_rule(gov_node) :
new_gov_node = find_for_va_rule(node, gov_node, search_mode=1)
if new_gov_node : gov_node = new_gov_node
# -----------------------------------------------------------------
break
next = next['rchild']
if gov_node :
return gov_node['eoj_idx']
return 0
def make_edge(top, node) :
if not top['lchild'] : # link to left child
top['lchild'] = node
node['parent'] = top
if VERBOSE : print node_string(top) + '-[left]->' + node_string(node)
elif not top['rchild'] : # link to right child
top['rchild'] = node
node['parent'] = top
top['lchild']['sibling'] = node
if VERBOSE : print node_string(top) + '-[right]->' + node_string(node)
else :
return False
return True
Plus, could you please explain why head final rule
works?
- what lchild and rchild mean in parsing tree
- why one node can have only two children.
let's take a look, https://github.com/dsindex/syntree. you can find out how the constituent parse tree looks like.
a constituent parse tree must be a binary tree. there are inner nodes and leafs. every inner nodes have a left and a right child. that is a constraint to build a parse tree by Korean Constituent Grammar.
if you have the Sejong Corpus which represents constituents, you could find the rule.
Plus, could you please explain why head final rule works?
in Korean language,
- an
eojeol
can only have one head(or governor).
in other word, only one parent edge from it.- ex) '나는 학교에 갔다', '나는' -> '갔다', '학교에' -> '갔다'
- one's head is always located in it's right direction. it is different from English grammar.
this is the head-final rule.
if you want to find one's head in a constituent parse tree,
- go through parent links to find a first inner node whose right child exists.
- in that inner node, go through left child links to find the right most leaf node(eojeol)
- but there are exceptional cases just like 'VX rule'
- ex) '틀이' -> '있다/VX', if this case, '달라지고' is the head of '틀이' because 'VX' doesn't have a actual meaning.
I got it! Thank you for explanation :)