dsindex / syntaxnet

reference code for syntaxnet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Children in c2d.py when converting Sejong Corpus

j-min opened this issue · comments

In c2d.py, find_gov determines 'HEAD' of CoNLL-U format to each node with 4 rules including head final rule, whose children are determined in make_edge.

Could you please explain the usage of lchild and rchild?
It seems like one node can only have two children, and child node is attached to parent node as lchild by default.
But I couldn't understand

  • what lchild and rchild mean in parsing tree
  • why one node can have only two children.

Are they leftmost child and rightmost child surrounding inner children?

def find_gov(node) :
    '''
    * node = leaf node

    1. head final rule
      - 현재 node에서 parent를 따라가면서
        첫번째로 right child가 있는 node를 만나면
        해당 node의 right child를 따라서 leaf node까지 이동
    2. VX rule
      - 보조용언을 governor로 갖는다면 본용언으로 바꿔준다. 
      - 보조용언은 아니지만 보조용언처럼 동작하는 용언도 비슷하게 처리한다. ex) '지니게 되다'
    3. VNP rule
      - 'VNP 것/NNB + 이/VCP + 다/EF' 형태를 governor로 갖는다면 앞쪽 용언으로 바꿔준다. 
    4. VA rule
      - '있/VA, 없/VA, 같/VA'가 governor인 경우, 앞쪽에 'ㄹ NNB' 형태가 오면 앞쪽 용언으로 바꿔준다. 
        node['pleaf'] 링크를 활용한다. 
    '''
    # 첫번째로 right child가 있는 node를 탐색
    # sibling link를 활용한다. 
    next = node
    found = None
    while next :
        if next['sibling'] :
            found = next['sibling']['parent']
            break
        next = next['parent']

    gov_node = None
    if found :
        # right child를 따라서 leaf node까지
        next = found
        while next :
            if next['leaf'] :
                gov_node = next
                # -----------------------------------------------------------------
                # gov_node가 vx rule을 만족하는 경우 parent->lchild를 따라간다. 
                if check_vx_rule(gov_node) :
                    new_gov_node = find_for_vx_rule(node, gov_node)
                    if new_gov_node : gov_node = new_gov_node
                # gov_node가 vnp rule을 만족하는 경우 parent->lchild를 따라간다. 
                if check_vnp_rule(gov_node) :
                    new_gov_node = find_for_vnp_rule(node, gov_node)
                    if new_gov_node :
                        gov_node = new_gov_node
                        # 새로운 지배소가 '있다,없다,같다'인 경우 
                        # check_va_rule을 한번 태워본다. 
                        if check_va_rule(gov_node) :
                            new_gov_node = find_for_va_rule(node, gov_node, search_mode=2)
                            if new_gov_node : gov_node = new_gov_node
                # gov_node가 va rule을 만족하는 경우 parent->lchild를 따라간다. 
                if check_va_rule(gov_node) :
                    new_gov_node = find_for_va_rule(node, gov_node, search_mode=1)
                    if new_gov_node : gov_node = new_gov_node
                # -----------------------------------------------------------------
                break
            next = next['rchild']
    if gov_node :
        return gov_node['eoj_idx']
    return 0


def make_edge(top, node) :
    if not top['lchild'] : # link to left child
        top['lchild'] = node
        node['parent'] = top
        if VERBOSE : print node_string(top) + '-[left]->' + node_string(node)
    elif not top['rchild'] : # link to right child
        top['rchild'] = node
        node['parent'] = top
        top['lchild']['sibling'] = node
        if VERBOSE : print node_string(top) + '-[right]->' + node_string(node)
    else :
        return False
    return True 

Plus, could you please explain why head final rule works?

@j-min

- what lchild and rchild mean in parsing tree
- why one node can have only two children.

let's take a look, https://github.com/dsindex/syntree. you can find out how the constituent parse tree looks like.

parse tree

a constituent parse tree must be a binary tree. there are inner nodes and leafs. every inner nodes have a left and a right child. that is a constraint to build a parse tree by Korean Constituent Grammar.

if you have the Sejong Corpus which represents constituents, you could find the rule.

Plus, could you please explain why head final rule works?

in Korean language,

  1. an eojeol can only have one head(or governor).
    in other word, only one parent edge from it.
    • ex) '나는 학교에 갔다', '나는' -> '갔다', '학교에' -> '갔다'
  2. one's head is always located in it's right direction. it is different from English grammar.

this is the head-final rule.

if you want to find one's head in a constituent parse tree,

  1. go through parent links to find a first inner node whose right child exists.
  2. in that inner node, go through left child links to find the right most leaf node(eojeol)
  • but there are exceptional cases just like 'VX rule'
  • ex) '틀이' -> '있다/VX', if this case, '달라지고' is the head of '틀이' because 'VX' doesn't have a actual meaning.

I got it! Thank you for explanation :)