./sejong/c2d.sh error

Question

./sejong/c2d.sh error

YeopIn opened this issue 6 years ago · comments

yeoineon commented 6 years ago

I had a problem with [training parser from Sejong treebank corpus]

,/sejong/split.sh -v -v is ok

but, ./sejong/c2d.sh -v -v had error

what should i do??

Myungchul Shin · Answer 1 · Tue Jul 10 2018 13:51:29 GMT+0800 (China Standard Time)

@YeopIn

you need to place a constituent parse tree corpus(sejong_treebank.txt.v1) to sejong directory.

$ ls
align.py  align_r.py  c2d.py  c2d.sh  context.pbtxt_p  env.sh  eval.py  log  sejong_treebank.sample  sejong_treebank.txt.v1  split.py  split.sh  tagged_input.sample  tagger.py  wdir
$ more sejong_treebank.txt.v1
; 1993/06/08 19
(NP	(NP 1993/SN + //SP + 06/SN + //SP + 08/SN)
	(NP 19/SN))

; 엠마누엘 웅가로 /
(NP	(NP	(NP 엠마누엘/NNP)
		(NP 웅가로/NNP))
	(X //SP))

; 의상서 실내 장식품으로…
(NP_AJT	(NP_AJT 의상/NNG + 서/JKB)
	(NP_AJT	(NP 실내/NNG)
		(NP_AJT 장식품/NNG + 으로/JKB + …/SE)))

; 디자인 세계 넓혀
(VP	(NP_OBJ	(NP 디자인/NNG)
		(NP_OBJ 세계/NNG))
	(VP 넓히/VV + 어/EC))
...

run split.sh, you will have

$ ls wdir
sejong_treebank.txt.v1.test
sejong_treebank.txt.v1.training
sejong_treebank.txt.v1.tuning

run 'c2d.sh`

as you see, this script generates .v2, .v3 files

for SET in training tuning test; do
    ${python} ${CDIR}/c2d.py --mode=0 < ${WDIR}/sejong_treebank.txt.v1.${SET} > ${WDIR}/sejong_treebank.txt.v2.${SET} 2> ${WDIR}/sejong_treebank.txt.v2.${SET}.err
    ${python} ${CDIR}/c2d.py --mode=1 < ${WDIR}/sejong_treebank.txt.v2.${SET} > ${WDIR}/deptree.txt.v2.${SET}         2> ${WDIR}/deptree.txt.v2.${SET}.err
    [ "${SET}" == "training" ] && extend=1 || extend=0
    ${python} ${CDIR}/align.py --extend=${extend} < ${WDIR}/deptree.txt.v2.${SET} > ${WDIR}/deptree.txt.v3.${SET}
done

if you have some troubles, then test like this

$ python c2d.py --mode=0 < wdir/sejong_treebank.txt.v1.training > wdir/sejong_treebank.txt.v2.training

you may notice which points were problem.

yeoineon · Answer 2 · Fri Jul 13 2018 13:43:34 GMT+0800 (China Standard Time)

I solved this problem, Thank you.

How to training Korean pos tagging?
Is that true for Korean pos tagging using train_dragnn.sh? and data using UD_Korean(universal_dependencies-2.0-ud_treebans-v2.0tgz)?
Is it need sejong_treebank.v1? I knew sejong_treebank.v1 is for Korean parser

I downloaded UD_Korean version of 2.0,

I changed SRC_CORPUS_DIR = UD_Korean and TRAIN_FILE = kr-ud-train.conllu and DEV_FILE = kr-ud-dev.conllu in train_dragnn.sh

but, There is out of range Error? What should I do?

Myungchul Shin · Answer 3 · Fri Jul 13 2018 16:30:18 GMT+0800 (China Standard Time)

@YeopIn

Is that true for Korean pos tagging using train_dragnn.sh?

-> No, train_dragnn.sh stands for training dependency parser only. it is basically same as train_dragnn_sejong.sh.

data using UD_Korean(universal_dependencies-2.0-ud_treebans-v2.0tgz)?
Is it need sejong_treebank.v1? I knew sejong_treebank.v1 is for Korean parser ...

-> i think you need to check *.conllu.conv. 'convert.py' generates '.conv' files and those files are used as training/tune corpus

TRAIN_FILE=${DATA_DIR}/en-ud-train.conllu.conv
DEV_FILE=${DATA_DIR}/en-ud-dev.conllu.conv
CHECKPOINT_FILE=${DATA_DIR}/checkpoint.model

function convert_corpus {
    local _corpus_dir=$1
    for corpus in $(ls ${_corpus_dir}/*.conllu); do
        ${python} ${CDIR}/convert.py < ${corpus} > ${corpus}.conv
    done
}

...
--training_corpus_path=${TRAIN_FILE} 
--tune_corpus_path=${DEV_FILE}

yeoineon · Answer 4 · Mon Jul 16 2018 09:41:53 GMT+0800 (China Standard Time)

Thank you so much..
My final goal is training both Korean Tag and Parser with Sejong Corpus data. Is there a way to solution?

Myungchul Shin · Answer 5 · Mon Jul 16 2018 14:54:45 GMT+0800 (China Standard Time)

there was a similar discussion before
#4 (comment)

but, i couldn't find proper way to train Korean POS tagger.
i thought ... it is worth that i use other Korean POS tagger(Konlpy) or implement character-based POS tagger for Korean and reconstruct morphs from inflectional forms.
for example,

tagging : '하늘을 나는 새를 본다' -> '하/b-ncn 늘/i-ncn 을/b-jks 나/b-vv 는/b-etm 새/b-ncn 를/b-jko 본/b-vv 다/b-ec'
reconstruct : '하늘/ncn 을/jks 날/vv 는/etm 새/ncn 를/jko 보/vv ㄴ다/ec'

of course, you need some extra resources for converting '본/b-vv 다/b-ec' -> '보/vv ㄴ다/ec'