core-peri thesaurus for feature expansion
@inproceedings{cui2018solving,
title={Solving Feature Sparseness in Text Classification using Core-Periphery Decomposition},
author={Cui, Xia and Kojaku, Sadamori and Masuda, Naoki and Bollegala, Danushka},
booktitle={Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics},
pages={255--264},
year={2018}
}
word_ids_generator()
: generateword_ids
in../data/
compute_links()
: generate and storeppmi_links
in../data/
(requires:word_ids
,ppmi.values
)compute_freq_coreness(domain)
: in-domain frequency as coreness for train and test data (requires:train
,test
in label-sentence format, andword_ids
)compute_ppmi_coreness(domain)
: same as above but generate ppmi as coreness (requires:train
,test
andword_ids
)convert_cp_nonoverlap(domain,method)
: convertkm
results tocore coreness peri1,score1,peri2,score2,..
, ids replaced with wordsconvert_cp_overlap(domain,method)
: convertkm_overlap
results tocore coreness peri1,score1,peri2,score2,..
, ids replaced with wordssort_peris(peris_list,core,h)
andget_h()
: subfunctions supporting format convertion in ppmi decsending order
Pre-computed requirements for convert_cp_overlap()
or convert_cp_nonoverlap()
:
../data/ppmi.values
../data/word_ids
: generated fromword_ids_generator()
../data/domain/result_method_overlap.dat
: generated fromkm_overlap
../data/domain/result_method_nonoverlap.dat
: generated fromkm
Outputs:
../data/domain/cpwords_method_overlap.dat
: cp_overlap../data/domain/cpwords_method_nonoverlap.dat
: cp_nonoverlap
- source code:
xiacui2@nlp1: ~/python/cp-thesaurus/src
- datasets:
xiacui2@nlp1: ~/python/cp-thesaurus/data
packages: numpy
, math
specify domain
and method
, then uncomment functions in main()
(e.g. domain = "TR", method = "ppmi"
)
python preprocess.py
script for automatically running step by step
- compute ppmi values
- km (nonoverlap) or km_overlap
- convert cp result to words version and add the ppmi values to them
- use 3 to expand the features and calculate the experimental results
- kmcpp code:
../../kmcpp
- word_ids:
../data/word_ids
(if not exists, you can generate frompreprocess.py
) - ppmi_links:
../data/ppmi_links.dat
(if not exists, you can generate frompreprocess.py
) - train and test data:
../data/domain/train
and../data/domain/test
from ~/python/cp-thesaurus/src
- USAGE:
python runner.py <option:overlap or nonoverlap> <dataset or domain>
- example:
python runner.py overlap B-D
- example:
python runner.py nonoverlap TR