This repository contains the code for revamping the FIGER entity typing ontology and dataset, and a simple classifier on the revamped dataset based on BERT-base-uncased.
This document hosts instructions for training & using the classifier. Please download the revamped FIGER data HERE and pre-trained checkpoints here; alternatively, to learn more about ontology revamping please refer to REFINE_TYPING.md.
With the revamped ontology, we are able to achieve a 94.3% macro-F1 score on the test set, XXX improvement over the original FIGER dataset.
Model | Dataset | Macro-F1 | Micro-F1 |
---|---|---|---|
BERT-base-uncased | Original FIGER | XXXX | XXXX |
BERT-base-uncased | Revamped FIGER | 94.3 | 95.1 |
BERT-large-uncased | Original FIGER | XXXX | XXXX |
BERT-large-uncased | Revamped FIGER | XXXX | XXXX |
We use this classifier to type the entities for our Entailment Graph project, stay tuned for more details.
The classifier is a multi-class multi-label sequence classifier based on bert-base-uncased
.
There are two major design choices: which tokens to extract representations from, and how large a language model to use;
- Representation extraction:
- use the [CLS] token;
- use the average of entity tokens;
- concatenate
i
andii
; - use the average of left context and average of right context tokens (concatenated);
- concatenate
ii
andiv
.
- How large a model: given a fixed computation constraint, we can choose to use a larger model with fp16, or a smaller model with fp32. The hypothesis is that using a larger model with fp16 will give better performance.
- Hyper-parameters: other hyper-parameters of interest include: learning rate, number of epochs, metric for best model ([macro_f1, micro_f1]).
- Note: empirically we have found using
entity
tokens for representation to be most efficient; removing entity representations harm the performance dramatically, whereas additionally including left and right contexts are not clearly better.
- NOTE: if you are using the slurm scripts, please change the
#SBATCH
options to match your environment (see this for detailed documentation); - Do cache:
python train.py --do_cache --label_smoothing_factor 0.1 --reload_data
, this will cache the representations of data entries; - Do train: @ ./classifier;
- On private servers: do
nohup python -u train.py --do_train --model_name_or_path ../../lms/bert-base-uncased --encode_mode entity --lr 5e-5 --num_train_epochs 5 --metric_best_model macro_f1 --label_smoothing_factor 0.0 > ./logdir/bbu_entity_lsf0.0_5e-5.log &
- On Cluster: do
sbatch -p ILCC_GPU --gres gpu:4 -o ./logdir/bbu_cls_entity_lsf0.0_5e-5.log train_script.sh bert-base-uncased cls_entity 5e-5 0.0 /disk/scratch/tli/figer_simple_classifier/model_ckpts/json_data/ /disk/scratch/tli/figer_simple_classifier/model_ckpts/ ../model_ckpts
; - Key tunable hyperparameters include:
--encode_mode
: which tokens' representation to take for the classifier;--num_clsf_layers
: how many layer MLP to use for the classifier;--typeset_fn
/--labels_key
: which typeset to use;--lr
: the learning rate;--use_fp16
: whether to use fp16 for speeding up;--metric_best_model
: which metric to use for selecting the best model, by default macro_f1, values from different metrics should be roughly aligned;--reload_data
: whether to reload the data (refresh cache), needed when data set is updated;--label_smoothing_factor
: the label smoothing factor, by default 0.0, values from 0.0 to 0.2 are reasonable;
- On private servers: do
- Do eval: @ ./classifier:
- On private servers, do: ``;
- On Cluster, do:
sbatch -p ILCC_GPU --gres gpu:2 -o ./logdir/bbu_entity_lsf0.0_5e-5_dev.log eval_script.sh bert-base-uncased entity /home/s2063487/figer_simple_classifier/model_ckpts/model_ckpts/bert-base-uncased/entity_5e-05_0.0/checkpoint-95000 /tli/figer_simple_classifier/model_ckpts/json_data/ dev
;
- Test predict: @ ./classifier:
- On cluster, do:
sbatch -p ILCC_GPU -w duflo --gres gpu:2 -o ./logdir/predict_test.log predict_script.sh bert-base-uncased entity /home/s2063487/figer_simple_classifier/model_ckpts/model_ckpts/bert-base-uncased/entity_5e-05_0.0/checkpoint-95000 /tli/figer_simple_classifier/model_ckpts/json_data/ test_wegtypes _bert-base-uncased_entgraph_labels_0.0 0.05 --is_inference --debug
;
- On cluster, do:
Note: you can create your own inference scripts analogous to news_proc
and levy_proc
.
- load_news_data:
- newsspike: @news_proc
nohup python -u load_news_data.py --in_path ../../entGraph/news_gen8_p.json --out_dir ../news_data/ --data_name newsspike --out_fn %s_gparser_typing_input.json --mode load > ns_load.log &
; - newscrawl: @news_proc
nohup python -u load_news_data.py --in_path ../../news_genC_GG.json --out_dir ../news_data/ --data_name newscrawl --out_fn %s_gparser_typing_input.json --mode load > nc_load.log &
; - LevyHolt: @levy_proc
python -u load_levy_data.py
- newsspike: @news_proc
- split loaded news data:
- newsspike:
nohup python -u load_news_data.py --out_dir ../news_data/ --data_name newsspike --out_fn %s_gparser_typing_input.json --mode split --num_slices 8 --expected_num_lines 63876006 > ns_split.log &
; - newscrawl:
nohup python -u load_news_data.py --out_dir ../news_data/ --data_name newscrawl --out_fn %s_gparser_typing_input.json --mode split --num_slices 120 --expected_num_lines 1584274524 > nc_split.log &
; - LevyHolt: NA;
- newsspike:
3. do cache:
- newsspike: NA;
- newscrawl: NA;
- LevyHolt:
nohup bash cache_script_pata.sh bert-base-uncased ../levy_data/dev_input.json 0.0 --reload_data --spanend_inclusive --force_encode > ./logdir/cache_levy_dev.log &
(the cached files can then be sent to MLP server);
- do predict:
- newsspike:
sbatch -p ILCC_GPU --exclude duflo --array 0 --gres gpu:4 -o ./logdir/predict_ns_%a_%A.log predict_script_array.sh bert-base-uncased entity /home/s2063487/figer_simple_classifier/model_ckpts/model_ckpts/bert-base-uncased/entity_5e-05_0.0/checkpoint-95000 /disk/scratch/tli/figer_simple_classifier/model_ckpts/news_data/ newsspike_gparser_typing_input entgraph_labels 0.05 --is_inference --spanend_inclusive
; - newsspike:
sbatch -p PGR-Standard --array 4-7 --gres gpu:2 -o ./logdir/predict_ns_%a_%A.log predict_script_array.sh bert-base-uncased entity /home/s2063487/figer_simple_classifier/model_ckpts/model_ckpts/bert-base-uncased/entity_5e-05_0.0/checkpoint-95000 /disk/scratch_big/tli/figer_simple_classifier/model_ckpts/news_data/ newsspike_gparser_typing_input entgraph_labels 0.05 --is_inference --spanend_inclusive
; - newscrawl:
sbatch -p ILCC_GPU --exclude duflo,levi --array 100-119%4 --gres gpu:4 -o ./logdir/predict_nc_%a_%A.log predict_script_array.sh bert-base-uncased entity /home/s2063487/figer_simple_classifier/model_ckpts/model_ckpts/bert-base-uncased/entity_5e-05_0.0/checkpoint-95000 /disk/scratch/tli/figer_simple_classifier/model_ckpts/news_data/ newscrawl_gparser_typing_input entgraph_labels 0.01 --is_inference --spanend_inclusive
; - newscrawl:
sbatch -p PGR-Standard --array 0-10:4 --gres gpu:2 -o ./logdir/predict_nc_%a_%A.log predict_script_array.sh bert-base-uncased entity /home/s2063487/figer_simple_classifier/model_ckpts/model_ckpts/bert-base-uncased/entity_5e-05_0.0/checkpoint-95000 /disk/scratch_big/tli/figer_simple_classifier/model_ckpts/news_data/ newscrawl_gparser_typing_input entgraph_labels 0.05 --is_inference --spanend_inclusive
;
- LevyHolt:
sbatch -p PGR-Standard --gres gpu:2 -o ./logdir/predict_levy_dev.log predict_script.sh bert-base-uncased entity /home/s2063487/figer_simple_classifier/model_ckpts/model_ckpts/bert-base-uncased/entity_5e-05_0.0/checkpoint-95000 levy_data /disk/scratch_big/tli/figer_simple_classifier/model_ckpts/levy_data/ dev_input _bert-base-uncased_entgraph_labels_0.0 0.05 --is_inference --spanend_inclusive --force_encode
;
-
Integrate Results:
- newsspike:
nohup python -u integrate_results.py --data_dir ../news_data/ --data_name newsspike --num_slices 8 --job_name model > ns_integrate_modelout.log &
;
- newsspike:
-
Generate typed corpus:
- newsspike:
nohup python -u integrate_results.py --data_dir ../news_data/ --data_name newsspike --output_fn %s_gparser_typing_output.json --job_name corpus --parsed_fn ../../entGraph_NS/news_gen8_p.json > ns_integrate_corpus.log &
;
- newsspike: