stefanDeveloper / inference

Quantifiers and monotonicity in reasoning tasks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SARN: Inference

Quantifiers and monotonicity (and opposite adjectives) in reasoning tasks

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

source .venv/bin/activate
# Finetuning
python -m sarn.train --output-dir models/bart-mq --log-dir logs/bart-mq facebook/bart-large-mnli data/training.csv
# Inference of two sequences (forwards)
python -m sarn.classify --model models/bart-mq "All dogs jumped over the fence." "Some small dogs jumped over the fence."
# ROC curve (SVG and PDF diagram)
python -m sarn.roc microsoft/deberta-large-mnli data/evaluation.csv
# Model accuracy on dataset
python -m sarn.accuracy models/deberta-mq data/evaluation.csv
# Dataset statistics
python -m sarn.stats data/training.csv
# Language Interpretability Tool
python -m sarn.lit \
  --models "facebook/bart-large-mnli" \
           "microsoft/deberta-large-mnli" \
           "./models/bart-mq" \
           "./models/deberta-mq" \
           "./models/bart-adj" \
           "./models/deberta-adj" \
  --datasets "./data/evaluation.csv" "./data/evaluation-adj.csv" \
  --cache_dir=cache_dir

As model, any valid Huggingface model (local or remote) can be specified that has been finetuned for sequence classification, e.g., facebook/bart-large-mnli, microsoft/deberta-large-mnli or a local path like models/bart-mq.

Tips

Download Models from CoLi Servers

# export COLI_USER=<your name>
scp -r ${COLI_USER:?}@last.cl.uni-heidelberg.de:/mnt/semproj/sem_proj20/proj1/models .

Check if labels do not have typos in datasets

for i in ./data/*.csv; do
  python3 -m sarn.validate_datasets "$i"
done

Datasets

See data/README.md for more information.

  • data/training.csv: Training dataset for quantifiers and monotonicity in reasoning tasks
  • data/evaluation.csv: Evaluation dataset for quantifiers and monotonicity in reasoning tasks
  • data/training.csv: Training dataset for quantifiers and monotonicity in reasoning tasks with opposite adjectives
  • data/evaluation.csv: Evaluation dataset for quantifiers and monotonicity in reasoning tasks with opposite adjectives

Creating the datasets

source .venv/bin/activate

# data/training.csv
wget https://github.com/verypluming/MED/raw/master/MED.tsv
python -m sarn.convert.med
wget https://github.com/verypluming/HELP/raw/master/output_en/pmb_train_v1.0.tsv
python -m sarn.convert.help
cat data/med.csv data/help.csv > data/training.csv

# data/evaluation.csv
wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
python -m sarn.convert.fracas
wget -O diagnostic-full.tsv https://www.dropbox.com/s/ju7d95ifb072q9f/diagnostic-full.tsv?dl=1
python -m sarn.convert.superglue
cat data/fracas.csv data/superglue.csv > data/evaluation.csv

# data/training-adj.csv
wget https://github.com/verypluming/MED/raw/master/MED.tsv
python -m sarn.convert.med_adjectives
# manual step here (you may modify sentences where it makes sense):
# - label data/med_adjectives_1.csv by hand (third column)
# - label data/med_adjectives_2.csv by hand (third column)
# - remove fourth column in both files
wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
python -m sarn.convert.fracas_adjectives
cat data/med_adjectives_1.csv data/med_adjectives_2.csv data/fracas_adjectives.csv > data/training-adj.csv
python -m sarn.validate_datasets data/training-adj.csv

# data/evaluation-adj.csv
python -m sarn.convert.evaluation_adjectives
# manual step here (you may modify sentences where it makes sense):
# - label data/evaluation-adj.csv by hand (third column)
# - remove fourth column
python -m sarn.validate_datasets data/evaluation-adj.csv

Dataset statistics

Character length

Dataset avg median min max
data/training.csv
Premises 48.26 41 5 478
Hypotheses 48.93 42 5 478
data/evaluation.csv
Premises 79.84 58 26 206
Hypotheses 61.57 50 26 186
data/training-adj.csv
Premises 48.93 44 14 212
Hypotheses 50.78 46 18 210
data/evaluation-adj.csv
Premises 100.19 83 25 189
Hypotheses 86.62 69 35 189

Word length

Dataset avg median min max
data/training.csv
Premises 9.98 9 2 83
Hypotheses 10.10 9 2 83
data/evaluation.csv
Premises 13.03 10 5 34
Hypotheses 10.14 9 5 30
data/training-adj.csv
Premises 8.89 8 3 29
Hypotheses 9.18 9 3 29
data/evaluation-adj.csv
Premises 15.44 12 5 31
Hypotheses 13.47 11 5 31

Labels

Dataset total contradiction neutral entailment
data/training.csv 41'273 0 (0.00%) 20'699 (50.15%) 20'574 (49.85%)
data/evaluation.csv 118 15 (12.71%) 52 (44.07%) 51 (43.22%)
data/training-adj.csv 1'206 420 (34.83%) 749 (62.11%) 37 (3.07%)
data/evaluation-adj.csv 144 47 (32.64%) 84 (58.33%) 13 (9.03%)

Models

  • facebook/bart-large-mnli: Pretrained model of BART finetuned on MultiNLI
  • microsoft/deberta-large-mnli: Pretrained model of DeBERTa finetuned on MultiNLI
  • models/bart-mq: finetuned version of facebook/bart-large-mnli on data/training.csv
  • models/deberta-mq: finetuned version of microsoft/deberta-large-mnli on data/training.csv
  • models/bart-adj: finetuned version of models/bart-mq on data/training-adj.csv
  • models/deberta-adj: finetuned version of models/deberta-mq on data/training-adj.csv

Creating the models

source .venv/bin/activate

# facebook/bart-large-mnli and microsoft/deberta-large-mnli will automatically
# be downloaded from huggingface.co when used

# models/bart-mq
python -m sarn.train --output-dir "models/bart-mq" --log-dir "logs/bart-mq" "facebook/bart-large-mnli" "data/training.csv"

# models/deberta-mq
python -m sarn.train --output-dir "models/deberta-mq" --log-dir "logs/deberta-mq" "microsoft/deberta-large-mnli" "data/training.csv"

# models/bart-adj
python -m sarn.train --output-dir "models/bart-adj" --log-dir "logs/bart-adj" "facebook/bart-large-mnli" "data/training-adj.csv"

# models/deberta-adj
python -m sarn.train --output-dir "models/deberta-adj" --log-dir "logs/deberta-adj" "microsoft/deberta-large-mnli" "data/training-adj.csv"

Model statistics

Accuracy

Model data/evaluation.csv data/evaluation-adj.csv
facebook/bart-large-mnli 65.25% 40.97%
microsoft/deberta-large-mnli 71.19% 47.22%
models/bart-mq 57.63% 34.72%
models/deberta-mq 61.86% 34.72%
models/bart-adj 45.76% 58.33%
models/deberta-adj 42.37% 57.64%

ROC curves

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. [...] AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). [...] AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

Google Machine Learning Crash Course

BART DeBERTa

About

Quantifiers and monotonicity in reasoning tasks


Languages

Language:Jupyter Notebook 79.8%Language:Python 20.2%