"A Large Negation Benchmark to Challenge Large Language Models"

We introduce a large semi-automatically generated dataset of ~400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms that we use to evaluate LLMs.

📖 Paper: This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models (EMNLP'23)
Dataset available in the 🤗HuggingFace Hub: HiTZ/This-is-not-a-dataset

We also provide the code to train and evaluate any LLM in the dataset, as well as the scorer to reproduce the results of the paper.

Dataset

The easiest and recommended way to download the dataset is using the 🤗HuggingFace Hub. See the Dataset Card for more information about the dataset.

from datasets import load_dataset

dataset = load_dataset("HiTZ/This-is-not-a-dataset")

We also distribute the dataset in this repository. See data/README.md for more information.

Requirements

The scorer evaluate.py does not require any dependency. If you want to run the training or evaluation code you need:

# Required dependencies
Pytorch>=1.9 (2.1.0 Recommeneded) 
https://pytorch.org/get-started/locally/

transformers
pip install transformers

accelerate 
pip install accelerate

wandb
pip install wandb

# Optional dependencies

bitsandbytes >= 0.40.0 # For 4 / 8 bit quantization
pip install bitsandbytes

PEFT >= 0.4.0 # For LoRA
pip install peft

# You can install all the dependencies with:
pip3 install --upgrade torch transformers accelerate wandb bitsandbytes peft

Evaluating a LLM

We provide a script to evaluate any LLM in the dataset. First, you need to create a configuration file. See configs/zero-shot for examples. This script will evaluate the model in our dataset in zero-shot setting. Here is an example config to evaluate LLama2-7b Chat:

We will format the inputs using the chat_template stored in the tokenizer

#Model args
model_name_or_path: meta-llama/Llama-2-7b-chat-hf
# Dtype in which we will load the model. You can use bfloat16 is you want to save memory
torch_dtype: "auto"
# Performs quatization using bitsandbytes integration. Allows evaluating LLMs in consumer hardware
quantization: 4
# If force_auto_device_map is set to True. We will split the model into all the available GPUs and CPU, this is useful for large models that do not fit in a single GPU VRAM. 
force_auto_device_map: false
# If set to false, we will sample the probability of generating the True or False tokens (recommended). If set to true, the model will generate a text and we will attempt to locate the string "true" or "false" in the output.
predict_with_generate: false
# Batch size for evaluation. We use auto_batch_finder, so this value is only used to set the maximum batch size, if the batch does not fit in memory, it will be reduced.
per_device_eval_batch_size: 32
# Add fewshot examples to the input
fewshot: false

# dataset arguments
do_train: false
do_eval: false
do_predict: true
# For zero-shot settings, you can evaluate the model in the concatenation of the train, dev and test sets. Set to false to only evaluate in the test set.
do_predict_full_dataset: false
max_seq_length: 4096

# Output Dir
output_dir: results/zero-shot/llama-2-7b-chat-hf

Once you have created the config file, you can run the evaluation script:

accelerate launch run.py --config configs/zero-shot/Llama2-7b.yaml

You can use accelerate to run the evaluation in multiple GPUs. See accelerate documentation for more information.

accelerate launch --multi_gpu --num_processes 2 run.py configs/zero-shot/Llama2-7b.yaml

We also support deepspeed zero 3 to split large models into multiple GPUs. See configs/zero-shot/base_deepspeed.yaml for an example.

accelerate launch --use_deepspeed --num_processes 4 --deepspeed_config_file configs/deepspeed_configs/deepspeed_zero3.json run.py --config configs/zero-shot/base_deepspeed.yaml --model_name_or_path HuggingFaceH4/zephyr-7b-beta --output_dir results/zero-shot/zephyr-7b-beta

If you want to evaluate multiple models, you can overwrite the model_name_or_path and output_dir values of a config.

for model_name in \
meta-llama/Llama-2-70b-chat-hf \
meta-llama/Llama-2-70b-hf \
meta-llama/Llama-2-13b-chat-hf \
meta-llama/Llama-2-13b-hf \
meta-llama/Llama-2-7b-chat-hf \
meta-llama/Llama-2-7b-hf \
mistralai/Mistral-7B-Instruct-v0.2 \
mistralai/Mixtral-8x7B-Instruct-v0.1 \
do

accelerate launch run.py --config configs/zero-shot/base.yaml --model_name_or_path "$model_name" --output_dir results/zero-shot/"$model_name"

done

Few shot examples

When doing zero-shot inference, adding few-shot examples can improve the performance of the models. If you set the flag fewshot: true we will add 4 examles from each pattern (44 total) as fews-shot examples to the input. See configs/zero-shot/base_fewshot.yaml and configs/zero-shot/base_fewshot_deepspeed.yaml for examples.

for model_name in \
meta-llama/Llama-2-70b-chat-hf \
meta-llama/Llama-2-70b-hf \
meta-llama/Llama-2-13b-chat-hf \
meta-llama/Llama-2-13b-hf \
meta-llama/Llama-2-7b-chat-hf \
meta-llama/Llama-2-7b-hf \
mistralai/Mistral-7B-Instruct-v0.2 \
mistralai/Mixtral-8x7B-Instruct-v0.1 \
do

# Run the model with 4 bit quantization using data parallel and 4 GPUs (One copy of the model per GPU)
accelerate launch --multi_gpu --num_processes 4 run.py \
 --config configs/zero-shot/base_fewshot.yaml --model_name_or_path "$model_name" --output_dir results/fewshot/"$model_name"

# Run the model in bfloat16 with deepspeed zero stage 3 using 4 GPUs (Split the model across 4 GPUs)
accelerate launch --use_deepspeed --num_processes 4 --deepspeed_config_file configs/deepspeed_configs/deepspeed_zero3.json run.py \
 --config configs/zero-shot/base_fewshot_deepspeed.yaml --model_name_or_path "$model_name" --output_dir results/fewshot/"$model_name"

Training a LLM

You can train a LLMs in our dataset. First, you need to create a configuration file. See configs/train for examples. Here is an example config to finetune LLama2-7b Chat:

#Model args
model_name_or_path: meta-llama/Llama-2-7b-chat-hf
torch_dtype: "bfloat16"
# We use LoRA for efficient training. Without LoRA you would need 4xA100 to train Llama2-7b Chat. See https://arxiv.org/abs/2106.09685
use_lora: true
quantization: 4
predict_with_generate: false
conversation_template: llama-2
force_auto_device_map: false

# Dataset arguments
do_train: true
do_eval: true
do_predict: true
do_predict_full_dataset: false
max_seq_length: 512

# Train only on a pattern i.e Synonymy1, Hypernymy, etc...
pattern: null
# Train only on affirmative sentences
only_affirmative: False
# Train only on negative sentences
only_negative: False
# Train only on sentences without a distractor
only_non_distractor: False
# Train only on sentences with a distractor
only_distractor: False

#Training arguments
per_device_train_batch_size: 32
gradient_accumulation_steps: 1
per_device_eval_batch_size: 32
optim: paged_adamw_32bit
learning_rate: 0.0003
weight_decay: 0
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.03

# Output Dir
output_dir: results/finetune/Llama-2-7b-chat-hf

Once you have created the config file, you can run the training script:

# Single GPU with bfloat16 mixed precision
accelerate launch --mixed_precision bf16 run.py --config configs/train/Llama2-7b.yaml
# Multi GPU with bfloat16 mixed precision
accelerate launch --multi_gpu --num_processes 2 --mixed_precision bf16 run.py --config configs/train/Llama2-7b.yaml

Scorer

If you use our run.py script, the models will be automatically evaluated. But you might want to evaluate results generated by your custom code. In that case, you can use the evaluate.py script. The scorer expects a .jsonl file as input similar to the dataset files, but with the extra field prediction. This field should contain the prediction for each example as a boolean true or false. Each line should be a dictionary. For example:

{"pattern_id": 1, "pattern": "Synonymy1", "test_id": 0, "negation_type": "affirmation", "semantic_type": "none", "syntactic_scope": "none", "isDistractor": false, "label": true, "sentence": "An introduction is commonly the first section of a communication.", "prediction": true}
{"pattern_id": 1, "pattern": "Synonymy1", "test_id": 0, "negation_type": "affirmation", "semantic_type": "none", "syntactic_scope": "none", "isDistractor": true, "label": false, "sentence": "An introduction is commonly the largest possible quantity.", "prediction": false}
...

You can call the scorer with the following command:

python3 evaluate.py --predictions_path <path_to_input_file>.jsonl --output_path <path_to_output_scores>.json

Scorer Result Interpretation

The scorer will output the following metrics. See the results/ folder for an example of the output of the scorer.

all_affirmations: Accuracy of the model in affirmative sentences
all_negations: Accuracy of the model in negative sentences
all: (Overall) Accuracy of the model in all sentences
input_affirmation: Accuracy of the model in affirmative sentences without distractors
input_negation: Accuracy of the model in negative sentences without distractors
distractor_affirmation: Accuracy of the model in affirmative sentences with distractors
distractor_negation: Accuracy of the model in negative sentences with distractors
Negation_analysis: Fine-grained analysis of the model in negative sentences (verbal, analytic, clausal, non_verbal, synthetic, subclausal negation types)
coherence_scores: Coherence scores of the predictions. Affirmation-Negation_Input refers to the coherence between the affirmative and negative sentences without distractor. Similarly Affirmation-Negation_Distractor refers to the coherence between the affirmative and negative sentences with distractor. Read the paper for more information about the Coherence metric.
Synonymy1, Hypernymy, Part...: Fine-grained analysis of the model in each pattern

Citation

@inproceedings{garcia-ferrero-etal-2023-dataset,
    title = "This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models",
    author = "Garc{\'\i}a-Ferrero, Iker  and
      Altuna, Bego{\~n}a  and
      Alvez, Javier  and
      Gonzalez-Dios, Itziar  and
      Rigau, German",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.531",
    doi = "10.18653/v1/2023.emnlp-main.531",
    pages = "8596--8615",
}

hitz-zentroa / This-is-not-a-Dataset