iPieter / dutch-lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models in Dutch.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dutch LM Evaluation Harness

This is a fork of EleutherAI's English Evaluation Harness with Dutch evaluations.

Overview

This project provides a unified framework to test generative language models for Dutch.

Features:

  • Support for Dutch evaluation benchmarks (e.g. SQUADNL).
  • Support for Dutch prompts.
  • Inherited features from the English LM Evaluation Harness:
    • Support for models loaded via transformers (including quantization via AutoGPTQ), GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface.
    • Support for fast and memory-efficient inference with vLLM.
    • Support for commercial APIs including OpenAI, and TextSynth.
    • Support for evaluation on adapters (e.g. LoRA) supported in HuggingFace's PEFT library.
    • Support for local models and benchmarks.
    • Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
    • Easy support for custom prompts and evaluation metrics.

Install

To install the lm-eval package from the github repository, run:

git clone https://github.com/ipieter/-dutch-lm-evaluation-harness
cd dutch-lm-evaluation-harness
pip install -e .

Basic Usage

Hugging Face transformers

To evaluate a model hosted on the HuggingFace Hub (e.g. GPT-J-6B) on hellaswag you can use the following command (this assumes you are using a CUDA-compatible GPU):

lm_eval --model hf \
    --model_args pretrained=gpt2 \
    --tasks squadnl \
    --device cuda:0 \
    --batch_size 32 \

Multi-GPU Evaluation with Hugging Face accelerate

To parallelize evaluation of HuggingFace models across multiple GPUs, we leverage the accelerate 🚀 library as follows:

accelerate launch -m lm_eval --model hf \
    --tasks squadnl \
    --batch_size 16

This will perform data-parallel evaluation: that is, placing a single full copy of your model onto each available GPU and splitting batches across GPUs to evaluate on K GPUs K times faster than on one.

If your model is is too large to be run on a single one of your GPUs then you can use accelerate with Fully Sharded Data Parallel (FSDP) that splits the weights of the model across your data parallel ranks. To enable this, ensure you select YES when asked Do you want to use FullyShardedDataParallel? when running accelerate config. To enable memory-efficient loading, select YES when asked Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start?. This will ensure only the rank 0 process loads the model and then broadcasts the parameters to the other ranks instead of having each rank load all parameters which can lead to large RAM usage spikes around the start of the script that may cause errors.

To pass even more advanced keyword arguments to accelerate, we allow for the following arguments as well:

  • device_map_option: How to split model weights across available GPUs. defaults to "auto".
  • max_memory_per_gpu: the max GPU memory to use per GPU in loading the model.
  • max_cpu_memory: the max amount of CPU memory to use when offloading the model weights to RAM.
  • offload_folder: a folder where model weights will be offloaded to disk if needed.

To use accelerate with the lm-eval command, use

accelerate launch --no_python lm-eval --model ...

About

A framework for few-shot evaluation of autoregressive language models in Dutch.

License:MIT License


Languages

Language:Python 98.1%Language:C++ 1.5%Language:Shell 0.4%