changwangss / GenAIEval

Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GenAIEval

Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination

Installation

git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -e .

Evaluation

lm-evaluation-harness

For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K and so on.

command line usage

python main.py \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cpu \
    --batch_size 8

function call usage

from GenAIEval.evaluation.lm_evaluation_harness import evaluate, LMEvalParser
args = LMevalParser(model = "hf", 
                    user_model = user_model,
                    tokenizer = tokenizer,
                    tasks = "hellaswag",
                    device = "cpu",
                    batch_size = 8,
                    )
results = evaluate(args)

bigcode-evaluation-harness

For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.

command line usage

There is a small code change in main.py regarding the import path.

- from GenAIEval.evaluation.lm_evaluation_harness import evaluate, setup_parser
+ from GenAIEval.evaluation.bigcode_evaluation_harness import evaluate, setup_parser
python main.py \
    --model "codeparrot/codeparrot-small" \
    --tasks "humaneval" \
    --n_samples 100 \
    --batch_size 10 \
    --allow_code_execution \

function call usage

from GenAIEval.evaluation.bigcode_evaluation_harness import evaluate, BigcodeEvalParser
args = BigcodeEvalParser(
                    user_model = user_model,
                    tokenizer = tokenizer,
                    tasks = "humaneval",
                    n_samples = 100,
                    batch_size = 10,
                    allow_code_execution=True,
                    )
results = evaluate(args)

About

Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination

License:Apache License 2.0


Languages

Language:Python 100.0%