Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination
git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -e .
For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC
, HellaSwag
, MMLU
, TruthfulQA
, Winogrande
, GSM8K
and so on.
python main.py \
--model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cpu \
--batch_size 8
from GenAIEval.evaluation.lm_evaluation_harness import evaluate, LMEvalParser
args = LMevalParser(model = "hf",
user_model = user_model,
tokenizer = tokenizer,
tasks = "hellaswag",
device = "cpu",
batch_size = 8,
)
results = evaluate(args)
For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.
There is a small code change in main.py
regarding the import path.
- from GenAIEval.evaluation.lm_evaluation_harness import evaluate, setup_parser
+ from GenAIEval.evaluation.bigcode_evaluation_harness import evaluate, setup_parser
python main.py \
--model "codeparrot/codeparrot-small" \
--tasks "humaneval" \
--n_samples 100 \
--batch_size 10 \
--allow_code_execution \
from GenAIEval.evaluation.bigcode_evaluation_harness import evaluate, BigcodeEvalParser
args = BigcodeEvalParser(
user_model = user_model,
tokenizer = tokenizer,
tasks = "humaneval",
n_samples = 100,
batch_size = 10,
allow_code_execution=True,
)
results = evaluate(args)