FRAG - Framework for Retrieval Augmented Generation Evaluation and Benchmarking

Introducing FRAG - Framework for Retrieval Augmented Generation evaluating and benchmarking.

Current common LLM evaluation suites are based evaluated based on tasks that are used as proxy for intelligence and based on styles. For example, Grade School Math (GSM8k), Massive Multitask Language Understanding (MMLU, k=5). Or for styles, for examples, LMSys that uses LLM as a proxy for human evaluation. While they might proxy LLM's intelligence, they do not evaluate LLM's capability for production use cases.

For production basis, we would want to evaluate based on their capabilities of their production use cases, specifically, hallucinations, context utilization, instruction following, tool dependents potentials used by LLM applications with production usage in mind.

Benchmarking LLM API Endpoints

We are using httpx.

Evaluating Factuality Benchmarking online serving throughput for LLM API endpoints.

Context Utilization

Hallucination

Instruction following

https://github.com/google-research/google-research/tree/master/instruction_following_eval

Table understanding

https://osu-nlp-group.github.io/TableLlama/

Tool usage

References

Reproducible Performance Metrics for LLM inference

About

Apache License 2.0

Languages

Language:Python 100.0%