Benchmark Study: Large Language Models in Brazil's Law Exam

This is an introductory repo to my bachelor's thesis with most of the code used to generate the results (it does not include all the code used for the PDF parsing, but all required files to run the benchmark). It

Benchmark Study: Large Language Models in Brazil's Law Exam
- Table of Contents
- Setup Python Virtual Environment

Setup Python Virtual Environment

To ensure a consistent development environment, it is recommended to use a Python virtual environment. Follow these steps:

Install virtualenv if you haven't already:
```
pip install virtualenv
```
Create a virtual environment:
```
virtualenv venv
```
Activate the virtual environment:
- On Windows:
```
.\venv\Scripts\activate
```
- On Unix or MacOS:
```
source venv/bin/activate
```
Install project dependencies from requirements.txt:
```
pip install -r requirements.txt
```

Now your Python virtual environment is set up.

This Benchmark used GPT 4, GPT 3.5, Llama 2 13B, and Llama 2 70B. Experiments were conducted from 2023 Nov 9 to 2023 Nov 12 using OpenAI and Replicate APIs.

RAG Hyperparameters

Hyperparameter	Value
LLM Model Temperature	0.2
LLM Max Tokens	50
Text Chunk Size (Number of Chars)	512
Text Chunk Overlap (Number of Chars)	64

Results

How much did OpenAI models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?

How much did Llama2 models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?

How much does the embedding model matter when doing RAG? Using GPT 3.5 and retrieving 5 text chunks

Note on Reproducibility

The results presented here are point estimates and may not be 100% reproducible due to the stochastic nature of Large Language Models (LLMs). This is especially true for commercial LLMs, where the internal workings are not fully transparent. Keep in mind that variations in results might occur even with the same hyperparameters and settings.

mateusnobre / oab_1st_phase_brazil_law_exam_RAG