mateusnobre / oab_1st_phase_brazil_law_exam_RAG

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Benchmark Study: Large Language Models in Brazil's Law Exam

This is an introductory repo to my bachelor's thesis with most of the code used to generate the results (it does not include all the code used for the PDF parsing, but all required files to run the benchmark). It

Table of Contents

Setup Python Virtual Environment

To ensure a consistent development environment, it is recommended to use a Python virtual environment. Follow these steps:

  1. Install virtualenv if you haven't already:

    pip install virtualenv
  2. Create a virtual environment:

    virtualenv venv
  3. Activate the virtual environment:

    • On Windows:
      .\venv\Scripts\activate
    • On Unix or MacOS:
      source venv/bin/activate
  4. Install project dependencies from requirements.txt:

    pip install -r requirements.txt

Now your Python virtual environment is set up.

This Benchmark used GPT 4, GPT 3.5, Llama 2 13B, and Llama 2 70B. Experiments were conducted from 2023 Nov 9 to 2023 Nov 12 using OpenAI and Replicate APIs.

RAG Hyperparameters

Hyperparameter Value
LLM Model Temperature 0.2
LLM Max Tokens 50
Text Chunk Size (Number of Chars) 512
Text Chunk Overlap (Number of Chars) 64

Results

How much did OpenAI models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?how_much_did_openai_models_score_on_the_1st_phase_of_the_37th_oab_sp_exam_(bar_exam)?

How much did Llama2 models score on the 1st Phase of the 37th OAB SP Exam (Bar Exam)?how_much_did_llama2_models_score_on_the_1st_phase_of_the_37th_oab_sp_exam_(bar_exam)?

How much does the embedding model matter when doing RAG? Using GPT 3.5 and retrieving 5 text chunks

how_much_the_embeddings_model_matter_when_doing_rag?_using_gpt_3 5_and_retrieving_5_text_chunks

Note on Reproducibility

The results presented here are point estimates and may not be 100% reproducible due to the stochastic nature of Large Language Models (LLMs). This is especially true for commercial LLMs, where the internal workings are not fully transparent. Keep in mind that variations in results might occur even with the same hyperparameters and settings.

About

License:MIT License


Languages

Language:Jupyter Notebook 100.0%