Create JSONL to fine tune OpenAI's Chat Model

This application is built using Streamlit and helps users create a properly formatted Jsonl file. This file format is needed to fine-tune an OpenAi chat model. This application is meant to be used locally please see our other version for a remote use.

data = {
    "messages": [
        {"role": "system", "content": system_message},
        {"role": "user", "content": prompt_text},
        {"role": "assistant", "content": ideal_generated_text}
    ]
}

Features

User Prompts: Enter a question under the label "Enter your question? Human:".
AI Response: Provide your ideal AI-generated response.
Custom System Message: Add a custom system message or stick with the default message "You are a helpful and friendly assistant.".
Data Saving: Upon pressing the "Accept Inputs" button, the provided data gets formatted and appended to an output.jsonl file.
TRAINING_FILE_ID Input: Users can input their TRAINING_FILE_ID required for fine-tuning.
Fine-Tuning: A button to send the output.jsonl file to OpenAI for fine-tuning.
Chat Window: Test the fine-tuned model by sending messages and viewing the model's response.

Setup & Run

Clone the repository using:

git clone https://github.com/raymondbernard/finetuneopenai.git

Navigate to the repository directory:
```
cd finetuneopenai
```
Install the required packages using:
```
pip install -r requirements.txt
```
Run the Streamlit app using:
```
streamlit run app.py
```

Dependencies

streamlit
jsonlines
tiktoken
numpy
requests
python-dotenv
streamlit_extras

Feedback & Contributions

Feel free to raise issues, provide feedback, or make contributions to improve the application.

License

This project is licensed under the MIT License. See LICENSE for more information.

openaicheck.py *(written by OpenAI)

openaicheck.py is a script designed to inspect and validate the structure of a dataset for chat completions. It performs the following operations:

Data Inspection: The script initially loads the dataset from output.jsonl and prints the number of examples and the first example to provide an overview.
Format Error Checks: The script checks for various formatting issues such as:
- Incorrect data types
- Missing message lists
- Unrecognized message keys
- Missing content
- Unrecognized roles in messages
- Absence of an assistant's message
Token Count: It calculates the number of tokens for each message and provides distribution statistics such as:
- Range (Min and Max)
- Average (Mean)
- Middle Value (Median)
- 5th Percentile
- 95th Percentile

Understanding the Output

Number of Messages per Example Distribution: Provides statistics about the number of messages in each example.
Total Tokens per Example Distribution: Indicates the total number of tokens in each example.
Assistant Tokens per Example Distribution: Pertains to the number of tokens in the assistant's messages within each example. For each distribution, the following statistics are provided:
Range: The smallest and largest values.
Average (Mean): The average value.
Middle Value (Median): The middle value when sorted.
5th Percentile: 5% of the data lies below this value.
95th Percentile: 95% of the data lies below this value.

OpenAI Blog: GPT-3.5 Turbo, Fine-Tuning, and API Updates

OpenAI Fine-Tuning guide

Authors

MIT License

This is a simple guide to get started with the fine-tune OpenAI Chat model. Please, do not hesitate to open an issue if you encounter any problem or have a suggestion.

xAIdrian / finetuneopenai