ChatGPT 3.5 Fine-tuning Utilities

This project provides a set of utilities to assist users in fine-tuning the ChatGPT 3.5 model with OpenAI. The utilities are wrapped into a single TrainGPT class which allows users to manage the entire fine-tuning lifecycle - from uploading data files, to starting training jobs, monitoring their progress, and managing the trained models.

I was using a collection of curl commands to "interact" with OAI API and it went out of control, so I started to group things together. I work a lot from the interactive Python console to test and "play" with things, so having things grouped up helps. I also plan to release the other collections for dealing with inference for custom models and managing the assests (fiels, embeddings, etc)

Features:

File Upload: Easily upload your fine-tuning data files.
File List: See all your files (Uploaded and results of previous trainings).
File Details: Get file details.
Count tokens: Count tokens with tiktoken library.
Start Training: Begin a new training job using your uploaded data.
List Jobs: View all your current and past training jobs.
Job Details: Retrieve detailed information about a specific training job.
Cancel: Cancel a training job.
Delete: Delete a training job.
List Models: View all your current and past fine-tuned models, filtered per your models and standard models
List Models Summaries: View all your models, grouped per owner.
Model Details: Retrieve detailed information about a specific model.
Delete Model: Delete a fine-tuned model.

PSA

~~The code contains a get_token_count() method that will count the tokens from the training file using tiktoken library.~~ ~~It will use 3 available encoders: "cl100k_base", "p50k_base", "r50k_base" and will show the results for each one.~~

YOU WILL BE CHARGED ABOUT 10 TIMES THAT NUMBER OF TOKENS. So, if you have 100k tokens returned by the get_token_count() method, you will be charged for 1M tokens.

I was wrong here. There is an overhead, but is not alwats 10x For small files (100, 500, 1000, 2000 tokens), trained tokens are 15k+, It seems you can't go bellow 15k tokens, no matter how small is your training file.

For bigger files, the overhead is still there, but lower. For a file with 3 920 281 tokens, trained tokens were 4 245 281, so the overhead is around 6%. For a file with 40 378 413 counted tokens, trained tokens were: 43 720 882.

There is an overhead that will be 10x on very small files, but it gets to bellow 10% on larger files

Here is a quick table with the overhead at different token levles:

Number of tokens in the training file	Number of charged tokens	Overhead
1 426	15 560	1091%
3 920 281	4 245 281	8.29%
40 378 413	43 720 882	8.27%
92 516 393	File exceeds maximum size of 50000000 tokens for fine-tuning
46 860 839	48 688 812	Here they removed some rows as "moderation"
25 870 859	26 903 007	9.61%
41 552 537	43 404 802	9.54%

It seems that there is a limitation to 50 000 000 tokens

Prerequisites:

API Key: Ensure you have set up your OpenAI API key. You can set it as an environment variable named OPENAI_API_KEY.

export OPENAI_API_KEY="your_api_key"

Installation:

Clone the Repository:

git clone https://github.com/your_username/chatgpt-fine-tuning-utilities.git
cd chatgpt-fine-tuning-utilities

Install Dependencies:
```
pip install -r requirements.txt
```

Prepare your data:

Data needs to be in JSONL format:

[
  {
  "messages": [
    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
    { "role": "user", "content": "Tell me a story." },
    { "role": "assistant", "content": "One day a student went to schoool." }
  ]
},
  {
  "messages": [
    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
    { "role": "user", "content": "Tell me a story." },
    { "role": "assistant", "content": "One day a student went to schoool." }
  ]
}
]

Save it as data.jsonl in the root directory of the project.

Detailed Usage:

Python Script Usage:

After setting up, you can utilize the TrainGPT class in your Python scripts as follows:

Initialization:

Start by importing and initializing the TrainGPT class.
```
from train_gpt_utilities import TrainGPT
trainer = TrainGPT()
```
Upload Training Data:

Upload your training data file to start the fine-tuning process.
```
trainer.create_file("path/to/your/training_data.jsonl")
```
Start a Training Job:

Begin the training process using the uploaded file.
```
trainer.start_training()
```

Listing All Jobs:

You can list all your current and past training jobs.

jobs = trainer.list_jobs()

You will get something like this:

trainer.list_jobs()
There are 1 jobs in total.
1 jobs of fine_tuning.job.
1 jobs succeeded.

List of jobs (ordered by creation date):

- Job Type: fine_tuning.job
 ID: ftjob-Sq3nFz3Haqt6fZwqts321iSH
 Model: gpt-3.5-turbo-0613
 Created At: 2023-08-24 04:19:56
 Finished At: 2023-08-24 04:29:55
 Fine Tuned Model: ft:gpt-3.5-turbo-0613:iongpt::7qwGfk6d
 Status: succeeded
 Training File: file-n3kU9Emvvoa8wRrewaafhUv

When the status is "succeeded" you should have your model ready to use. You can jump to step 7 to find the fine tuned model.

If you have multiple jobs in the list, you can use the id to fetch the details of a specific job.

Fetching Job Details:

You can get detailed statistics of a specific training job.
```
job_details = trainer.get_job_details("specific_job_id")
```
If something goes wrong, you can cancel a job using
Cancel a Job:

You can cancel a training job if it is still running.
```
trainer.cancel_job("specific_job_id")
```
Find the fine tuned model: For this we will use the list_models_summaries method.
```
models = trainer.list_models_summaries()
```
You will get something like this:
```
You have access to 61 number of models.
Those models are owned by:
openai: 20 models
openai-dev: 32 models
openai-internal: 4 models
system: 2 models
iongpt: 3 models
```
Then, you can use the owner to fetch the details of models from specific owner. The fine tuned model will be in that list.
```
trainer.list_models_by_owner("iongpt")
```

You will get something like this:

Name: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
Created: 2023-04-12 17:05:19
Owner: iongpt
Root model: ada:2020-05-03
Parent model: ada:2020-05-03
-----------------------------
Name: ada:ft-iongpt:url-mapping-2023-04-12-18-07-26
Created: 2023-04-12 18:07:26
Owner: iongpt
Root model: ada:2020-05-03
Parent model: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
-----------------------------
Name: davinci:ft-iongpt:url-mapping-2023-04-12-15-54-23
Created: 2023-04-12 15:54:23
Owner: iongpt
Root model: davinci:2020-05-03
Parent model: davinci:2020-05-03
-----------------------------
Name: ft:gpt-3.5-turbo-0613:iongpt::7qy7qwVC
Created: 2023-08-24 06:28:54
Owner: iongpt
Root model: sahara:2023-04-20
Parent model: sahara:2023-04-20
-----------------------------

Command Line Usage:

This part was not tested yet. Please use the Python script usage for now. Recommended to use from a python interactive shell.

Uploading a File:

python train_gpt_cli.py --create-file /path/to/your/file.jsonl

Starting a Training Job:

python train_gpt_cli.py --start-training

Listing All Jobs:
```
python train_gpt_cli.py --list-jobs
```

For any command that requires a specific job or file ID, you can provide it as an argument. For example:

python train_gpt_cli.py --get-job-details your_job_id

ToDo

Add support for inference on the custom fine tune models
Add suport for embeddings

Contribution:

We welcome contributions to this project. If you find a bug or want to add a feature, feel free to open an issue or submit a pull request.

License:

This project is licensed under the MIT License. See LICENSE for more details.

iongpt / ChatGPT-fine-tuning