iongpt / ChatGPT-fine-tuning

Utility class for training fine tuning ChatGPT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


ChatGPT 3.5 Fine-tuning Utilities

This project provides a set of utilities to assist users in fine-tuning the ChatGPT 3.5 model with OpenAI. The utilities are wrapped into a single TrainGPT class which allows users to manage the entire fine-tuning lifecycle - from uploading data files, to starting training jobs, monitoring their progress, and managing the trained models.

I was using a collection of curl commands to "interact" with OAI API and it went out of control, so I started to group things together. I work a lot from the interactive Python console to test and "play" with things, so having things grouped up helps. I also plan to release the other collections for dealing with inference for custom models and managing the assests (fiels, embeddings, etc)

Features:

  • File Upload: Easily upload your fine-tuning data files.
  • File List: See all your files (Uploaded and results of previous trainings).
  • File Details: Get file details.
  • Count tokens: Count tokens with tiktoken library.
  • Start Training: Begin a new training job using your uploaded data.
  • List Jobs: View all your current and past training jobs.
  • Job Details: Retrieve detailed information about a specific training job.
  • Cancel: Cancel a training job.
  • Delete: Delete a training job.
  • List Models: View all your current and past fine-tuned models, filtered per your models and standard models
  • List Models Summaries: View all your models, grouped per owner.
  • Model Details: Retrieve detailed information about a specific model.
  • Delete Model: Delete a fine-tuned model.

PSA

The code contains a get_token_count() method that will count the tokens from the training file using tiktoken library. It will use 3 available encoders: "cl100k_base", "p50k_base", "r50k_base" and will show the results for each one.

YOU WILL BE CHARGED ABOUT 10 TIMES THAT NUMBER OF TOKENS. So, if you have 100k tokens returned by the get_token_count() method, you will be charged for 1M tokens.

I was wrong here. There is an overhead, but is not alwats 10x For small files (100, 500, 1000, 2000 tokens), trained tokens are 15k+, It seems you can't go bellow 15k tokens, no matter how small is your training file.

For bigger files, the overhead is still there, but lower. For a file with 3 920 281 tokens, trained tokens were 4 245 281, so the overhead is around 6%. For a file with 40 378 413 counted tokens, trained tokens were: 43 720 882.

There is an overhead that will be 10x on very small files, but it gets to bellow 10% on larger files

Here is a quick table with the overhead at different token levles:

Number of tokens in the training file Number of charged tokens Overhead
1 426 15 560 1091%
3 920 281 4 245 281 8.29%
40 378 413 43 720 882 8.27%
92 516 393 File exceeds maximum size of 50000000 tokens for fine-tuning
46 860 839 48 688 812 Here they removed some rows as "moderation"
25 870 859 26 903 007 9.61%
41 552 537 43 404 802 9.54%

It seems that there is a limitation to 50 000 000 tokens

Prerequisites:

  • API Key: Ensure you have set up your OpenAI API key. You can set it as an environment variable named OPENAI_API_KEY.
export OPENAI_API_KEY="your_api_key"

Installation:

  1. Clone the Repository:

    git clone https://github.com/your_username/chatgpt-fine-tuning-utilities.git
    cd chatgpt-fine-tuning-utilities
  2. Install Dependencies:

    pip install -r requirements.txt

Prepare your data:

Data needs to be in JSONL format:

[
  {
  "messages": [
    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
    { "role": "user", "content": "Tell me a story." },
    { "role": "assistant", "content": "One day a student went to schoool." }
  ]
},
  {
  "messages": [
    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
    { "role": "user", "content": "Tell me a story." },
    { "role": "assistant", "content": "One day a student went to schoool." }
  ]
}
]

Save it as data.jsonl in the root directory of the project.

Detailed Usage:

Python Script Usage:

After setting up, you can utilize the TrainGPT class in your Python scripts as follows:

  1. Initialization:

    Start by importing and initializing the TrainGPT class.

    from train_gpt_utilities import TrainGPT
    trainer = TrainGPT()
  2. Upload Training Data:

    Upload your training data file to start the fine-tuning process.

    trainer.create_file("path/to/your/training_data.jsonl")
  3. Start a Training Job:

    Begin the training process using the uploaded file.

    trainer.start_training()
  4. Listing All Jobs:

    You can list all your current and past training jobs.

    jobs = trainer.list_jobs()

    You will get something like this:

    trainer.list_jobs()
    There are 1 jobs in total.
    1 jobs of fine_tuning.job.
    1 jobs succeeded.
    
    List of jobs (ordered by creation date):
    
    - Job Type: fine_tuning.job
     ID: ftjob-Sq3nFz3Haqt6fZwqts321iSH
     Model: gpt-3.5-turbo-0613
     Created At: 2023-08-24 04:19:56
     Finished At: 2023-08-24 04:29:55
     Fine Tuned Model: ft:gpt-3.5-turbo-0613:iongpt::7qwGfk6d
     Status: succeeded
     Training File: file-n3kU9Emvvoa8wRrewaafhUv

    When the status is "succeeded" you should have your model ready to use. You can jump to step 7 to find the fine tuned model.

    If you have multiple jobs in the list, you can use the id to fetch the details of a specific job.

  5. Fetching Job Details:

    You can get detailed statistics of a specific training job.

    job_details = trainer.get_job_details("specific_job_id")

    If something goes wrong, you can cancel a job using

  6. Cancel a Job:

    You can cancel a training job if it is still running.

    trainer.cancel_job("specific_job_id")
  7. Find the fine tuned model: For this we will use the list_models_summaries method.

    models = trainer.list_models_summaries()

    You will get something like this:

    You have access to 61 number of models.
    Those models are owned by:
    openai: 20 models
    openai-dev: 32 models
    openai-internal: 4 models
    system: 2 models
    iongpt: 3 models

    Then, you can use the owner to fetch the details of models from specific owner. The fine tuned model will be in that list.

  8. trainer.list_models_by_owner("iongpt")

You will get something like this:

Name: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
Created: 2023-04-12 17:05:19
Owner: iongpt
Root model: ada:2020-05-03
Parent model: ada:2020-05-03
-----------------------------
Name: ada:ft-iongpt:url-mapping-2023-04-12-18-07-26
Created: 2023-04-12 18:07:26
Owner: iongpt
Root model: ada:2020-05-03
Parent model: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
-----------------------------
Name: davinci:ft-iongpt:url-mapping-2023-04-12-15-54-23
Created: 2023-04-12 15:54:23
Owner: iongpt
Root model: davinci:2020-05-03
Parent model: davinci:2020-05-03
-----------------------------
Name: ft:gpt-3.5-turbo-0613:iongpt::7qy7qwVC
Created: 2023-08-24 06:28:54
Owner: iongpt
Root model: sahara:2023-04-20
Parent model: sahara:2023-04-20
-----------------------------

Command Line Usage:

This part was not tested yet. Please use the Python script usage for now. Recommended to use from a python interactive shell.

  1. Uploading a File:

    python train_gpt_cli.py --create-file /path/to/your/file.jsonl
  2. Starting a Training Job:

    python train_gpt_cli.py --start-training
  3. Listing All Jobs:

    python train_gpt_cli.py --list-jobs

For any command that requires a specific job or file ID, you can provide it as an argument. For example:

python train_gpt_cli.py --get-job-details your_job_id

ToDo

  1. Add support for inference on the custom fine tune models
  2. Add suport for embeddings

Contribution:

We welcome contributions to this project. If you find a bug or want to add a feature, feel free to open an issue or submit a pull request.

License:

This project is licensed under the MIT License. See LICENSE for more details.

About

Utility class for training fine tuning ChatGPT

License:Apache License 2.0


Languages

Language:Python 100.0%