gh640 / openai-fine-tuning-validate

A simple script to validate datasets for OpenAI fine tuning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tests and style check

openai-fine-tuning-validate

A simple script to validate datasets for OpenAI fine tuning.

The validator function was blatantly copied from the OpenAI Cookbook.

Target models

  • gpt-3.5-turbo-0125
  • gpt-3.5-turbo-1106
  • gpt-3.5-turbo-0613

Caution

babbage-002 and davinci-002 use a different format and we cannot validate datasets for them with this script.

Prerequisites

  • Python >=3.12
  • Poetry >=1.8.1

Usage

Checkout the repository.

git clone https://github.com/gh640/openai-fine-tuning-validate

Install dependencies with Poetry.

poetry install

Run openai-fine-tuning-validate command in a venv Poetry manages.

poetry run openai-fine-tuning-validate [dataset-file]

Sample outputs

Valid samples:

poetry run openai-fine-tuning-validate tests/data/dataset-1-simple.jsonl
# => Dataset is valid
poetry run openai-fine-tuning-validate tests/data/dataset-2-multi-turn.jsonl
# => Dataset is valid

Invalid samples:

echo '{}' >> invalid.jsonl
poetry run openai-fine-tuning-validate invalid.jsonl
# => {'missing_messages_list': 1}
echo '{"messages": [{"role": "unknown"}]}' > invalid.jsonl
poetry run openai-fine-tuning-validate invalid.jsonl
# =>
# {'example_missing_assistant_message': 1,
#  'message_missing_key': 1,
#  'missing_content': 1,
#  'unrecognized_role': 1}

Reference

About

A simple script to validate datasets for OpenAI fine tuning


Languages

Language:Python 100.0%