openai / openai-python

The official Python library for the OpenAI API

Home Page:https://pypi.org/project/openai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEATURE REQUEST] Add stratification on train/validation split with fine_tune.prepare_data

fratambot opened this issue Β· comments

Hello,
first of all many thanks for this great library ! πŸ™

When preparing data for multiclass classification for fine-tuning and accepting the split into train and validation data, I end up with a different number of classes in both datasets with respect to those I specified.
Error message:

[2022-03-25 10:12:57] Fine-tune failed. Errors:
The number of classes in file-LSGG6mb4lhNMqyAxN6dA63sc does not match the number of classes specified in the hyperparameters.
The number of classes in file-tRE2P9nw9pq2NtM4qpKgceI2 does not match the number of classes specified in the hyperparameters.

It seems to me a problem related to stratification while splitting. Do you think it'd be possible to include this option in the future ? I know it's not an easy task and when you have not many examples you have to manually play with test_size until you get the same number of classes in the splits but it could be automated by progressively increase the test_size until train_dataset.nunique() == test_dataset.nunique()

Thanks for writing in @fratambot! I'll pass this along to the fine tuning team

This does not look like a bug in the SDK, so I'm going to go ahead and close this issue. If it's still relevant, I encourage you to repost at community.openai.com.