openai / openai-python

The official Python library for the OpenAI API

Home Page:https://pypi.org/project/openai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Finetuning classifier giving recommendations which are not from any of the specified categories used to train the models

mrmrinal opened this issue · comments

Hello!

I made a finetuned classifier with 7 categories, similar to the one shown at examples/finetuning/finetuning-classification.ipynb and used the following command to create the classifier.

openai api fine_tunes.create -t "data_prepared_train.jsonl" -v "data_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 7 -m ada

I use the following command in python when querying from the classifier

res = openai.Completion.create(model=ft_model, prompt = prompt + "\n\n###\n\n", max_tokens=1, temperature=0, logprobs=5)

However, when querying a completion from the classifier, the log_probs returns values which does not fall under any of the predefined categories. I have tried these with both numerical categories, and string based categories, and the same behavior is still exhibited. Was wondering if there was any way to stop these other non-predefined categories in the log_probs as they do not aid with classification in my use case. Just having the probabilities for the predefined categories would suffice. Additionally, there is also the concern that the fine-tuning model could classify the prompt into one of the undefined categories.

Here are a few examples of such occurrences:

Numerical Categories
Predefined categories are from 0-6 inclusive

<OpenAIObject at 0x174ab8f9450> JSON: {
  "text_offset": [
    57
  ],
  "token_logprobs": [
    -0.08814435
  ],
  "tokens": [
    " 1"
  ],
  "top_logprobs": [
    {
      " 1": -0.08814435,
      " 2": -8.10889,
      " 5": -7.6156697,
      " 6": -2.4937658,
      " 7": -9.455983
    }
  ]
}

Here it suggests the value " 7" which has never been specified in the dataset

<OpenAIObject at 0x174ab8cf360> JSON: {
  "text_offset": [
    28
  ],
  "token_logprobs": [
    -0.028152594
  ],
  "tokens": [
    " 5"
  ],
  "top_logprobs": [
    {
      " 0": -8.047939,
      " 4": -6.4044333,
      " 5": -0.028152594,
      " 6": -3.8101103,
      " very": -8.138523
    }
  ]
}

Here it suggests the value " very" which is not an integer

<OpenAIObject at 0x174ab8cac70> JSON: {
  "text_offset": [
    60
  ],
  "token_logprobs": [
    -0.07207727
  ],
  "tokens": [
    " 2"
  ],
  "top_logprobs": [
    {
      " 2": -0.07207727,
      " 3": -6.8346596,
      " 4": -6.0346746,
      " 6": -2.7322218,
      "2": -8.06412
    }
  ]
}

Here it suggests "2" which is different from " 2"

String categories
Where categories are ["background", "bug", "feature-request" (shows up as feature), "follow-up", "null", "painpoint" (shows up as pain), "usability"]

<OpenAIObject at 0x174ab91d630> JSON: {
  "text_offset": [
    60
  ],
  "token_logprobs": [
    -0.2158865
  ],
  "tokens": [
    " feature"
  ],
  "top_logprobs": [
    {
      " UX": -8.75913,
      " feature": -0.2158865,
      " features": -7.479752,
      " functionality": -9.452176,
      " usability": -1.6482441
    }
  ]
}

Here it is returning other options such as " UX", " features", " functionality", etc over other predefined categories.

<OpenAIObject at 0x174ab8ed900> JSON: {
  "text_offset": [
    52
  ],
  "token_logprobs": [
    -0.001725924
  ],
  "tokens": [
    " pain"
  ],
  "top_logprobs": [
    {
      " Pain": -6.667007,
      " bug": -9.389308,
      " pain": -0.001725924,
      " painful": -8.79043,
      "pain": -9.397688
    }
  ]
}

Here it is returning many different versions of "pain".

Would really appreciate any advice/support over this, as such completions which are not part of the predefined categories are not expected.

Hi @mrmrinal, thanks for writing in with your issue.

I totally agree this is an issue, though I don't think it's as much an issue with this python library as it is with our API. Can you send this issue to support@openai.com instead? They'll be happy to help you with your issue there.

Hi @mrmrinal, since GPT3 is a generative model, it doesn't guarantee that top_logprobs keys will be one of your predefined categories contrary to how other classifiers work. If GPT3 Ada can't learn well, it comes up with other tokens (such as it may say "yeah" instead of "yes")

To solve this issue, I suggest increasing the number of samples in your training data (as mentioned in the documentation, it's suggested to have at least 100 samples per category) and training for more epochs (like at least 3-4 epochs instead of 1-2) to increase the chance of getting predictions among predefined categories.

In addition, if you're going to use string categories, make sure to prefer 1-token labels because max_tokens=1 in prediction calls. "pain point" and "feature-requests" have 2 or 3 tokens in your case.

Thank you @hallacy and @zafercavdar for your replies! I actually have larger than 100 samples per category and have set the epochs to the default epochs which according to the documentation has 4 epochs, yet still getting these anomalous results. I have been using the numerical categories in this case. I believe it might be better to just lookout for the 7 categories specified.

@mrmrinal I'm gonna go ahead and close out this issue since I think your idea of looking out for the categories seems like the right one. Feel free to reopen this issue if you have any concerns.