Brandon82 / llm-dataset-gen

Using LLMs (OpenAI API) to generate and add data to datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

llm-dataset-gen

Provides a LLMDataset class for generating and adding data to .csv datasets using LLMs (OpenAI API)

Installation

Install the following packages: pip install openai==1.3.5 pandas==2.1.3 python-dotenv==1.0.0

Usage

1. Create a .env file in the root directory of the project and add your OpenAI API key to it:

OPENAI_API_KEY=<your-openai-api-key>

2. Create an empty dataset file using the create_dataset.py script

You can skip this step if you already have a dataset file

3. Create an instance of the LLMDataset class and provide a dataset_path:

from llm_dataset_gen import LLMDataset
data_filepath = "./data/Dataset.csv"
dataset = LLMDataset(dataset_path=data_filepath)

4. Call the add_data method by providing the context and num_samples parameters:

dataset_context="For Context, this dataset represents requirements engineering excerpts and their corresponding Language Construct (LC) and Language Quality (LQ) codings"
dataset.add_data(context=dataset_context, num_samples=20)
  • The add_data method will automatically overwrite/save the dataset file after appending the new data
  • The context parameter is the prompt that will be used to generate the data
  • The num_samples parameter is the number of data samples to generate and add to the dataset

How It Works

The LLMDataset class is designed to manage a dataset and interact with the OpenAI API to generate new data entries. By using the JSON Mode of the OpenAI API and the gpt-4-1106-preview or gpt-3.5-turbo-1106 model, it can generate new data entries (as JSON Objects) that match the structure of a given dataset, and easily append them to the dataset.

When calling the API, two messages are sent to the model: a dataset_description, and a context

  • The dataset_description is automatically generated by the LLMDataset class and describes the column names in the dataset, the number of data entries to generate, and how to format the data entries. This ensures that the generated data is consistent with the structure of the dataset.
  • The context is the prompt that is used to describe the data entries. This is provided by the user as a parameter in the add_data method.
  • If the dataset contains an ID column, the LLMDataset will ignore the LLM's generated ID and instead use the next available ID in the dataset.

About

Using LLMs (OpenAI API) to generate and add data to datasets

License:MIT License


Languages

Language:Python 100.0%