GPT4ALL-collector

GPT4ALL-collector allows you to mass collect the ChatGPT API, allowing for input/output pairs in the millions to finetune your own model for conversational use, allowing the opensourcing of datasets and models alike (unlike what a certain inudstry leader is doing.)

Features

Mass "collect" the ChatGPT API
Obtain input/output pairs in the millions
Open source the datasets to create your amazing models!

Installation

Clone the repository: git clone https://github.com/Yuvanesh-ux/GPT4ALL-collector.git
Install the required dependencies: pip install -r requirements.txt

Usage

This is an example of how you'd use the scraper on a jsonl file from the OIG project for instance - https://huggingface.co/datasets/laion/OIG/tree/main

Navigate to the prompt-scrape directory.
Create your own file with input prompts. Here is an example of an input file that will work out of the box with the scraper.py without modification:

{"prompt": "Can you write me a poem about kenneth fearing, aphrodite and jubal fearing in the style of KENNETH FEARING?", "source": "OIG - unified_poetry_instructions.jsonl"}
{"prompt": "Can you write me a poem about wallace stevens and alfred a. knopf?", "source": "OIG - unified_poetry_instructions.jsonl"}
{"prompt": "Can you write me a poem about time?", "source": "OIG - unified_poetry_instructions.jsonl"}

Other file formats will require either conversion to this format or modifications to scraper.py to accomodate your own custom format.

To create such a file from an OIG jsonl file you can run: python convert_oig_to_scraper_input.py /path/to/OIG/file.jsonl /path/to/output_file.jsonl
If you are running the scraper on an OIG file you should first map the data with atlas to see if the data is of sufficient quality like this: python atlas_mapper.py /path/to/output_file.jsonl where the output file here is what was created in the previous step.

Run python scrape.py -k <OPENAI_API_KEY> /path/to/your/input_file.jsonl /path/to/your/output_file.jsonl
You can also set your OpenAI API keys to OPENAI_API_KEY environment variable.
The script will generate output a JSON file containing the prompt and response pairs, along with the model settings and source. Note that the output files will be appended if the path already exists.
You can modify the num_workers and shard_size parameters in scrape() to change the number of workers and the number of prompts processed per worker, respectively.

Note: You will need a ChatGPT API key(s) to use this tool. You can obtain a key from the OpenAI website.

Contributing

Contributions are welcome! If you encounter any bugs or have suggestions for new features, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

manyoso / GPT4ALL-collector