younesbelkada / DataDreamer

Prompt. Generate Synthetic Data. Train & Align Models. πŸ€–πŸ’€

Home Page:https://datadreamer.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataDreamer
https://datadreamer.dev

Prompt. Generate Synthetic Data. Train & Align Models.

Tests & Release Ruff

DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. It is designed to be simple, extremely efficient, and research-grade.

Installation

pip3 install datadreamer.dev
demo.py Result of demo.py
                                                                                                   
demo.py

See the full demo script


                                                                                     
Demo

See the synthetic dataset and the trained model

πŸš€ For more demonstrations and recipes see the Quick Tour page.

With DataDreamer you can:

  • πŸ’¬ Create Prompting Workflows: Create and run multi-step, complex, prompting workflows easily with major open source or API-based LLMs.
  • πŸ“Š Generate Synthetic Datasets: Generate synthetic datasets for novel tasks or augment existing datasets with LLMs.
  • βš™οΈ Train Models: Align models. Fine-tune models. Instruction-tune models. Distill models. Train on existing data or synthetic data.
  • ... learn more about what's possible in the Overview Guide

DataDreamer is:

  • 🧩 Simple: Simple and approachable to use with sensible defaults, yet powerful with support for bleeding edge techniques.
  • πŸ”¬ Research-Grade: Built for researchers, by researchers, but accessible to all. A focus on correctness, best practices, and reproducibility.
  • 🏎️ Efficient: Aggressive caching and resumability built-in. Support for techniques like quantization, parameter-efficient training (LoRA), and more.
  • πŸ”„ Reproducible: Workflows built with DataDreamer are easily shareable, reproducible, and extendable.
  • 🀝 Makes Sharing Easy: Publishing datasets and models is simple. Automatically generate data cards and model cards with metadata. Generate a list of any citations required.
  • ... learn more about the motivation and design principles behind DataDreamer.

Citation

coming soon...

Contact

Please reach out to us via email (ajayp@upenn.edu) or on Discord if you have any questions, comments, or feedback.



Copyright Β© 2024, Ajay Patel. Released under the MIT License.

Thank you to the maintainers at Hugging Face and LiteLLM for accepting contributions neccessary for DataDreamer and providing upstream support.

About

Prompt. Generate Synthetic Data. Train & Align Models. πŸ€–πŸ’€

https://datadreamer.dev

License:MIT License


Languages

Language:Python 96.5%Language:Shell 3.5%