davanstrien / data-for-fine-tuning-llms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Datasets for fine-tuning LLMs

This repo is to accompany a session run as part of the Mastering LLMs: A Conference For Developers & Data Scientists conference. The session focused on some of the data issues related to fine-tuning LLMs.

The goals of the notebooks are focused on balancing the requirement to have sufficiently diverse data, with high quality and the right quantity i.e. avoid duplication.

Diagram showing goals of the notebooks

Slides

Notebooks

Synthetic data pipelines

  • dataset-card-summaries: This folder contains a pipeline for generating a synthetic dataset focused on generating tl;dr summaries of datasets based on their dataset card.

Other resources for synthetic data generation

About


Languages

Language:Jupyter Notebook 96.3%Language:Python 3.7%