AJlearner46 / Deduplicate-flanv2-finetune-LLaMa3-

perform deduplication on FLAN v2 dataset & Finetune LLaMa3 using this dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deduplicate-flanv2-finetune-LLaMa3-

FLAN v2 Cot Deduplicated Dataset

Data Preprocessing

  • Remove instructions with less than 100 in 'targets'.
  • Dedepulicate Dataset using cosine similarity with a threshold of 0.95.

Finetune

  • Finetuned LLaMa3-8b model with this dataset and quantize it.

Huggingface links

Acknowledgments

  • The original dataset is provided by SirNeural/flan_v2.
  • Tokenizer used: bert-base-uncased from Hugging Face.

About

perform deduplication on FLAN v2 dataset & Finetune LLaMa3 using this dataset


Languages

Language:Jupyter Notebook 100.0%