jiangxinke / data_management_LLM

Collection of training data management explorations for large language models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Management for LLM

A curated list of training data management for large language model resources.

Contents

Pretraining

Data Quantity

  • Scaling Laws

    • Scaling Laws for Neural Language Models (Arxiv, Jan. 2020) [Paper]

    • An empirical analysis of compute-optimal large language model training (NeurIPS 2022) [Paper]

  • Data Repetition

    • Scaling Laws and Interpretability of Learning from Repeated Data (Arxiv, May 2022) [Paper]

    • Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning (Arxiv, Oct. 2022) [Paper]

    • Scaling Data-Constrained Language Models (Arxiv, May 2023) [Paper] [Code]

    • To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis (Arxiv, May 2023) [Paper]

    • D4: Improving LLM Pretraining via Document De-Duplication and Diversification (Arxiv, Aug. 2023) [Paper]

Data Quality

  • Deduplication

    • Deduplicating training data makes language models better (ACL 2022) [Paper] [Code]
    • Deduplicating training data mitigates privacy risks in language models (ICML 2022) [Paper]
    • Noise-Robust De-Duplication at Scale (ICLR 2022) [Paper]
    • SemDeDup: Data-efficient learning at web-scale through semantic deduplication (Arxiv, Mar. 2023) [Paper] [Code]
    • The MiniPile Challenge for Data-Efficient Language Models (Arxiv, April 2023) [Paper] [Dataset]
  • Quality Filtering

    • An Empirical Exploration in Quality Filtering of Text Data (Arxiv, Sep. 2021) [Paper]
    • Quality at a glance: An audit of web-crawled multilingual datasets (ACL 2022) [Paper]
    • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
    • Textbooks Are All You Need (Arxiv, Jun. 2023) [Paper] [Code]
    • The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (NeurIPS Dataset and Benchmark track 2023) [Paper] [Dataset]
    • Textbooks Are All You Need II: phi-1.5 technical report (Arxiv, Sep. 2023) [Paper] [Model]
    • When less is more: Investigating Data Pruning for Pretraining LLMs at Scale (Arxiv, Sep. 2023) [Paper]
  • Toxicity Filtering

    • Detoxifying language models risks marginalizing minority voices (NAACL-HLT, 2021) [Paper] [Code]
    • Challenges in detoxifying language models (EMNLP Findings, 2021) [Paper]
    • What’s in the box? a preliminary analysis of undesirable content in the Common Crawl corpus (Arxiv, May 2021) [Paper] [Code]
    • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
  • Social Biases

    • Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus (EMNLP 2021) [Paper]

    • An empirical survey of the effectiveness of debiasing techniques for pre-trained language models (ACL, 2022) [Paper] [Code]

    • Whose language counts as high quality? Measuring language ideologies in text data selection (EMNLP, 2022) [Paper] [Code]

    • From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models (ACL 2023) [Paper] [Code]

  • Diversity & Age

    • Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data (Arxiv, Jun. 2023) [Paper]

    • D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning (Arxiv, Oct. 2023) [Paper] [Code]

    • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]

Domain Composition

  • Lamda: Language models for dialog applications (Arxiv, Jan. 2022) [Paper] [Code]
  • Data Selection for Language Models via Importance Resampling (Arxiv, Feb. 2023) [Paper] [Code]
  • CodeGen2: Lessons for Training LLMs on Programming and Natural Languages (ICLR 2023) [Paper] [Model]
  • DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Arxiv, May 2023) [Paper] [Code]
  • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
  • SlimPajama-DC: Understanding Data Combinations for LLM Training (Arxiv, Sep. 2023) [Paper] [Model] [Dataset]
  • DoGE: Domain Reweighting with Generalization Estimation (Arxiv, Oct. 2023) [Paper]

Data Management Systems

  • Data-Juicer: A One-Stop Data Processing System for Large Language Models (Arxiv, Sep. 2023) [Paper] [Code]
  • Oasis: Data Curation and Assessment System for Pretraining of Large Language Models (Arxiv, Nov. 2023) [Paper] [Code]

Supervised Fine-Tuning

Data Quantity

  • Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases (Arxiv, Mar. 2023) [Paper]
  • Lima: Less is more for alignment (Arxiv, May 2023) [Paper]
  • Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning (Arxiv, May 2023) [Paper]
  • Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
  • How Abilities In Large Language Models Are Affected By Supervised Fine-Tuning Data Composition (Arxiv, Oct. 2023) [Paper]
  • Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace (Arxiv, Oct. 2023) [Paper]

Data Quality

  • Instruction Quality

    • Lima: Less is more for alignment (Arxiv, May 2023) [Paper]
    • Enhancing Chat Language Models by Scaling High-quality Instructional Conversations (Arxiv, May 2023) [Paper] [Code]
    • INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (Arxiv, Jun. 2023) [Paper] [Code]
    • Instruction mining: High-quality instruction data selection for large language models (Arxiv, Jul. 2023) [Paper]
    • Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models (Arxiv, Aug. 2023) [Paper]
    • Self-Alignment with Instruction Backtranslation (Arxiv. Aug. 2023) [Paper]
  • Instruction Diversity

    • Stanford Alpaca (Mar. 2023) [Code]
    • Enhancing Chat Language Models by Scaling High-quality Instructional Conversation (Arxiv, May 2023) [Paper] [Code]
    • Lima: Less is more for alignment (Arxiv, May 2023) [Paper]
    • #InsTag: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
    • Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration (Arxiv, Oct. 2023) [Paper] [Code]
  • Instruction Complexity

    • WizardLM: Empowering Large Language Models to Follow Complex Instructions (Arxiv, April 2023) [Paper] [Code]
    • WizardCoder: Empowering Code Large Language Models with Evol-Instruct (Arxiv, Jun. 2023) [Paper] [Code]
    • Orca: Progressive Learning from Complex Explanation Traces of GPT-4 (Arxiv, Jun. 2023) [Paper] [Code]
    • A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment (Arxiv, Aug. 2023) [Paper]
    • #InsTag: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
    • Can Large Language Models Understand Real-World Complex Instructions? (Arxiv, Sep. 2023) [Paper] [Benchmark]
  • Prompt Design

    • Reframing instructional prompts to gptk’s language (ACL Findings, 2022) [Paper] [Code]
    • Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts (NAACL, 2022) [Paper] [Code]
    • Demystifying Prompts in Language Models via Perplexity Estimation (Arxiv, Dec. 2022) [Paper]
    • Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning (ACL, 2023) [Paper] [Code]
    • Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning (ACL, 2023) [Paper]
    • The False Promise of Imitating Proprietary LLMs (Arxiv, May 2023) [Paper]
    • Exploring Format Consistency for Instruction Tuning (Arxiv, Jul. 2023) [Paper]
    • Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning (Arxiv, Oct. 2023) [Paper]
    • Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace (Arxiv, Oct. 2023) [Paper]

Task composition

  • Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ Tasks (EMNLP 2022) [Paper] [Dataset]
  • Finetuned Language Models Are Zero-Shot Learners (ICLR 2022) [Paper] [Dataset]
  • Multitask Prompted Training Enables Zero-Shot Task Generalization (ICLR 2022) [Paper] [Code]
  • Scaling Instruction-Finetuned Language Models (Arxiv, Oct. 2022) [Paper] [Dataset]
  • OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization (Arxiv, Dec. 2022) [Paper] [Model]
  • The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (ICML, 2023) [Paper] [Dataset]
  • Exploring the Benefits of Training Expert Language Models over Instruction Tuning (ICML, 2023) [Paper] [Code]
  • Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning (Arxiv, May 2023) [Paper]
  • How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources (Arxiv, Jun. 2023) [Paper] [Code]
  • How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition (Arxiv, Oct. 2023) [Paper]

Data-Efficient Learning

  • Data Quantity
    • Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning (Arxiv, Jul. 2023) [Paper]
    • How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition (Arxiv, Oct. 2023) [Paper]
  • Instruction Quality
    • NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks (SustaiNLP, 2023) [Paper]
    • Instruction Mining: High-Quality Instruction Data Selection for Large Language Models (Arxiv, Jul. 2023) [Paper]
    • AlpaGasus: Training A Better Alpaca with Fewer Data (Arxiv, Jul. 2023) [Paper]
    • OpenChat: Advancing Open-source Language Models with Mixed-Quality Data (Arxiv, Sep. 2023) [Paper] [Code]
  • Instruction Diversity
    • Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning (Arxiv, Nov. 2023) [Paper] [Code]
  • Task Composition
    • Data-Efficient Finetuning Using Cross-Task Nearest Neighbors (ACL Findings, 2023) [Paper] [Code]

    • Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation (Arxiv, May 2023) [Paper] [Code]

    • MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning (Arxiv, Sep. 2023) [Paper] [Code]

  • Others
    • Data-Juicer: A One-Stop Data Processing System for Large Language Models (Arxiv, Sep. 2023) [Paper] [Code]

    • LoBaSS: Gauging Learnability in Supervised Fine-tuning Data (Arxiv, Oct. 2023) [Paper]

Useful Resources

About

Collection of training data management explorations for large language models