There are 3 repositories under data-selection topic.
Official Repository of "LLM × DATA" Survey Paper
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
A Survey on Data Selection for Language Models
:no_entry: [DEPRECATED] Adapt Transformer-based language models to new text domains
🔥[VLDB'26] Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning".
Code for ACL 2025 Main paper "Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning".
InstructionGPT-4
[ACL 2025 main] SCAR: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models
DataFlex is a data-centric training framework that enhances model performance by either selecting the most influential samples, optimizing their weights, or adjusting their mixing ratios.
[ACL2025 Findings] Official code for MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
[ACL 2023] The code for our ACL'23 paper Cold-Start Data Selection for Few-shot Language Model Fine-tuning: A Prompt-Based Uncertainty Propagation Approach
This is an official repository for "Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources" (NeurIPS 2023).
Enhancing Efficiency in Multidevice Federated Learning through Data Selection
Enhanced spatio-temporal electric load forecasts with less data using active deep learning
Repository for the experiments in my paper accepted to the CLIN Journal: "Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts"
Keras sentence classification
Dynamic Transfer Learning for Low-Resource Neural Machine Translation
Code for NeurIPS 2023 Paper (Imitation Learning from Imperfection: Theoretical Justifications and Algorithms)
An Approach to Enhancing the Efficacy of Post-Training Using Synthetic Data by Iterative Data Selection
CORE: Mitigating Catastrophic Forgetting in Continual Learning through Cognitive Replay (CogSci 2024 Oral)
Skill-Targeted Adaptive Training
Code for Generative Deduplication For Socia Media Data Selection (Findings of EMNLP 2024)
This repo contains the code for "Prioritizing Data Acquisition For End-to-End Speech Model Improvement", accepted at ICASSP 2024
This repository contains the data and code for the paper "Self-training with Two-phase Self-augmentation for Few-shot Dialogue Generation" (EMNLP2022-Findings).
Use embedding data from LLMs to determine the most different text in a given corpus.
Official Repository for the Paper: Chasing Random: Instruction Selection Strategies Fail to Generalize
A Python Tool for Selecting Domain-Specific (Contextually Similar Data) for Machine Translation
Quilt: Robust Data Segment Selection against Concept Drifts (AAAI 2024)
[KDD 2025] Proxy-Validated Importance-Aware Federated Sample Selection with Meta Learning
Autoguided Online Data Curation for Diffusion Model Training
A project to select only part of a PDF file. It's usefull when you want to extract informations with some python library like fitz.
This repo contains the code for "Privacy Preserving Data Selection for Bias Mitigation in Speech Models"
DataFlex is a data-centric training framework that enhances model performance by either selecting the most influential samples, optimizing their weights, or adjusting their mixing ratios.