There are 12 repositories under data-centric-ai topic.
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
A curated, but incomplete, list of data-centric AI resources.
Automatically find issues in image datasets and practice data-centric computer vision.
Resources for Data Centric AI
Curated list of open source tooling for data-centric AI on unstructured data.
Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 👩🏽💻
Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function: A Minmax Perspective" (NCFM) in CVPR 2025 (Highlight).
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Introduction to Data-Centric AI, MIT IAP 2024 🤖
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Papers about training data quality management for ML models.
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets
[ECCV 2022] Official Implementation for Unsupervised Selective Labeling for More Effective Semi-Supervised Learning
Trending projects & awesome papers about data-centric llm studies.
A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective
🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
This data-centric AI repository implements a robust deep learning method (LFBNet) for fully automated tumor segmentation in whole-body [18]F-FDG PET/CT images.
A curated list of awesome open source tools and commercial products to catalog, version, and manage data 🚀
Client interface to Cleanlab Studio
Frontiers in Neuroinformatics 2022: Local Label Point Correction for Edge Detection of Overlapping Cervical Cells
Unsupervised classification to improve the quality of a bird song recording dataset. https://doi.org/10.1016/j.ecoinf.2022.101952
Estimate dataset difficulty and detect label mistakes using reconstruction error ratios!
A better Alpaca Model Trained with Less Data (only 9k instructions of the original set)
Codes for a Top 5% finish in the Data-Centric AI Competition organized by Andrew Ng and DeepLearning.AI