There are 45 repositories under data-cleaning topic.
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
A light-weight, flexible, and expressive statistical data testing library
Jupyter notebook and datasets from the pandas video series
General Assembly's 2015 Data Science course in Washington, DC
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Machine learning with dataframes
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
Schema-Inspector is a simple JavaScript object sanitization and validation module.
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
Data Science Feature Engineering and Selection Tutorials
Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
Pydantic extension for annotating autocorrecting fields.
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
🗺️ Data Cleaning and Textual Data Visualization 🗺️
CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Practices on data analysis including: cleaning, visualization and EDA on different datasets using Python, SQL, Power BI, etc.
Portfolio of data science and data analyst projects completed by me for academic, self learning, and hobby purposes.
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)