There are 38 repositories under data-processing topic.
A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
A light-weight, flexible, and expressive statistical data testing library
Extract Transform Load for Python 3.5+
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
All-in-one text de-duplication
Scalable data pre processing and curation toolkit for LLMs
A list about Apache Kafka
Machine Learning notebooks for refreshing concepts.
Harmonious distributed data analysis in Rust.
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
Super fast list of dicts to pre-formatted tables conversion library for Python 2/3
Elastic data processing with Apache Pulsar and Apache Flink