data-processing's repositories
kafka-embedded
Runs embedded, in-memory Apache Kafka instances. Helpful for integration testing.
kafka-manager
A tool for managing Apache Kafka.
kangaroo
Hadoop utilities for Kafka
klio
Smarter data pipelines for audio.
mpire
A Python package for easy multiprocessing, but faster than multiprocessing
Neuraxle
Build neat pipelines with the right abstractions to do AutoML. Let your pipeline steps have hyperparameter spaces. Enable checkpoints to cut duplicate calculations. Go from research to production environment easily.
rabit
Reliable Allreduce and Broadcast Interface for distributed machine learning
bloop
A hot bloop for your productivity
crawler4j
Open Source Web Crawler for Java
dask
Task scheduling and blocked algorithms for parallel processing
dataduct
DataPipeline for humans.
disque
Disque is a distributed message broker
emr-bootstrap-actions
This repository hold the Amazon Elastic MapReduce sample bootstrap actions
faust
Python Stream Processing
fireant
Data analysis and reporting tool for quick access to custom charts and tables in Jupyter Notebooks and in the shell.
flink
Mirror of Apache Flink
gain
Web crawling framework based on asyncio for everyone.
GoogleScraper
A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, Baidu and others) by using proxies (socks4/5, http proxy) and with many different IP's, including asynchronous networking support (very fast).
grpc-java
The Java gRPC implementation. HTTP/2 based RPC
HiBench
HiBench is a Hadoop benchmark suite.
hydra
Hydra is a framework for elegantly configuring complex applications
Persimmon
A visual dataflow programming language for sklearn
pyspider
A Powerful Spider System with Web UI
samoa
SAMOA (Scalable Advanced Massive Online Analysis) is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.
spark-redshift
Spark and Redshift integration
Stream-Framework
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
ufora
Compiled, automatically parallel Python for data science