gstamatakis / bigdataprojects

Projects on Spark and Hadoop

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Projects on various BigData platforms

Projects

This repository hosts the following projects, more can be found on each projects individual README.

Clinking one of the following links takes you directly to the projects module.

Skyline operator implemented in HadooopMR.

Distributed Bloom Filter and Count-Min sketches in Apache Storm.

Scheduling workloads in Spark, Flink, Apex and GPUs based on various metrics.

Calculating the Jaccard Index of terms and categories using a Per-Split SemiJoin algorithm in HadoopMR.

Used frameworks

Links redirect to each framework's download page.

Apache Spark

Apache Storm

Apache Flink

Apache Hadoop

Apache Hive

Apache Kafka

Apache NiFi

Elasticsearch (entire ELK stack)

Docker

The docker folder in the root directory contains various docker-compose.yml files for some of the Frameworks used in these projects. Docker is extremely powerful when complex networking is involved or rapid prototyping is necessary.

Structure

Inside each module there may be more submodules, usually one for each implementation (eg. Spark,Hadoop,...)

Building

This repository uses Maven3 to build its submodules. In order to build all of the submodules simply run the following from the root of this repo.

mvn clean package

Inside each submodule there will be a target directory with the module's uberjar.

To build just a single artifact (eg. The hadoop implementation of the skyline) simply:

mvn clean pacakge -pl :hadoopSkyline

About

Projects on Spark and Hadoop


Languages

Language:Java 99.6%Language:Python 0.3%Language:Cuda 0.1%