MuhammadTaha / spark-flink-benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parallel Batch Processing: Apache Spark and Apache Flink Benchmark

Benchmark tool to measure the throughput and runtime for Flink and Spark for given workloads. The major goal of this project is to find answer for the following research question:

Measure throughput and runtime (for Group-By, Sorting, Aggregate) of Flink and Spark clusters on different input sizes and fixed structures (Book-Store) on fixed comparable systems and comparing the results.

The answer can be found here.

Architecture

Architecture

The bucket which contains the JAR-files is a public AWS-S3-Bucket.

Benchmark Client

The Benchmark Client first validate the input data and returns an Error if the input data isn't correct. Afterwards, if the input data was successfully verified data is generated randomly depending on the given size and will be uploaded into the aws s3 Bucket. Then a benchmark test based on runtime and throughput is running depending of the given input. Finally, the measured runtime and throughput results are stored into a csv file.

Execution

java -jar benchmarkclient-1.0.jar <aws_username> <aws_secret_key> <cluster_type> <metric_type> <s3_bucket_data_file> <s3_emr_log_dir> [data_size] [number_of_iterations]

Input Parameters

Input Field Type Required Description Examples
aws_username String yes Your aws username *****@34kj
aws_secret_key String Yes Your aws secret key **********************262mUimD25jtF
cluster_type String Yes The cluster type which can be benchmarked. flink, spark
metric_type String Yes The metric type which can be benchmarked. aggregate, groupby, sorting
s3_bucket_data_file String Yes The path to the data.csv file of the aws s3 bucket. s3://test-emr-ws1819/data/data.csv
s3_emr_log_dir String Yes The path of the log directory in your aws s3 bucket. s3://test-emr-ws1819/log/
data_size String No The location of the VirtualBusStop. 4m, 1g, 5g
number_of_iterations int No The location of the VirtualBusStop. 1, 5, 10

Workloads/Metrics

We defined the following fixed data structure (CSV file):

id, userId, title, genre, author, pages, publisher, date, price

This structure is used to perform the following workload jobs on each cluster type (Flink | Spark) for different input sizes:

When the jobs are done, the system will collect the runtime and throughput (=metrics) for each job.

Documentation

Report

Poster

About


Languages

Language:Jupyter Notebook 88.5%Language:Java 11.4%Language:Shell 0.1%