Introduction

With cloud computing, users are able to tune cloud configurations to meet their performance or cost objectives. In our research project, we aim to find out the best cloud configuration for a given workload and a give objective. During our research, we found performance data is very hard to find—at least, we could not find performance that suits our needs. We instead collected the required data. This data repository is the effort. We make this data available to encourage research advance in cloud performance optimization.

This data repository includes large-scale performance data of Hadoop and Spark applications on AWS EC2. Since performance varies with different inputs, our data includes multiple combinations of applications and inputs. We use workload to describe an application and its input. The workloads are extracted from HiBench and spark-perf.

We ran these workloads on numerous cloud configuration on Amazon EC2. Each configuration is composed of a virtual machine (VM) type and a number of the same VMs. This data repository includes both the single-node setting and the multi-node setting. The single-node setting includes 18 VM types and the multi-node setting includes 69 configurations (9 VM types and various numbers of VMs).

For each measurement, we collect its execution time and also its low-level performance information using sar. For more detail, read the description of each dataset.

Dataset Overview

ID	Platforms	Systems	Workloads	Description
osr_single_node	AWS EC2	Hadoop 2.7 Spark 2.1 Spark 1.5	sort terasort pagerank workcount aggregation join scan chi-feature chi-gof chi-mat spearman statistics-summary pearson svd pca word2vec classification regression als naive-bayes lr mm decision tree gradient boosted tree random forest fp-growth gmm kmeans lda pic	Multiple workloads running on a single-node setting on AWS
osr_multiple_nodes	AWS EC2	Hadoop 2.7 Spark 2.1 Spark 1.5	terasort pagerank wordcount join lr kmeans naive-bayes regression	Multiple workloads running on the multiple-nodes setting on AWS

How to Contribute

We encourage researchers share their performance data. Please submit a pull request.
You can obtain the scripts and required AMI at the scout-scripts repo.

How to Cite

@inproceedings{hsu2018arrow,
  title={Arrow: Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM},
  author={Hsu, Chin-Jung and Nair, Vivek and Freeh, Vincent W and Menzies, Tim},
  booktitle={the 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS 2018)},
  year={2018}
}
@inproceedings{hsu2018micky,
  title={Micky: A Cheaper Alternative for Selecting Cloud Instances},
  author={Hsu, Chin-Jung and Nair, Vivek and Menzies, Tim and Freeh, Vincent},
  booktitle={the IEEE International Conference on Cloud Computing (IEEE CLOUD 2018)}
  year={2018}
}
@article{hsu2018scout,
  title={Scout: An Experienced Guide to Find the Best Cloud Configuration},
  author={Hsu, Chin-Jung and Nair, Vivek and Menzies, Tim and Freeh, Vincent},
  journal={arXiv preprint arXiv:1803.01296},
  year={2018}
}

mkhan037 / scout

Introduction

Dataset Overview

How to Contribute

How to Cite

About