This page summarizes the work done as part of Google Summer of Code 2019 project; it contains links to github pull requests, performance analysis document, presentation slides, and code documentation.
Project Description: Spark and Parquet Backend for cBioPortal Web API
cBioPortal utilizes a Spring MVC architecture with MyBatis for the persistence layer and a relational database (MySQL) for data storage. As the number and size of cancer datasets increase, high-performance computing and storage will only become more vital in providing an adequate cBioPortal user experience. The primary goal of this project was to create a prototype which improves performance of the existing web APIs that support the Study Summary View for large sample cohorts.
A utility for writing Parquet files and all 7 APIs used in Study Summary View page were implemented, reviewed and merged for the proof of concept.
-
Parquet writer utility & mutated-genes api - Java program for writing data files into Parquet formated files and backend code and junit tests for mutated-genes/fetch api.
-
Clinical-data apis - Backend code and junit tests for clinical-data-count, clinical-data-bin-counts, clinical-data-density-plot apis.
-
Filtered-samples api - Backend code and junit tests for filtered-samples api.
-
Sample-counts api - Backend code and junit tests for sample-counts api.
-
Cna-genes api - Backend code and junit tests for cna-genes and copy-number-enrichments apis.
-
Documentation - Documentation for new Spark configuration properties and Organization of Parquet files and naming conventions.
Performance analysis was done locally with small (1k), medium (10k), and large (60k, 240k samples) datasets. The results show that Spark implementation doesn't improve performance on small datasets, but it improves performance with some APIs and large datasets.
All 6 GSoC students working with cBioPortal presented our projects in cBioPortal community meetings. My presentation slides include the background of the project, Spark application components, Parquet writing utilities, performance analysis, and Spark UI for monitoring the application.
Incorporating a new technology stack to an existing application always comes with a number of challenges - from a hybrid mode or fully migrated approach to selecting the right technologies for the application. Presenting the project to the cBioPortal community resulted in good discussion and suggestions: possibly using a Cassandra database that is scalable with low latency and testing Spark's parallelization in AWS. To migrate to Spark and Parquet technology stack, there are still some analysis and experiments that we should do in the environment that is comparable to cBioPortal production environment; however this proof of concept provides insights into what benefits we may get from using Spark and Parquet.