Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously

The full output of a version of this analysis is available at Figshare under the DOI: 10.6084/m9.figshare.5035997.v2

Summary

We performed a series of supervised and unsupervised machine learning evaluations, as well as differential expression analyses, to assess which normalization methods are best suited for combining data from microarray and RNA-seq platforms.

We evaluated five normalization approaches for all methods:

log-transformation (LOG)
non-paranormal transformation (NPN)
quantile normalization (QN)
Training Distribution Matching (TDM)
standardizing scores (z-scoring; Z).

A version of this project is detailed in our pre-print Cross-Platform Normalization Enables Machine Learning Model Training On Microarray And RNA-Seq Data Simultaneously.

We are actively making improvements to this codebase; see #12.

Breast Cancer Data

The Cancer Genome Atlas BRCA data used for these analyses is available at zenodo.

# To download data, run in top directory:
sh brca_data_download.sh

Analysis

Machine Learning Pipeline

Here's a schematic overview of our machine learning experiments:

Overview of supervised and unsupervised machine learning experiments.

520 TCGA Breast Cancer samples run on both microarray and RNA-seq were split into a training (2/3) and holdout set (1/3).
RNA-seq’d samples were "titrated" into the training set, 10% at a time (0-100%) resulting in eleven training sets for each normalization method.
Machine learning applications. Three supervised multi-class (BRCA PAM50 subtype) classifiers—LASSO, linear SVM, and Random Forest—were trained on each training set and tested on the microarray and RNA-seq holdout sets. The holdout sets were projected onto and back out of the training set space using two unsupervised techniques, Independent and Principal Components Analysis, to obtain reconstructed holdout sets. The classifiers used in step 4A above were used to predict on the reconstructed holdout sets.

# To run the machine learning pipeline, run in top directory:
sh run_machine_learning_experiments.sh

# To run one repeat of the subtype classifier pipeline, use:
Rscript run_experiments.R

Differential Expression Pipeline

Here's a schematic overview of our main differential expression experiment:

Overview of differential expression experiment.

All matched TCGA breast cancer samples (n = 520) were considered when building the platform-specific “silver standards.” These standards are the genes that were differentially expressed at a specified False Discovery Rate (FDR) using data sets comprised entirely of one platform and processed in a standard way: log2-transformed microarray data and “untransformed” RSEM count data (preprocessed using the limma::voom function).
RNA-seq’d samples were ‘titrated’ into the data set, 10% at a time (0-100%) resulting in eleven experimental sets for each n ormalization method.
Differentially expressed genes (DEGs) were identified using the limma package. We compared the Her2 and LumA subtypes as well as Basal v. all other samples.
Lists of experimental DEGs were compared to standard gene sets using Jaccard similarity.

# Note: This requires the data to be processed to include matched samples only, 
# and split into training and test sets (0-expression_data_overlap_and_split.R)

# To run the differential expression pipeline, run in top directory:
sh run_differential_expression_experiments.sh

Requirements

This analysis was performed in R. It requires R & Bioconductor packages detailed in check_installs.R to be installed.

One github package (TDM) is required. To install, run:

library(devtools)
devtools::install_github("greenelab/TDM")

This analysis is in the process of being moved to a Docker image.

Funding

This work was supported the Gordon and Betty Moore Foundation [GBMF 4552] and the National Institutes of Health [T32-AR007442, U01-TR001263].

jaclyn-taroni / RNAseq_titration_results