This Python package provide a series of tools to integrate and query the genomics, transcriptomics, proteomics, and clinical data (aka multi-omics data). With scalable data-frame manipulation tools, OpenOmics facilitates the common coding tasks when preparing data for bioinformatics analysis.

Documentation (Latest | Stable) | OpenOmics at a glance

Features

OpenOmics assist in integration of heterogeneous multi-omics bioinformatics data. The library provides a Python API as well as an interactive Dash web interface. It features support for:

Genomics, Transcriptomics, Proteomics, and Clinical data.
Harmonization with 20+ popular annotation, interaction, disease-association databases.

OpenOmics also has an efficient data pipeline that bridges the popular data manipulation Pandas library and Dask distributed processing to address the following use cases:

Provides a standard pipeline for dataset indexing, table joining and querying, which are transparent and customizable for end-users.
Efficient disk storage for large multi-omics dataset with Parquet data structures.
Multiple data types that supports both interactions and sequence data, and allows users to export to NetworkX graphs or down-stream machine learning.
An easy-to-use API that works seamlessly with external Galaxy tool interface or the built-in Dash web interface (WIP).

Installation via pip:

pip install openomics

How to use OpenOmics:

Importing the openomics library

from openomics import MultiOmics

Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler). It is located at tests/data/TCGA_LUAD.

folder_path = "tests/data/TCGA_LUAD/"

Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data

from openomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein

# Load each expression dataframe
mRNA = MessengerRNA(data=folder_path+"LUAD__geneExp.txt", transpose=True,
                    usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
miRNA = MicroRNA(data=folder_path+"LUAD__miRNAExp__RPM.txt"), transpose=True,
                 usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
lncRNA = LncRNA(data=folder_path+"TCGA-rnaexpr.tsv"), transpose=True,
                usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id")
som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt"),
                      transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
pro = Protein(data=folder_path+"protein_RPPA.txt"), transpose=True,
              usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name")

# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD")
luad_data.add_clinical_data(
    clinical_data=folder_path+"nationwidechildrens.org_clinical_patient_luad.txt")

luad_data.add_omic(mRNA)
luad_data.add_omic(miRNA)
luad_data.add_omic(lncRNA)
luad_data.add_omic(som)
luad_data.add_omic(pro)

luad_data.build_samples()

Each data is stored as a Pandas DataFrame. Below are all the data imported for TCGA LUAD. For each, the first number represents the number of samples, the second number is the number of features.

PATIENTS (522, 5)
SAMPLES (1160, 6)
DRUGS (461, 4)
MessengerRNA (576, 20472)
SomaticMutation (587, 21070)
MicroRNA (494, 1870)
LncRNA (546, 12727)
Protein (364, 154)

Annotate LncRNAs with GENCODE genomic annotations

# Import GENCODE database (from URL)
from openomics.database import GENCODE

gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
                  file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
                                  "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
                                  "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz",
                                  "transcripts.fa": "gencode.v32.transcripts.fa.gz"},
                  remove_version_num=True,
                  npartitions=5)

# Annotate LncRNAs with GENCODE by gene_id
luad_data.LncRNA.annotate_genomics(gencode, index="gene_id",
                                   columns=['feature', 'start', 'end', 'strand', 'tag', 'havana_gene'])

luad_data.LncRNA.annotations.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13729 entries, ENSG00000082929 to ENSG00000284600
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   feature      13729 non-null  object
 1   start        13729 non-null  object
 2   end          13729 non-null  object
 3   strand       13729 non-null  object
 4   tag          13729 non-null  object
 5   havana_gene  13729 non-null  object
dtypes: object(6)
memory usage: 1.4+ MB

Each multi-omics and clinical data can be accessed through luad_data.data[], like:

luad_data.data["PATIENTS"]

	bcr_patient_barcode	gender	race	histologic_subtype	pathologic_stage
bcr_patient_barcode
TCGA-05-4244	TCGA-05-4244	MALE	NaN	Lung Adenocarcinoma- Not Otherwise Specified (...	Stage IV
TCGA-05-4245	TCGA-05-4245	MALE	NaN	Lung Adenocarcinoma- Not Otherwise Specified (...	Stage III
TCGA-05-4249	TCGA-05-4249	MALE	NaN	Lung Adenocarcinoma- Not Otherwise Specified (...	Stage I
TCGA-05-4250	TCGA-05-4250	FEMALE	NaN	Lung Adenocarcinoma- Not Otherwise Specified (...	Stage III
TCGA-05-4382	TCGA-05-4382	MALE	NaN	Lung Adenocarcinoma Mixed Subtype	Stage I

522 rows × 5 columns

luad_data.data["MessengerRNA"]

gene_name	A1BG	A1BG-AS1	A1CF	A2M	A2ML1	A4GALT	A4GNT	AAAS	AACS	AACSP1	...	ZXDA	ZXDB	ZXDC	ZYG11A	ZYG11B	ZYX	ZZEF1	ZZZ3	psiTPTE22
TCGA-05-4244-01A	4.756500	5.239211	0.000000	13.265291	0.431997	7.043317	1.033652	9.348765	9.652057	0.763921	...	5.350285	8.197321	9.907260	0.763921	10.088859	11.471139	9.768648	9.170597	2.932118
TCGA-05-4249-01A	6.920471	7.056843	0.402722	14.650247	1.383939	9.178805	0.717123	9.241537	9.967223	0.000000	...	5.980428	8.950001	10.204971	4.411650	9.622978	11.199826	10.153700	9.433116	7.499637
TCGA-05-4250-01A	5.696542	6.136327	0.000000	14.048541	0.000000	8.481646	0.996244	9.203535	9.560412	0.733962	...	5.931168	8.517334	9.722642	4.782796	8.895339	12.408981	10.194168	9.060342	2.867956
TCGA-05-4382-01A	7.198727	6.809804	0.000000	14.509730	2.532591	9.117559	1.657045	9.251035	10.078124	1.860883	...	5.373036	8.441914	9.888267	6.041142	9.828389	12.725186	10.192589	9.376841	5.177029

576 rows × 20472 columns

To match samples accross different multi-omics, use

luad_data.match_samples(modalities=["MicroRNA", "MessengerRNA"])

Index(['TCGA-05-4384-01A', 'TCGA-05-4390-01A', 'TCGA-05-4396-01A',
       'TCGA-05-4405-01A', 'TCGA-05-4410-01A', 'TCGA-05-4415-01A',
       'TCGA-05-4417-01A', 'TCGA-05-4424-01A', 'TCGA-05-4425-01A',
       'TCGA-05-4427-01A',
       ...
       'TCGA-NJ-A4YG-01A', 'TCGA-NJ-A4YI-01A', 'TCGA-NJ-A4YP-01A',
       'TCGA-NJ-A4YQ-01A', 'TCGA-NJ-A55A-01A', 'TCGA-NJ-A55O-01A',
       'TCGA-NJ-A55R-01A', 'TCGA-NJ-A7XG-01A', 'TCGA-O1-A52J-01A',
       'TCGA-S2-AA1A-01A'],
      dtype='object', length=465)

To prepare the data for classification

# This function selects only patients with patholotic stages "Stage I" and "Stage II"
X_multiomics, y = luad_data.load_dataframe(modalities=["MessengerRNA", "MicroRNA", "LncRNA"], target=['pathologic_stage'],
                                     pathologic_stages=['Stage I', 'Stage II'])
print(X_multiomics['MessengerRNA'].shape, X_multiomics['MicroRNA'].shape, X_multiomics['LncRNA'].shape, y.shape)

(336, 20472) (336, 1870) (336, 12727) (336, 1)

	pathologic_stage
TCGA-05-4390-01A	Stage I
TCGA-05-4405-01A	Stage I
TCGA-05-4410-01A	Stage I
TCGA-05-4417-01A	Stage I
TCGA-05-4424-01A	Stage II
TCGA-05-4427-01A	Stage II
TCGA-05-4433-01A	Stage I
TCGA-05-5423-01A	Stage II
TCGA-05-5425-01A	Stage II
TCGA-05-5428-01A	Stage II
TCGA-05-5715-01A	Stage I
TCGA-38-4631-01A	Stage I
TCGA-38-7271-01A	Stage I
TCGA-38-A44F-01A	Stage I
TCGA-44-2655-11A	Stage I

336 rows × 1 columns

Log2 transform the mRNA, microRNA, and lncRNA expression values

def expression_val_transform(x):
    return np.log2(x+1)
X_multiomics['MessengerRNA'] = X_multiomics['MessengerRNA'].applymap(expression_val_transform)
X_multiomics['MicroRNA'] = X_multiomics['MicroRNA'].applymap(expression_val_transform)
# X_multiomics['LncRNA'] = X_multiomics['LncRNA'].applymap(expression_val_transform)

Classification of Cancer Stage

from sklearn import preprocessing
from sklearn import metrics
from sklearn.svm import SVC, LinearSVC
import sklearn.linear_model
from sklearn.model_selection import train_test_split

binarizer = preprocessing.LabelEncoder()
binarizer.fit(y)
binarizer.transform(y)

array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])

for omic in ["MessengerRNA", "MicroRNA"]:
    print(omic)
    scaler = sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=False)
    scaler.fit(X_multiomics[omic])

    X_train, X_test, Y_train, Y_test = \
        train_test_split(X_multiomics[omic], y, test_size=0.3, random_state=np.random.randint(0, 10000), stratify=y)
    print(X_train.shape, X_test.shape)


    X_train = scaler.transform(X_train)

    model = LinearSVC(C=1e-2, penalty='l1', class_weight='balanced', dual=False, multi_class="ovr")
#     model = sklearn.linear_model.LogisticRegression(C=1e-0, penalty='l1', fit_intercept=False, class_weight="balanced")
#     model = SVC(C=1e0, kernel="rbf", class_weight="balanced", decision_function_shape="ovo")

    model.fit(X=X_train, y=Y_train)
    print("NONZERO", len(np.nonzero(model.coef_)[0]))
    print("Training accuracy", metrics.accuracy_score(model.predict(X_train), Y_train))
    print(metrics.classification_report(y_pred=model.predict(X_test), y_true=Y_test))

MessengerRNA
(254, 20472) (109, 20472)
NONZERO 0
Training accuracy 0.6929133858267716
             precision    recall  f1-score   support

    Stage I       0.69      1.00      0.82        75
   Stage II       0.00      0.00      0.00        34

avg / total       0.47      0.69      0.56       109

MicroRNA
(254, 1870) (109, 1870)
NONZERO 0
Training accuracy 0.6929133858267716
             precision    recall  f1-score   support

    Stage I       0.69      1.00      0.82        75
   Stage II       0.00      0.00      0.00        34

avg / total       0.47      0.69      0.56       109

Credits

This package was created with Cookiecutter_ and the pyOpenSci/cookiecutter-pyopensci_ project template, based off audreyr/cookiecutter-pypackage_.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter .. _pyOpenSci/cookiecutter-pyopensci: https://github.com/pyOpenSci/cookiecutter-pyopensci .. _audreyr/cookiecutter-pypackage: https://github.com/audreyr/cookiecutter-pypackage

gawbul / OpenOmics