gawbul / OpenOmics

A bioinformatics API and web-app to integrate multi-omics datasets & interface with public databases.

Home Page:https://openomics.readthedocs.io/en/latest/readme.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PyPI version Documentation Status DOI Build Status codecov Updates

This Python package provide a series of tools to integrate and query the genomics, transcriptomics, proteomics, and clinical data (aka multi-omics data). With scalable data-frame manipulation tools, OpenOmics facilitates the common coding tasks when preparing data for bioinformatics analysis.

Documentation (Latest | Stable) | OpenOmics at a glance

Features

OpenOmics assist in integration of heterogeneous multi-omics bioinformatics data. The library provides a Python API as well as an interactive Dash web interface. It features support for:

  • Genomics, Transcriptomics, Proteomics, and Clinical data.
  • Harmonization with 20+ popular annotation, interaction, disease-association databases.

OpenOmics also has an efficient data pipeline that bridges the popular data manipulation Pandas library and Dask distributed processing to address the following use cases:

  • Provides a standard pipeline for dataset indexing, table joining and querying, which are transparent and customizable for end-users.
  • Efficient disk storage for large multi-omics dataset with Parquet data structures.
  • Multiple data types that supports both interactions and sequence data, and allows users to export to NetworkX graphs or down-stream machine learning.
  • An easy-to-use API that works seamlessly with external Galaxy tool interface or the built-in Dash web interface (WIP).

Installation via pip:

pip install openomics

How to use OpenOmics:

Importing the openomics library

from openomics import MultiOmics

Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler). It is located at tests/data/TCGA_LUAD.

folder_path = "tests/data/TCGA_LUAD/"

Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data

from openomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein

# Load each expression dataframe
mRNA = MessengerRNA(data=folder_path+"LUAD__geneExp.txt", transpose=True,
                    usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
miRNA = MicroRNA(data=folder_path+"LUAD__miRNAExp__RPM.txt"), transpose=True,
                 usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
lncRNA = LncRNA(data=folder_path+"TCGA-rnaexpr.tsv"), transpose=True,
                usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id")
som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt"),
                      transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
pro = Protein(data=folder_path+"protein_RPPA.txt"), transpose=True,
              usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name")

# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD")
luad_data.add_clinical_data(
    clinical_data=folder_path+"nationwidechildrens.org_clinical_patient_luad.txt")

luad_data.add_omic(mRNA)
luad_data.add_omic(miRNA)
luad_data.add_omic(lncRNA)
luad_data.add_omic(som)
luad_data.add_omic(pro)

luad_data.build_samples()

Each data is stored as a Pandas DataFrame. Below are all the data imported for TCGA LUAD. For each, the first number represents the number of samples, the second number is the number of features.

PATIENTS (522, 5)
SAMPLES (1160, 6)
DRUGS (461, 4)
MessengerRNA (576, 20472)
SomaticMutation (587, 21070)
MicroRNA (494, 1870)
LncRNA (546, 12727)
Protein (364, 154)

Annotate LncRNAs with GENCODE genomic annotations

# Import GENCODE database (from URL)
from openomics.database import GENCODE

gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
                  file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
                                  "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
                                  "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz",
                                  "transcripts.fa": "gencode.v32.transcripts.fa.gz"},
                  remove_version_num=True,
                  npartitions=5)

# Annotate LncRNAs with GENCODE by gene_id
luad_data.LncRNA.annotate_genomics(gencode, index="gene_id",
                                   columns=['feature', 'start', 'end', 'strand', 'tag', 'havana_gene'])

luad_data.LncRNA.annotations.info()
<class 'pandas.core.frame.DataFrame'>
Index: 13729 entries, ENSG00000082929 to ENSG00000284600
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   feature      13729 non-null  object
 1   start        13729 non-null  object
 2   end          13729 non-null  object
 3   strand       13729 non-null  object
 4   tag          13729 non-null  object
 5   havana_gene  13729 non-null  object
dtypes: object(6)
memory usage: 1.4+ MB

Each multi-omics and clinical data can be accessed through luad_data.data[], like:

luad_data.data["PATIENTS"]
bcr_patient_barcode gender race histologic_subtype pathologic_stage
bcr_patient_barcode
TCGA-05-4244 TCGA-05-4244 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage IV
TCGA-05-4245 TCGA-05-4245 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage III
TCGA-05-4249 TCGA-05-4249 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage I
TCGA-05-4250 TCGA-05-4250 FEMALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage III
TCGA-05-4382 TCGA-05-4382 MALE NaN Lung Adenocarcinoma Mixed Subtype Stage I

522 rows × 5 columns

luad_data.data["MessengerRNA"]
gene_name A1BG A1BG-AS1 A1CF A2M A2ML1 A4GALT A4GNT AAAS AACS AACSP1 ... ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3 psiTPTE22 tAKR
TCGA-05-4244-01A 4.756500 5.239211 0.000000 13.265291 0.431997 7.043317 1.033652 9.348765 9.652057 0.763921 ... 5.350285 8.197321 9.907260 0.763921 10.088859 11.471139 9.768648 9.170597 2.932118 0.000000
TCGA-05-4249-01A 6.920471 7.056843 0.402722 14.650247 1.383939 9.178805 0.717123 9.241537 9.967223 0.000000 ... 5.980428 8.950001 10.204971 4.411650 9.622978 11.199826 10.153700 9.433116 7.499637 0.000000
TCGA-05-4250-01A 5.696542 6.136327 0.000000 14.048541 0.000000 8.481646 0.996244 9.203535 9.560412 0.733962 ... 5.931168 8.517334 9.722642 4.782796 8.895339 12.408981 10.194168 9.060342 2.867956 0.000000
TCGA-05-4382-01A 7.198727 6.809804 0.000000 14.509730 2.532591 9.117559 1.657045 9.251035 10.078124 1.860883 ... 5.373036 8.441914 9.888267 6.041142 9.828389 12.725186 10.192589 9.376841 5.177029 0.000000

576 rows × 20472 columns

To match samples accross different multi-omics, use

luad_data.match_samples(modalities=["MicroRNA", "MessengerRNA"])
Index(['TCGA-05-4384-01A', 'TCGA-05-4390-01A', 'TCGA-05-4396-01A',
       'TCGA-05-4405-01A', 'TCGA-05-4410-01A', 'TCGA-05-4415-01A',
       'TCGA-05-4417-01A', 'TCGA-05-4424-01A', 'TCGA-05-4425-01A',
       'TCGA-05-4427-01A',
       ...
       'TCGA-NJ-A4YG-01A', 'TCGA-NJ-A4YI-01A', 'TCGA-NJ-A4YP-01A',
       'TCGA-NJ-A4YQ-01A', 'TCGA-NJ-A55A-01A', 'TCGA-NJ-A55O-01A',
       'TCGA-NJ-A55R-01A', 'TCGA-NJ-A7XG-01A', 'TCGA-O1-A52J-01A',
       'TCGA-S2-AA1A-01A'],
      dtype='object', length=465)

To prepare the data for classification

# This function selects only patients with patholotic stages "Stage I" and "Stage II"
X_multiomics, y = luad_data.load_dataframe(modalities=["MessengerRNA", "MicroRNA", "LncRNA"], target=['pathologic_stage'],
                                     pathologic_stages=['Stage I', 'Stage II'])
print(X_multiomics['MessengerRNA'].shape, X_multiomics['MicroRNA'].shape, X_multiomics['LncRNA'].shape, y.shape)
(336, 20472) (336, 1870) (336, 12727) (336, 1)
y
pathologic_stage
TCGA-05-4390-01A Stage I
TCGA-05-4405-01A Stage I
TCGA-05-4410-01A Stage I
TCGA-05-4417-01A Stage I
TCGA-05-4424-01A Stage II
TCGA-05-4427-01A Stage II
TCGA-05-4433-01A Stage I
TCGA-05-5423-01A Stage II
TCGA-05-5425-01A Stage II
TCGA-05-5428-01A Stage II
TCGA-05-5715-01A Stage I
TCGA-38-4631-01A Stage I
TCGA-38-7271-01A Stage I
TCGA-38-A44F-01A Stage I
TCGA-44-2655-11A Stage I

336 rows × 1 columns

Log2 transform the mRNA, microRNA, and lncRNA expression values

def expression_val_transform(x):
    return np.log2(x+1)
X_multiomics['MessengerRNA'] = X_multiomics['MessengerRNA'].applymap(expression_val_transform)
X_multiomics['MicroRNA'] = X_multiomics['MicroRNA'].applymap(expression_val_transform)
# X_multiomics['LncRNA'] = X_multiomics['LncRNA'].applymap(expression_val_transform)

Classification of Cancer Stage

from sklearn import preprocessing
from sklearn import metrics
from sklearn.svm import SVC, LinearSVC
import sklearn.linear_model
from sklearn.model_selection import train_test_split
binarizer = preprocessing.LabelEncoder()
binarizer.fit(y)
binarizer.transform(y)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])
for omic in ["MessengerRNA", "MicroRNA"]:
    print(omic)
    scaler = sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=False)
    scaler.fit(X_multiomics[omic])

    X_train, X_test, Y_train, Y_test = \
        train_test_split(X_multiomics[omic], y, test_size=0.3, random_state=np.random.randint(0, 10000), stratify=y)
    print(X_train.shape, X_test.shape)


    X_train = scaler.transform(X_train)

    model = LinearSVC(C=1e-2, penalty='l1', class_weight='balanced', dual=False, multi_class="ovr")
#     model = sklearn.linear_model.LogisticRegression(C=1e-0, penalty='l1', fit_intercept=False, class_weight="balanced")
#     model = SVC(C=1e0, kernel="rbf", class_weight="balanced", decision_function_shape="ovo")

    model.fit(X=X_train, y=Y_train)
    print("NONZERO", len(np.nonzero(model.coef_)[0]))
    print("Training accuracy", metrics.accuracy_score(model.predict(X_train), Y_train))
    print(metrics.classification_report(y_pred=model.predict(X_test), y_true=Y_test))
MessengerRNA
(254, 20472) (109, 20472)
NONZERO 0
Training accuracy 0.6929133858267716
             precision    recall  f1-score   support

    Stage I       0.69      1.00      0.82        75
   Stage II       0.00      0.00      0.00        34

avg / total       0.47      0.69      0.56       109

MicroRNA
(254, 1870) (109, 1870)
NONZERO 0
Training accuracy 0.6929133858267716
             precision    recall  f1-score   support

    Stage I       0.69      1.00      0.82        75
   Stage II       0.00      0.00      0.00        34

avg / total       0.47      0.69      0.56       109

Credits

This package was created with Cookiecutter_ and the pyOpenSci/cookiecutter-pyopensci_ project template, based off audreyr/cookiecutter-pypackage_.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter .. _pyOpenSci/cookiecutter-pyopensci: https://github.com/pyOpenSci/cookiecutter-pyopensci .. _audreyr/cookiecutter-pypackage: https://github.com/audreyr/cookiecutter-pypackage

About

A bioinformatics API and web-app to integrate multi-omics datasets & interface with public databases.

https://openomics.readthedocs.io/en/latest/readme.html

License:MIT License


Languages

Language:Python 81.1%Language:CSS 9.6%Language:TeX 8.5%Language:Makefile 0.8%