MintaYLu / COT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Readme

Description

COT is a Python module for machine learning built on top of NumPy and Pandas, and is distributed under the MIT license. Cosine based One-sample Test (COT) is an accurate and efficient method to detect Marker Genes (MG) among many subtypes using subtype-enriched expression profiles. Basically, COT uses the cosine similarity between a molecule’s cross-subtype expression pattern and the exact mathematical definition of MG as the test statistic, and formulates the detection problem as a one-sample test. Under the assumption that a significant majority of genes are associated with the null hypothesis, COT approximates the empirical null distribution for calculating p-values. The project was developed and maintained by Virginia Tech CBIL Group.

Installation

Dependencies

COT requires:

  • python (>= 3.7.4)
  • numpy (>= 1.19.5)
  • scipy (>= 1.3.1)
  • pandas (>= 1.2.0)
  • statsmodels (>= 0.10.1)
  • scikit-learn (>= 0.21.3)
  • seaborn (>= 0.11.1)
  • matplotlib (>= 3.1.1)

Installation

To install from Github, run:

pip install git+https://github.com/MintaYLu/COT.git

To install from a local copy, please go to the main package folder and run:

python setup.py install

Example:

Perform COT calculation step by step:

1. Import the COT package
from COT.COT import COT

2. Create COT class instance and load the raw data
cot = COT(df_raw=df_raw, normalization=False)

input: df_raw

Note that the input of COT should be batch-corrected.

gene S1 S2 S3 S4
0 0.5 0.7 0.7 0.9
1 1.0 1.0 0.0 0.0

output: cot.df_raw

gene S1 S2 S3 S4
0 0.5 0.7 0.7 0.9
1 1.0 1.0 0.0 0.0
3. Generate the subtype mean values
cot.generate_subtype_means(subtype_label=subtype_label)

input: subtype_label = ["A", "A", "B", "B"]

output: cot.subtypes {"A": ["S1", "S2"], "B": ["S3", "S4"]}

cot.df_mean

gene A B
0 0.6 0.8
1 1.0 0.0
4. Generate the cosine values
cot.generate_cos_values()

output: cot.df_cos

gene cos subtype
0 0.8 B
1 1.0 A
5. Estimate the p-values
cot.estimate_p_values()

Attention: too few genes may not work for predicting p-values. Please remove NaN before this step. output: cot.df_cos

gene cos subtype p.value q.value
1 1.0 A ? ?
0 0.8 B ? ?

Cannot calculate p-values due to the limited genes numbers, please see the Example_GSE28490 for p-values computation.

6. Obtain the subtype markers
cot.obtain_subtype_markers()

output: cot.markers = {"A": [1], "B": [0]}

7. Plot the simplex
cot. plot_simplex()

8. Plot the heatmap
cot.plot_heatmap()

Or use the pipeline:

from COT.COT import COT

cot = COT(df_raw=df_raw, normalization=False)
cot.cos_pipeline(subtype_label=subtype_label, top=2)

Then we will obtain the same output with a single step.

License

This project is licensed under the MIT License - see the LICENSE.txt file for details

Citing

If you have used this tool please cite:

Lu, Y., C.-T. Wu, S. J. Parker, L. Chen, G. Saylor, J. E. Van Eyk, D. M. Herrington and Y. Wang (2021). "COT: an efficient Python tool for detecting marker genes among many subtypes." bioRxiv, 2021.01.10.426146

https://doi.org/10.1101/2021.01.10.426146

About

License:MIT License


Languages

Language:Jupyter Notebook 99.9%Language:Python 0.1%