peterk87/TreeOrdination

TreeOrdination

Implementation of a wrapper which creates unsupervised projections using LANDMark and UMAP.

Input Data

To create an instance of the TreeOrdination model you will need a NumPy array of feature names.

For fitting the TreeOrdination you will need the following inputs:
    1) A NumPy array, X, where rows are samples and columns are features.
    2) A NumPy array, y, of target values (class labels for classification)

Future Work

In the future we hope to add the following feature to TreeOrdination:
    1) A simple interface to apply normalization and standardization procedures to the dataset.
    2) A simple interface to apply useful transformations to the dataset.
    3) Expand the ability to create balanced samples in cases where there is considerable class imbalance

Install

The LANDMark package is needed for TreeOrdination to work. It is available at: https://github.com/jrudar/LANDMark

Once downloaded, go to the TreeOrdination directory and type: python setup.py sdist Switch into the dist directory and type pip install TreeOrdination-a.b.c.tar.gz where a, b, and c are the version numbers of the package.

Class Parameters

The current hyper-parameters are available for tuning.

feature_names: list-like, required
    A list of feature names.

resample_data: bool, default = False
    Specifies if data will be re-sampled.
    
resample_class: str, default = None
    Specifies the class which will be down-sampled.
    
n_resamples: int, default = None
    Specifies how many samples (without replacement) will be
    taken.

metric: str, default = "hamming"
    The metric used by UMAP to calculate the dissimilarity between 
    LANDMark embeddings.
    
supervised_clf: default = ExtraTreesClassifier(1024)
    The classification model used to predict the class of each sample
    using the unsupervised projections.
    
n_iter_unsup: int, default = 5
    The number of LANDMark embeddings which will be used to construct
    the final embedding.
    
unsup_n_estim: int, default = 160
    The number of decision trees in each LANDMark classifier.

max_samples_tree: int, default = 100
    Specifies how many samples will be used to train each LANDMark tree.

n_jobs: int, default = 4
    The number of processes used by LANDMark to train each classifier.
    
scale: bool, default = False
    Specifies if each row in X should be divided by its sum.
    
clr_trf: bool, default = False
    Specifies if the data should be center log-ratio transformed.
    
rclr_trf: bool, default = False
    Specifies if the data should be transformed using the robust centered
    log-ratio transformation.
    
exclude_col: list-like, default = [False, [0]]
    Specifies which columns should be excluded for scaling and/or
    transformation. If the first entry in the list is true the columns
    specified by the second entry will be excluded from scaling.
    
n_neighbors: int, default = 8
    The 'n_neighbors' parameter of UMAP. A larger value will capture
    more of the global structure of the data while smaller values will
    focus more on the local structure of the data. Larger datasets will
    likely need a larger value for this parameter.
 
n_components: int, default = 2
    The number of components of the final unsupervised projection.
 
min_dist: float, default = 0.001
    The 'min_dist' parameter of UMAP.

Fit Parameters

    X: NumPy array of shape (m, n) where 'm' is the number of samples and 'n'
    the number of features (features, taxa, OTUs, ASVs, etc).

    y: NumPy array of shape (m,) where 'm' is the number of samples. Each entry
    of 'y' should be a factor.

Example Usage

    from TreeOrdination import TreeOrdination
    from sklearn.datasets import make_classification
    
    #Create the dataset
    X, y = make_classification(n_samples = 200, n_informative = 20)
    
    #Give features a name
    f_names = ["Feature %s" %str(i) for i in range(X.shape[0])]
    
    tree_ord = TreeOrdination(feature_names = f_names).fit(X, y)

    #This is the LANDMark embedding of the dataset. This dataset is used to train the supervised model ('supervised_clf' parameter)
    landmark_embedding = tree_ord.R_final
    
    #This is the UMAP projection of the LANDMark embedding
    umap_projection = tree_ord.tree_emb
    
    #This is the PCA projetion of the UMAP embedding
    pca_projection = tree_ord.R_PCA_emb

References

Rudar, J., Porter, T.M., Wright, M., Golding G.B., Hajibabaei, M. LANDMark: an ensemble 
approach to the supervised selection of biomarkers in high-throughput sequencing data. 
BMC Bioinformatics 23, 110 (2022). https://doi.org/10.1186/s12859-022-04631-z

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: 
Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–30. 

Geurts P, Ernst D, Wehenkel L. Extremely Randomized Trees. Machine Learning. 2006;63(1):3–42.

Rudar, J., Golding, G.B., Kremer, S.C., Hajibabaei, M. (2023). Decision Tree Ensembles Utilizing 
Multivariate Splits Are Effective at Investigating Beta Diversity in Medically Relevant 16S Amplicon 
Sequencing Data. Microbiology Spectrum e02065-22.

peterk87 / TreeOrdination