johnchase / q2-phylofactor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

General goals for PhyloFactor plugin?

tanaes opened this issue · comments

Thought it might be useful to organize effort around specific goals for the phylofactor plugin. After playing around with tutorials and test data, here's what I came up with as a starting point:

Functionality:

  • General
    • export table of per-factor ILR abundances
    • take factorization of one dataset and apply to features of another
      dataset (CrossValidateMap)
    • handle custom objective functions (small R scripts?)
  • Visualization
    • Plot phylogeny with factors highlighted
    • Plot factored heatmap vs phylogeny
    • Plot individual factor abundances against model
    • Produce data summaries including diagnostic taxonomic labels per factor

I think these points are a good bullet of the main functions.

The first two are currently implemented, and the third can probably be done with the help of a skeleton of an R script such as that for phylofactor::GAM.

The visualization will be fun ot figure out - ultimately I imagine we'll just export a .png or .tiff? The pf.tree object can be constructed by importing the tree and groups. The heatmap likewise just needs the tree, groups, and data for the heatmap.

The data summaries can be done with the function summary(pf,taxonomy,factor). The "signal.table" element of this is key - it contains a taxonomic breakdown and measure of relative importance of taxa in each group. It can take optional input "taxon.trimming='species'" for species-level breakdown, or "taxon.trimming='taxon'" for a particular taxonomic level (for this case, the taxonomy needs to be trimmed to the appropriate level, which is easily done in R).

Jon - by "plot individual factor abundances against model", do you mean a plot of the observed vs. predicted ILR abundances, or the observed/predicted heatmaps? Both are doable.

One possibility for the figures is to swipe what @mortonjt has put together for the Gneiss plugin, e.g. the tree-heatmap. That would be relatively easy to import and plot on the Python side, using the exported data from the PhyloFactor object.

For "individual factor abundances against model" I mean e.g. the inset plots in your tutorial -- something that can really illustrate how particular identified factors are changing in relative to the correlated metadata of interest. (I think it might also be useful to have an un-transformed representation of the raw data for these factors too.)

For the data summarization, what do we think is the optimal way forward? Replicating the functionality in Python? Or augmenting the R script called by _phylofactor.py to also produce the data tables necessary for the qzv visualization object?

Let's talk with Jamie about the tree-heatmap. If it's easy to visualize clades, then it's probably easiest to find MRCA's and highlight them with the tree-heatmap functionality already in python.

The "factor-abundances" plots, especially with a tree, are super useful and I agree it would be good to automate them. The two cruxes I found with automating this were: (1) deciding the x-axis (for multivariate regression) and (2) putting reasonable restrictions on the number of allowable sub-plots. If we want to make plots correspond to factors visualized on the tree tree (e.g. like the first figure in the tutorial), then one option is to output a pf.tree figure and an additional figure, one for each factor they want to plot. Allowing a customized input vector that either specifies the x-axis (e.g. "pH" for soils) OR finds the x-axis that maximizes some objective function (e.g. most significant explanatory variable).

For data summary, there's no substitute for the summary.phylofactor function's measure of "signal" for each taxonomic group. The summary tables are all easily exported as data tables - the big crux will be from my end, adding an additional input that allows someone to parallelize, should they want many summaries. For more "point & clickability", it might be good for users to be able to input a vector of factors they want summarized. If they just want taxa, we can use pf.taxa, but if they want signal contributions from taxa I can parallelize (it gets intensive, especially for big data, since it re-calculates the objective function for each unique taxonomic label in each group).

In other words, for going forward I can make two scripts:

  1. A script that makes a pf.tree object and saves it, and then goes through a vector of factors, chooses the x-axis and makes & saves plots with colors corresponding to the colors in the pf.tree. Fitted values can optionally be added with standard-error ribbons. Alternatively, I can make one pf.tree and export the legend, allowing the Python ILR-factor table to do the rest. A final alternative: we can work with Jamie to use just the groups to make tree-visualization in Python, and then use the ILR-tables to also make sub-plots in python.

  2. A script that parallelizes the computation of summary(pf,taxonomy,factor) over a range of factors and exports the tables for use in Python.

Sure, that sounds good! Let's see what @johnchase has to say as well.

In my wildest dreams, you can mouseover each highlighted clade and get a popup of the ILR abundances for that clade regressed against each of your predictor variables. :D

In regards to the original post, I think this sounds good, and I'm really close to having a working interface. As @reptalex mentioned the R scripts are in there, I just got sidetracked making a new semantic type for the factor groupings. I would love to get those first bullet points working and then get some users to test out the API, I've just been naming things as I go and it may not be the best. Here is a really small example if you all are interested https://github.com/johnchase/q2-phylofactor/blob/master/example/phylofactor/phylofactor_walkthrough.ipynb

The output of the above function will go into a crossvmap function along with a new tree and data representing a new data dataset, in order to apply the groupings. I should have that ready to go in a few days

In terms of the plotting I am less certain about the best way forward and I think that the suggestions put forth here are good. It would be great to get with @mortonjt as I imagine that there will be significant overlap in the visualizations here and with gneiss. One question is how where the plotting code should live. The visualizations can be just about anything so we could write new visualizations in R included in phylofactor and then make use of them in the plugin, or we could write them in javascript or python and include them here - the drawback is that they would not be available for phylofactor. But interactivity would be awesome!

In my wildest dreams, you can mouseover each highlighted clade and get a popup of the ILR abundances

I would have thought your wildest dreams were a bit wilder, but yeah that would be fun to put together