RNA-seq: Pathway Analysis - GSVA
cansavvy opened this issue · comments
What are the goals of this new example analysis?
We have a GSVA example for microarray, but should create one for RNA-seq.
What kind of dataset will this need?
We need an RNA-seq dataset that we can normalize before running GSVA
What steps should be included in this analysis?
I think most of the GSVA steps from microarray can stay the same with these exceptions I can tell as of now:
- I think we'll need to use
DESeq2
andvst()
normalized data. - The
Handling duplicate gene identifiers
strategy may need to be different (currently its pick the max value for each sample). - It's unclear to me if/which parameters for the
gsva()
run inPerform GSVA
should be changed -- something to look into.
Between data types, the main parameter you might want to change is the kcdf
argument but if using transformed RNA-seq data it should be the same if I recall correctly.
When this issue is addressed, note that the intro paragraph from #349 will need to be added here, and the table will need to be made to reflect the RNA-seq versions of the analyses.
I'm going to try out SRP140558 for this one and see how it goes.
For handling the duplicate identifiers, for GSVA, I'm not sure mixing values for different Ensembl IDs in RNA-seq makes as much sense as it did for Microarray.
Should I instead switch to something where we pick one Ensembl ID's values over the other(s) -- take them as a set? (Based on bigger average or bigger variance?)
@cansavvy I would take a look at what was originally in #352 and what is in this comment #352 (comment)
My guess since you referenced that comment is yes, we still agree think this makes sense since its still per-sample basis?
A few things we discusssed over video chat that I'm going to change in the draft. These things should help shorten up the notebook (which is currently ~800 lines).
I've taken the general outline from my draft PR and made what are the edits to it:
EDITED:
- Set up data
- Filter out low counts
- DESeq2 normalize and transform (use vst)
- Still Hallmark pathways only
- Do gene ID convert to Entrez IDs
- Resolve multi mapped Entrez IDs
same way as beforeselect by a Ensembl ID by max average - Use normalized data for GSVA
Differential expression with limma- Move metadata label cleaning to down here!
Make sina plot of most DE pathwayHeatmap of pathways (without any DE)- Save plot, session info
Hallmark pathways onlyUse all of them!
I disagree with this - there are only 50 hallmark gene sets so you can put them all in a heatmap.
I'm going to close this issue. Any changes to this example will come about as part of #371.