RNA-seq: Pathway Analysis - GSVA

Question

RNA-seq: Pathway Analysis - GSVA

cansavvy opened this issue 4 years ago · comments

Candace Savonen commented 4 years ago

What are the goals of this new example analysis?

We have a GSVA example for microarray, but should create one for RNA-seq.

What kind of dataset will this need?

We need an RNA-seq dataset that we can normalize before running GSVA

What steps should be included in this analysis?

I think most of the GSVA steps from microarray can stay the same with these exceptions I can tell as of now:

I think we'll need to use DESeq2 and vst() normalized data.
The Handling duplicate gene identifiers strategy may need to be different (currently its pick the max value for each sample).
It's unclear to me if/which parameters for the gsva() run in Perform GSVA should be changed -- something to look into.

Jaclyn Taroni · Answer 1 · Mon Nov 23 2020 23:09:13 GMT+0800 (China Standard Time)

Between data types, the main parameter you might want to change is the kcdf argument but if using transformed RNA-seq data it should be the same if I recall correctly.

Candace Savonen · Answer 2 · Mon Nov 23 2020 23:28:48 GMT+0800 (China Standard Time)

When this issue is addressed, note that the intro paragraph from #349 will need to be added here, and the table will need to be made to reflect the RNA-seq versions of the analyses.

Candace Savonen · Answer 3 · Wed Dec 02 2020 21:03:12 GMT+0800 (China Standard Time)

I'm going to try out SRP140558 for this one and see how it goes.

Candace Savonen · Answer 4 · Wed Dec 02 2020 21:14:20 GMT+0800 (China Standard Time)

For handling the duplicate identifiers, for GSVA, I'm not sure mixing values for different Ensembl IDs in RNA-seq makes as much sense as it did for Microarray.

Should I instead switch to something where we pick one Ensembl ID's values over the other(s) -- take them as a set? (Based on bigger average or bigger variance?)

Jaclyn Taroni · Answer 5 · Wed Dec 02 2020 21:17:49 GMT+0800 (China Standard Time)

@cansavvy I would take a look at what was originally in #352 and what is in this comment #352 (comment)

Candace Savonen · Answer 6 · Wed Dec 02 2020 21:23:58 GMT+0800 (China Standard Time)

My guess since you referenced that comment is yes, we still agree think this makes sense since its still per-sample basis?

Candace Savonen · Answer 7 · Thu Dec 03 2020 02:58:23 GMT+0800 (China Standard Time)

A few things we discusssed over video chat that I'm going to change in the draft. These things should help shorten up the notebook (which is currently ~800 lines).

I've taken the general outline from my draft PR and made what are the edits to it:

EDITED:

Set up data
Filter out low counts
DESeq2 normalize and transform (use vst)
Still Hallmark pathways only
Do gene ID convert to Entrez IDs
Resolve multi mapped Entrez IDs ~~same way as before~~ select by a Ensembl ID by max average
Use normalized data for GSVA
~~Differential expression with limma~~
Move metadata label cleaning to down here!
~~Make sina plot of most DE pathway~~ Heatmap of pathways (without any DE)
Save plot, session info

Jaclyn Taroni · Answer 8 · Thu Dec 03 2020 03:01:30 GMT+0800 (China Standard Time)

~~Hallmark pathways only~~ Use all of them!

I disagree with this - there are only 50 hallmark gene sets so you can put them all in a heatmap.

Jaclyn Taroni · Answer 9 · Sun Dec 06 2020 07:05:04 GMT+0800 (China Standard Time)

I'm going to close this issue. Any changes to this example will come about as part of #371.