AlexsLemonade / refinebio-examples

Example workflows for refine.bio data

Home Page:https://www.refine.bio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RNA-seq: Pathway Analysis - GSVA

cansavvy opened this issue · comments

What are the goals of this new example analysis?

We have a GSVA example for microarray, but should create one for RNA-seq.

What kind of dataset will this need?

We need an RNA-seq dataset that we can normalize before running GSVA

What steps should be included in this analysis?

I think most of the GSVA steps from microarray can stay the same with these exceptions I can tell as of now:

  • I think we'll need to use DESeq2 and vst() normalized data.
  • The Handling duplicate gene identifiers strategy may need to be different (currently its pick the max value for each sample).
  • It's unclear to me if/which parameters for the gsva() run in Perform GSVA should be changed -- something to look into.

Between data types, the main parameter you might want to change is the kcdf argument but if using transformed RNA-seq data it should be the same if I recall correctly.

When this issue is addressed, note that the intro paragraph from #349 will need to be added here, and the table will need to be made to reflect the RNA-seq versions of the analyses.

I'm going to try out SRP140558 for this one and see how it goes.

For handling the duplicate identifiers, for GSVA, I'm not sure mixing values for different Ensembl IDs in RNA-seq makes as much sense as it did for Microarray.

Should I instead switch to something where we pick one Ensembl ID's values over the other(s) -- take them as a set? (Based on bigger average or bigger variance?)

@cansavvy I would take a look at what was originally in #352 and what is in this comment #352 (comment)

My guess since you referenced that comment is yes, we still agree think this makes sense since its still per-sample basis?

A few things we discusssed over video chat that I'm going to change in the draft. These things should help shorten up the notebook (which is currently ~800 lines).

I've taken the general outline from my draft PR and made what are the edits to it:

EDITED:

  • Set up data
  • Filter out low counts
  • DESeq2 normalize and transform (use vst)
  • Still Hallmark pathways only
  • Do gene ID convert to Entrez IDs
  • Resolve multi mapped Entrez IDs same way as before select by a Ensembl ID by max average
  • Use normalized data for GSVA
  • Differential expression with limma
  • Move metadata label cleaning to down here!
  • Make sina plot of most DE pathway Heatmap of pathways (without any DE)
  • Save plot, session info

Hallmark pathways only Use all of them!

I disagree with this - there are only 50 hallmark gene sets so you can put them all in a heatmap.

I'm going to close this issue. Any changes to this example will come about as part of #371.