AlexsLemonade / refinebio-examples

Example workflows for refine.bio data

Home Page:https://www.refine.bio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New Analysis Example: RNA-seq Pathway analysis -- ORA

cansavvy opened this issue · comments

What are the goals of this new example analysis?

We show how to use microarray (limma) results to do ORA: #206

It might be good to show how to do ORA with RNA-seq (DESeq2 output) -- but we should keep in mind that these will likely not differ too much. This will start out as an exploration of how different these examples might be? This will help inform the organizational discussion on #340

If we find that these RNA-seq vs microarray examples are going to be SO identical, we may want to reconsider having separate technology examples (related to #223 and #175) and this is something we can comment about on #340

Alternatively, if we do want to maintain separate technology examples, we can take the strategy of illustrating different aspects/strategies in this ORA example as compared to what is shown in the microarray example. Aka, maybe we could decide on a gene list using a different method and make different plots??

What kind of dataset will this need?

This will need RNA-seq differential expression results, which currently for RNA-seq we have one option already prepared: 03-rnaseq/differential-expression_rnaseq_01.html (until #242 is completed that is).

This will involve using this file specifically: SRP078441_differential_expression_results.tsv which is based on AML patients.
We haven't used this for pathway analysis yet. If it turns out to be an insignificant dud, we may have to look into completing #242 and trying the results from that instead.

What steps should be included in this analysis?

The steps should for the most part follow what is being used for microarray: https://github.com/AlexsLemonade/refinebio-examples/blob/staging/02-microarray/pathway-analysis_microarray_02_ora.Rmd

We'll have to change the steps to use human though instead of zebrafish and also consider what other alternative decisions we may want to show users as possibilities.

What packages/methods do you recommend using or looking into for this analysis?

Same as the microarray example, we should still be able to use clusterProfiler.

In my opinion, I think we can get started on this before we have a decision on #340. (@cansavvy might share that opinion, but it's hard to tell based on this issue.) Specifically, we can start to list all the ways this could potentially differ from the microarray example.

One idea that just occurred to me and is further afield would be not to use differential expression analysis results at all. You could instead imagine a situation where you have some grouping of genes (a co-expression module?) and want to know if there's an overlap with pathways/gene sets. If one did that, that would make the upstream steps different by technology. That's not originally what we talked about, but might be a good illustration of a situation where you would probably be better served using ORA – in a lot of other cases, if you're looking at genome-wide differential expression results, GSEA is probably preferred.

@cansavvy might share that opinion, but it's hard to tell based on this issue.

I wasn't sure what made sense which is why this issue is quite wishy washy.

I'm going to try to think up and explore some options and post them here and tomorrow morning @cbethell are going to do some planning for this and #343.

One idea that just occurred to me and is further afield would be not to use differential expression analysis results at all. You could instead imagine a situation where you have some grouping of genes (a co-expression module?) and want to know if there's an overlap with pathways/gene sets. If one did that, that would make the upstream steps different by technology.

If we want to look into doing this kind of strategy, should we consider showing users how to run WGCNA or something similar?

If we want to look into doing this kind of strategy, should we consider showing users how to run WGCNA or something similar?

Yep, that was my thought. I'm not sure if WGCNA is currently thought to be the best way to find co-expression modules these days. I have some vague recollection of recent literature that would suggest not, but I think looking into what to use should be part of this issue.

Related to #346, I've been trying out some things with WGCNA and CoGaps, and I think for the purposes of ORA, going with a quick WGCNA and running ORA on the biggest gene module seems like a straightforward and not too crazy way to go.

Perhaps at a different time we can look into using CoGaps for its own example, but that is probably too much conceptual info as well a computing power that would be needed for the purposes of running ORA.

This being said, I'll propose an outline for what an ORA of a gene module might look like and then if it seems reasonable, I'll prepare a draft PR we can discuss it over.

Perhaps at a different time we can look into using CoGaps for its own example

I think CoGAPS is a great contender for an "advanced usage" example where we use a larger dataset comprised of multiple experiments if you want to get a new issue file to track what you've found.

Rough outline:

  1. Install WGCNA and impute which is apparently a package it needs but sometimes has trouble getting.
  2. Import data and metadata per usual. So far I ran this with the zebrafish dataset: SRP040561 we use elsewhere and it seemed to give okay results: but we don't have to stick with this dataset if we think a different one would be better.
  3. Need to transpose the data.
  4. Use the transposed data in WGCNA::blockwiseModules()
  5. Extract the gene ids for the module with the most significance (unclear to me at this point from the documentation how to definitively tell which module this is -- my assumption is that the module listed first has the strongest correlations but I have yet to find in the docs where that is confirmed so TBD that my assumptions are right. This decision step should probably involve making some kind of summary visual of what happened in WGCNA but I don't want to dwell too long on WGCNA results so we'll try to make this decision somewhat quickly.

From here we use the genes in the "most of interest" module and carry on with the ORA steps we've used in the microarray module so:
6) Gene id conversion to symbols
7) Run ORA with all genes as the background set.
8) Make the two enrichment plots included in the package but maybe switch one of these out for something different so we are showing a different visual option. I think they are both useful plots though, maybe add another plot?

For WGCNA, we should probably normalize with DESeq2 instead of using refine.bio normalized data?

With the DESeq2 normalization steps and WGCNA, this seems like it will be too much for one notebook. Even after trying to trim it down, its ~640 lines. I think WGCNA (if we are to use it) needs to be its own example.

We should figure out and implement #306 and #349 before tackling this.

Update: WGCNA (which we are going to use results from for ORA input) is going to become an "advanced-topics" module; we can still use the output from there, but if in the future we add a simpler method of reaching a group of genes (say a k-means clustering of genes example or what have you) Then ORA for RNA-seq can be updated at that time to use that output.

On a basic level we want a group of genes that it makes sense to run ORA for, but not differential expression results since it would make more sense to use GSEA for that.

A reminder that the intro paragraph from #349 will need to be added here, and the table will need to be made to reflect the RNA-seq versions of the analyses.

All wrapped up!

All wrapped up!

I'd consider reopening this and calling it wrapped up when merged to master? This seems like a good place to track that final step.