AlexsLemonade / refinebio-examples

Example workflows for refine.bio data

Home Page:https://www.refine.bio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New Analysis Example: Microarray Pathway Analysis - GSVA

cansavvy opened this issue · comments

What are the goals of this new example analysis?

ORA and GSEA are certainly popular pathway analyses methods, but GSVA requires a bit less cutoffs and decision making so having this method as an example would probably be helpful for our users.

Having a per sample pathway analysis results is a different question that GSVA can answer but the others can't so much.

What kind of dataset will this need?

We may want to use the same. original dataset we used in either GSEA or ORA so we have a comparison of pathway analyses?: GSE71270 (zebrafish CREB study) or GSE37418 (human medulloblastoma subtype).

What steps should be included in this analysis?

We can borrow some inspiration from https://github.com/AlexsLemonade/training-modules/blob/master/pathway-analysis/03-gene_set_variation_analysis.Rmd, keeping in mind that the narrative will need to change somewhat like other examples we've adapted from training to refinebio-examples: See #306

  1. Import library(GSVA) (add this to the Dockerfile)
  2. Set up gene expression data as a matrix that that
  3. Import gene lists and decide about Hallmark or not (this decision should be made considering the discussion happening on #339 (comment) -- we'll want to. make sure users understand the implications of multiple testing corrections and how smaller gene sets can help with this.
  4. Use GSVA::gsva() to perform GSVA, probably start out with largely the same parameters used in training but adjust if/when things look wonky.
  5. Display a preview of significant results in one way or another. Somewhat related to this discussion #339 (comment)
  6. Make some sort of visualization of the GSVA scores. Not sure what makes the most sense here? Plotting the top results and maybe a jitter plot by group?
  7. Write results to a TSV.

What packages/methods do you recommend using or looking into for this analysis?

Probably GSVA unless there are other package suggestions we should consider.

Based on a discussion with @cansavvy, the plan in the original comment above, and the training modules example for inspiration, the tentative plan for tackling this ticket is as follows:

  1. Import library(GSVA) (add this to the Dockerfile)
  2. Read in gene expression data (Homo sapiens, likely a dataset already on S3)
  3. Import gene list from broad institute url using recommendation from GSVA vignette to read in file (and isolate hallmark gene sets) — include context making sure users understand the implications of multiple testing corrections and how smaller gene sets can help with this (if we were to read in a smaller subset file)
  4. Gene identifier conversion — map to human gene symbols or entrez ids, likely symbols
  5. Remove duplicate identifiers — using the highest variance to select which row to keep perhaps?
  6. Use GSVA::gsva() to perform GSVA, probably start out with largely the same parameters used in training but adjust if/when things look wonky.
  7. Make some sort of visualization of the GSVA scores. Plotting the results using a heatmap and maybe a violin or jitter plot to plot by group? To plot by highest variance? To plot by highest GSVA score?
  8. Write results to a TSV.

Feel free to leave any suggestions/modifications you believe should be made before implementing this plan!
cc: @jaclyn-taroni and @jashapiro

Remove duplicate identifiers — using the highest variance to select which row to keep perhaps?

You could also aggregate to the mean value for a gene symbol for each sample.