AlexsLemonade / refinebio-examples

Example workflows for refine.bio data

Home Page:https://www.refine.bio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New Analysis Example: Simpler example of gene clustering -- k-means for microarray

cansavvy opened this issue · comments

What are the goals of this new example analysis?

Currently ORA in microarray uses differential expression results and the RNA-seq one (not yet developed #344) will end up using a gene module from WGCNA.

However, we should give users a more basic way to find gene clusters from data. Some users may find WGCNA a bit daunting (it also requires some computing power). And it may be more than what a user needs for their particular question, so an example that shows something like k-means.

What kind of dataset will this need?

Something with enough samples that a cluster would make some kind of sense.
I think GSE37382 which is medulloblastoma with subgroups and is used for dimension reduction seems like a reasonable dataset to use for this too.

What steps should be included in this analysis?

These are the roughest ideas of steps I have right now that will need to be made more specific and further polished when we dig into this example more.

  1. Import data and metadata
  2. Use k-means function
  3. Do some exploration into how "well" k-means ran -- unclear to me without doing a bit more digging what this looks like. It may be as simple as printing out some kind of summary stats.
  4. May want to run more iterations and see if you get the same-ish results?
  5. Get some kind of annotation for the genes that you can use as a test for seeing if your gene clustering seems sensible. This could be something like GO terms (But maybe not GO terms since they overlap so much).
  6. Probably plot gene-wise PCA and label the k-means clusters as colors and another form of gene annotation as shapes and see if it makes sense.

What packages/methods do you recommend using or looking into for this analysis?

May not need extra packages besides magrittr, and tidyverse ones (which are assumed everywhere). Both k-means and prcomp are in base R.

Note if/when this issue is completed, the ORA example should be updated to use this output (this should be its own issue and PR).

If all goes alright with this example, it can be made into an RNA-seq version as well which will require additional steps for DESeq2 transformation.

I think this example could just as easily use KNN if we think that would be better for a particular reason.

My gut tells me that this is not going to simpler than WGCNA from an explanation point of view, to be honest. Particularly the part about picking k...

My gut tells me that this is not going to simpler than WGCNA from an explanation point of view, to be honest. Particularly the part about picking k...

I agree its not simply "plug and chug" but at least its mainly k and not 4-5 other parameters? I think its more straightforward than WGCNA, but that's because WGCNA has a lot of pieces in comparison.
If we don't like k-means, do you have an even simpler suggestion for finding gene groups?

No, not really. I think whenever you're going to talk about number of clusters or cluster validation it's going to be tricky.