AlexsLemonade / refinebio-examples

Example workflows for refine.bio data

Home Page:https://www.refine.bio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Discussion] Organizational structure of pathway analysis material

cansavvy opened this issue · comments

Background

I'm writing down some thoughts from a discussion with @jaclyn-taroni:

Currently, we have plans for making a Pathway Analysis Introduction notebook (#214) which will be the first kind of module-specific introduction notebook. This strategy was discussed (mostly by me it looks like) in #141 where the idea of what to do with README info was brought up. For most modules it wasn't a concern, but for pathway analysis it is.

For pathway analysis the intro notebook is meant to give some background information that will help inform the user about what kind of pathway analysis they may want to use (based on their dataset). And each pathway analysis example links back to that introduction notebook

Problem

As of now, both RNA-seq and microarray sections would have their own pathway analyses and introduction notebook. It's unlikely the Introduction notebook between the technologies would differ much in their content, meaning they would probably be pretty redundant.

No matter what, there will probably be some redundancy involved as has been discussed elsewhere : #223 and #175

What we need to find is what is the best balance between user navigability and ease of maintenance. In other words, redundant material is harder to maintain but it's easier for a user to find what they are looking for if it is structured more like the other modules.

Three options we can think of as of now:
Option 1) Continue with plan as is, keep intro notebooks noting that microarray and rna-seq pathway intro notebooks will be almost identical.
- Pros: the structure will be consistent with being "by technology" which may be nice if users largely focus on one technology (perhaps more likely one dataset that happens to use a particular technology) over the other.
- Cons: this will mean two introduction to pathway analyses notebooks exist that are largely identical. We deal with this elsewhere in this repo, but do we think its worth it to deal with redundancy again here?

Option 2) Abbreviate the introduction notebook material into a section that can be in each pathway analysis example and continue with the microarray and rna-seq examples.
- Pros: this continues with the philosophy of making each example "self-contained" and doesn't disturb the "by technology" organizational structure too much.
- Cons: this is still being rather redundant and will require abbreviating the pathway analyses intro material some. These may not be insurmountable barriers, but they still involve the redudancy issue (which appears to be common to this repo).

  1. Make a whole new "bigger" cross-tech section of pathway-analysis aka 02-microarray, 03-rna-seq, 04-pathway-analysis where an introduction could be more similar to the rna-seq and microarray introductions and would require less redundancy -- but how would this be user-wise?
    • Pros: This is much easier to maintain, next to no redundancy.
    • Cons: It's definitely straying from the "by technology" structure, will users find this annoying/confusing/hard to find?

Whatever we decide for pathway analysis's introduction notebook will likely be used for any other "big modules" that also could use introduction material. So this is somewhat about precedent and not just about pathway analysis.

What are the recommended next steps?

It would be great if we could get @dvenprasad 's input on this over a video chat. It is unclear what is the best decision since we are unsure how users will be navigating and using this material.

@dvenprasad for some additional context that I think is important - the way we are to run the "most popular" (very scientific analysis by me and my gut) types of pathway analyses, ORA and GSEA, will rely on the differential expression analyses that are currently included under the microarray or RNA-seq sections depending on the type of data they use. They start with a table of differential gene expression results, rather than the expression matrix you download from refine.bio.

The differences in the steps that you need to take for the ORA and GSEA based on the two technologies have been resolved for the most part once you get to ORA and GSEA themselves.

However, there is another pathway or gene set analysis method (GSVA) that we've talked about including (and maybe the QuSAGE analyses) that will diverge from this pattern - it will start with an expression matrix from refine.bio and your options, etc. will depend on the technology.

Another set of concerns I have when we talk about maintainability and just thinking about our overall effort are: what is the purpose of these examples and how are we (uniquely) adding value? (See also: #223 (comment))

We know that:

  1. Biologists are interested in pathway analyses; it is a popular use case for expression data
  2. When we reduce fragmentation – when you can get your data and your information about how to use it under the same umbrella – we reduce friction and that's our why for these.

Or at least I think we know that, so I'd like your insight and opinions! This is a more general concern, but I think it is very pertinent here because of the interest in this type of analysis and because it often requires more steps to get to your result.

If we de-emphasize teaching folks about the different kinds of pathway analyses, we do less writing and eliminate some of the choices above.

If possible, it would be good for us to each have our thoughts jotted down about this by the end of this sprint. This way if we need to have a video chat about this, we can tentatively schedule that chat for the week Nov 9th. Does that timeframe work for you, @dvenprasad?

Okay,

tl;dr: Please keep our current technology > analyses hierarchy and keep examples as self-contained as possible. I would recommend option 2.

  1. Options 1 and 2 seem to not break current information architecture pattern and I would recommend those. 3 just breaks out of the established hierarchy of Technology > analysis, which is how the rest of the analyses are organized and it can be confusing and harder to learn since the hierarchy of info is not consistent.

  2. How do users come to refine.bio?

  • Users are linked to examples after they download a dataset. AlexsLemonade/refinebio-frontend#906 will ensure that we link to examples based on technology. (Most people land on experiment page directly via search and download them, so we will see a majority of datasets with one technology.)

  • Users are linked to examples via docs.

  • Users are linked from the landing page.

All this to say that we can primarily expect people to land on refine.bio examples because there is a specific analysis they have in mind (and likely a dataset too, if they get redirected after downloading a dataset) (this could be wrong, and we'll find out) . So they might benefit from the current technology> analysis hierarchy.

  1. ORA and GSEA, will rely on the differential expression analyses that are currently included under the microarray or RNA-seq sections depending on the type of data they use. They start with a table of differential gene expression results, rather than the expression matrix you download from refine.bio.

This also makes me lean towards keeping our technology> analysis hierarchy, and link users between these notebooks.

Now, between Options 1 and 2: I would recommend 2, since it is good to have these be as self contained as possible due to how users are being directed to the analyses, .i.e just dropped into what they are interested in.

I think I've addressed all of the questions, please let me know if I have missed anything.

Thanks for this helpful and straightforward input. I think this makes sense.

I think Option 2 is doable so we can aim for that. I think we can make the intro material abbreviated and link out to more info where needed (a strategy we’ve used often in this repo). The briefer/more efficient we can be the less maintenance issues will be a problem so I think this is possible.

We can try this out on one of the modules and revisit this if we don’t feel that is working.

The action items for how we should implement option 2 are written here: #349. I'm going to close this for now.