snakemake-workflows / rna-seq-star-deseq2

RNA-seq workflow using STAR and DESeq2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Condition in samples.tsv is too specific

jonathandmoore opened this issue · comments

For a generic RNA-Seq pipeline, 'condition' is too specific, as a label for grouping samples.

It may be that the user wants to make a comparison between sample groups from different genotypes, different timepoints, rather than different conditions or treatments. 'Group' would be a more generic label than 'Condition'.

Hi @jonathandmoore,

thanks for the feedback. Let's try to develop some more fitting and generic solution!

Two quick thoughts already:

  1. We use 'group' in a different context, to define an analysis group: different samples from the same individual or family that we want to model jointly for variant calling from DNAseq data. It's in the dna-seq-varlociraptor workflow:

https://github.com/snakemake-workflows/dna-seq-varlociraptor/blob/dc56d67399a71e98fa38f256308ca2871a7f24e9/workflow/schemas/samples.schema.yaml#L13-L16

Thus, this could get confusing for people using both workflows. In one instance comparisons are only within a group, in others between groups. Also, group doesn't really work for me when referring to different timepoints.

  1. I could imagine different genotypes referred to as different conditions. But I also see the point, that different timepoints don't fit in there as smoothly and that there are probably more types of setups that don't fit this kind of label.

I guess the most generic statistical label would be independent_variable, but this label could also be applied to confounders or batch effects, and it probably isn't very intuitive for users with a non-stats background (which will be most). Then, I thought about manipulated_variable, but this also doesn't really fit different timepoints (even though it is the experimenter who chooses timepoints and thus in a sense "manipulates"...).

Maybe explanatory_variable is a good choice? I got the idea from this Wikipedia list of Synonyms:
https://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms

One of the citations on that Wikipedia mentions also gives a good explanation why "explanatory variable" might be preferred over "independent variable" (page 197 of the "The Oxford dictionary of statistical terms":

independent variable This term is regularly used in contradistinction to ‘dependent variable’ in regression analysis. When a variable Y is expressed as a function of variables X,, X2,..., plus a stochastic term, the X’s are known as ‘independent variables’. The terminology is rather unfortunate since the concept has no connection with either mathematical or statistical dependence. Modern usage prefers ‘explanatory variable’, ‘covariate’ or ‘regressor’.

And here's a very straightforward explanation of it:
https://online.stat.psu.edu/stat200/lesson/1/1.1/1.1.2

So, what do you think?

Actually, some more thoughts from an internal discussion:

explanatory variable has a connotation of causality, that is problematic. So maybe not the best choice.

Something that speaks for condition is that it is commonly used in the field, e.g. see the docs of DeSeq2 and the docs of sleuth. What speaks against it from those sources, is that e.g. in the sleuth docs, condition is also used for confounding variables / uninteresting covariates that are simply modeled to exclude their effect from the final differential expression test.

To a certain extent, these docs also speak for the use of group in this context, as they do make use of this word to describe the categories in a variable of interest.

However, this last thought brings me to a candidate that might be more suitable: variable_of_interest
This can also encompass continuous-valued variables, as opposed to only categorical values. And e.g. DeSeq2 can model such a continuouas-valued variable of interest. And makes use of the term "variable of interest" in the documentation. So after some more thought, this is my new favourite candidate. But I'm glad to hear more suggestions!

variable_of_interest seems a good fit.

I think the true solution is to just require people to specify a design matrix (via a formula over column names) instead of naming one column in a predefined way. We should modify the workflow accordingly. Our bandwidth is limited at the moment, so that I cannot promise this to happen immediately, and any PR would be welcome.

While formulas are a good general solution, I think there should be an alternative (simple) way to pick one "class label" column with no need to worry about a possible bias term and formula syntax.

So in fact, you may want to name the column arbitrarily, but then declare your variable of interest to be column "weekday" or sth.

I think the true solution is to just require people to specify a design matrix (via a formula over column names) instead of naming one column in a predefined way. We should modify the workflow accordingly. Our bandwidth is limited at the moment, so that I cannot promise this to happen immediately, and any PR would be welcome.

Obviously, it wasn't as simple as that. One does not simply specify a design matrix.

But there's an attempt to side-step this naming issue in the spirit of what @johanneskoester suggested, here. It generalizes the configuration of the DESeq2 differential analysis, hopefully explaining all the options clear enough righ there in the config.yaml file. And as a plus, it does use the wording variable_of_interest as a meta entry key.

@jonathandmoore, it's been a while. But maybe you're still up for a review? If so, please head over to PR #66.