Question about --collapse

Question

Question about --collapse

Antonia-Chalka opened this issue 3 years ago · comments

I have a very basic question about how the --collapse flag determines grouping. Does it collapse genotypes that have the exact same distribution across all the samples, or is some other type of correlation statistic used to determine that (and if so, what is it and what is the threshold)?

Both readme and the paper note the following:

For each phenotype supplied via columns in the traits
file, Scoary does the following: first, correlated genotype
variants are collapsed. Plasmid genes, for example, are
typically inherited together rather than as individual
units and Scoary will collapse these genes into a single
unit.

Annita · Answer 1 · Mon Jul 19 2021 18:04:34 GMT+0800 (China Standard Time)

From a quick view at the code in the methods script, it seems the correlation has to be perfect, but there's also a mention of having a 'softer' mention so I'm not 100% sure 😅

Ola Brynildsrud · Answer 2 · Thu Oct 21 2021 17:54:38 GMT+0800 (China Standard Time)

Thanks for your question, and sorry about the wait.

As you have already figured out, the genotypes need to be 100% correlated to be collapsed. You may also have seen from the code that I thought about using a softer threshold, but I have never gotten around to implementing that.

I'm also a bit uncertain how the distribution of the collapsed variant should be counted, i.e. should it be present in all isolates with either of the original variants? I'm uncertain how that would impact other assumptions that are made.

Another thing I'm not sure about is whether the collapsed genes should then go through subsequent rounds of correlation -> collapse. That is, when we collapse two genes into one, this will have a new distribution pattern, and there is a chance that this new pattern will fall within the correlation threshold of being collapsed with yet another gene.