Genomic associations between subspecies
karchern opened this issue · comments
From what I read, scoary is currently not able to work with non-binary traits.
I want to use scoary in order to determine the pangenomic differences between three apparent subspecies of my bacterium of interest. There appears to be a pretty strong signal, as the genomes cluster distinctly in a PCoA based on gene presence / absence data.
Specifically, I would like to find out which genes are differentially prevalent between the three clusters. Can I supply a trait file that has "dummy variables", something like this. My approach should work if scoary simply removes those samples that have no information for a specific trait. What do you think about this?
Sample_name Comp_clust_1_2 Comp_clust_1_3 Comp_clust_2_3
member_cluster_1 0 0 NA/empty
member_cluster_1 0 0 NA/empty
...
...
member_cluster_2 1 NA/empty 0
member_cluster_2 1 NA/empty 0
...
...
member_cluster_3 NA/empty 1 1
member_cluster_3 NA/empty 1 1
It is indeed possible to use scoary this way.
I would recommend using the --no_pairwise flag if you do this, since pairwise comparisons as implemented in scoary do not really make sense if you're looking at enrichments in groups rather than looking at variants with a causal hypothesis (The presence of a certain gene CAUSING the phenotype).
By splitting your genomes into a sort of pseudo-phenotype using PCoA, as you have done, you are (to a certain extent) handling spurious findings from population structure. This is similar to what is done in many (most) GWA studies.
A possible problem would arise if, within one of your clusters (prinicipal components), you had some fairly different genomes and then a bunch of almost identical outbreak genomes. Then your results might show enrichment of genes present only in the outbreak genomes, even if these are lacking in the other genomes within the same cluster. I tthink one way of handling this could be to add more principal components. (Which in most cases correspond well to lineages).
I hope this made any sense, and if not please fire away!
Hi Ola, thank you very much for your detailed answer.
Am I right in assuming that running scoary with the --no_pairwise flag is essentially equal to running a fisher-test for each gene WITHOUT taking into account the structure of the (phylogenetic) tree of samples (as scoary would do without setting the --no_pairwise flag)?
Cheers,
Nic
That's correct!
This would be an adequate way of measuring between-group differences in gene enrichment unless you have a large number of pseudoreplicates within your groups.
Thanks a lot, Ola! I'm closing this issue