luntergroup / octopus

Bayesian haplotype-based mutation calling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Polyclone caller produces VCF records with variable numbers of genotypes across records

gtollefson opened this issue · comments

Describe the bug
Polyclone caller produces VCF records with variable numbers of genotypes across records. If the number of clones is inferred from the data, I don't think the number of clones should change per site - rather there should be a phased genotype for each clone. This makes calling haplotypes per clone from octopus polyclone VCFs impossible. I see from the Octopus running messages that the polyclone caller is still in experimental development and not recommended for production level use - is this a known bug?

Version
0.7.4 current stable version

Additional context
Screenshot of vcf records showing variable numbers of genotypes per site produced from a lab control mixed clonal infection sample of haploid organisms.
Screenshot 2023-04-28 at 2 02 11 PM

I don't think the polyclone caller is designed to infer the true number of haplotypes across the entire data set. Inferring the true number of haplotypes across the data set could be implemented by some OTHER piece of software that takes the output of the polyclone caller as input.

Its not possible for octopus to construct an chromosome-length haplotype, as this would require phasing the data. Octopus only gets local phase information -- in order to infer full haplotypes you'd need to do something like MCMC to sample from the posterior distribution of haplotypes.

I see, thank you for your explanation @bredelings! Sorry for my delayed response.