brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Somalier with pooled parents

tetedange13 opened this issue · comments

Hi,

First thanks for developping somalier, it is a great tool !

In my team we have exome data, with pooled parents, most of the time 4 mums and 4 dads together
=> I run somalier directly on BAMs and I would have a few questions if you do not mind :

  • Have you experienced using somalier for this specific case of pooled parents ?
    => From what I tested with relate --infer method, pools have always a relatedness around 0.5
    => And for a given parent pool, relatedness is the same between the child of these parents versus any other child (from a different family)
    => So it cannot be used to verify that a parent of a given child is well present in its corresponding parental pool

  • I also noticed that with the relate --ped option, somalier does not allow duplicated samples even if they have different famID
    => Would it be possible to consider duplication only among a given family ?
    => I noticed --sample-prefix options, but AFAIK it does not fit my need as I want to use the same ".somalier" file multiple times

  • Do you have any hints about using somalier to guess ploidy ?
    => Would be to make sure that predicted ploidy is correct (as a quality control to spot forgotten sample in pool)
    => Maybe using "scaled mean depth on chrX" metric ?


Thanks for any kind of help on this !
Best regards,
Felix.

Hi Felix, do you mean by pooled that all reads from all samples are mixed, without barcodes so you don't know which reads came from which samples?
It's possible that somalier can help here, but it's not designed for that. And certainly, --infer will not work well (if at all) for that case.
If you children are sequenced individually, you could look at the rate of IBS0 to the parent pool. That should be very close to 0 if the parent is in the pool, but even that might not be reliable because if only a single parent has the allele, the ratio will be very low and it might be called as hom-ref.

Thanks for your quick answer !

Yes I meant "pooled parents" exactly as you described and our children are well sequenced individually

For relatedness, IBS0 is indeed a good indicator
=> With child having always a IBS0 under 20 with their parental pool (versus IBS0 above 50 with any other unrelated pool)

I also found Homozygous concordance to be a good metric too
=> With "child - pooled_parent" relationships always being above 0.6-0.65 when parent is well in the pool (and lower otherwise)
=> All "pool to pool" relationships exhibit low IBS0, but they never have high enough "hom_concord" (so even better metric than IBS0 in my case ?)

If --ped is the method to go, I would really benefit from being able to have duplicate sampleID in input PED (at condition that they have different famID)
=> It would be essentially to have a correct "expected_relatedness" set in "pairs.tsv"
=> For all possible "child_{1,2,3,4} - pooled_parents_1+2+3+4" relationships of a given pool (I hope I am clear enough here)

In regard of guessing from data the number of samples pooled together, I also made some progress :
(somalier_relate.html is very handy for all that)

  • Nb of 1/1 sites correlates well with number of pooled samples
Number of samples in pool n_hom_ref
1 > 5000
2 ~ 2500
3 ~ 1500
4 ~ 1000

=> I rather use "fraction of hom_alt" (= hom_alt / (hom_alt +het + hom_ref)"
=> And after plotting this fraction against "expected_ploidy", I found a good linear correlation
=> With int(-12.5 * frac_hom_alt + 5.3) giving a rounded estimate of number of samples in pool

  • In a more "experimental" way, I use also Scaled Y mean depth to have info on "mixed pools" we have sometimes (rarely)
    => For example : 2 moms + 1 dad

Thanks again !
Best regards,
Felix.