brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How're `gt_depth_mean` and `depth_mean` calculated?

SHuang-Broad opened this issue · comments

Hi,

As we continue to test out somalier for QC on our long reads data (currently for sex check), we are seeing something a bit strange.

The sample we ran this on has an estimated coverage of 10 by mosdepth (single CCS SMRT cell sample). However, as you can see from the table below, the gt_depth_mean and depth_mean are estimated to be approximately 1.5 of that,

#family_id sample_id paternal_id maternal_id sex phenotype original_pedigree_sex gt_depth_mean gt_depth_sd depth_mean depth_sd ab_mean ab_std n_hom_ref n_het n_hom_alt n_unknown p_middling_ab X_depth_mean X_n X_hom_ref X_het X_hom_alt Y_depth_mean Y_n
family 1557026 paternal maternal 1 -9 male 15.3 6.4 14.6 6.8 0.43 0.56 4601 3045 6052 3686 0.013 10.54 178 83 0 95 11.57 7

whereas the X_depth_mean and Y_depth_mean is roughly in line with the mosdepth estimate.

So, as you can foretell, this makes the sex check look confusing
Screen Shot 2022-09-13 at 11 12 06 AM

I also run samtools depth on the sites VCF locations, and the mean from that is also 10.

So I'm wondering how gt_depth_mean and depth_mean are estimated in somalier?

Thanks,
Steve

Hi, I see what you mean. Can you run somalier relate with --min-depth 5 and let me know if the result matches more what you expect? With a cutoff of 7 for a sample with mean depth of 10, I suspect many lower depths are getting skipped and this shifts the mean depth of (what I call) genotyped sites up. Indeed, this is a problem, but should be easily fixed.

I actually used --min-depth 4 first, and here's the table

#family_id sample_id paternal_id maternal_id sex phenotype original_pedigree_sex gt_depth_mean gt_depth_sd depth_mean depth_sd ab_mean ab_std n_hom_ref n_het n_hom_alt n_unknown p_middling_ab X_depth_mean X_n X_hom_ref X_het X_hom_alt Y_depth_mean Y_n
family 1557026 paternal maternal 1 -9 male 14.7 6.7 14.6 6.8 0.53 0.45 4948 3177 6668 2591 0.013 10.54 178 83 0 95 11.57 7

You can see the two fields are similar to the table I shared above.

Indeed, that rules out my guess. Is there any chance that you can share the extracted .somalier file?

I think it is odd that there are so many n_hom_alt,

  • have you run other QC checks on the sample?
  • are you running somalier on BAM/CRAM or VCF?
  • Are you using the default sites from somalier?
  • can you run mosdepth with --mapq 1 to re check depth? (default is to count mapping quality 0 reads)

Somalier is tuned for higher coverage samples, but I'm not seeing anything else in your output or in the code that would stop it from working on your samples.

I will also check that the cigar stuff is correct. It could be that it's close enough for illumina reads, but for long reads, whatever bug is triggered enough to be a problem.

Thanks for this list of detailed suggestions.
Here are my answers:

  • Regarding QC: we've run fingerprint verification on this sample, which passed. The mean read length is 19kbp. We've also done de novo assembly. The primary assembly indicates GC percentage is 40.7%, with N50 of 490181 and 2375 for L50. The assembly itself, however, is longer than expected (4Gbp). What other checks would you like to see reported?
  • We are running somalier on minimap2 generated BAMs.
  • We are using the sites file from release v0.2.15, and the docker version is the same.
  • Here is our mosdepth command: mosdepth -x -n -Q1 ~{prefix} ~{bam}

We also ran DeepVariant on this sample.
Running the following to count PASSing SNPs, we've found 1.76 million HOM_ALTs and 3.03 million HETs overall, which is not the same het/hom_alt ratio as in the somalier tables.

zcat < 1557026.deepvariant_pepper.vcf.gz  | grep -v "^#" | grep -F 'PASS' | awk 'BEGIN {OFS="\t"} {a=length($4); b=length($5); if (a==1 && b==1) print}' | grep -c "1/1"

Regarding the .somalier file, I'm afraid I cannot share this as the samples are protected.
If you can pin down which part of the file you'd be interested in, I can probably share that part.

Thanks for jumping on this so quickly!
Steve

Thanks for the clear report. It seems likely that there's something with somalier. I expect it's easily fixed, but I'll have to find the problem. I'll check for a few long-read samples to run it on and see if I can replicate.

If you have a multi-sample DeepVariant VCF with this sample, I'd be interested to see if those results are better. If it is single-sample, there are other problems with somalier not having enough info about hom-ref.

Unfortunately, it's not a joint VCF yet.

Can you elaborate on this, I'm not quite getting the point:

there are other problems with somalier not having enough info about hom-ref

Can you elaborate on this, I'm not quite getting the point:

there are other problems with somalier not having enough info about hom-ref

Actually, that's just for relatedness stuff, not for single-sample information, so you could try running on the single-sample VCF file.

Hi Brent,

Sorry for dropping the ball on this.
I followed your advice and ran with the single sample DeepVariant VCF, and here's the table, which give the same inference

#family_id sample_id paternal_id maternal_id sex phenotype original_pedigree_sex gt_depth_mean gt_depth_sd depth_mean depth_sd ab_mean ab_std n_hom_ref n_het n_hom_alt n_unknown p_middling_ab X_depth_mean X_n X_hom_ref X_het X_hom_alt Y_depth_mean Y_n
family 1557026 paternal maternal 1 -9 male 11.2 3.0 5.7 5.8 0.33 0.52 2136 3849 1444 9955 0.002 8.16 37 16 0 21 9.00 2

Changing the depth requirement to 4 didn't change things materially.
Both, unsurprisingly, generate a HET to HOM_ALT ratio closer to the WGS ratio (duh).

#family_id sample_id paternal_id maternal_id sex phenotype original_pedigree_sex gt_depth_mean gt_depth_sd depth_mean depth_sd ab_mean ab_std n_hom_ref n_het n_hom_alt n_unknown p_middling_ab X_depth_mean X_n X_hom_ref X_het X_hom_alt Y_depth_mean Y_n
family 1557026 paternal maternal 1 -9 male 10.5 3.4 5.7 5.8 0.44 0.37 2483 4263 1642 8996 0.002 8.16 37 16 0 21 9.00 2

Thanks,
Steve
Screen Shot 2022-09-15 at 2 52 17 PM

I'm really confused about what's going on. This is showing a much lower depth from DeepVariant calls.
The ratio of homalts is dramatically reduced from DV indicating there could be a problem with the somalier genotyping on long reads.
But, the ratio of autosomal to chrX, chrY depth is relatively unchanged.
So, I think two things are true:

  1. something is odd about the coverage on your sample. I would look at the mosdepth summary output and see if you have some trisomies. you can use covdist or your own plotting to look at the output in more ... depth.
  2. there is a bug in the genotyping in somalier for long reads (this doesn't affect your results related to sex vs autosome depth, but I will look into it).

I need to re-run mosdepth on that sample to get the plots, but here's the summary from the previous run

chrom	length	bases	mean	min	max
chr1	248956422	2760787872	11.09	0	7801
chr2	242193529	2618785137	10.81	0	361
chr3	198295559	2156666022	10.88	0	1947
chr4	190214555	2199934962	11.57	0	5891
chr5	181538259	1954884959	10.77	0	1477
chr6	170805979	1868390025	10.94	0	259
chr7	159345973	1725551069	10.83	0	891
chr8	145138636	1584893778	10.92	0	642
chr9	138394717	1306272197	9.44	0	199
chr10	133797422	1514116404	11.32	0	1229
chr11	135086622	1467878253	10.87	0	90
chr12	133275309	1422786382	10.68	0	627
chr13	114364328	1138077672	9.95	0	953
chr14	107043718	957978895	8.95	0	91
chr15	101991189	894579432	8.77	0	112
chr16	90338345	944218678	10.45	0	4696
chr17	83257441	870271120	10.45	0	1345
chr18	80373285	875065734	10.89	0	121
chr19	58617616	538146550	9.18	0	94
chr20	64444167	718687371	11.15	0	369
chr21	46709983	433343387	9.28	0	374
chr22	50818468	402498377	7.92	0	1216
chrX	156040895	832630526	5.34	0	155
chrY	57227415	225360913	3.94	0	2369

It doesn't look like it's trisomy.

So, mosdepth shows about the expected coverage ratios for a male, though the X is a bit high.

DeepVariant+somalier does not so it's not a problem to specific to how somalier is counting. So, perhaps there is something about the sites in the autosome chosen by somalier that are biased to have lower coverage, or the sites on chrX or chrY are biased to have higher coverage in CCS. The autosome is entirely coding sites, but the sites on chrX and chrY are any common sites, not necessarily coding.

Somalier does use sites near start of chromosomes X and Y, so maybe there is high coverage near the telomeres?

@zeeev, are there cases where CCS has bias to lower coverage in high-GC (in genes)?

quick responses

So, perhaps there is something about the sites in the autosome chosen by somalier that are biased to have lower coverage

I actually checked the coverage (with samtools depth) on these SNP sites (the released VCF), the mean coverage comes back to be 10 as well.

Somalier does use sites near start of chromosomes X and Y, so maybe there is high coverage near the telomeres?

Are PARs excluded in these sites?


another curious discovery, regarding sex "bias"?

HG00514

Based on the table posted in my previous ticket #105
(though the table posted there is malformed, corrected below)

#family_id sample_id paternal_id maternal_id sex phenotype original_pedigree_sex gt_depth_mean gt_depth_sd depth_mean depth_sd ab_mean ab_std n_hom_ref n_het n_hom_alt n_unknown p_middling_ab X_depth_mean X_n X_hom_ref X_het X_hom_alt Y_depth_mean Y_n
chs UnnamedSample 0 0 2 0 female 26.6 10.9 26.5 11.0 0.51 0.43 4964 3826 6122 2472 0.010 24.57 324 115 74 135 0.00 0

The coverage estimation doesn't look off for that sample (~30X), yet it's the same scenario where HOM.ALT calls are way more than HET calls.

Note this is HG00514, a female sample. Is this related to the sex of the sample?

Check on more protected samples

So I checked the results on two more protected samples, one female and one male, both around 10X.
Interestingly, the two male samples are seeing the inflated coverage (ratio 1.5), but not for the female sample.

Check on HG002

This is a male sample, the bam is released by GIAB/NIST
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/

and mosdepth gives the following coverage summary

chr1	248956422	13969946903	56.11	0	28904
chr2	242193529	13700025724	56.57	0	2582
chr3	198295559	11329262682	57.13	0	16236
chr4	190214555	11662996790	61.31	0	40877
chr5	181538259	10261322700	56.52	0	9964
chr6	170805979	9861766821	57.74	0	7307
chr7	159345973	8824929365	55.38	0	1078
chr8	145138636	8252258159	56.86	0	2810
chr9	138394717	6700195676	48.41	0	719
chr10	133797422	7677494877	57.38	0	6042
chr11	135086622	7504549211	55.55	0	984
chr12	133275309	7441148334	55.83	0	668
chr13	114364328	5793614946	50.66	0	1978
chr14	107043718	5054143320	47.22	0	1580
chr15	101991189	4655047530	45.64	0	661
chr16	90338345	4784597624	52.96	0	27083
chr17	83257441	4376055735	52.56	0	1802
chr18	80373285	4457796186	55.46	0	647
chr19	58617616	2746694349	46.86	0	734
chr20	64444167	3634695784	56.40	0	3900
chr21	46709983	2339692268	50.09	0	3092
chr22	50818468	2052775521	40.39	0	26160
chrX	156040895	4369507111	28.00	0	305
chrY	57227415	1323232774	23.12	0	18494
...
total	3099844859	164920005749	53.20	0	73360

What I find is somewhat following this "sex-bias" trend (if we can call it that way).
The scaled depth on X and Y are—correctly—both around 1.
But look at the table below

#family_id sample_id paternal_id maternal_id sex phenotype original_pedigree_sex gt_depth_mean gt_depth_sd depth_mean depth_sd ab_mean ab_std n_hom_ref n_het n_hom_alt n_unknown p_middling_ab X_depth_mean X_n X_hom_ref X_het X_hom_alt Y_depth_mean Y_n
family HG002 paternal maternal 1 -9 male 75.7 25.3 75.7 25.3 0.51 0.40 4553 5112 5627 2092 0.004 36.66 353 170 0 183 37.73 15

Here's the somalier file for HG002

HG002.somalier.zip

This is really puzzling now.

🤔

I think there is some bias with higher coverage near start of the chromosomes. I will look into this next week. Thanks very much for following up. Should have a fix soon.
Can you verify which sites file you used for somalier?

It's this on the release page

https://github.com/brentp/somalier/files/3412456/sites.hg38.vcf.gz

I should also mention that we are using the no_alt version of GRCh38

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

And thank you for following through on this!
Steve

you could also try: https://github.com/brentp/somalier/files/4566475/sites.hg38.rna.vcf.gz
but I will have a look at this problem.

I had a look and didnt find anything obvious, but I'll look again next week. I rewrote the pileup and got the same answer (and verified it was correct), so I don't think that's the issue.

Thanks, Brent!

I'm a little confused by this

I rewrote the pileup and got the same answer (and verified it was correct)

Could you also try your new code on the GIAB-HG002 and see?
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/

I'm observing inflated coverage estimation on that BAM by somalier (~1.5X inflation, see above).

Thanks!
Steve

I mean that it's counting alleles correctly, but I can see the issue you mention. So I'm still confused.