brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Runtime with large datasets

marcustutert opened this issue · comments

Hi,

If I am running somalier relate (after having previously extracted using somalier extract) on 20k samples and ~250 sites how long should I expect my command to take?

I have been running for about 20min or so and just wanted to get an idea of the runtime.

Best,
Marcus

Has it finished reading the files?
On my laptop, it does about 500 thousand pairs per second.
(20K * 20K)/2 is 200 million pairs. So it should be done soon.
But it might take a long time to write the output. I'm interested to see what it shows for stdout/err when it's done.

Yes, it did 'finish' although have gotten some out of memory errors Im currently trying to track down. Suspect I'll need to request more from the cluster on a second go round shortly

Yeah, to store the matrices, it will need:

200000*200000*2 bytes * 4 (matrices) == 320GB

of memory. But it will probably need at least double that because of the json and other overhead.
So give it as much as you can!

Thanks Brent

I think requesting that much memory might be a bit tricky to do -- is there a way to filter/lower the memory overhead here? If it helps, I really only care about samples that are highly related (can go to something like IBD of 0.5 or so, approx 1st degree relative pairs).

It has to do all vs all now. So you can do subsets of e.g. 10K which would use much less memory.
It does save memory on the JSON step for unrelated (expected and observed) but it compares each sample to all other samples.

I will think about other solutions in the software, but for now, you'd need to subset or get more memory.

Thanks Brent -- I managed to secure enough(?) memory that I think that I'm making progress here.

I'm at the point where the command has generated the .html output, and a pairs.tsv file (which is still being populated, currently at 3GB in size) and a .samples.tsv which hasn't been populated yet (0B size).

What size should I expect these files to eventually end up at would you guess?

Here is my full output so far (currently still running):
somalier version: 0.2.15
[somalier] starting read of 21465 samples
[somalier] time to read files and get per-sample stats for 21465 samples: 38.52
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] html and text output will have unrelated sample-pairs subset to 0.04% of points

Update:

Seems to have hit a snag towards the end of running:
(base) mt27@hm-02:/lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch$ /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier relate "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/somalier_output/*"
somalier version: 0.2.15
[somalier] starting read of 21465 samples
[somalier] time to read files and get per-sample stats for 21465 samples: 38.52
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] html and text output will have unrelated sample-pairs subset to 0.04% of points
[somalier] time to calculate all vs all relatedness for all 230362380 combinations: 1007.99
io.nim(139) raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

Have included command and stderr here

That's writing the html, I suspect. Did you run out of disk space in your output directory?

No, I didn't hit a disk space error I don't think
If it makes a difference, my html ended with 5.9GB size and my pairs.tsv with 5.5GB

Hmm. I don't see anything obvious. But I'd double check that there isn't a limit on your scratch size or disk usage or something.
If that doesn't work. I'll get you a binary that more aggressively subsets the pairs that are output (it will still output any interesting pairs, just not those that appear unrelated and are indicated to be unrelated by your pedigree file).

Have confirmed with IT that they don't think its an issue with the quota, but I've just tried to run it again (with even more memory, previously was running with 200GB, now running with 1TB) so will see if that makes any difference. In the case that it doesn't, that binary which more aggressively subsets the pairs would be brilliant, in my case I would be happy with anything above 0.4 relatedness. May I suggest in the future a feature that allows users to specify the filter themselves? No worries if this is too much work though, was just an idea!

Thanks.

OK. Let me know. I can make you a binary in a few minutes. Might make it a hidden option.
I know you just want your results (and we'll get them), but I am pleased with:

time to calculate all vs all relatedness for all 230362380 combinations: 1007.99

That's 228K pairs per second. :)

Brilliant, that would be great if you could! The option to pre-specify that filter would be amazing for my use-cases.

Yes, have to say the runtime for 20k samples is insanely impressive!

Yes, running it a second time with increased memory still led to the same error as above. Trying this out on that binary would be helpful to diagnose the issue.

Ok. Give me 24 hours or so.

Hi, can you try this binary (gunzip and chmod +x ).
somalier.gz
run as SOMALIER_SAMPLE_RATE=0 ./somalier ...)

Thanks for pulling this together for me Brent -- I assume that the SOMALIER_SAMPLE_RATE=0 option sets the min relatedness (so could set it to say SOMALIER_SAMPLE_RATE=0.4) and run somalier as previously?

Hi Brent -- unfortunately still am back where I started with these EIO errors I highlighted above. I tried the new binary you kindly provided for me, and ran it such that I should only sample 0.1% of the pairs (points). Here was my command and resulting output:

(base) mt27@hm-04:/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier$ SOMALIER_SAMPLE_RATE=0.001 /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier_updated relate "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/somalier_output/*"
somalier version: 0.2.15
[somalier] time to read files and get per-sample stats for 21465 samples: 22.18
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] got sampling rate 0.001 from env var
[somalier] html and text output will have unrelated sample-pairs subset to 0.10% of points
[somalier] time to calculate all vs all relatedness for all 230362380 combinations: 1730.05
io.nim(139)              raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

Based on the size of the output files as compared to yesterday and earlier runs of the code, I'm fairly convinced this isn't a disk quota issue on my end as was suggested earlier. Do you have any tips to debug this?

Best,
Marcus

Hi, you want SOMALIER_SAMPLE_RATE=0

Hi Brent, I've just tried that but have run into the same issue as above. Do you have any other suggestions on what could be going wrong?

what does the output show?

and how large in MB and lines is the TSV file?

(base) mt27@hm-04:/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier$ SOMALIER_SAMPLE_RATE=0 /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier_updated relate "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/somalier_output/*"
somalier version: 0.2.15
[somalier] time to read files and get per-sample stats for 21465 samples: 22.18
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] got sampling rate 0 from env var
[somalier] html and text output will have unrelated sample-pairs subset to 0.00% of points
[somalier] time to calculate all vs all relatedness for all 230362380 combinations: 1725.2
io.nim(139)              raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

The TSV file .pairs is 5.5GB in size, with 68660483 lines
The .samples file is empty
The .html is 5.9GB in size

ok. so the TSV file is complete, unfortunately, you still have 68.6 million sample-pairs with a relatedness > 0.05.
This is causing a problem during the JSON -> string -> file step.

Can you do:

 awk '$3 > 0.25' $tsv | wc -l

and let me know the result. If that's a reasonable number, we can be sure that allowing to set the relatedness cutoff will resolve this.

That's quite a few!
Running the awk command gave:
1719229

Using a more strict filter (which I will end up likely use for the bulk of my downstream analyses for this data) of 0.4, I get:
29505

OK. Thanks for sticking with it. We are getting close. Can you try this binary?
and use as:

SOMALIER_SAMPLE_RATE=0 SOMALIER_RELATEDNESS_CUTOFF=0.25 somalier_test ...

you can set the cutoff higher, to e.g. 0.3, but 1.7 million points should be ok.

somalier_test.gz

and just to re-iterate, the pairs.tsv output is complete and fine to use.

Hi Brent -- good news, we've finally had a full run through of the code (and again thanks for all your help here!), but the output I am getting looks a bit off(?) and am having trouble parsing some of the output.

To reiterate, here was my command and output

SOMALIER_SAMPLE_RATE=0 SOMALIER_RELATEDNESS_CUTOFF=0.4 /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier relate "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/somalier_output/*"
somalier version: 0.2.16
[somalier] starting read of 21465 samples
[somalier] time to read files and get per-sample stats for 21465 samples: 36.65
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] got sampling rate 0.0 from env var
[somalier] setting relatedness cutoff to: 0.4 from env var
[somalier] html and text output will have unrelated sample-pairs subset to 0.00% of points
[somalier] time to calculate all vs all relatedness for all 230362380 combinations: 707.58
[somalier] wrote interactive HTML output for 29504 pairs to: somalier.html
[somalier] joining families EGAN00002590822 and EGAN00002586516 because of relatedness > 0.2
[somalier] joining families EGAN00002258766 and EGAN00002590822 because of relatedness > 0.2
[somalier] joining families EGAN00002258766 and EGAN00002228367 because of relatedness > 0.2
[somalier] joining families EGAN00002258766 and EGAN00002467345 because of relatedness > 0.2
[somalier] joining families EGAN00002647309 and EGAN00001730469 because of relatedness > 0.2
[somalier] joining families EGAN00002647309 and EGAN00002258766 because of relatedness > 0.2
[somalier] joining families EGAN00002647309 and EGAN00002228729 because of relatedness > 0.2
[somalier] wrote groups to: somalier.groups.tsv (look at this for cancer samples)
[somalier] wrote samples to: somalier.samples.tsv
[somalier] wrote pair-wise relatedness metrics to: somalier.pairs.tsv

Note that I've cut out some of the joining families lines just to grab you the main details of the output.
Two quick questions:

  1. What does it mean when it is joining those families?
  2. I did a quick look at the distribution of the relatedness (again remembering I pre-filtered for >0.4), and am seeing an odd distribution. These samples (barring a few duplicates that were just human errors within the wet lab) are all from different randomly samples and (relatively) unrelated individuals, however with that being the case, I would suspect that there should be a higher peak at 0.5 rather than 0.4 since 1st degree relatives would create a peak there. In fact, if I look at the entire dataset (without doing filtering and examining all 68 million pairs), I end up with a distribution that just skews heavily towards 0, without any characteristic peaks at 1st/2nd/3rd etc degree relatives that you might expect in a random sample like this.

Am I misinterpreting something or doing something wrong?

Thanks,
Marcus

Hi, glad you got it running:

  1. you can ignore that. somalier has a --infer option to infer pedigrees. I should hide those messages if --infer is not used.
  2. let's assume that all of your sample pairs are unrelated. then you would still see a distribution. and that distribution gets wider with lower quality pairs (or samples). so you may see pairs at 0.5 because one or both of them is low quality.
    I would look at the sample plot (right-hand side) in the HTML and see if there are any/many obvious sample outliers.
    Then you could re-run after removing those (you could do an awk on the samples.tsv file once you use the plot to decide some cutoffs).
    Remember that if you have a single bad sample, then 20,000+ pairs will be affected. If you have 100 bad/low-quality samples ...

If you do have a few related samples, the signal for that in the plot you are describing would be low enough so as not to break about above the noise.

If you share some of the sample plots, I can try to help diagnose further.

Thanks for the continued help here Brent. Good to know I can ignore those messages.
That's a helpful tip to keep in mind.
Here is an example of some of the html plots, let me know which ones (seems as if there are quite a few options) would be best to help you diagnose further:
image

Well, you clearly have some sample duplicates (relatedness === 1)

How many known sites do you have in general? I see from your other issue that you said 800 sites. Why only 800? You should get ~15K for high quality exomes.
You'll get much better resolution on relatedness if you can use more sites.

Yes, those are duplicate samples it appears that made its way past the pre-sequencing QC.

I have about 20k sites I can use -- the reason for the filtering is that I'll soon be running somalier on a joint subset of sequences (some of which weren't run using WES and others that were genotyped etc.) and the 1000 sites are the intersection of the sites across the different cohorts which passed QC.

Just getting back to what you earlier mentioned,
"let's assume that all of your sample pairs are unrelated. then you would still see a distribution. and that distribution gets wider with lower quality pairs (or samples). so you may see pairs at 0.5 because one or both of them is low quality.
I would look at the sample plot (right-hand side) in the HTML and see if there are any/many obvious sample outliers.
Then you could re-run after removing those (you could do an awk on the samples.tsv file once you use the plot to decide some cutoffs)."

Does these QC plots indicate to you anything about there being any obvious sample outliers in the data? And I'm not sure what you mean qualitatively by "wider" here, do you think you could explain a bit more/differently?

Thanks!

it's much easier to see outliers with larger set of variants.
I'd also click on the links at the bottom of the somalier html output and look at each of those sets of plots.

It will take longer per-sample for crams, and the speed will depend nearly entirely on disk (random access) speed. so if you are running thousands of jobs at once and whatever network drive you have could get hammered by those--though I assume with 20K samples, you should be well equipped for this type of problem.
On a decent machine+disk, a cram should take about 3 cpu-minutes.

Thanks for the update there Brent.

Bit of a different question (same dataset), will somalier automatically strip some IDs based on prefixes/suffixes?
I am using somalier on a directory of files with the same EGAN ID, but half of the IDs have a prefix of CRAM_... in front of them, yet for some reason when I look at the somalier outputted tsv, the actual CRAM_ identifier as gone missing which I require for some downstream analysis

if they have the same ID, it must be because the SM is the same in the read-group within the bam/cram. (somalier uses read-group, not file name).
when you run somalier extract, you can specify, for example, --sample-prefix extra- (or in your case --sample-prefix CRAM_so that the value you see in the final plot would be extra-$sampleid.
(later, you can also specify --sample-prefix $prefix when you run somalier relate so that these added prefixes are removed from the plot and output)

Hi Brent, maybe I'm doing something daft, but I just tried that and got:

(base) mt27@node-11-2-3:/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/problematic_somalier_output$ \
  SOMALIER_SAMPLE_RATE=0 SOMALIER_RELATEDNESS_CUTOFF=0.4 \
  /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier relate \
   "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/problematic_somalier/*" \
  --sample-prefix CRAM
somalier version: 0.2.16
[somalier] starting read of 1126 samples
fatal.nim(49)            sysFatal
Error: unhandled exception: relate.nim(423, 14) `sl == formatVersion` expected matching versions got 69, expected 2 [AssertionDefect]

My somalier files look like (so half of them have the CRAM prefix and the other half don't)
image

(I just edited your comment to adjust formatting. you can use 3 backticks for multiline)

So you ran: somalier extract --sample-prefix CRAM ... for the relevant samples?

Then you don't need --sample-prefix CRAM for somalier relate (though that should work).

No sorry -- I guess I will need to run the extraction first with the prefix then with relate?
Didn't realize I needed to rerun the extraction if so

yes, that's right. the extract step has the option to adjust the sample name for exactly this reason.
it uses sample-id as a unique identifier.
and, I see I wrote somalier relate above where I meant somalier extract. Sorry about that. I'll edit that comment so I don't cause more confusion.