Runtime with large datasets

Question

Runtime with large datasets

marcustutert opened this issue 2 years ago · comments

Hi,

If I am running somalier relate (after having previously extracted using somalier extract) on 20k samples and ~250 sites how long should I expect my command to take?

I have been running for about 20min or so and just wanted to get an idea of the runtime.

Best,
Marcus

Brent Pedersen · Answer 1 · Tue Feb 01 2022 21:43:20 GMT+0800 (China Standard Time)

Has it finished reading the files?
On my laptop, it does about 500 thousand pairs per second.
(20K * 20K)/2 is 200 million pairs. So it should be done soon.
But it might take a long time to write the output. I'm interested to see what it shows for stdout/err when it's done.

marcustutert · Answer 2 · Tue Feb 01 2022 21:46:16 GMT+0800 (China Standard Time)

Yes, it did 'finish' although have gotten some out of memory errors Im currently trying to track down. Suspect I'll need to request more from the cluster on a second go round shortly

Brent Pedersen · Answer 3 · Tue Feb 01 2022 21:53:27 GMT+0800 (China Standard Time)

Yeah, to store the matrices, it will need:

200000*200000*2 bytes * 4 (matrices) == 320GB

of memory. But it will probably need at least double that because of the json and other overhead.
So give it as much as you can!

marcustutert · Answer 4 · Tue Feb 01 2022 22:01:45 GMT+0800 (China Standard Time)

Thanks Brent

I think requesting that much memory might be a bit tricky to do -- is there a way to filter/lower the memory overhead here? If it helps, I really only care about samples that are highly related (can go to something like IBD of 0.5 or so, approx 1st degree relative pairs).

Brent Pedersen · Answer 5 · Tue Feb 01 2022 22:06:54 GMT+0800 (China Standard Time)

It has to do all vs all now. So you can do subsets of e.g. 10K which would use much less memory.
It does save memory on the JSON step for unrelated (expected and observed) but it compares each sample to all other samples.

I will think about other solutions in the software, but for now, you'd need to subset or get more memory.

marcustutert · Answer 6 · Tue Feb 01 2022 22:40:05 GMT+0800 (China Standard Time)

Thanks Brent -- I managed to secure enough(?) memory that I think that I'm making progress here.

I'm at the point where the command has generated the .html output, and a pairs.tsv file (which is still being populated, currently at 3GB in size) and a .samples.tsv which hasn't been populated yet (0B size).

What size should I expect these files to eventually end up at would you guess?

Here is my full output so far (currently still running):
somalier version: 0.2.15
[somalier] starting read of 21465 samples
[somalier] time to read files and get per-sample stats for 21465 samples: 38.52
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] html and text output will have unrelated sample-pairs subset to 0.04% of points

marcustutert · Answer 7 · Tue Feb 01 2022 22:50:20 GMT+0800 (China Standard Time)

Update:

Seems to have hit a snag towards the end of running:
(base) mt27@hm-02:/lustre/scratch123/hgi/mdt2/projects/ibdgwas_bioresource/mt27/CSI/Scratch$ /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier relate "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/somalier_output/*"
somalier version: 0.2.15
[somalier] starting read of 21465 samples
[somalier] time to read files and get per-sample stats for 21465 samples: 38.52
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] html and text output will have unrelated sample-pairs subset to 0.04% of points
[somalier] time to calculate all vs all relatedness for all 230362380 combinations: 1007.99
io.nim(139) raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

Have included command and stderr here

Brent Pedersen · Answer 8 · Tue Feb 01 2022 23:02:12 GMT+0800 (China Standard Time)

That's writing the html, I suspect. Did you run out of disk space in your output directory?

marcustutert · Answer 9 · Tue Feb 01 2022 23:07:00 GMT+0800 (China Standard Time)

No, I didn't hit a disk space error I don't think
If it makes a difference, my html ended with 5.9GB size and my pairs.tsv with 5.5GB

Brent Pedersen · Answer 10 · Tue Feb 01 2022 23:32:59 GMT+0800 (China Standard Time)

Hmm. I don't see anything obvious. But I'd double check that there isn't a limit on your scratch size or disk usage or something.
If that doesn't work. I'll get you a binary that more aggressively subsets the pairs that are output (it will still output any interesting pairs, just not those that appear unrelated and are indicated to be unrelated by your pedigree file).

marcustutert · Answer 11 · Tue Feb 01 2022 23:36:05 GMT+0800 (China Standard Time)

Have confirmed with IT that they don't think its an issue with the quota, but I've just tried to run it again (with even more memory, previously was running with 200GB, now running with 1TB) so will see if that makes any difference. In the case that it doesn't, that binary which more aggressively subsets the pairs would be brilliant, in my case I would be happy with anything above 0.4 relatedness. May I suggest in the future a feature that allows users to specify the filter themselves? No worries if this is too much work though, was just an idea!

Thanks.

Brent Pedersen · Answer 12 · Tue Feb 01 2022 23:38:51 GMT+0800 (China Standard Time)

OK. Let me know. I can make you a binary in a few minutes. Might make it a hidden option.
I know you just want your results (and we'll get them), but I am pleased with:

time to calculate all vs all relatedness for all 230362380 combinations: 1007.99

That's 228K pairs per second. :)

marcustutert · Answer 13 · Wed Feb 02 2022 00:02:48 GMT+0800 (China Standard Time)

Brilliant, that would be great if you could! The option to pre-specify that filter would be amazing for my use-cases.

Yes, have to say the runtime for 20k samples is insanely impressive!

marcustutert · Answer 14 · Wed Feb 02 2022 00:14:15 GMT+0800 (China Standard Time)

Yes, running it a second time with increased memory still led to the same error as above. Trying this out on that binary would be helpful to diagnose the issue.

Brent Pedersen · Answer 15 · Wed Feb 02 2022 00:16:13 GMT+0800 (China Standard Time)

Ok. Give me 24 hours or so.

Brent Pedersen · Answer 16 · Wed Feb 02 2022 02:37:26 GMT+0800 (China Standard Time)

Hi, can you try this binary (gunzip and chmod +x ).
somalier.gz
run as SOMALIER_SAMPLE_RATE=0 ./somalier ...)

marcustutert · Answer 17 · Wed Feb 02 2022 18:50:34 GMT+0800 (China Standard Time)

Thanks for pulling this together for me Brent -- I assume that the SOMALIER_SAMPLE_RATE=0 option sets the min relatedness (so could set it to say SOMALIER_SAMPLE_RATE=0.4) and run somalier as previously?

marcustutert · Answer 18 · Wed Feb 02 2022 20:54:35 GMT+0800 (China Standard Time)

Hi Brent -- unfortunately still am back where I started with these EIO errors I highlighted above. I tried the new binary you kindly provided for me, and ran it such that I should only sample 0.1% of the pairs (points). Here was my command and resulting output:

(base) mt27@hm-04:/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier$ SOMALIER_SAMPLE_RATE=0.001 /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier_updated relate "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/somalier_output/*"
somalier version: 0.2.15
[somalier] time to read files and get per-sample stats for 21465 samples: 22.18
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] got sampling rate 0.001 from env var
[somalier] html and text output will have unrelated sample-pairs subset to 0.10% of points
[somalier] time to calculate all vs all relatedness for all 230362380 combinations: 1730.05
io.nim(139)              raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

Based on the size of the output files as compared to yesterday and earlier runs of the code, I'm fairly convinced this isn't a disk quota issue on my end as was suggested earlier. Do you have any tips to debug this?

Best,
Marcus

Brent Pedersen · Answer 19 · Wed Feb 02 2022 21:17:08 GMT+0800 (China Standard Time)

Hi, you want SOMALIER_SAMPLE_RATE=0

marcustutert · Answer 20 · Wed Feb 02 2022 22:01:49 GMT+0800 (China Standard Time)

Hi Brent, I've just tried that but have run into the same issue as above. Do you have any other suggestions on what could be going wrong?

Brent Pedersen · Answer 21 · Wed Feb 02 2022 22:05:35 GMT+0800 (China Standard Time)

what does the output show?

Brent Pedersen · Answer 22 · Wed Feb 02 2022 22:07:06 GMT+0800 (China Standard Time)

and how large in MB and lines is the TSV file?

marcustutert · Answer 23 · Wed Feb 02 2022 22:10:18 GMT+0800 (China Standard Time)

(base) mt27@hm-04:/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier$ SOMALIER_SAMPLE_RATE=0 /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier_updated relate "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/somalier_output/*"
somalier version: 0.2.15
[somalier] time to read files and get per-sample stats for 21465 samples: 22.18
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] got sampling rate 0 from env var
[somalier] html and text output will have unrelated sample-pairs subset to 0.00% of points
[somalier] time to calculate all vs all relatedness for all 230362380 combinations: 1725.2
io.nim(139)              raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

The TSV file .pairs is 5.5GB in size, with 68660483 lines
The .samples file is empty
The .html is 5.9GB in size

Brent Pedersen · Answer 24 · Wed Feb 02 2022 22:23:22 GMT+0800 (China Standard Time)

ok. so the TSV file is complete, unfortunately, you still have 68.6 million sample-pairs with a relatedness > 0.05.
This is causing a problem during the JSON -> string -> file step.

Can you do:

 awk '$3 > 0.25' $tsv | wc -l

and let me know the result. If that's a reasonable number, we can be sure that allowing to set the relatedness cutoff will resolve this.

marcustutert · Answer 25 · Wed Feb 02 2022 22:27:19 GMT+0800 (China Standard Time)

That's quite a few!
Running the awk command gave:
1719229

marcustutert · Answer 26 · Wed Feb 02 2022 22:28:57 GMT+0800 (China Standard Time)

Using a more strict filter (which I will end up likely use for the bulk of my downstream analyses for this data) of 0.4, I get:
29505

Brent Pedersen · Answer 27 · Wed Feb 02 2022 22:49:28 GMT+0800 (China Standard Time)

OK. Thanks for sticking with it. We are getting close. Can you try this binary?
and use as:

SOMALIER_SAMPLE_RATE=0 SOMALIER_RELATEDNESS_CUTOFF=0.25 somalier_test ...

you can set the cutoff higher, to e.g. 0.3, but 1.7 million points should be ok.

somalier_test.gz

and just to re-iterate, the pairs.tsv output is complete and fine to use.

marcustutert · Answer 28 · Thu Feb 03 2022 20:22:31 GMT+0800 (China Standard Time)

Hi Brent -- good news, we've finally had a full run through of the code (and again thanks for all your help here!), but the output I am getting looks a bit off(?) and am having trouble parsing some of the output.

To reiterate, here was my command and output

SOMALIER_SAMPLE_RATE=0 SOMALIER_RELATEDNESS_CUTOFF=0.4 /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier relate "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/somalier_output/*"
somalier version: 0.2.16
[somalier] starting read of 21465 samples
[somalier] time to read files and get per-sample stats for 21465 samples: 36.65
[somalier] time to get expected relatedness from pedigree graph: 0.05
[somalier] got sampling rate 0.0 from env var
[somalier] setting relatedness cutoff to: 0.4 from env var
[somalier] html and text output will have unrelated sample-pairs subset to 0.00% of points
[somalier] time to calculate all vs all relatedness for all 230362380 combinations: 707.58
[somalier] wrote interactive HTML output for 29504 pairs to: somalier.html
[somalier] joining families EGAN00002590822 and EGAN00002586516 because of relatedness > 0.2
[somalier] joining families EGAN00002258766 and EGAN00002590822 because of relatedness > 0.2
[somalier] joining families EGAN00002258766 and EGAN00002228367 because of relatedness > 0.2
[somalier] joining families EGAN00002258766 and EGAN00002467345 because of relatedness > 0.2
[somalier] joining families EGAN00002647309 and EGAN00001730469 because of relatedness > 0.2
[somalier] joining families EGAN00002647309 and EGAN00002258766 because of relatedness > 0.2
[somalier] joining families EGAN00002647309 and EGAN00002228729 because of relatedness > 0.2
[somalier] wrote groups to: somalier.groups.tsv (look at this for cancer samples)
[somalier] wrote samples to: somalier.samples.tsv
[somalier] wrote pair-wise relatedness metrics to: somalier.pairs.tsv

Note that I've cut out some of the joining families lines just to grab you the main details of the output.
Two quick questions:

What does it mean when it is joining those families?
I did a quick look at the distribution of the relatedness (again remembering I pre-filtered for >0.4), and am seeing an odd distribution. These samples (barring a few duplicates that were just human errors within the wet lab) are all from different randomly samples and (relatively) unrelated individuals, however with that being the case, I would suspect that there should be a higher peak at 0.5 rather than 0.4 since 1st degree relatives would create a peak there. In fact, if I look at the entire dataset (without doing filtering and examining all 68 million pairs), I end up with a distribution that just skews heavily towards 0, without any characteristic peaks at 1st/2nd/3rd etc degree relatives that you might expect in a random sample like this.

Am I misinterpreting something or doing something wrong?

Thanks,
Marcus

Brent Pedersen · Answer 29 · Thu Feb 03 2022 20:37:24 GMT+0800 (China Standard Time)

Hi, glad you got it running:

you can ignore that. somalier has a --infer option to infer pedigrees. I should hide those messages if --infer is not used.
let's assume that all of your sample pairs are unrelated. then you would still see a distribution. and that distribution gets wider with lower quality pairs (or samples). so you may see pairs at 0.5 because one or both of them is low quality.
I would look at the sample plot (right-hand side) in the HTML and see if there are any/many obvious sample outliers.
Then you could re-run after removing those (you could do an awk on the samples.tsv file once you use the plot to decide some cutoffs).
Remember that if you have a single bad sample, then 20,000+ pairs will be affected. If you have 100 bad/low-quality samples ...

If you do have a few related samples, the signal for that in the plot you are describing would be low enough so as not to break about above the noise.

If you share some of the sample plots, I can try to help diagnose further.

marcustutert · Answer 30 · Thu Feb 03 2022 20:42:29 GMT+0800 (China Standard Time)

Thanks for the continued help here Brent. Good to know I can ignore those messages.
That's a helpful tip to keep in mind.
Here is an example of some of the html plots, let me know which ones (seems as if there are quite a few options) would be best to help you diagnose further:

Brent Pedersen · Answer 31 · Thu Feb 03 2022 21:11:51 GMT+0800 (China Standard Time)

Well, you clearly have some sample duplicates (relatedness === 1)

How many known sites do you have in general? I see from your other issue that you said 800 sites. Why only 800? You should get ~15K for high quality exomes.
You'll get much better resolution on relatedness if you can use more sites.

marcustutert · Answer 32 · Thu Feb 03 2022 21:37:51 GMT+0800 (China Standard Time)

Yes, those are duplicate samples it appears that made its way past the pre-sequencing QC.

I have about 20k sites I can use -- the reason for the filtering is that I'll soon be running somalier on a joint subset of sequences (some of which weren't run using WES and others that were genotyped etc.) and the 1000 sites are the intersection of the sites across the different cohorts which passed QC.

Just getting back to what you earlier mentioned,
"let's assume that all of your sample pairs are unrelated. then you would still see a distribution. and that distribution gets wider with lower quality pairs (or samples). so you may see pairs at 0.5 because one or both of them is low quality.
I would look at the sample plot (right-hand side) in the HTML and see if there are any/many obvious sample outliers.
Then you could re-run after removing those (you could do an awk on the samples.tsv file once you use the plot to decide some cutoffs)."

Does these QC plots indicate to you anything about there being any obvious sample outliers in the data? And I'm not sure what you mean qualitatively by "wider" here, do you think you could explain a bit more/differently?

Thanks!

Brent Pedersen · Answer 33 · Fri Feb 04 2022 16:21:49 GMT+0800 (China Standard Time)

it's much easier to see outliers with larger set of variants.
I'd also click on the links at the bottom of the somalier html output and look at each of those sets of plots.

It will take longer per-sample for crams, and the speed will depend nearly entirely on disk (random access) speed. so if you are running thousands of jobs at once and whatever network drive you have could get hammered by those--though I assume with 20K samples, you should be well equipped for this type of problem.
On a decent machine+disk, a cram should take about 3 cpu-minutes.

marcustutert · Answer 34 · Sat Feb 05 2022 00:02:34 GMT+0800 (China Standard Time)

Thanks for the update there Brent.

Bit of a different question (same dataset), will somalier automatically strip some IDs based on prefixes/suffixes?
I am using somalier on a directory of files with the same EGAN ID, but half of the IDs have a prefix of CRAM_... in front of them, yet for some reason when I look at the somalier outputted tsv, the actual CRAM_ identifier as gone missing which I require for some downstream analysis

Brent Pedersen · Answer 35 · Sat Feb 05 2022 00:17:38 GMT+0800 (China Standard Time)

if they have the same ID, it must be because the SM is the same in the read-group within the bam/cram. (somalier uses read-group, not file name).
when you run somalier extract, you can specify, for example, --sample-prefix extra- (or in your case --sample-prefix CRAM_so that the value you see in the final plot would be extra-$sampleid.
(later, you can also specify --sample-prefix $prefix when you run somalier relate so that these added prefixes are removed from the plot and output)

marcustutert · Answer 36 · Sat Feb 05 2022 00:39:35 GMT+0800 (China Standard Time)

Hi Brent, maybe I'm doing something daft, but I just tried that and got:

(base) mt27@node-11-2-3:/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/problematic_somalier_output$ \
  SOMALIER_SAMPLE_RATE=0 SOMALIER_RELATEDNESS_CUTOFF=0.4 \
  /lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/somalier/somalier relate \
   "/lustre/scratch123/hgi/projects/ibdgwas_bioresource/mt27/CSI/Scratch/WES/problematic_somalier/*" \
  --sample-prefix CRAM
somalier version: 0.2.16
[somalier] starting read of 1126 samples
fatal.nim(49)            sysFatal
Error: unhandled exception: relate.nim(423, 14) `sl == formatVersion` expected matching versions got 69, expected 2 [AssertionDefect]

My somalier files look like (so half of them have the CRAM prefix and the other half don't)

Brent Pedersen · Answer 37 · Sat Feb 05 2022 00:47:14 GMT+0800 (China Standard Time)

(I just edited your comment to adjust formatting. you can use 3 backticks for multiline)

So you ran: somalier extract --sample-prefix CRAM ... for the relevant samples?

Then you don't need --sample-prefix CRAM for somalier relate (though that should work).

marcustutert · Answer 38 · Sat Feb 05 2022 00:48:42 GMT+0800 (China Standard Time)

No sorry -- I guess I will need to run the extraction first with the prefix then with relate?
Didn't realize I needed to rerun the extraction if so

Brent Pedersen · Answer 39 · Sat Feb 05 2022 00:50:14 GMT+0800 (China Standard Time)

yes, that's right. the extract step has the option to adjust the sample name for exactly this reason.
it uses sample-id as a unique identifier.
and, I see I wrote somalier relate above where I meant somalier extract. Sorry about that. I'll edit that comment so I don't cause more confusion.