Relate Killed?

Question

Relate Killed?

kevwilhelm95 opened this issue a year ago · comments

Hello,

I am trying to run relate on > 400,000 samples. I know this is quite the task and I believe I have figured out a way to run while chunking the samples into 6 different groups and comparing head to head. For example, I would compare split 1 to split 2 which would consist of ~78,000 samples in each group (~156,000 per relate run). I thought this would fix my issues with memory, however, when I run Somalier, it still gets killed within seconds. I calculated I would need at least 42 Gb of memory to run this and my machine has 250Gb available. Any other ideas of how to work around this?

Brent Pedersen · Answer 1 · Mon May 01 2023 04:17:01 GMT+0800 (China Standard Time)

That's a large cohort!
What is the exitcode when it fails within seconds?
I think you might be hitting memory limits. Maybe you can split to smaller groups of e.g. 20K?

Also, to make sure I understand your splits ,you are sending a total of 78K samples per run? Is that right?

kevwilhelm95 · Answer 2 · Mon May 01 2023 08:35:30 GMT+0800 (China Standard Time)

This is the error command - b'somalier version: 0.2.16\n[somalier] starting read of 78306 samples\nKilled\n'. I have run it combining splits (~156,000) and just within one split (78,306 samples). I guess I will just keep splitting until I have enough resources to run.

kevwilhelm95 · Answer 3 · Tue May 02 2023 05:46:51 GMT+0800 (China Standard Time)

@brentp Another question that I am curious about, my group has previously used KING to estimate the relatedness of individuals in our cohort, but our cohort has grown so large that we are testing out Somalier. Using KING, we defined 2nd degree relatives as those with kinship > 0.0884. Does this threshold fit with the relatedness predictions from Somalier or is there another threshold we should use?

Thank you for all of your hard work on this

Brent Pedersen · Answer 4 · Thu May 04 2023 00:57:35 GMT+0800 (China Standard Time)

That cutoff should generally be ok. The problem is if you have 20K samples, and a single sample with low-quality, that may appear to be related at that level to 19,999 other samples. So you do have to do some filtering.
I use the html plot to decide on reasonable cutoffs--this will be difficult with the size of your cohort, but should work for 10-20K as it sub-samples unrelated samples pretty heavily.

Another thought is that it's most likely to die generating the JSON for the html as that uses a lot of memory. But even in that case, you may still get the text-output.