nQuire create gives "Segmentation fault"

Question

nQuire create gives "Segmentation fault"

MikeEHMatson opened this issue 7 years ago · comments

Hello,
I'm running

nQuire create -b $sample.bam -o $sample

Yet the only thing returned is the error "Segmentation fault".
My bamfile originates from bwa-mem -> Picard Sort -> MergeSamfiles -> IndelRealign. The bamfiles appear valid, as further GATK steps proceed as expected. The version of nQuire is nQuire/af0a7f0

Thanks,
Mike

Clemens Weiss · Answer 1 · Fri Feb 02 2018 21:26:41 GMT+0800 (China Standard Time)

Hi Mike,

do you use 'MergeSamfiles' to combine multiple samples with different Read Groups into one bam file? If so, I haven't yet implemented a way to handle multi-sample bam files. Still, it shouldn't error out just with a SegFault. If you can confirm that it indeed is a multi-sample bam, I'll investigate this further.

Thanks for the report!

MikeEHMatson · Answer 2 · Sat Feb 03 2018 01:21:29 GMT+0800 (China Standard Time)

Hello, Yes, I did use MergeSamFiles, guess I forgot to include that in my flow chart. However, I think the resulting file shouldn't technically be multi-sample file if RGSM is the same for each input file. Then again, gatk wouldn't work anyway if there were different RGSMs in a single merged bamfile, so I suspect the issue you were talking about is what is going on here. Thanks, Mike On Feb 2, 2018 5:26 AM, "clwgg" <notifications@github.com> wrote: Hi Mike, do you use 'MergeSamfiles' to combine multiple samples with different Read Groups into one bam file? If so, I haven't yet implemented a way to handle multi-sample bam files. Still, it shouldn't error out just with a SegFault. If you can confirm that it indeed is a multi-sample bam, I'll investigate this further. Thanks for the report! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AbkfGUhZImeUMk77WExffPAbq_Ox-jLcks5tQw0RgaJpZM4RzBzY> .

Clemens Weiss · Answer 3 · Wed Feb 07 2018 19:53:17 GMT+0800 (China Standard Time)

Hi Mike,

unfortunately, I have not been able to reproduce the error with data of my own, trying to replicate the processing pipeline you describe. Could you make a test dataset available, for which you see the error?

Thanks!

MikeEHMatson · Answer 4 · Thu Feb 08 2018 01:39:11 GMT+0800 (China Standard Time)

Hi, I can try to do so, but before I do, could I have a recommendation on the software use? I have ben able to run the software previously, however (as shown in the paper), the accuracy of the ploidy calls does not become very confident until around 30-50x coverage. Has this minimum coverage limit been improved in the most recent version of nQuire? If not, many of my samples are in the 15-25x range, and I won’t be able to take advantage of nQuire anyway. Mike

…

On Feb 7, 2018, at 3:53 AM, clwgg ***@***.***> wrote: Hi Mike, unfortunately, I have not been able to reproduce the error with data of my own, trying to replicate the processing pipeline you describe. Could you make a test dataset available, for which you see the error? Thanks! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AbkfGatij4FEOeKJrMLJ-2bzhxCq-d-Jks5tSY6tgaJpZM4RzBzY>.

Clemens Weiss · Answer 5 · Fri Feb 09 2018 21:37:10 GMT+0800 (China Standard Time)

It depends on a couple of things apart from coverage. A big one is repetitiveness of the genome, as misalignments get more frequent in highly complex genomes, and noise increases.
For the parameters, I'd refrain from dropping the minimum coverage below 10, and advise to use mapping quality filter of at least 1 (I'm considering making these into defaults).
I also implemented a way to create the bin file only from select regions. So if you have something like mappability estimates for parts of the genome, you can try to run the model just on regions you are more confident in to give you reliable base frequencies.
20x is definitely enough to at least play around a little. However, to reliably distinguish especially tetraploids in a complex genome, more might be needed.