Segfault for FastK, sometimes, when k isn't 40

Question

Segfault for FastK, sometimes, when k isn't 40

rsharris opened this issue 3 years ago · comments

synopsis: When running FastK with -k set 32 or less, I'm seeing segfaults.

Specifically, I've seen this with one particular input file (the blue whale assembly) and k in {20,26,31,32}.

Details:

I was trying to count 31-mers in this VGP blue whale assembly:
https://s3.amazonaws.com/genomeark/species/Balaenoptera_musculus/mBalMus1/assembly_curated/mBalMus1.pri.cur.20200528.fasta.gz

My current directory had mBalMus1.pri.cur.20200528.fasta.gz as a symlink to some other directory. Then I ran
FastK -v -k31 mBalMus1.pri.cur.20200528.fasta.gz
and got this output:

  Gzipped file mBalMus1.pri.cur.20200528.fasta.gz being temporarily uncompressed

Partitioning 1 .fasta file into 4 parts

Determining minimizer scheme & partition for mBalMus1.pri.cur.20200528
  Estimate 2.375G 31-mers
  Dividing data into 2 blocks
  Using 5-minimizers with 1024 core prefixes

Phase 1: Partitioning K-mers into 8 Super-mer Files

  There are 105 reads totalling 2,374,852,541 bps

     Part:         31-mer   super-mers  ave. length
        0:  1,151,752,026   74,926,939         15.4
        1:  1,197,133,971  112,483,639         10.6
      Sum:  2,348,885,997  187,410,578         12.5

      Range 1,151,752,026 - 1,197,133,971 (3.86%)

  Resources for phase:  1:27.610u  5.846s  1:01.816w  151.2%

Phase 2: Sorting & Counting K-mers in 2 blocks

  Processing block 2: Sorting super-mers     **Segmentation fault**

I'm using commit 7cebc7d, from a few hours ago.

Eugene W Myers Jr · Answer 1 · Tue Apr 06 2021 23:28:09 GMT+0800 (China Standard Time)

Hmm, I downloaded it and ran it as per your optioning and it ran fine. I also tried the different values of k and it still worked. Not what we wanted given I'd like to fix the problem :-) Must be hardware, OS, compiler specific. So I guess the first thing is which machine, which OS, which compiler? Things that might yield clues: * Does it matter if it is a symlink to the data? (I doubt it, but could you check). * Does it run on any other data sets, i.e. is it just failing on this specific data set? On thing unusual about this one is that each scaffold is placed on a single line, implying very, very long lines. One quick experiment would be to cut the lines up and see if that makes a difference. The very long lines might be catching a bug in FastK's input buffering codes. * Perhaps you could compile with -g option and run with a debugger and send me a trace -- might give me an idea of where to look given I can't reproduce the segfault. Thanks, Gene

…

On 4/5/21, 11:19 PM, Bob Harris wrote: synopsis: When running FastK with -k set 32 or less, I'm seeing segfaults. Specifically, I've seen this with one particular input file (the blue whale assembly) and k in {20,26,31,32}. Details: I was trying to count 31-mers in this VGP blue whale assembly: https://s3.amazonaws.com/genomeark/species/Balaenoptera_musculus/mBalMus1/assembly_curated/mBalMus1.pri.cur.20200528.fasta.gz My current directory had mBalMus1.pri.cur.20200528.fasta.gz as a symlink to some other directory. Then I ran |FastK -v -k31 mBalMus1.pri.cur.20200528.fasta.gz| and got this output: | Gzipped file mBalMus1.pri.cur.20200528.fasta.gz being temporarily uncompressed Partitioning 1 .fasta file into 4 parts Determining minimizer scheme & partition for mBalMus1.pri.cur.20200528 Estimate 2.375G 31-mers Dividing data into 2 blocks Using 5-minimizers with 1024 core prefixes Phase 1: Partitioning K-mers into 8 Super-mer Files There are 105 reads totalling 2,374,852,541 bps Part: 31-mer super-mers ave. length 0: 1,151,752,026 74,926,939 15.4 1: 1,197,133,971 112,483,639 10.6 Sum: 2,348,885,997 187,410,578 12.5 Range 1,151,752,026 - 1,197,133,971 (3.86%) Resources for phase: 1:27.610u 5.846s 1:01.816w 151.2% Phase 2: Sorting & Counting K-mers in 2 blocks Processing block 2: Sorting super-mers **Segmentation fault** | I'm using commit 7cebc7d <7cebc7d>, from a few hours ago. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINV6E6ZS6VLKLZYJHJ3THISPHANCNFSM42NPGOFQ>.

Bob Harris · Answer 2 · Tue Apr 06 2021 23:32:16 GMT+0800 (China Standard Time)

Yep, I was just about to try it on some different platforms. The one I was running on is some variant of linux. Compiler is no doubt gcc but might be an old version. I'll provide more specific details after I run on some other machines, and I'll try the things you mentioned.

Haynes Heaton · Answer 3 · Wed Apr 07 2021 03:36:56 GMT+0800 (China Standard Time)

I have also run into this. Gene, do you have access to the sanger farm? If so, I can show you how to reproduce it there.

directory /lustre/scratch118/malaria/team222/hh5/datasets/assembly/ilvanatal1
job script happrof.sh which just has
FastK -M20 -k31 -p:hets2 -t1 -T8 PacBio/m64016_190918_162737.Q20.fasta.gz
(i tried it on the unaligned bam as well, same error)
bsub command is
bsub -o happrof.out -n8 -R"span[hosts=1] select[mem>28000] rusage[mem=28000]" -M28000 ./happrof.sh

In this I'm trying to get profiles of just the het kmers. I created a fasta file from the output of Haplex and then made a ktab from that (had to remove 2 exits for short sequences). Now I am running the pacbio data through vs that het kmer ktab and get 0s except for positions where there is a het kmer and it is seg faulting on that. Kmer size is 31.

Eugene W Myers Jr · Answer 4 · Wed Apr 07 2021 04:17:34 GMT+0800 (China Standard Time)

Gol dang it. I am committed to fixing any problems. But I need to be able to replicate them. FastK is 10,000 lines of fairly complex code and the bug could be anywhere. First thing if possible is to give me a link to download the data set and the exact call parameters and see if I can duplicate it here. Another thing to do since you are a savvy user is to compile with the -g option and send me a stack trace -- sometimes but not often that is enough to detect the problem. I have run FastK with different k-mer sizes without a problem (e.g. k=21 and 18 for Illumina data). So ... let's work it out. -- Gene

…

On 4/6/21, 9:37 PM, Haynes Heaton wrote: I have also run into this. Am trying to different things to see what the problem may be. Gene, do you have access to the sanger farm? If I can show you how to reproduce it there. directory /lustre/scratch118/malaria/team222/hh5/datasets/assembly/ilvanatal1 binary happrof.sh bsub command bsub -o happrof.out -n8 -R"span[hosts=1] select[mem>28000] rusage[mem=28000]" -M28000 ./happrof.sh In this I'm trying to get profiles of just the het kmers. I created a fasta file from the output of Haplex and then made a ktab from that (had to remove 2 exits for short sequences). Now I am running the pacbio data through vs that het kmer ktab and get 0s except for positions where there is a het kmer and it is seg faulting on that. Kmer size is 31. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINXNYJHNNFFNDRU5J5LTHNPGZANCNFSM42NPGOFQ>.

Bob Harris · Answer 5 · Wed Apr 07 2021 04:18:55 GMT+0800 (China Standard Time)

I'm still testing, but it looks like in my case the problem is that /tmp fills up. The solution, of course, is for me to use -P to redirect temp files to my own directory.

I think the k setting is probably a red herring, at least for me.

This explains why I didn't see the problem until I started using a whole genome, and why the failure occurs on some machines and not others. The machine it was failing on has only about 4G allocated to /tmp.

Eugene W Myers Jr · Answer 6 · Wed Apr 07 2021 04:21:23 GMT+0800 (China Standard Time)

Typically the biggest difference between my mac and most other architectures is that the mac zeros all memory before loading a program -- so sometimes undefined variables are not caught on the mac. Try some of the ideas I sent, but please I'll feel bad if you spend a lot of time trying things. The best thing is for us to work together. I have an idea if nothing else: we zoom, and you share your screen and I debug with you as my partner. What time zone are you in?

…

-- Gene

On 4/6/21, 5:32 PM, Bob Harris wrote: Yep, I was just about to try it on some different platforms. The one I was running on is some variant of linux. Compiler is no doubt gcc but might be an old version. I'll provide more specific details after I run on some other machines, and I'll try the things you mentioned. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINXYE6LHW57LYVUE3ELTHMSQ7ANCNFSM42NPGOFQ>.

Eugene W Myers Jr · Answer 7 · Wed Apr 07 2021 04:23:33 GMT+0800 (China Standard Time)

Super, I had just sent an e-mail about how we might debug but filling temp would certainly be a problem. My code should exit more gracefully than a crash through -- it should say it couldn't create the file -- I'm going to review my code tomorrow morning and see if I do that properly. (Its already 10:30pm in German). -- Gene

…

On 4/6/21, 10:19 PM, Bob Harris wrote: I'm still testing, but it looks like in my case the problem is that /tmp fills up. The solution, of course, is for me to use -P to redirect temp files to my own directory. I think the k setting is probably a red herring, at least for me. This explains why I didn't see the problem until I started using a whole genome, and why the failure occurs on some machines and not others. The machine it was failing on has only about 4G allocated to /tmp. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINSSCSARE4BS4SZH5UDTHNUD3ANCNFSM42NPGOFQ>.

Eugene W Myers Jr · Answer 8 · Wed Apr 07 2021 05:13:07 GMT+0800 (China Standard Time)

Haynes, excuse my last whine about debugging this -- I see now you suggested I get on the farm to reproduce the problem. Alas I don't but I could try to get said. One thing is could you check /tmp or wherever you are putting the temp files (specificiable with the -P option). If it is not big enough, this will cause a crash as I do not check file writes or reads, just file opens. -- Gene

…

On 4/6/21, 9:37 PM, Haynes Heaton wrote: I have also run into this. Am trying to different things to see what the problem may be. Gene, do you have access to the sanger farm? If I can show you how to reproduce it there. directory /lustre/scratch118/malaria/team222/hh5/datasets/assembly/ilvanatal1 binary happrof.sh bsub command bsub -o happrof.out -n8 -R"span[hosts=1] select[mem>28000] rusage[mem=28000]" -M28000 ./happrof.sh In this I'm trying to get profiles of just the het kmers. I created a fasta file from the output of Haplex and then made a ktab from that (had to remove 2 exits for short sequences). Now I am running the pacbio data through vs that het kmer ktab and get 0s except for positions where there is a het kmer and it is seg faulting on that. Kmer size is 31. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINXNYJHNNFFNDRU5J5LTHNPGZANCNFSM42NPGOFQ>.

Haynes Heaton · Answer 9 · Wed Apr 07 2021 05:17:19 GMT+0800 (China Standard Time)

No problem at all. It is late there. I will run with a debugger and post here. I tried with -P temp file of my own and still crashes. We will figure it out. Don't feel the need to rush. Thanks for the help and the great tool. This is what you get for writing a good tool lol - users...

Best,
Haynes

Bob Harris · Answer 10 · Wed Apr 07 2021 05:28:33 GMT+0800 (China Standard Time)

I ran a couple tests where I put /tmp on the hairy edge before running FastK. I saw two things.

(I should mention that I was running with a file containing just the first 15 scaffolds of blue whale. Uncompressed this is 1.7G)

(1) Unrelated to this thread, if there's not enough room to unzip the gzipped file, it looks like gzip reports "gzip: stdout: No space left on device" but FastK doesn't realize gzip failed. So FastK continues, processing a short version of the uncompressed genome. I didn't let this run to completion -- possibly this would end up producing a shortened result, and when running in a pipeline the failure wouldn't be noticed or immediately obvious.

(2) Leaving enough room for a fully unzipped fasta, then while FastK was running I was repeatedly grabbed an ls-al /tmp. Before the segfault, it had written these files (plus the unzipped fasta)

-rwx------  1 rsharris rsharris   36839424 Apr  6 17:08 mBalMus1.15scaffolds.0.T0
-rwx------  1 rsharris rsharris   38023168 Apr  6 17:08 mBalMus1.15scaffolds.0.T1
-rwx------  1 rsharris rsharris   39821312 Apr  6 17:08 mBalMus1.15scaffolds.0.T2
-rwx------  1 rsharris rsharris   40181760 Apr  6 17:08 mBalMus1.15scaffolds.0.T3
-rwx------  1 rsharris rsharris   54337536 Apr  6 17:08 mBalMus1.15scaffolds.1.T0
-rwx------  1 rsharris rsharris   46800896 Apr  6 17:08 mBalMus1.15scaffolds.1.T1
-rwx------  1 rsharris rsharris   49356800 Apr  6 17:08 mBalMus1.15scaffolds.1.T2
-rwx------  1 rsharris rsharris   53616640 Apr  6 17:08 mBalMus1.15scaffolds.1.T3

It then removed the first four files (the mBalMus1.15scaffolds.0.* ones). Just before segfaulting, the final messages were

Phase 2: Sorting & Counting K-mers in 2 blocks
  Processing block 1: Sorting weighted k-mersSegmentation fault

The four mBalMus1.15scaffolds.1.* files remained at that point.

Haynes Heaton · Answer 11 · Wed Apr 07 2021 06:43:42 GMT+0800 (China Standard Time)

Backtrace isn't useful "No such file or directory" then just malloc stuff.
Thread 104 "FastK" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x15550fbb7700 (LWP 35806)]
tcache_thread_shutdown () at malloc.c:2978
2978 malloc.c: No such file or directory.
(gdb) bt
#0 tcache_thread_shutdown () at malloc.c:2978
#1 arena_thread_freeres () at arena.c:950
#2 0x00001555542445e2 in __libc_thread_freeres () at thread-freeres.c:29
#3 0x0000155555112700 in start_thread (arg=0x15550fbb7700) at pthread_create.c:476
#4 0x00001555541c971f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

There does seem to be a lot of thread churning. Like maybe 1 thread per read.

Command is
gdb --args FastK -M20 -k31 -p:hets2 -t1 -T8 -P/lustre/scratch118/malaria/team222/hh5/datasets/assembly/ilvanatal1/tmp PacBio/test.fq

But I had a bit of progress. Seems to be a threading issue for me. I got a small example of just 20 reads on which it hits the seg fault but if I give it only 1 thread it completes. Minimal example attached with the fastq and the hets2.ktab file. With 19 reads it doesn't crash but with 20 reads it does crash. And there isn't anything weird about the 20th read. To make sure, I put the 20th read in a few file first and then 18 reads after it and that one finishes but if you put one more read on it fails again.

Also going back to the full dataset with 1 thread it succeeds. And that only took 15min so maybe single threaded is fine for this purpose. Since it's this fast, this will not be blocking for me.

segfault_examp.zip

And then just as follow up, it works as intended for finding the locations of het kmers.

Eugene W Myers Jr · Answer 12 · Wed Apr 07 2021 17:14:53 GMT+0800 (China Standard Time)

Hmmm, happy to take a look and see if it duplicates here if you want to send it on. -- Gene

…

On 4/7/21, 12:43 AM, Haynes Heaton wrote: Seems to be a threading issue for me. I got a small example of just 20 reads on which it hits the seq fault but if I give it only 1 thread it completes. Will upload tiny example in a bit. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINQHDNC2HHIIKHK2AHDTHOFCVANCNFSM42NPGOFQ>.

Eugene W Myers Jr · Answer 13 · Wed Apr 07 2021 21:33:58 GMT+0800 (China Standard Time)

Yes, this is all consistent with running out of disk space in /tmp. For now the fix is simply to make sure your target directory is big enough. Longer term, I will "bite the bullet" and check all my file writes -- what's happening is that I am writing to the target file(s) but not checking if those writes are OK. When the disk fills up the writes fail, but the program carries on and so crashes. The fix is to check every write and if one is bad, write a friendly "Your out of disk" message and exit. Best, Gene

…

On 4/6/21, 11:28 PM, Bob Harris wrote: I ran a couple tests where I put /tmp on the hairy edge before running FastK. I saw two things. (I should mention that I was running with a file containing just the first 15 scaffolds of blue whale. Uncompressed this is 1.7G) (1) Unrelated to this thread, if there's not enough room to unzip the gzipped file, it looks like gzip reports "gzip: stdout: No space left on device" but FastK doesn't realize gzip failed. So FastK continues, processing a short version of the uncompressed genome. I didn't let this run to completion -- possibly this would end up producing a shortened result, and when running in a pipeline the failure wouldn't be noticed or immediately obvious. (2) Leaving enough room for a fully unzipped fasta, then while FastK was running I was repeatedly grabbed an ls-al /tmp. Before the segfault, it had written these files (plus the unzipped fasta) |-rwx------ 1 rsharris rsharris 36839424 Apr 6 17:08 mBalMus1.15scaffolds.0.T0 -rwx------ 1 rsharris rsharris 38023168 Apr 6 17:08 mBalMus1.15scaffolds.0.T1 -rwx------ 1 rsharris rsharris 39821312 Apr 6 17:08 mBalMus1.15scaffolds.0.T2 -rwx------ 1 rsharris rsharris 40181760 Apr 6 17:08 mBalMus1.15scaffolds.0.T3 -rwx------ 1 rsharris rsharris 54337536 Apr 6 17:08 mBalMus1.15scaffolds.1.T0 -rwx------ 1 rsharris rsharris 46800896 Apr 6 17:08 mBalMus1.15scaffolds.1.T1 -rwx------ 1 rsharris rsharris 49356800 Apr 6 17:08 mBalMus1.15scaffolds.1.T2 -rwx------ 1 rsharris rsharris 53616640 Apr 6 17:08 mBalMus1.15scaffolds.1.T3 | It then removed the first four files (the mBalMus1.15scaffolds.0.* ones). Just before segfaulting, the final messages were |Phase 2: Sorting & Counting K-mers in 2 blocks Processing block 1: Sorting weighted k-mersSegmentation fault | The four mBalMus1.15scaffolds.1.* files remained at that point. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINU7DWECTDOVP6XZ5J3THN4I5ANCNFSM42NPGOFQ>.

Haynes Heaton · Answer 14 · Wed Apr 07 2021 21:37:14 GMT+0800 (China Standard Time)

I included example data for my case. It's not a big deal because it runs fine single threaded, but probably something to fix eventually.

Best,
Haynes

Eugene W Myers Jr · Answer 15 · Tue Apr 13 2021 14:45:34 GMT+0800 (China Standard Time)

Haynes, Sorry to get back to you so late on this. I cannot test your example because you only including the stub file "hets2.ktab" in your zip. A FastK table consists of several files, a visible .ktab stub file, and then several hidden files (of the form .hets.ktab.1, .hets.ktab.2 ...) depending on the # of threads used in its creationg. Could you rebundle the seg-fault example with these files included? Thanks, Gene

…

On 4/7/21, 3:37 PM, Haynes Heaton wrote: I included example data for my case. It's not a big deal because it runs fine single threaded, but probably something to fix eventually. Best, Haynes — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINQRCCEXSFB26NB6VQDTHRNZZANCNFSM42NPGOFQ>.

Eugene W Myers Jr · Answer 16 · Thu Apr 15 2021 00:40:00 GMT+0800 (China Standard Time)

Haynes, A reminder to send me your crash data set again, this time with those pesky hidden file :-)

…

-- Gene

On 4/7/21, 3:37 PM, Haynes Heaton wrote: I included example data for my case. It's not a big deal because it runs fine single threaded, but probably something to fix eventually. Best, Haynes — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINQRCCEXSFB26NB6VQDTHRNZZANCNFSM42NPGOFQ>.

Haynes Heaton · Answer 17 · Thu Apr 15 2021 01:05:08 GMT+0800 (China Standard Time)

https://drive.google.com/file/d/16c_Vrblnhp44z0hvrWGme1l_gNeqNvYF/view?usp=sharing

Thanks,
Haynes

Eugene W Myers Jr · Answer 18 · Thu Apr 15 2021 23:55:01 GMT+0800 (China Standard Time)

I grabbed the new example and ran "FastK -M20 -k31 -p:hets2 -t1 -T8 test.fq".
Alas it ran just fine on my Mac. It did use only 2 threads because the file was
so small. There was a small fix for Bob -- did that by any chance fix it for you?
Otherwise ...

Eugene W Myers Jr · Answer 19 · Thu Apr 15 2021 23:57:51 GMT+0800 (China Standard Time)

Could you run with -v so there is at least some idea what phase it was in when
it crashed?