thegenemyers / FASTK

A fast K-mer counter for high-fidelity shotgun datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segfault for FastK, sometimes, when k isn't 40

rsharris opened this issue · comments

synopsis: When running FastK with -k set 32 or less, I'm seeing segfaults.

Specifically, I've seen this with one particular input file (the blue whale assembly) and k in {20,26,31,32}.

Details:

I was trying to count 31-mers in this VGP blue whale assembly:
https://s3.amazonaws.com/genomeark/species/Balaenoptera_musculus/mBalMus1/assembly_curated/mBalMus1.pri.cur.20200528.fasta.gz

My current directory had mBalMus1.pri.cur.20200528.fasta.gz as a symlink to some other directory. Then I ran
FastK -v -k31 mBalMus1.pri.cur.20200528.fasta.gz
and got this output:

  Gzipped file mBalMus1.pri.cur.20200528.fasta.gz being temporarily uncompressed

Partitioning 1 .fasta file into 4 parts

Determining minimizer scheme & partition for mBalMus1.pri.cur.20200528
  Estimate 2.375G 31-mers
  Dividing data into 2 blocks
  Using 5-minimizers with 1024 core prefixes

Phase 1: Partitioning K-mers into 8 Super-mer Files

  There are 105 reads totalling 2,374,852,541 bps

     Part:         31-mer   super-mers  ave. length
        0:  1,151,752,026   74,926,939         15.4
        1:  1,197,133,971  112,483,639         10.6
      Sum:  2,348,885,997  187,410,578         12.5

      Range 1,151,752,026 - 1,197,133,971 (3.86%)

  Resources for phase:  1:27.610u  5.846s  1:01.816w  151.2%

Phase 2: Sorting & Counting K-mers in 2 blocks

  Processing block 2: Sorting super-mers     **Segmentation fault**

I'm using commit 7cebc7d, from a few hours ago.

Yep, I was just about to try it on some different platforms. The one I was running on is some variant of linux. Compiler is no doubt gcc but might be an old version. I'll provide more specific details after I run on some other machines, and I'll try the things you mentioned.

I have also run into this. Gene, do you have access to the sanger farm? If so, I can show you how to reproduce it there.

directory /lustre/scratch118/malaria/team222/hh5/datasets/assembly/ilvanatal1
job script happrof.sh which just has
FastK -M20 -k31 -p:hets2 -t1 -T8 PacBio/m64016_190918_162737.Q20.fasta.gz
(i tried it on the unaligned bam as well, same error)
bsub command is
bsub -o happrof.out -n8 -R"span[hosts=1] select[mem>28000] rusage[mem=28000]" -M28000 ./happrof.sh

In this I'm trying to get profiles of just the het kmers. I created a fasta file from the output of Haplex and then made a ktab from that (had to remove 2 exits for short sequences). Now I am running the pacbio data through vs that het kmer ktab and get 0s except for positions where there is a het kmer and it is seg faulting on that. Kmer size is 31.

I'm still testing, but it looks like in my case the problem is that /tmp fills up. The solution, of course, is for me to use -P to redirect temp files to my own directory.

I think the k setting is probably a red herring, at least for me.

This explains why I didn't see the problem until I started using a whole genome, and why the failure occurs on some machines and not others. The machine it was failing on has only about 4G allocated to /tmp.

No problem at all. It is late there. I will run with a debugger and post here. I tried with -P temp file of my own and still crashes. We will figure it out. Don't feel the need to rush. Thanks for the help and the great tool. This is what you get for writing a good tool lol - users...

Best,
Haynes

I ran a couple tests where I put /tmp on the hairy edge before running FastK. I saw two things.

(I should mention that I was running with a file containing just the first 15 scaffolds of blue whale. Uncompressed this is 1.7G)

(1) Unrelated to this thread, if there's not enough room to unzip the gzipped file, it looks like gzip reports "gzip: stdout: No space left on device" but FastK doesn't realize gzip failed. So FastK continues, processing a short version of the uncompressed genome. I didn't let this run to completion -- possibly this would end up producing a shortened result, and when running in a pipeline the failure wouldn't be noticed or immediately obvious.

(2) Leaving enough room for a fully unzipped fasta, then while FastK was running I was repeatedly grabbed an ls-al /tmp. Before the segfault, it had written these files (plus the unzipped fasta)

-rwx------  1 rsharris rsharris   36839424 Apr  6 17:08 mBalMus1.15scaffolds.0.T0
-rwx------  1 rsharris rsharris   38023168 Apr  6 17:08 mBalMus1.15scaffolds.0.T1
-rwx------  1 rsharris rsharris   39821312 Apr  6 17:08 mBalMus1.15scaffolds.0.T2
-rwx------  1 rsharris rsharris   40181760 Apr  6 17:08 mBalMus1.15scaffolds.0.T3
-rwx------  1 rsharris rsharris   54337536 Apr  6 17:08 mBalMus1.15scaffolds.1.T0
-rwx------  1 rsharris rsharris   46800896 Apr  6 17:08 mBalMus1.15scaffolds.1.T1
-rwx------  1 rsharris rsharris   49356800 Apr  6 17:08 mBalMus1.15scaffolds.1.T2
-rwx------  1 rsharris rsharris   53616640 Apr  6 17:08 mBalMus1.15scaffolds.1.T3

It then removed the first four files (the mBalMus1.15scaffolds.0.* ones). Just before segfaulting, the final messages were

Phase 2: Sorting & Counting K-mers in 2 blocks
  Processing block 1: Sorting weighted k-mersSegmentation fault

The four mBalMus1.15scaffolds.1.* files remained at that point.

Backtrace isn't useful "No such file or directory" then just malloc stuff.
Thread 104 "FastK" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x15550fbb7700 (LWP 35806)]
tcache_thread_shutdown () at malloc.c:2978
2978 malloc.c: No such file or directory.
(gdb) bt
#0 tcache_thread_shutdown () at malloc.c:2978
#1 arena_thread_freeres () at arena.c:950
#2 0x00001555542445e2 in __libc_thread_freeres () at thread-freeres.c:29
#3 0x0000155555112700 in start_thread (arg=0x15550fbb7700) at pthread_create.c:476
#4 0x00001555541c971f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

There does seem to be a lot of thread churning. Like maybe 1 thread per read.

Command is
gdb --args FastK -M20 -k31 -p:hets2 -t1 -T8 -P/lustre/scratch118/malaria/team222/hh5/datasets/assembly/ilvanatal1/tmp PacBio/test.fq

But I had a bit of progress. Seems to be a threading issue for me. I got a small example of just 20 reads on which it hits the seg fault but if I give it only 1 thread it completes. Minimal example attached with the fastq and the hets2.ktab file. With 19 reads it doesn't crash but with 20 reads it does crash. And there isn't anything weird about the 20th read. To make sure, I put the 20th read in a few file first and then 18 reads after it and that one finishes but if you put one more read on it fails again.

Also going back to the full dataset with 1 thread it succeeds. And that only took 15min so maybe single threaded is fine for this purpose. Since it's this fast, this will not be blocking for me.

segfault_examp.zip

And then just as follow up, it works as intended for finding the locations of het kmers.

Screen Shot 2021-04-06 at 6 41 29 PM

I included example data for my case. It's not a big deal because it runs fine single threaded, but probably something to fix eventually.

Best,
Haynes

I grabbed the new example and ran "FastK -M20 -k31 -p:hets2 -t1 -T8 test.fq".
Alas it ran just fine on my Mac. It did use only 2 threads because the file was
so small. There was a small fix for Bob -- did that by any chance fix it for you?
Otherwise ...

Could you run with -v so there is at least some idea what phase it was in when
it crashed?