atifrahman / HAWK

Hitting associations with k-mers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Quick start guide?

MikeEHMatson opened this issue · comments

Hello,
Would it be possible to provide a quickstart guide or some kind of basic tutorial that explains how to setup the file and directory structure? It is currently not very clear to myself, at least.

For example, is the directory my illumina files (.fastq only, or .fastq.gz?) reside required to be named "Reads"? Do my files need to have the prefix "Reads"?

Apologies for the low level technical question.

Hi,
I've expanded the README file.

The version provided assumes that illumina files from each sample is in separate directory and the directories have the prefix. The read files can have any names as long as those are the only fastq (or fastq.gz) files in the directory.

Please let me know if you have other queries.

Thanks for the README expansion, but I still seem to be having an issue getting it to work.

In the countKmers script, I have my directory defined as:

dir=/path/to/dir/hawkReads
I'm fairly confident the jellyfish, hawk, and sort directories are correct since what I have filled in came straight from our sysadmin's recommendations.

There are two directories in hawkReads:

/path/to/dir/hawkReads/Reads_nonPath
/path/to/dir/hawkReads/Reads_Path

And one file in each directory:

/path/to/dir/hawkReads/Reads_nonPath/AL1.fastq.gz
/path/to/dir/hawkReads/Reads_Path/11-12.fastq.gz

However, I still get an error:

$ ./countKmers     
                                                                                                                                                                                                       
./countKmers: line 13: 78637 Killed                  ${jellyfishDir}/jellyfish count -C -o ${OUTPREFIX}_kmers/tmp -m ${KMERSIZE} -t ${CORES} -s 20G <( zcat *.fastq.gz )
ls: cannot access Reads_Path_kmers/tmp*: No such file or directory
Error: Requires at least 2 arguments.
Usage: jellyfish merge [options] input:string+
Use --help for more information
ls: cannot access Reads_Path_kmers_jellyfish: No such file or directory
./countKmers: line 13: 78719 Killed                  ${jellyfishDir}/jellyfish count -C -o ${OUTPREFIX}_kmers/tmp -m ${KMERSIZE} -t ${CORES} -s 20G <( zcat *.fastq.gz )
ls: cannot access Reads_nonPath_kmers/tmp*: No such file or directory
Error: Requires at least 2 arguments.
Usage: jellyfish merge [options] input:string+
Use --help for more information
ls: cannot access Reads_nonPath_kmers_jellyfish: No such file or directory

I feel like I am missing something simple, though I can't quite isolate it.

Thanks for any help you can provide.

Would it be possible to send me the whole script?

Sure. I will add, since it is not apparent here, that the scripts and their parent directory are not in the same directory as specified in line 6 at "dir=/bigdata/..."

CORES=4 #number of cores to use for blast searches
KMERSIZE=31 # RD:61

#modified from NIKS script

dir=/bigdata/judelsonlab/mmatson/usdaTemp/fastqFiles/mps/untrimmed/hawkReads            #directory for read files 
hawkDir=/opt/linux/centos/7.x/x86_64/pkgs/HAWK/0.8.3/                   #directory where hawk is installed
jellyfishDir=/opt/linux/centos/7.x/x86_64/pkgs/jellyfish-hawk/bin/              #directory where jellyfish is installed
sortDir=/usr/bin                #directory where parallel sort is installed

cd ${dir}

for file in `ls -d Reads*`
do
        OUTPREFIX=$file

        cd ${file}

        mkdir ${OUTPREFIX}_kmers

        ${jellyfishDir}/jellyfish count -C -o ${OUTPREFIX}_kmers/tmp -m ${KMERSIZE} -t ${CORES} -s 20G <( zcat *.fastq.gz ) #change if not gzipped

        COUNT=$(ls ${OUTPREFIX}_kmers/tmp* |wc -l)

        if [ $COUNT -eq 1 ]
        then
                mv ${OUTPREFIX}_kmers/tmp_0 ${OUTPREFIX}_kmers_jellyfish
        else
                ${jellyfishDir}/jellyfish merge -o ${OUTPREFIX}_kmers_jellyfish ${OUTPREFIX}_kmers/tmp*
        fi
        rm -rf ${OUTPREFIX}_kmers

        COUNT=$(ls ${OUTPREFIX}_kmers_jellyfish |wc -l)

        if [ $COUNT -eq 1 ]
        then

                ${jellyfishDir}/jellyfish histo -f -o ${OUTPREFIX}.kmers.hist.csv -t ${CORES} ${OUTPREFIX}_kmers_jellyfish
                awk '{print $2"\t"$1}' ${OUTPREFIX}.kmers.hist.csv > ${OUTPREFIX}_tmp
                mv ${OUTPREFIX}_tmp ${OUTPREFIX}.kmers.hist.csv

                awk -f ${hawkDir}/countTotalKmer.awk ${OUTPREFIX}.kmers.hist.csv >> ${dir}/total_kmer_counts.txt

                CUTOFF=1
                echo $CUTOFF > ${OUTPREFIX}_cutoff.csv


                ${jellyfishDir}/jellyfish dump -c -L `expr $CUTOFF + 1` ${OUTPREFIX}_kmers_jellyfish > ${OUTPREFIX}_kmers.txt
                ${sortDir}/sort --parallel=${CORES} -n -k 1 ${OUTPREFIX}_kmers.txt > ${OUTPREFIX}_kmers_sorted.txt

                rm ${OUTPREFIX}_kmers_jellyfish
                rm ${OUTPREFIX}_kmers.txt

                echo "${dir}/Reads_${OUTPREFIX}/${OUTPREFIX}_kmers_sorted.txt" >> ${dir}/sorted_files.txt

        fi


        cd ..

done

Hello,
Apologies for sounding impatient, but I was wondering if you had found out what the issue was yet, or if this was the script you were referring to?

Thanks,
Mike

Yeah... this the script I wanted but couldn't figure out what the issue is yet.

Hi,
I was able to move past the error by giving the script a considerable amount of memory (>200gb). I think Jellyfish needed the memory boost.

Mike

Ah great!

Had the same problem working on AWS EC2 service. The problem was how I mounted the instance store device. I used the ext4 file system. So i remounted with sudo mkfs -t xfs /dev/nvme1n1 and countKmers_jf2 and the error from countKmer_jf2 described above disappeared and it worked. So it may have something to do with the file system, not sure.