zeeev / wham

Structural variant detection and association testing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FATAL: Unable to gather stats on bamfile

LeileiCui opened this issue · comments

Dear WHAM´s team:
I am trying to run both wham and whamg for SV calling using Arabidopsis bam files (~20x), the command lines I used is:

$wham -x 2 -f ../../1-TAIR10/tair10_um_new.fa -t Ler_0_ler_phaseI.bam > NEW_wham_ler_0.vcf 2> NEW_wham_ler_0.err
$whamg -x 2 -a ../../1-TAIR10/tair10_um_new.fa -f Ler_0_ler_phaseI.bam -c Chr1,Chr2,Chr3,Chr4,Chr5 > NEW_whamg_ler_0.vcf 2> NEW_whamg_ler_0.err

Both of these two command lines run successfully and output of the *.err files are:

[_@_ 2-TAIR10.BAM]$ tail NEW_wham_ler_0.err
INFO: running region: Chr5:22000500-23000500
INFO: running region: Chr5:23000500-24000500
INFO: running region: Chr5:24000500-25000500
INFO: running region: Chr5:12000500-13000500
INFO: running region: Chr5:25000500-26000500
INFO: running region: Chr5:26000500-27000500
INFO: running region: Chr5:13000500-14000500
INFO: running region: chloroplast:500-1000500
INFO: running region: mitochondria:500-1000500
INFO: WHAM-BAM finished normally.
[_@_ 2-TAIR10.BAM]$ tail NEW_whamg_ler_0.err
INFO: Loading discordant reads into forest.
INFO: Reading: Ler_0_ler_phaseI.bam
INFO: Ler_0: processed 100Mb of the genome.
INFO: Ler_0_ler_phaseI.bam had 1722153 reads that were not processed
INFO: Finished loading reads.
INFO: Gathering graphs from forest.
INFO: Matching breakpoints.
INFO: Printing.
INFO: done processing trees
INFO: WHAM finished normally, goodbye!

But I just got several header lines in the VCF files and there is no any SVs has been detected. And then I tried to lower the threshold of wham and whamg and still didn't detect any SVs.

$wham -x 2 -f ../../1-TAIR10/tair10_um_new.fa -t Ler_0_ler_phaseI.bam -m 1 -q 1 -p 1 > test_wham_ler_0.vcf 2> test_wham_ler_0.err
$whamg -x 2 -a ../../1-TAIR10/tair10_um_new.fa -f Ler_0_ler_phaseI.bam -c Chr1,Chr2,Chr3,Chr4,Chr5 -m 1 > test_whamg_ler_0.vcf 2> test_whamg_ler_0.err

Could you please give me any advise about the empty VCF files?
Thanks in advance,
Leilei

What are the data? How were they aligned?

Thanks for the rapid reply, zeeev! These bam files are Whole Genome Sequencing data with ~20x, and they are aligned by a software named Stampy (http://www.well.ox.ac.uk/project-stampy). I noticed that you recommended to use BWA mem to implement alignment before using wham, but is this will cause a big effect? Do you think is there something wrong with the command lines I used?

@CllKiller Your commands are correct. You need to use BWA mem.

So sorry to reply you later, Zeeev. In fact, I had also tried to realign these bam files using BWA mem and as I don't have the original fastq files, I just convert the original bam files to fastq and then realign, here is the command I used:

$bedtools bamtofastq -i ../2-TAIR10.BAM/Ler_0_ler_phaseI.bam -fq Ler_0.fq
$bwa index Arab.all.fa -p Arab.all
$bwa mem -R '@rg\tID:ler_phaseI\tCN:WTCHG\tLB:ler_phaseI\tPL:ILLUMINA\tSM:Ler_0' -M -t 10 ../1-TAIR10/Arab.all Ler_0.fq > Ler_0.sam
$samtools view -F 4 -S -b -h Ler_0.sam > Ler_0.bam
$samtools sort Ler_0.bam Ler_0.sorted
$samtools rmdup Ler_0.sorted.bam Ler_0.sorted.mkdup.bam
$samtools view -F 1024 Ler_0.sorted.mkdup.bam -h -b -o Ler_0.sorted.rmdup.bam
$samtools index Ler_0.sorted.rmdup.bam
$wham -x 10 -f ../1-TAIR10/Arab.all.fa -t Ler_0.sorted.rmdup.bam >Ler_0.vcf 2>Ler_0.err
$whamg -x 10 -a ../1-TAIR10/Arab.all.fa -f Ler_0.sorted.rmdup.bam -c Chr1,Chr2,Chr3,Chr4,Chr5 > Ler_0_whamg.vcf 2> Ler_0_whamg.err

But there are some errors happened:

[_@_ 3-Realign-BWA-Samtools]$ tail *.err
==> Ler_0.err <==
INFO: WHAM-BAM will using the following fasta: ../1-TAIR10/Arab.all.fa
INFO: target bams:
Ler_0.sorted.rmdup.bam
INFO: OpenMP will roughly use 10 threads
INFO: gathering stats for each bam file.
FATAL: was not able to gather stats on bamfile: Ler_0.sorted.rmdup.bam

==> Ler_0_whamg.err <==
INFO: OpenMP will roughly use 10 threads
INFO: fasta file: ../1-TAIR10/Arab.all.fa
INFO: target bams:
Ler_0.sorted.rmdup.bam
INFO: gathering stats (may take some time) for bam: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
INFO: processed 0 reads for: Ler_0.sorted.rmdup.bam
FATAL: Unable to gather stats on bamfile: Ler_0.sorted.rmdup.bam
INFO: Consider using -z if bamfile was split by region.

And the header lines of the final BAM file to run wham/whamg is:
@hd VN:1.3 SO:coordinate
@sq SN:Chr1 LN:30427671
@sq SN:Chr2 LN:19698289
@sq SN:Chr3 LN:23459830
@sq SN:Chr4 LN:18585056
@sq SN:Chr5 LN:26975502
@sq SN:mitochondria LN:366924
@sq SN:chloroplast LN:154478
@rg ID:ler_phaseI CN:WTCHG LB:ler_phaseI PL:ILLUMINA SM:Ler_0
@pg ID:bwa PN:bwa VN:0.7.15-r1140

The number of @sq tags seems normal, could give me some suggestions to solve this error? (P.S. The ref I used to realign is another version which is just slightly different in organelles with the initial ref that generate the BAM files. I'm also try to realign with the initial ref and they should generate this error, too.)

Hi, Zeeev. As the genome of Arabidopsis is much smaller than Human's genome, do you think this will be one potential factor that cause the empty VCF files and the errors above? Arabidopsis has five chromosomes and a total size of approximately 135-megabases. The table below shows the approximate total length and the length of the golden path for each chromosome.

Golden_path_length Approximate_chromosome_length
Chromosome 1 30,427,671 bp 34,964,571 bp
Chromosome 2 19,698,289 bp 22,037,565 bp
Chromosome 3 23,459,830 bp 25,499,034 bp
Chromosome 4 18,585,056 bp 20,862,711 bp
Chromosome 5 26,975,502 bp 31,270,811 bp
Total 119,146,348 bp 134,634,692 bp

@CllKiller What type of data are you working with? Is it paired end?

@zeeev Yes, the bam file I tested is two paired-end. The details of genome sequencing are:
For most accessions, two paired-end (PE) libraries with different insert sizes and read lengths (~200bp; 32bp PE and ~400bp; 51bp PE) were made from different plants and sequenced with a Genome Analyzer II.
Is there any wrong with all the command lines I used to realign?

@CllKiller I don't see anything obvious. Can you share the bam file that is causing problems?

No problem, sir! So sorry to take you many time, I will share the bam file and ref file by Google Drive to your email (zev.kronenberg@gmail.com). Many thanks again!! :)

Zev, I have just share the files with you by Google Drive and you will receive an email to download them. Please check them. Tks!

@CllKiller Your reads are too short ( < 50bp) for wham or whamg. However, I've added the -d flag that allows users to adjust the number of matching bases. After running your data with -d and -g I see INDELS, but not SVs (in the graph file, not the VCF).

I'd encourage you to try a tool that doesn't rely on soft-clipped bases. LUMPY or VH are two options.

Many thanks for all these great help and your awesome software, Zev! I have already tried LUMPY on my data, but what is the VH? Can't find it from google, could you please give me more information about this software? :)