gymrek-lab / LongTR

Tandem repeat genotyping with long reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No repeat were detected

HLHsieh opened this issue · comments

Hi there,

After your suggestions, I can run LongTR on my data as follows:

LongTR --bams C9ORF72_1_10R.sorted.bam --fasta $dir/genome.fa --regions $dir/test.bed --tr-vcf $dir/test.vcf.gz --phased-bam --bam-samps C9ORF72_1 --bam-libs C9ORF72_1

Message showed:

Using phased BAM tags to genotype and phase TRs (WARNING: Any arguments provided to --snp-vcf will be ignored)
Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.bed
Region file contains 1 regions

Processing region chr9 27573528 27573546
111 reads overlapped region, of which
	0 were hard clipped
	0 had an 'N' base call
	0 had low MAPQ
	111 had low base quality scores
	0 did not span the STR
	0 did not have a unique mapping
	0 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 0 reads
Skipping locus with too few reads: TOTAL=0, MIN=10



------HipSTR Execution Summary------
Skipped 1 loci with too few reads for stutter model model training or genotyping.
	 If this is a sizeable portion of your loci, see the --min-reads command line option
Genotyping succeeded for 0/0 loci

Approximate timing breakdown
 BAM seek time       = 0.007054 seconds
 Read filtering      = 0.065989 seconds
 SNP info extraction = 2.6e-05 seconds
 Genotyping          = 0 seconds
	 Trimming alignment        = 0 seconds
	 Haplotype generation  = 0 seconds
	 Haplotype alignment   = 0 seconds
LongTR execution finished: Total runtime = 3.21404 sec
-----------------

Therefore, no repeats were detected. This data should contain some repeat regions. I am wondering if there is something wrong with the command or if you have any suggestions on how I can adjust it accordingly.

Many thanks,
Hsin

Hi Hsin,

It seems that your locus is skipped because all overlapping reads had low quality, you can decrease the threshold by specifying a value for --min-mean-qual, the default value is 30.

Best,
Helia

Hi Helia,

Thank you for your quick response. I was trying the following command, but the same error message returned.

LongTR --bams C9ORF72_1_10R.sorted.bam --fasta $dir/genome.fa --regions $dir/test.bed --tr-vcf $dir/test.vcf.gz --phased-bam --bam-samps C9ORF72_1 --bam-libs C9ORF72_1 --min-mean-qual 0 --min-reads 1
Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.bed
Region file contains 1 regions

Processing region chr9 27573528 27573546
111 reads overlapped region, of which
	0 were hard clipped
	0 had an 'N' base call
	0 had low MAPQ
	111 had low base quality scores
	0 did not span the STR
	0 did not have a unique mapping
	0 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 0 reads and 0 out of 0 samples
Skipping locus with too few reads: TOTAL=0, MIN=1



------HipSTR Execution Summary------
Skipped 1 loci with too few reads for stutter model model training or genotyping.
	 If this is a sizeable portion of your loci, see the --min-reads command line option
Genotyping succeeded for 0/0 loci

Approximate timing breakdown
 BAM seek time       = 0.006804 seconds
 Read filtering      = 0.065036 seconds
 SNP info extraction = 4.3e-05 seconds
 Genotyping          = 0 seconds
	 Trimming alignment        = 0 seconds
	 Haplotype generation  = 0 seconds
	 Haplotype alignment   = 0 seconds
LongTR execution finished: Total runtime = 3.1865 sec
-----------------

Any suggestions would be appreciated.

Many thanks,
Hsin

I have had to specify "-1" for that parameter to avoid removing all (nanopore) reads.

Hi,

Thanks. Specified "-1" for that parameter can include all reads, but no repeats were detected in the output file.

Log:

Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.bed
Region file contains 1 regions

Processing region chr9 27573528 27573546
111 reads overlapped region, of which
        0 were hard clipped
        0 had an 'N' base call
        0 had low MAPQ
        0 had low base quality scores
        0 did not span the STR
        0 did not have a unique mapping
        111 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 111 reads and 0 out of 1 samples
Trimming reads
Failed to trim align 6 out of 111 reads
Generating candidate haplotypes
        GCCCCGGGCCCGCCCCCGGGCCCGCCCCGACCACG CCCCGGCCCCGGCCCCGGCCCC                                                                               TAGCGCGCGACTCCTGAGTTCCAGAGCTTGCTACA
                                            GCCCCGGCCCCGGCCCCGGCCCC
                                            CCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCC

                                            GCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGC...GCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCC

Added 2 inexact haplotypes generated by POA
Aborting genotyping of the locus as the sequence upstream of the repeat is too repetitive for accurate genotyping
        Flanking sequence = GCCCCGGGCCCGCCCCCGGGCCCGCCCCGACCACG
Locus timing:
 BAM seek time       = 0.0069 seconds
 Read filtering      = 0.085952 seconds
 SNP info extraction = 0.005669 seconds
 Stutter estimation  = 4e-06 seconds
 Genotyping          = 0.067338 seconds
        Trim alignment        = 0.057342 seconds
         Haplotype generation  = 0.009429 seconds
         Haplotype alignment   = 0 seconds




------HipSTR Execution Summary------
Genotyping succeeded for 0/1 loci

Approximate timing breakdown
 BAM seek time       = 0.0069 seconds
 Read filtering      = 0.085952 seconds
 SNP info extraction = 0.005669 seconds
 Genotyping          = 0.067338 seconds
         Trimming alignment        = 0.057342 seconds
         Haplotype generation  = 0.009429 seconds
         Haplotype alignment   = 0 seconds
LongTR execution finished: Total runtime = 3.27426 sec
-----------------

VCF file

##fileformat=VCFv4.1
##command=LongTR-638942f-dirty --bams C9ORF72_1_10R_NanoSim_50x.sorted.bam --fasta /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/genome.fa --regions /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.bed --tr-vcf /scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/test.vcf.gz --bam-samps C9ORF72_1 --bam-libs C9ORF72_1 --min-mean-qual -1 --min-reads 1
##reference=/scratch/kinfai_root/kinfai0/hsinlun/C9ORF72_1_9R_NanoSim_2x/genome.fa
##contig=<ID=chrM,length=16569>
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##INFO=<ID=INEXACT_ALLELE,Number=A,Type=Integer,Description="Boolean showing if each alternate allele is exact or approximated by POA, 0 for exact 1 for approximated.">
##INFO=<ID=BPDIFFS,Number=A,Type=Integer,Description="Base pair difference of each alternate allele from the reference allele">
##INFO=<ID=START,Number=1,Type=Integer,Description="Inclusive start coodinate for the repetitive portion of the reference allele">
##INFO=<ID=END,Number=1,Type=Integer,Description="Inclusive end coordinate for the repetitive portion of the reference allele">
##INFO=<ID=PERIOD,Number=1,Type=Integer,Description="Length of STR motif">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=REFAC,Number=1,Type=Integer,Description="Reference allele count">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate allele counts">
##INFO=<ID=NSKIP,Number=1,Type=Integer,Description="Number of samples not genotyped due to various issues">
##INFO=<ID=NFILT,Number=1,Type=Integer,Description="Number of samples whose genotypes were filtered due to various issues">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total number of valid reads used to genotype all samples">
##INFO=<ID=DSNP,Number=1,Type=Integer,Description="Total number of reads with SNP phasing information">
##INFO=<ID=DFLANKINDEL,Number=1,Type=Integer,Description="Total number of reads with an indel in the regions flanking the STR">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GB,Number=1,Type=String,Description="Base pair differences of genotype from reference">
##FORMAT=<ID=Q,Number=1,Type=Float,Description="Posterior probability of unphased genotype">
##FORMAT=<ID=PQ,Number=1,Type=Float,Description="Posterior probability of phased genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of valid reads used for sample's genotype">
##FORMAT=<ID=DSNP,Number=1,Type=Integer,Description="Number of reads with SNP phasing information">
##FORMAT=<ID=PSNP,Number=1,Type=String,Description="Number of reads with SNPs supporting each haploid genotype">
##FORMAT=<ID=PDP,Number=1,Type=String,Description="Fractional reads supporting each haploid genotype">
##FORMAT=<ID=GLDIFF,Number=1,Type=Float,Description="Difference in likelihood between the reported and next best genotypes">
##FORMAT=<ID=DFLANKINDEL,Number=1,Type=Integer,Description="Number of reads with an indel in the regions flanking the STR">
##FORMAT=<ID=AB,Number=1,Type=Float,Description="log10 of the allele bias pvalue, where 0 is no bias and more negative values are increasingly biased. 0 for all homozygous genotypes">
##FORMAT=<ID=FS,Number=1,Type=Float,Description="log10 of the strand bias pvalue from Fisher's exact test, where 0 is no bias and more negative values are increasingly biased. 0 for all homozygous genotypes">
##FORMAT=<ID=DAB,Number=1,Type=Integer,Description="Number of reads used in the AB and FS calculations">
##FORMAT=<ID=ALLREADS,Number=1,Type=String,Description="Base pair difference observed in each read's Needleman-Wunsch alignment">
##FORMAT=<ID=MALLREADS,Number=1,Type=String,Description="Maximum likelihood bp diff in each read based on haplotype alignments for reads that span the repeat region by at least 5 base pairs">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  C9ORF72_1

My test bed file:

chr9    27573529        27573546        6       3       C9orf72

Any suggestions would be appreciated.

Many thanks,
Hsin

Hi Hsin,

You are getting this error because LongTR is trying to assemble the flanking sequence, which is unnecessary with long reads, we recommend adding --skip-assembly to the command when using long reads to avoid this step.

Best,
Helia