FRED-2 / OptiType

Precision HLA typing from next-generation sequencing data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does OptiType care about read groups?

Stikus opened this issue · comments

As said here - OptiType cares about read pairs.
But what about read groups?
Here is our situation - we have WGS-data with several lines - e.g. not pair of fastq.gz but 8 files with each 2 paired.
Can you propose solution how to use OptiType on them?
Do we need to align then first (with Razers3 or Yara or even BWA Mem - we want to try different aligners coz Razers3 crashes on big genome files with unpredictable result) and merge BAM's after?
Or we can simply summarize fastq's into 2 big paired fastq files and feed them to OptiType?
What is best practice?

If you have the individual, smaller bam files already, I would samtools cat those of the first ends, then the second ends, and call OptiType on these two concatenated bam files. What you can also do is call Yara with <(zcat lane1_1.fq.gz lane2_1.fq.gz lane3_1.fq.gz lane4_1.fq.gz) as input, instead of creating an actual big file on your disk. I'd probably do this for future samples.

@andras86 Thank you for fast answer.
If I understand you correctly - we should align first and second ends separately with Yara and start OptiType with 2 BAMs?

You propose Yara instead of Razers3 as aligner because we have problems with Razer? Or Yara is better (according to developer page - yes, it is better, but my tests for aligners still in progress)?

Check out our newer, faster read aligner Yara

And here:

we recommend to use the yara read mapper instead.

And if Yara is better - is there any possibility for replacing Razers3 with Yara in OptiType itself?
I find your workaround described here - it is still best way to use Yara for OptiType or something new was proposed?

Yes, exactly, align end1 and end2 separately, and call OptiType on the two resulting bam files.

Yara used to have a few niggly issues with full sensitivity, which have been rectified recently, so they should produce equivalent results under all circumstances now, and there's no reason to stick with RazerS3 anymore. You're right Yara would be more suitable to be the default rather than the workaround read mapper now, I'll keep you posted.

@andras86 I'm trying to implement Yara in my script but I have problem with your executive command
yara_mapper -e 3 -f bam -u -os /path/to/ref path/to/reads_1.fastq.gz | samtools view -h -F 4 -b1
My Yara help doesn't have such keys (-u && -os):

Yara help

yara_mapper - Yara Mapper
=========================

SYNOPSIS
    yara_mapper [OPTIONS] <REFERENCE INDEX PREFIX> <SE-READS FILE>
    yara_mapper [OPTIONS] <REFERENCE INDEX PREFIX> <PE-READS FILE 1> <PE-READS FILE 2>

DESCRIPTION
    Yara - Yet Another Read Aligner.

    See http://www.seqan.de/projects/yara for more information.

    (c) Copyright 2011-2014 by Enrico Siragusa.

    (c) Copyright 2013 by NVIDIA Corporation.

REQUIRED ARGUMENTS
    REFERENCE_INDEX_PREFIX INPUT_PREFIX
          An indexed reference genome.
    READS_FILE List of INPUT_FILE's
          Either one single-end or two paired-end / mate-pair read files. Valid filetypes are: .sam[.*], .raw[.*],
          .gbk[.*], .frn[.*], .fq[.*], .fna[.*], .ffn[.*], .fastq[.*], .fasta[.*], .faa[.*], .fa[.*], .embl[.*], and
          .bam, where * is any of the following extensions: gz, bz2, and bgzf for transparent (de)compression.

OPTIONS
    -h, --help
          Display the help message.
    --version-check BOOL
          Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,
          0, OFF, FALSE, F, and NO. Default: 1.
    --version
          Display version information.
    -v, --verbose
          Displays global statistics.
    -vv, --very-verbose
          Displays extensive statistics for each batch of reads.

  Output Options:
    -o, --output-file OUTPUT_FILE
          Specify an output file. Default: write the file to standard output. Valid filetypes are: .sam[.*] and .bam,
          where * is any of the following extensions: gz, bz2, and bgzf for transparent (de)compression.
    -f, --output-format STRING
          Specify an output format. Note: when specifying the option --output-file, the output format is taken from
          the filename extension. One of bam, sam, sam.bgzf, sam.gz, and sam.bz2. Default: sam.
    -rg, --read-group STRING
          Specify a read group for all records in the SAM/BAM file. Default: none.
    -sa, --secondary-alignments STRING
          Specify whether to output secondary alignments in the XA tag of the primary alignment, as separate secondary
          records, or to omit them. One of tag, record, and omit. Default: tag.
    -ra, --rabema-alignments
          Output alignments compatible with the Read Alignment BEnchMArk (RABEMA).

  Mapping Options:
    -e, --error-rate INTEGER
          Consider alignments within this percentual number of errors. Increase this threshold to increase the number
          of mapped reads. Decrease this threshold to decrease the runtime. In range [0..10]. Default: 5.
    -s, --strata-rate INTEGER
          Consider suboptimal alignments within this percentual number of errors from the optimal alignment. Increase
          this threshold to increase the number of alternative alignments at the expense of runtime. In range [0..10].
          Default: 0.
    -y, --sensitivity STRING
          Sensitivity with respect to edit distance. Full sensitivity guarantees to find all considered alignments but
          increases runtime, low sensitivity decreases runtime by breaking such guarantee. One of low, high, and full.
          Default: high.

  Paired-End Mapping Options:
    -ll, --library-length INTEGER
          Expected library length. Default: autodetected. In range [1..inf].
    -ld, --library-deviation INTEGER
          Deviation from the expected library length. Default: autodetected. In range [1..inf].
    -i, --indel-rate INTEGER
          Rescue unaligned ends within this percentual number of indels. In range [0..50]. Default: 25.
    -ni, --no-indels
          Turn off the rescue of unaligned ends containing indels.

  Performance Options:
    -t, --threads INTEGER
          Specify the number of threads to use. In range [1..2048]. Default: 14.

VERSION
    Last update: 2018-10-18_17:13:22_+0200
    yara_mapper version: 0.9.11 [55b8b1f]
    SeqAn version: 2.4.0

Can you tell me what they mean or replacement for them?

What is your Yara version? yara_mapper --version

Bottom of help:

VERSION
    Last update: 2018-10-18_17:13:22_+0200
    yara_mapper version: 0.9.11 [55b8b1f]
    SeqAn version: 2.4.0

Should I use more recent version? This is last from master.
Here is my installation from Dockerfile:

cd "$SOFT" \
   && git clone https://github.com/seqan/seqan.git \
   && mkdir -p "$SOFT/yara-build" \
   && cd "$SOFT/yara-build" \
   && cmake "$SOFT/seqan" -DSEQAN_BUILD_SYSTEM=APP:yara -DCMAKE_CXX_COMPILER=/usr/bin/g++-4.9 \
   && make -j"$(($(nproc)+1))" all \
   && mkdir -p "$SOFT/yara-0.9.11/bin" \
   && mv -t "$SOFT/yara-0.9.11/bin" "$SOFT/yara-build/bin/yara_indexer" "$SOFT/yara-build/bin/yara_mapper" \
   && cd "$SOFT" \
   && rm -rf "$SOFT/seqan" \
   && rm -rf "$SOFT/yara-build"

In that case, try yara_mapper -e 3 -y full -t 4 /path/to/ref /path/to/reads_1.fq.gz | samtools view -h -F 4 -b1 - instead. I added -t 4 because Yara uses as many threads as you have cores by default, and you may not find that desirable, so it's best to limit the number of threads. Runtime doesn't decrease proportionally with the number of threads anyway.

New question - what does -b1 flag mean for samtools view? Cannot find it in help again :)

Samtools view help

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -q INT   only include reads with mapping quality >= INT [0]
  -l STR   only include reads in library STR [null]
  -m INT   only include reads with number of CIGAR operations consuming
           query sequence >= INT [0]
  -f INT   only include reads with all  of the FLAGs in INT present [0]
  -F INT   only include reads with none of the FLAGS in INT present [0]
  -G INT   only EXCLUDE reads with all  of the FLAGs in INT present [0]
  -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
           fraction of templates/read pairs to keep; INT part sets seed)
  -M       use the multi-region iterator (increases the speed, removes
           duplicates and outputs the reads as they are ordered in the file)
  -x STR   read tag to strip (repeatable) [null]
  -B       collapse the backward CIGAR operation
  -?       print long help, including note about region specification
  -S       ignored (input format is auto-detected)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
  -T, --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]

samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
  • Is there any difference between -F 0x4 and yours -F 4?
  • Do we still need -h even if our output is BAM?
  • Should we use -@ flag for samtools for multithreading?

Big thanks for your answers and help :)

Upd: found -b1 description as -b and -1 combined.

Thank you for help - looks like everything working fine.
This is my executive commands:
$YARA_MAPPER -e 3 -y full -t $YARACPUS $YARA_DNA_INDEX <(unpigz -c ${yaraNorm1Input[*]}) | $SAMTOOLS view -@ $SAMVIEWCPUS -F 4 -b1 -o $yaraMappedNorm1BAM

python3 $OPTITYPE -i $yaraMappedNorm1BAM $yaraMappedNorm2BAM --dna -p ${sampleName}_normal -o $optitypeOutDir

Looking forward for your Yara implementation as default mapper for OptiType.