nanoporetech / dorado

Oxford Nanopore's Basecaller

Home Page:https://nanoporetech.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

capture dulpex rate to file

hagen-wende opened this issue · comments

For duplex basecalling of barcoded data I'm looping over the barcode directories to find the reads in the simplex fastqs. Then I duplex basecall these with the following command:

/home/vc/ont/dorado_0.4.2/bin/dorado \ duplex \ /home/vc/ont/rerio/dorado_models/res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation \ --read-ids $INPUT_DIR/$barcode/all_reads.txt \ $POD5_DIR \ -r \ --min-qscore 10 \ -x "cuda:0" \ > ./$DUPLEX_CALLS_DIR/$barcode/duplex_calls.bam

As a QC step I would like to extract the duplex rate from all barcodes and save them to a file. So I'm trying to capture the stdout output by appending
| tee output.txt
to the above scipt and would then process the file further. Unfortunately the output is empty. Is there any other way to get the duplex rate?

Hi @hagen-wende,
The output bam file contains the dx tag which can be used to calculate the duplex rate. README#duplex

In your command, stdout is being redirected to the bam file (this is where dorado outputs the actual reads). The log messages you see are being printed to stderr, so you would need to redirect that to a separate file as well.

Thanks for the input, I managed to get both methods to work, however they yield slightly different results (minor enough for me to ignore it)

based on @HalfPhoton suggestion I implemented this script:

#!/bin/bash

eval "$(conda shell.bash hook)"
conda activate samtools

### dx:1 bam file
duplex_length="$(samtools view ./fastq_pass_duplex_only/barcode21/duplex_reads.bam | awk '{sum += length($10)} END {print sum}')"
### dx:0 bam file
simplex_length="$(samtools view ./fastq_pass_duplex_with_simplex/barcode21/simplex_reads.bam | awk '{sum += length($10)} END {print sum}')"

eval "$(conda shell.bash hook)"
conda activate base

duplex_rate=$(awk "BEGIN {print ($duplex_length*2)/($duplex_length*2+$simplex_length)*100}")

echo "Duplex rate: $duplex_rate %"

and based on @malton-ont comment I used "2>" to redirect stderr and extract the duplex from there. I will stick to the latter method, as it is more straightforward:

    /home/vc/ont/dorado_0.4.2/bin/dorado \
    duplex \
    /home/vc/ont/rerio/dorado_models/res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation \
    --read-ids $INPUT_DIR/$barcode/all_reads.txt \
    $POD5_DIR \
    -r \
    --min-qscore 10 \
    -x "cuda:0" \
    > ./$DUPLEX_CALLS_DIR/$barcode/duplex_calls.bam \
    2> $temp_log
    
    # Extract the duplex rate using awk
    duplex_rate=$(awk -F ': ' '/Duplex rate/ {gsub("%",""); print $2}' $temp_log)

    # Append the formatted duplex rate information to the output file
    echo "Duplex rate for $barcode is: $duplex_rate %" >> $output_file

    # remove the temporary log file
    rm $temp_log

    echo "Duplex rate is: $duplex_rate"

Thanks for your help!