capture dulpex rate to file
hagen-wende opened this issue · comments
For duplex basecalling of barcoded data I'm looping over the barcode directories to find the reads in the simplex fastqs. Then I duplex basecall these with the following command:
/home/vc/ont/dorado_0.4.2/bin/dorado \ duplex \ /home/vc/ont/rerio/dorado_models/res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation \ --read-ids $INPUT_DIR/$barcode/all_reads.txt \ $POD5_DIR \ -r \ --min-qscore 10 \ -x "cuda:0" \ > ./$DUPLEX_CALLS_DIR/$barcode/duplex_calls.bam
As a QC step I would like to extract the duplex rate from all barcodes and save them to a file. So I'm trying to capture the stdout output by appending
| tee output.txt
to the above scipt and would then process the file further. Unfortunately the output is empty. Is there any other way to get the duplex rate?
Hi @hagen-wende,
The output bam file contains the dx
tag which can be used to calculate the duplex rate. README#duplex
In your command, stdout
is being redirected to the bam file (this is where dorado outputs the actual reads). The log messages you see are being printed to stderr
, so you would need to redirect that to a separate file as well.
Thanks for the input, I managed to get both methods to work, however they yield slightly different results (minor enough for me to ignore it)
based on @HalfPhoton suggestion I implemented this script:
#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate samtools
### dx:1 bam file
duplex_length="$(samtools view ./fastq_pass_duplex_only/barcode21/duplex_reads.bam | awk '{sum += length($10)} END {print sum}')"
### dx:0 bam file
simplex_length="$(samtools view ./fastq_pass_duplex_with_simplex/barcode21/simplex_reads.bam | awk '{sum += length($10)} END {print sum}')"
eval "$(conda shell.bash hook)"
conda activate base
duplex_rate=$(awk "BEGIN {print ($duplex_length*2)/($duplex_length*2+$simplex_length)*100}")
echo "Duplex rate: $duplex_rate %"
and based on @malton-ont comment I used "2>" to redirect stderr and extract the duplex from there. I will stick to the latter method, as it is more straightforward:
/home/vc/ont/dorado_0.4.2/bin/dorado \
duplex \
/home/vc/ont/rerio/dorado_models/res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation \
--read-ids $INPUT_DIR/$barcode/all_reads.txt \
$POD5_DIR \
-r \
--min-qscore 10 \
-x "cuda:0" \
> ./$DUPLEX_CALLS_DIR/$barcode/duplex_calls.bam \
2> $temp_log
# Extract the duplex rate using awk
duplex_rate=$(awk -F ': ' '/Duplex rate/ {gsub("%",""); print $2}' $temp_log)
# Append the formatted duplex rate information to the output file
echo "Duplex rate for $barcode is: $duplex_rate %" >> $output_file
# remove the temporary log file
rm $temp_log
echo "Duplex rate is: $duplex_rate"
Thanks for your help!