Tips and tricks (mainly for comp. gen.) in the Bash language
- SRA_download_bam.sh: script to download BAM files from SRA (CAUTION: this suppose that the query data in SRA contains
aligned
reads)- input1: the txt file containing in rows the SRRXXXXX identifiers of the runs to download
- input2: output folder
- command line example:
./script_dwn_to_bam.sh SraAccList_SRP041470.txt BAM_SRP041470
Example of input list file:
[tdelhomme@fsupeksvr SRA]$ head -n3 SraAccList_SRP041470.txt
SRR1264612
SRR1264613
SRR1264614
This page gives information about how to download dbGaP data using sratoolkit: link
Note: to get the SRA accession list, we should use the SRA Run selector tool, available on the NCBI website
- Intersection of 2 files:
grep -Fxf "file1" "file2" > intersection
- Transfer remote files (whole folder) using lftp (e.g. from CNAG server)
lftp -u username,'password' sftp://ftp.cnag.cat
set ftp:ssl-allow false
set ftp:passive-mode off
set ssl:verify-certificate no
mirror --verbose -c --verbose /PATH/TO/REMOTE/FOLDER /PATH/TO/LOCAL/FOLDER
- Compute mean coverage from a particular BAM file
One way is to use samtools depth
that returns the depth at each sequenced position:
samtools depth file.bam | awk '{sum+=$3} END { print "Average = ",sum/NR}'
But this is extremely time intensive for large BAM (like WGS or high coverage sequencing, e.g. target seq -- totally depends on the total number of reads). A better approach is to use samtools idxstats
, which returns for each sequenced chromosome, the number of mapped reads. Then, with the combination of a second bash command that computes the estimated mean lenght of the reads, one can compute an approximation of the coverage in a few seconds. Note that we are using one approximation -- the number of lines i.e. reads used to compute mean read length, with this, we do not need to read all the BAM file, this should probably be adapted depending on the coverage:
declare -i meanreadlength
meanreadlength=`samtools view file.bam | head -n 1000000 | cut -f 10 | perl -ne 'chomp;print length($_) . "\n"' | sort | awk 'BEGIN {total=0} {total += $1} END { print int(total/NR) }'`
declare -i numberreads
numberreads=`samtools idxstats file.bam | awk 'BEGIN {total=0} {total += $3} END {print total}'`
declare -i lengthsequence
lengthsequence=`samtools idxstats file.bam | awk 'BEGIN {total=0} {total += $2} END {print total}'`
meancoverage=$((meanreadlength * numberreads / lengthsequence))
echo $meancoverage
- Re-align a BAM file to a new reference
# based on https://www.biostars.org/p/326714/
java -jar picard.jar SamToFastq I=<file_alnMAP.bam> FASTQ=<filemap_1.fq> SECOND_END_FASTQ=<filemap_2.fq>
bwa mem -R <read_group> <ref2.fa> <filemap_1.fq> <filemap_1.fq>
- Compute number of genomic positions in a bed file
cat file.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
- Extract nucleotides in a FASTA file at given positions
samtools faidx genome.fasta chr:start-end
- Convert a Fastq to a Fasta file
sed -n '1~4s/^@/>/p;2~4p' input.fastq > input.fasta
- Create a new environment
conda create --name environment_name
- Install a package inside an existing environment
conda install -n environment_name -c channel_name package_to_install