Just a collection of usefull one-liners, often a bit too long to remember
Pretty useful for big tables (source):
awk -F'\t' -v c='colname' 'NR==1{for (i=1; i<=NF; i++) if ($i==c){p=i; break}; next} {print $p}' myfile.tsv
Apparently this has issues if you want to retrieve the column in the last position: it returns the whole line.
Alternative (source):
awk -F'\t' -v colname='colname' '{if(NR==1) for(i=1;i<=NF;i++) { if($i~colname) { colnum=i;break} } else print $colnum}' myfile.tsv
awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' ids.txt myseqs.fasta
Where
ids.txt
is a list of the names of the sequences to extract frommyseqs.fasta
awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} $2 {print ">"$0}' myseqs.fasta
Useful to do any work from the exported amino acid gene calls from
anvi-get-sequences-for-gene-calls
as it keeps the non-protein coding gene headers with no sequence.
This will add up the values of column 2, giving the total sum for each unique value in column 1 (source).
awk -F "\t" '{a[$1] += $2; OFS="\t"} END {for (i in a) print i, a[i]}' myfile.tsv
awk 'BEGIN{OFS=FS=" "}{if(/^>/){NF--}}{print $1}' myseqs.fasta
Change the
OFS=FS=" "
to the character marking where you want to delete from.
Convenient for cleaning fasta headers for phylo work.
This is another alternative that should work in most cases:
cut -f1 -d " " myseqs.fasta
sort -k1,1n -k 2,2 -k5,5gr myfile.tsv
Sort first based on number on column 1, then standard (alphabetic) sorting of column 2, and finally do "general numeric sort" in reverse order of column 5 (i.e. recognises scientific notation and sorts high to low).
cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c
paste
places all lines in 4 columns.
The second line in each FastQ record is the actual sequence (we grab them with
cut
).
tr
is used to remove end of line (\n
) characters, otherwisewc
counts them.