sam2psl

Convert SAM format to PSL format.

Usage

   make  
   ./sam2psl -h  ## print header, control-c to exit  
   cat tmp.sam | ./sam2psl -h  | grep -v ^#

Notes

This software is written to be portable, meant to be used in a pipe. To compile, g++ -O2 sam2psl.cpp. Bug reports are apprecaited.
By default, the original SAM alignments are printed with '#' at the beginning. Use grep -v ^# to get rid of them. Use | cut -f-21 to produce PSL only output.
This software is only tested on bwa and bowtie2 outputs and may or maybe not work well for other aligners.
tStart, is reported as -1 if a read is not aligned.
matches, is the total length of M/I/D blocks from CIGAR string.
misMatches, is the edit distance reported in the NM:i:[0-9]+ field.
repMatches, is not calculated, reported as 0.
blocks, in SAM format, all M/I/D/= blocks are considered matched parts. To closely conform with PSL format, all M blocks are treated as different blocks. For example, a CIGAR with 10S30M4D30M5S produces 2 blocks, with blockSizes being 30,30,, qStarts being 10,40,, and tStarts being POS-1, POS-1+34,.
TLEN, is the whole template length reported by bowtie2, matched(not including soft-cliped part) template length reported by BWA.
AS, is the mapping score. Different aligners have different formulars.
MAPS, is given by matches-misMatches; this value is calculated so that alignments from different aligners are comparable.

Status

qBlocks and tBlocks are not printed yet, but the two fields are present in the output as ,'s.
The other fields should be accurate.

Output

The columns are:

matches,     ##1.  Number of matching bases that aren't repeats; note, all matches are included. 
misMatches,  ##2.  Number of bases that don't match.
repMatches,  ##3.  Number of matching bases that are part of repeats; always 0, see column 1.  
nCount,      ##4.  Number of 'N' bases.
qNumInsert,  ##5.  Number of inserts in query.
qBaseInsert, ##6.  Number of bases inserted into query.
tNumInsert,  ##7.  Number of inserts in target.
tBaseInsert, ##8.  Number of bases inserted into target; commonly refered as deletion.
strand,      ##9.  '+' or '-'; '*' if cannot be determined.
qName,       ##10. Query sequence name.
qSize,       ##11. Query sequence size.
qStart,      ##12. Alignment start position in query.
qEnd,        ##13. Alignment end position in query.
tName,       ##14. Target sequence name.
tSize,       ##15. Target sequence size.
tStart,      ##16. Alignment start position in query.
tEnd,        ##17. Alignment end position in query.
blockCount,  ##18. Number of blocks in the alignment.
blockSizes,  ##19. Comma-separated list of sizes of each block.
qStarts,     ##20. Comma-separated list of start position of each block in query.
tStarts,     ##21. Comma-separated list of start position of each block in target.

qBlocks,     ##22. Comma-separated list of sequence blocks in query. 
tBlocks,     ##23. Comma-separated list of sequence blocks in target on aligned strand reading from 5' to 3'. 

RI,          ##24. Read index, [1/2].
RNEXT,       ##25. Mate target, [=/chr].
PNEXT,       ##26. Mate position, 0-based.
TLEN,        ##27. Template length, reported by aligner. 
MAPQ,        ##28. Mapping quality.
AS,          ##29. Mapping score, reported by the aligner.
MAPS,        ##30. Mapping score, matched length - edit distance, similar to bwa.
FPAIRED,     ##31. Paired end or single end, [P/S].
FPROPER_PAIR,##32. Pairs properly paired or unpaired, [P/U].
FSECONDARY,  ##33. Primary or secondary mapping, [P/S].    
FQC,         ##34. Quality control pass or fail, [P/F].
FDUP,        ##35. Read primary or duplicated, [P/D].

yhwu / sam2psl

sam2psl

Usage

Notes

Status

Output

About

Languages