Non-text characters appearing in SEQ field

Question

Non-text characters appearing in SEQ field

csoeder opened this issue a year ago · comments

I had an odd error crop up. I aligned my reads as usual:
bwa samse droSim1.fa fragSimulated_BCM10NE.clean.R0.fastq.droSim1.sai fragSimulated_BCM10NE.clean.R0.fastq -r '@RG\tID:foo\tSM:bar' > test.sam
(I have tried this with and without the readgroup flag, with the same outcome)

but when I tried to manipulate the alignment with samtools I got this:
$ samtools view test.sam > /dev/null
[W::sam_read1_sam] Parse error at line 28
samtools view: error reading file "test.sam"

There was nothing obviously wrong with line 28:

$ head -n 28 test.sam | tail -n 2
READ_7 0 chr2h_random 1166746 37 100M * 0 0 GCAAACCTATTTGAGCCTGCTTCAGACACGACGGTGAGGTATGCACTGTTTCGATGTAAAGAGAGTCGGCGCTCGTCTTGCTCATTTTGCCGCTGAGCGC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
READ_8 4 * 0 0 * * 0 0 TCGGTGCACAGAAAGAAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGGGGTTGAGGCTTAGAAGGGGGCGTGGCCGGGCGGAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo

I experimented a bit; deleting the offending line just finds another one at line 353, and so on. Some are unmapped; some are mapped. Oddly, although the file appeared to be an uncompressed SAM, it sometimes behaved as though it were a binary, such as not grepping properly:
grep READ test.sam
Binary file test.sam matches
file test.sam
test.sam: data

Then, I noticed that the problematic lines have a length mismatch between the SEQ and QUAL fields, which seemed like it could be an issue, but have never encountered this happening before.

Finally, while I was examining the file in less, I noticed this:

READ_8 4 * 0 0 * * 0 0 TCGGTGCACAGAAAGAAAANNNNNNNNNNNNNNNNNNNNNNNNN^@NNNNNNNNNNNNNNNNGGGGGGTTGAGGCTTAGAAGGGGGCGTGGCCGGGCGGAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo

There seems to be a mystery character, "^@", which has found its way into the SEQ field instead of a proper character. I don't know how samtools interprets it, but when sent to standard output it just disappears, giving a string with one fewer character than it's supposed to have. Within less, it appears with the reverse coloration that you see when you open a binary file, which might explain some previous observations.

The only thing I'm doing that's slightly unusual is that I'm aligning pseudoreads synthetically generated by fragmenting a reference genome, but I don't think that's responsible (I've recently used this pipeline without issue).

Any tips on how to make this not happen?

Thanks,
Charlie