Read group (RG:Z:) position in tags seems to matter and will not run without it being at the end
JustinChu opened this issue · comments
When I run somalier on this dataset I get a readgroup error.
somalier extract -d extracted/ --sites sites.hg38.vcf.gz -f hs38.fa HG00733_hic.bam
somalier version: 0.2.15
somalier.nim(28) get_sample_name
Error: unhandled exception: [somalier] no read-group in bam file [ValueError]
samtools view HG00733_hic.bam | head -1
NB551675:7:HHLMHBGX9:1:22205:10514:9656 16 chr1 9998 39 67S84M * 0 0 CCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCACCCCCCACCCTAACCCTATCTCTAATCTTTACGATAACCCTAACCCTAACCCTAACACTAACC
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA EEEE6/AEEE///E<</////6///A/////</////////////////////E////A///////////E/////////////A////A////E/////A////E//A//EE/A/EEEE/E/EAEEEAEEEEEEEEEEEEEEEEEA
AAAA NM:i:1 MD:Z:25C58 AS:i:79 XS:i:64 RG:Z:HG00733_hic SA:Z:chr2,32916254,+,109S42M,0,2; XA:Z:chr20,+64287312,50M2D32M69S,4;
However it seems to work when I move the readgroup tag over to the end.
somalier extract -d extracted/ --sites sites.hg38.vcf.gz -f hs38.fa HG00733_hic_moveRG.bam
samtools view HG00733_hic_moveRG.bam | head -1
NB551675:7:HHLMHBGX9:1:22205:10514:9656 16 chr1 9998 39 67S84M * 0 0 CCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCACCCCCCACCCTAACCCTATCTCTAATCTTTACGATAACCCTAACCCTAACCCTAACACTAACC
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA EEEE6/AEEE///E<</////6///A/////</////////////////////E////A///////////E/////////////A////A////E/////A////E//A//EE/A/EEEE/E/EAEEEAEEEEEEEEEEEEEEEEEA
AAAA NM:i:1 MD:Z:25C58 AS:i:79 XS:i:64 SA:Z:chr2,32916254,+,109S42M,0,2; XA:Z:chr20,+64287312,50M2D32M69S,4; RG:Z:HG00733_hic
As per BAM/SAM file specification tag position should not matter. It is quite tedious to alter and reindex all of the alignments so if you have any easy fix suggestions that would be great.
can you show the SAM header (with grep RG) of both files?
samtools view -H HG00733_hic.bam | grep @RG | grep -v @PG
@RG ID:HG00733_hic
samtools view -H HG00733_hic_RG.bam | grep @RG | grep -v @PG
@RG ID:HG00733_hic SM:HG00733_hic
I'll try swapping the headers to see if that does anything due to the additional SM
entry
Yes, the SM tag is required to get the sample name.