alexdobin / STAR

RNA-seq aligner

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect SAM flag set in BAM file when processing 10X single-cell 5prime data?

edg1983 opened this issue · comments

I'm processing some datasets generated using 10X 5-prime kit, and I observed unexpected SAM flags in the BAM file generated by STAR.

For this, I'm using the following command.

STAR --runThreadN 12 \
	--soloType CB_UMI_Simple \
	--soloCellFilter EmptyDrops_CR \
	--soloFeatures GeneFull SJ \
	--genomeDir ${genome_ref} \
	--soloCBwhitelist ${white_list} \
	--outFileNamePrefix ${out_prefix}. \
	--readFilesCommand zcat \
	--outSAMtype BAM SortedByCoordinate \
	--quantMode TranscriptomeSAM \
	--outSAMattributes NH HI AS GX GN CB UB \
   	--soloMultiMappers EM \
	--readFilesIn ${R1_fastqs} ${R2_fastqs} \
	--soloBarcodeMate 1 \
	--soloStrand Forward \
	--clip5pNbases 39 0 \
	--soloCBstart 1   --soloCBlen 16   --soloUMIstart 17   --soloUMIlen 10

In the resulting sorted BAM file (5prime.Aligned.sortedByCoord.out.bam), the bitwise FLAG for the reads is set to include read-paired + mate-unmapped (values like 137, 153, 393, 409).
This creates issues downstream since many tools processing the BAM file see this FLAG and skip the reads for not having a proper pair or refuse to process the file since the paired reads are not found.

Given this is single-cell data, reads are not expected to be paired, so these FLAGs are incorrectly set here in my opinion.

Indeed, when analyzing 3-prime data, the generated BAM file contains the expected FLAG, not assuming paired reads (values like 16, 256, 272).

For 3-prime dataset processing, I used the following command.

STAR --runThreadN 12 \
	--soloType CB_UMI_Simple \
	--soloUMIlen 12 \
	--soloCellFilter EmptyDrops_CR \
	--soloFeatures GeneFull SJ \
	--genomeDir ${genome_ref} \
	--soloCBwhitelist ${white_list} \
	--outFileNamePrefix ${out_prefix}. \
	--readFilesCommand zcat \
	--outSAMtype BAM SortedByCoordinate \
	--quantMode TranscriptomeSAM \
    --clipAdapterType CellRanger4 \
	--outSAMattributes NH HI AS GX GN CB UB \
    --soloMultiMappers EM \
	--readFilesIn ${R2_fastqs} ${R1_fastqs}

--soloBarcodeMate 1 expects both mates to have cDNA sequence and map to the genome.
If your 5' library has no cDNA sequence on barcode read, you can map it as if ti were 3' without --soloBarcodeMate 1, probably with --soloStrand Reverse