STAR diploid sometimes labels reads as haplotype specific in regions with no variants

Question

STAR diploid sometimes labels reads as haplotype specific in regions with no variants

mparker2 opened this issue 4 months ago · comments

I came across an interesting corner case in my analysis using STAR diploid, which was run with the following command:

        STAR \
          --runThreadN {threads} \
          --genomeDir "$TOPDIR/{input.index}" \
          --readFilesIn "$TOPDIR/{input.read}" "$TOPDIR/{input.mate}" "$TOPDIR/{input.barcode}" \
          --readFilesCommand "zcat" \
          --soloType "CB_samTagOut" \
          --soloCBmatchWLtype "1MM" \
          --soloCBwhitelist "$TOPDIR/{input.barcode_whitelist}" \
          --soloBarcodeReadLength 0 \
          --outFilterMultimapNmax 1 \
          --outFilterMismatchNmax 0 \
          --alignMatesGapMax 500 \
          --alignIntronMax 1 \
          --outSAMtype BAM Unsorted \
          --outSAMattributes NH HI AS nM CB sS sQ ha \
          --genomeTransformOutput SAM

Because of the use of --outFilterMultimapNmax 1, the resulting BAM files only contain reads that can distinguish one haplotype from the other. These should all theoretically overlap the position of at least one of the variants in the VCF file used during indexing, however I found that a small number of haplotype-labelled alignments fall in regions that do not contain any variants at all:

The top track shows the VCF that was used to build the STAR diploid index. The lower track shows the read alignments, grouped by ha tag. Variants from the region on the left (up to ~11,737 kb) have been masked, so no reads should align in this region. However there are some that still do.

Anecdotally this seems to mostly occur at repetitive elements or in the centromere. I was wondering if there is some stochasticity or heuristic in the mapping procedure, that could cause the reads to only map in one haplotype and not the other, even when the sequence is present in both? If this is the case, perhaps a check during the genomic coordinate transformation procedure could be used to prevent incorrect labelling of a unique haplotype?

Best wishes
Matt

Alexander Dobin · Answer 1 · Sat Mar 09 2024 02:51:40 GMT+0800 (China Standard Time)

Hi Matt,

This should not happen, in principle.
If you can extract a few reads like that from SAM, I can take a closer look.

Matthew Parker · Answer 2 · Mon Mar 11 2024 23:39:53 GMT+0800 (China Standard Time)

Sure, attached. do you also need the indices used for mapping?
sample.zip