Interaction between STARsolo and STAR diploid

Question

Interaction between STARsolo and STAR diploid

mparker2 opened this issue 2 months ago · comments

I've noticed that reads that are assigned to haplotype 2 by STAR diploid are not processed in the same way as haplotype 1 reads by STARsolo. For instance, I am fairly confident that only reads that map to haplotype 1 or both haplotypes equally well are used for producing STARsolo gene x cell count matrices. Also the UB tags on reads that align to haplotype 2 do not seem to be deduplicated properly and still match the UR tags (tested using the "1MM_Directional_UMItools). This could lead to biases in downstream analyses using either the count matrices or UMIs for quantification, if users are not aware of the limitations of using STARsolo and STAR diploid at the same time.

Best wishes
Matt

Alexander Dobin · Answer 1 · Tue Apr 23 2024 09:55:02 GMT+0800 (China Standard Time)

Hi Matt,
This could be a bug. Can you extract a few reads that map to hap2 and map them separately to confirm.
I will look into it at some point, but not sure when I will have time.

Matthew Parker · Answer 2 · Tue Apr 23 2024 22:24:51 GMT+0800 (China Standard Time)

Hi Alex,

Thanks for the reply! Heres a bit of evidence for the lack of UMI deduplication on hap2 reads. Across a BAM file of scRNA-seq reads mapped with STAR diploid + STARsolo, using the UMI dedup method 1MM_Directional_UMItools, there is a large difference in the UMI count distributions for ha:2 reads compared to ha:1 and ha:0, with ha:2 reads having many more UMIs that are supported by only a single read. This is probably an indicator of lack of deduplication:

Another thing pointing to this is that if I compare the uncorrected UMI tag UR to the UB tag for each read in the BAM file, then for ha:0 and ha:1 tags I find they differ approximately 1.27% of the time (to be precise, 1.278% for ha:0 and 1.269% for ha:1), indicating the UB has been corrected for a small proportion of the reads. For ha:2 reads only 0.004% of UB and UR tags differ, indicating they are not getting corrected at anywhere near the same rate.

Demonstrating this for the count matrices is slightly more tricky as I can't trace the counts back to which reads they originated from, but I can try to generate some examples as you suggest.

Matthew Parker · Answer 3 · Thu Apr 25 2024 21:18:37 GMT+0800 (China Standard Time)

Hi Alex, I've now done some simulations to test whether hap2 reads from STAR diploid are counted in STARsolo count matrices, & find that they are not currently being counted properly.

Sorry for deleting and reposting... I have removed all the unnecessary code from the post I initially uploaded, to keep it simple. If you need I can share all the code and simulated data for you to retest.

I simulated a 10 kb reference sequence with three 1kb intronless genes, and added ~100 annotated SNPs to each gene as haplotype 2:

fold -w 80 ref.fa | head


    >Chr1
    GGGGCATCCCTCAGTGGGTGTTATCCGGCTTCGAATCCCATCCATCCTACTGGCAAATGGCAAAAATTGTTGTGAAGGGA
    GCTTATAGGGGCTCCGTGTAGGAGGATTGGTCAACATCAGACCGCTCGGCGTCTATGAGGCTGATATATTGTTCGTTATC
    AGATTATTGAATTGGAACAGCAATTGATTTGCCGCCATCATGTACTTAACTCAAAGTGCGGTGAAGTTGTATCCAAGTTA
    GAGCGTGGTGAGTTGCCCCGATTCGAAGGCTTAGGATGCATTAGTAATCAGAGACTTTCTAGGCACGATAAGTAAGGCGA
    GTGTATCGCATGGTGTAGGTTGGTTGCTGGGAATCTCGGAGAGTGGGTCCGCGAGTGCTTGTTGTCCCTGGGTCCATAAG
    ACCTCCCCAGACTTCTGTAGACCCCATGATTTGCTACCCTTTCACACTTCAACGGCCGCCTGGTACTACTGGGAAGAGGG
    GAGAGATCCCCACAGCCTAGCTGTCGTAAAATAAGGCGGACCGTTAAGCTACTCTCTCAATTGGTGGCCCCAGTTTCGGC
    CAGCTTGAACGGGTTTGTGGGTAGAATGCACCTATCCGGCAATTCGATGGCTGTGCCTCTTTCACCCATCTCGCTATGGC
    GATTTGTAGGCGCGTGAAGCGTCAGGTGAACGACGAACTCCCGTATACGCTATGAAGCCCCTGACTTAGACGAAAATCCT


head annot.gtf


    Chr1	sim	transcript	1001	2000	.	+	.	gene_id "gene1"; transcript_id "gene1.1";
    Chr1	sim	exon	1001	2000	.	+	.	gene_id "gene1"; transcript_id "gene1.1";
    Chr1	sim	transcript	3001	4000	.	+	.	gene_id "gene2"; transcript_id "gene2.1";
    Chr1	sim	exon	3001	4000	.	+	.	gene_id "gene2"; transcript_id "gene2.1";
    Chr1	sim	transcript	5001	7000	.	+	.	gene_id "gene3"; transcript_id "gene3.1";
    Chr1	sim	exon	5001	7000	.	+	.	gene_id "gene3"; transcript_id "gene3.1";


head var.vcf


    ##fileformat=VCFv4.3
    ##source=sim
    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HAP2
    Chr1	1003	snp1	A	T	.	PASS	.	GT	0|1
    Chr1	1010	snp2	C	A	.	PASS	.	GT	0|1
    Chr1	1013	snp3	A	G	.	PASS	.	GT	0|1
    Chr1	1014	snp4	G	C	.	PASS	.	GT	0|1
    Chr1	1022	snp5	T	G	.	PASS	.	GT	0|1
    Chr1	1025	snp6	C	A	.	PASS	.	GT	0|1
    Chr1	1046	snp7	C	T	.	PASS	.	GT	0|1

These were used to generate a diploid index using the following command:

STAR \
  --runThreadN 2 \
  --runMode genomeGenerate \
  --genomeDir STAR_index \
  --genomeFastaFiles ref.fa \
  --sjdbGTFfile annot.gtf \
  --genomeTransformVCF var.vcf \
  --genomeTransformType Diploid \
  --genomeSAindexNbases 5 \
  --sjdbOverhang 150


    	STAR --runThreadN 2 --runMode genomeGenerate --genomeDir STAR_index --genomeFastaFiles ref.fa --sjdbGTFfile annot.gtf --genomeTransformVCF var.vcf --genomeTransformType Diploid --genomeSAindexNbases 5 --sjdbOverhang 150
    	STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
    Apr 25 13:32:47 ..... started STAR run
    Apr 25 13:32:47 ... starting to generate Genome files
    Apr 25 13:32:47 ..... processing annotations GTF
    Apr 25 13:32:47 ... starting to sort Suffix Array. This may take a long time...
    Apr 25 13:32:47 ... sorting Suffix Array chunks and saving them to disk...
    Apr 25 13:32:47 ... loading chunks from disk, packing SA...
    Apr 25 13:32:47 ... finished generating suffix array
    Apr 25 13:32:47 ... generating Suffix Array index
    Apr 25 13:32:47 ... completed Suffix Array index
    Apr 25 13:32:47 ... writing Genome to disk ...
    Apr 25 13:32:47 ... writing Suffix Array to disk ...
    Apr 25 13:32:47 ... writing SAindex to disk
    Apr 25 13:32:47 ..... finished successfully
    Apr 25 13:32:47 ... starting to generate Genome files
    Apr 25 13:32:47 ..... processing annotations GTF
    Apr 25 13:32:47 ... starting to sort Suffix Array. This may take a long time...
    Apr 25 13:32:47 ... sorting Suffix Array chunks and saving them to disk...
    Apr 25 13:32:47 ... loading chunks from disk, packing SA...
    Apr 25 13:32:47 ... finished generating suffix array
    Apr 25 13:32:47 ... generating Suffix Array index
    Apr 25 13:32:47 ... completed Suffix Array index
    Apr 25 13:32:47 ... writing Genome to disk ...
    Apr 25 13:32:47 ... writing Suffix Array to disk ...
    Apr 25 13:32:47 ... writing SAindex to disk
    Apr 25 13:32:47 ..... finished successfully

I then simulated 10 cells, 5 from each haplotype, and for each I simulate 100 reads (with unique UMIs) per gene, to give a total of 1500 hap1 reads, and 1500 hap2 reads.

STAR \
  --runThreadN 2 \
  --genomeDir "STAR_index" \
  --readFilesIn read.2.fastq read.1.fastq \
  --soloType "CB_UMI_Simple" \
  --soloCBwhitelist cb_whitelist.txt \
  --soloUMIlen 12 \
  --soloUMIdedup "Exact" \
  --outSAMtype SAM \
  --outSAMattributes NH HI AS nM ha \
  --genomeTransformOutput SAM

    	STAR --runThreadN 2 --genomeDir STAR_index --readFilesIn read.2.fastq read.1.fastq --soloType CB_UMI_Simple --soloCBwhitelist cb_whitelist.txt --soloUMIlen 12 --soloUMIdedup Exact --outSAMtype SAM --outSAMattributes NH HI AS nM ha --genomeTransformOutput SAM
    	STAR version: 2.7.11b   compiled:  :/Users/distiller/project/STARcompile/source
    Apr 25 13:32:47 ..... started STAR run
    Apr 25 13:32:47 ..... loading genome
    Apr 25 13:32:47 ..... started mapping
    Apr 25 13:32:48 ..... finished mapping
    Apr 25 13:32:48 ..... started Solo counting
    Apr 25 13:32:48 ..... finished Solo counting
    Apr 25 13:32:48 ..... finished successfully

100% of reads align to the genome, nearly 100% of them are able to be assigned to a specific haplotype:

cat Log.final.out

                                     Started job on |	Apr 25 13:32:47
                                 Started mapping on |	Apr 25 13:32:47
                                        Finished on |	Apr 25 13:32:48
           Mapping speed, Million of reads per hour |	10.80
    
                              Number of input reads |	3000
                          Average input read length |	150
                                        UNIQUE READS:
                       Uniquely mapped reads number |	3000
                            Uniquely mapped reads % |	100.00%
                              Average mapped length |	150.00
                           Number of splices: Total |	0
                Number of splices: Annotated (sjdb) |	0
                           Number of splices: GT/AG |	0
                           Number of splices: GC/AG |	0
                           Number of splices: AT/AC |	0
                   Number of splices: Non-canonical |	0
                          Mismatch rate per base, % |	0.00%
                             Deletion rate per base |	0.00%
                            Deletion average length |	0.00
                            Insertion rate per base |	0.00%
                           Insertion average length |	0.00
                                 MULTI-MAPPING READS:
            Number of reads mapped to multiple loci |	0
                 % of reads mapped to multiple loci |	0.00%
            Number of reads mapped to too many loci |	0
                 % of reads mapped to too many loci |	0.00%
                                      UNMAPPED READS:
      Number of reads unmapped: too many mismatches |	0
           % of reads unmapped: too many mismatches |	0.00%
                Number of reads unmapped: too short |	0
                     % of reads unmapped: too short |	0.00%
                    Number of reads unmapped: other |	0
                         % of reads unmapped: other |	0.00%
                                      CHIMERIC READS:
                           Number of chimeric reads |	0
                                % of chimeric reads |	0.00%

grep -o -E "ha:i:\d" Aligned.out.sam | sort | uniq -c

      18 ha:i:0
    1489 ha:i:1
    1493 ha:i:2

If we look at the count matrix for the cells with reads simulated from haplotype 1, they are quantified perfectly:

	TCCTCGTTGCAGTAGG	TGCAGAGGTAGTCACT	ACACATAAGATCGCAC	TAAGCTTCATCAGTTT	GCCTTGTCATTGGAGT
gene1	100.0	100.0	100.0	100.0	100.0
gene2	100.0	100.0	100.0	100.0	100.0
gene3	100.0	100.0	100.0	100.0	100.0

But for the haplotype 2 cells, the reads that were assigned to hap2 are not counted at all. The low counts that we do see probably come from reads that are ambiguous (could be hap1 or hap2)

	TCGCTCTGCCCCTAGA	GGGTTGGCCAGATTGC
gene1	0.0	0.0
gene2	0.0	0.0
gene3	4.0	3.0