Interaction between STARsolo and STAR diploid
mparker2 opened this issue · comments
Dear @alexdobin,
I've noticed that reads that are assigned to haplotype 2 by STAR diploid are not processed in the same way as haplotype 1 reads by STARsolo. For instance, I am fairly confident that only reads that map to haplotype 1 or both haplotypes equally well are used for producing STARsolo gene x cell count matrices. Also the UB tags on reads that align to haplotype 2 do not seem to be deduplicated properly and still match the UR tags (tested using the "1MM_Directional_UMItools). This could lead to biases in downstream analyses using either the count matrices or UMIs for quantification, if users are not aware of the limitations of using STARsolo and STAR diploid at the same time.
Best wishes
Matt
Hi Matt,
This could be a bug. Can you extract a few reads that map to hap2 and map them separately to confirm.
I will look into it at some point, but not sure when I will have time.
Hi Alex,
Thanks for the reply! Heres a bit of evidence for the lack of UMI deduplication on hap2 reads. Across a BAM file of scRNA-seq reads mapped with STAR diploid + STARsolo, using the UMI dedup method 1MM_Directional_UMItools
, there is a large difference in the UMI count distributions for ha:2
reads compared to ha:1
and ha:0
, with ha:2
reads having many more UMIs that are supported by only a single read. This is probably an indicator of lack of deduplication:
Another thing pointing to this is that if I compare the uncorrected UMI tag UR
to the UB
tag for each read in the BAM file, then for ha:0
and ha:1
tags I find they differ approximately 1.27% of the time (to be precise, 1.278% for ha:0
and 1.269% for ha:1
), indicating the UB
has been corrected for a small proportion of the reads. For ha:2
reads only 0.004% of UB
and UR
tags differ, indicating they are not getting corrected at anywhere near the same rate.
Demonstrating this for the count matrices is slightly more tricky as I can't trace the counts back to which reads they originated from, but I can try to generate some examples as you suggest.
Hi Alex, I've now done some simulations to test whether hap2 reads from STAR diploid are counted in STARsolo count matrices, & find that they are not currently being counted properly.
Sorry for deleting and reposting... I have removed all the unnecessary code from the post I initially uploaded, to keep it simple. If you need I can share all the code and simulated data for you to retest.
I simulated a 10 kb reference sequence with three 1kb intronless genes, and added ~100 annotated SNPs to each gene as haplotype 2:
fold -w 80 ref.fa | head
>Chr1
GGGGCATCCCTCAGTGGGTGTTATCCGGCTTCGAATCCCATCCATCCTACTGGCAAATGGCAAAAATTGTTGTGAAGGGA
GCTTATAGGGGCTCCGTGTAGGAGGATTGGTCAACATCAGACCGCTCGGCGTCTATGAGGCTGATATATTGTTCGTTATC
AGATTATTGAATTGGAACAGCAATTGATTTGCCGCCATCATGTACTTAACTCAAAGTGCGGTGAAGTTGTATCCAAGTTA
GAGCGTGGTGAGTTGCCCCGATTCGAAGGCTTAGGATGCATTAGTAATCAGAGACTTTCTAGGCACGATAAGTAAGGCGA
GTGTATCGCATGGTGTAGGTTGGTTGCTGGGAATCTCGGAGAGTGGGTCCGCGAGTGCTTGTTGTCCCTGGGTCCATAAG
ACCTCCCCAGACTTCTGTAGACCCCATGATTTGCTACCCTTTCACACTTCAACGGCCGCCTGGTACTACTGGGAAGAGGG
GAGAGATCCCCACAGCCTAGCTGTCGTAAAATAAGGCGGACCGTTAAGCTACTCTCTCAATTGGTGGCCCCAGTTTCGGC
CAGCTTGAACGGGTTTGTGGGTAGAATGCACCTATCCGGCAATTCGATGGCTGTGCCTCTTTCACCCATCTCGCTATGGC
GATTTGTAGGCGCGTGAAGCGTCAGGTGAACGACGAACTCCCGTATACGCTATGAAGCCCCTGACTTAGACGAAAATCCT
head annot.gtf
Chr1 sim transcript 1001 2000 . + . gene_id "gene1"; transcript_id "gene1.1";
Chr1 sim exon 1001 2000 . + . gene_id "gene1"; transcript_id "gene1.1";
Chr1 sim transcript 3001 4000 . + . gene_id "gene2"; transcript_id "gene2.1";
Chr1 sim exon 3001 4000 . + . gene_id "gene2"; transcript_id "gene2.1";
Chr1 sim transcript 5001 7000 . + . gene_id "gene3"; transcript_id "gene3.1";
Chr1 sim exon 5001 7000 . + . gene_id "gene3"; transcript_id "gene3.1";
head var.vcf
##fileformat=VCFv4.3
##source=sim
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HAP2
Chr1 1003 snp1 A T . PASS . GT 0|1
Chr1 1010 snp2 C A . PASS . GT 0|1
Chr1 1013 snp3 A G . PASS . GT 0|1
Chr1 1014 snp4 G C . PASS . GT 0|1
Chr1 1022 snp5 T G . PASS . GT 0|1
Chr1 1025 snp6 C A . PASS . GT 0|1
Chr1 1046 snp7 C T . PASS . GT 0|1
These were used to generate a diploid index using the following command:
STAR \
--runThreadN 2 \
--runMode genomeGenerate \
--genomeDir STAR_index \
--genomeFastaFiles ref.fa \
--sjdbGTFfile annot.gtf \
--genomeTransformVCF var.vcf \
--genomeTransformType Diploid \
--genomeSAindexNbases 5 \
--sjdbOverhang 150
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir STAR_index --genomeFastaFiles ref.fa --sjdbGTFfile annot.gtf --genomeTransformVCF var.vcf --genomeTransformType Diploid --genomeSAindexNbases 5 --sjdbOverhang 150
STAR version: 2.7.11b compiled: :/Users/distiller/project/STARcompile/source
Apr 25 13:32:47 ..... started STAR run
Apr 25 13:32:47 ... starting to generate Genome files
Apr 25 13:32:47 ..... processing annotations GTF
Apr 25 13:32:47 ... starting to sort Suffix Array. This may take a long time...
Apr 25 13:32:47 ... sorting Suffix Array chunks and saving them to disk...
Apr 25 13:32:47 ... loading chunks from disk, packing SA...
Apr 25 13:32:47 ... finished generating suffix array
Apr 25 13:32:47 ... generating Suffix Array index
Apr 25 13:32:47 ... completed Suffix Array index
Apr 25 13:32:47 ... writing Genome to disk ...
Apr 25 13:32:47 ... writing Suffix Array to disk ...
Apr 25 13:32:47 ... writing SAindex to disk
Apr 25 13:32:47 ..... finished successfully
Apr 25 13:32:47 ... starting to generate Genome files
Apr 25 13:32:47 ..... processing annotations GTF
Apr 25 13:32:47 ... starting to sort Suffix Array. This may take a long time...
Apr 25 13:32:47 ... sorting Suffix Array chunks and saving them to disk...
Apr 25 13:32:47 ... loading chunks from disk, packing SA...
Apr 25 13:32:47 ... finished generating suffix array
Apr 25 13:32:47 ... generating Suffix Array index
Apr 25 13:32:47 ... completed Suffix Array index
Apr 25 13:32:47 ... writing Genome to disk ...
Apr 25 13:32:47 ... writing Suffix Array to disk ...
Apr 25 13:32:47 ... writing SAindex to disk
Apr 25 13:32:47 ..... finished successfully
I then simulated 10 cells, 5 from each haplotype, and for each I simulate 100 reads (with unique UMIs) per gene, to give a total of 1500 hap1 reads, and 1500 hap2 reads.
STAR \
--runThreadN 2 \
--genomeDir "STAR_index" \
--readFilesIn read.2.fastq read.1.fastq \
--soloType "CB_UMI_Simple" \
--soloCBwhitelist cb_whitelist.txt \
--soloUMIlen 12 \
--soloUMIdedup "Exact" \
--outSAMtype SAM \
--outSAMattributes NH HI AS nM ha \
--genomeTransformOutput SAM
STAR --runThreadN 2 --genomeDir STAR_index --readFilesIn read.2.fastq read.1.fastq --soloType CB_UMI_Simple --soloCBwhitelist cb_whitelist.txt --soloUMIlen 12 --soloUMIdedup Exact --outSAMtype SAM --outSAMattributes NH HI AS nM ha --genomeTransformOutput SAM
STAR version: 2.7.11b compiled: :/Users/distiller/project/STARcompile/source
Apr 25 13:32:47 ..... started STAR run
Apr 25 13:32:47 ..... loading genome
Apr 25 13:32:47 ..... started mapping
Apr 25 13:32:48 ..... finished mapping
Apr 25 13:32:48 ..... started Solo counting
Apr 25 13:32:48 ..... finished Solo counting
Apr 25 13:32:48 ..... finished successfully
100% of reads align to the genome, nearly 100% of them are able to be assigned to a specific haplotype:
cat Log.final.out
Started job on | Apr 25 13:32:47
Started mapping on | Apr 25 13:32:47
Finished on | Apr 25 13:32:48
Mapping speed, Million of reads per hour | 10.80
Number of input reads | 3000
Average input read length | 150
UNIQUE READS:
Uniquely mapped reads number | 3000
Uniquely mapped reads % | 100.00%
Average mapped length | 150.00
Number of splices: Total | 0
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 0
Number of splices: GC/AG | 0
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 0
Mismatch rate per base, % | 0.00%
Deletion rate per base | 0.00%
Deletion average length | 0.00
Insertion rate per base | 0.00%
Insertion average length | 0.00
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 0
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 0
% of reads unmapped: too short | 0.00%
Number of reads unmapped: other | 0
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
grep -o -E "ha:i:\d" Aligned.out.sam | sort | uniq -c
18 ha:i:0
1489 ha:i:1
1493 ha:i:2
If we look at the count matrix for the cells with reads simulated from haplotype 1, they are quantified perfectly:
TCCTCGTTGCAGTAGG | TGCAGAGGTAGTCACT | ACACATAAGATCGCAC | TAAGCTTCATCAGTTT | GCCTTGTCATTGGAGT | |
---|---|---|---|---|---|
gene1 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
gene2 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
gene3 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
But for the haplotype 2 cells, the reads that were assigned to hap2 are not counted at all. The low counts that we do see probably come from reads that are ambiguous (could be hap1 or hap2)
TCGCTCTGCCCCTAGA | CCGTGAAATCCCAGGG | AAAGCGTGTAATGGCC | GGGTTGGCCAGATTGC | ACACACGCACAGCTTA | |
---|---|---|---|---|---|
gene1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
gene2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
gene3 | 4.0 | 0.0 | 0.0 | 3.0 | 0.0 |