alexdobin / STAR

RNA-seq aligner

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing splice junctions in matrix.mtx file

claireleblanc opened this issue · comments

Hi Alex,

I am currently trying to do a splicing analysis of some SmartSeq single cell RNA-Seq data, aligned using StarSolo. During this, I have noticed a difference between the splice junction counts in the matrix.mtx file and the splice junction counts in the “unique” column of the SJ.out.tab file. Specifically, there are many more splice junction counts in the SJ.out.tab “unique” column versus the matrix (1,004,455 in the SJ.out.tab file versus 121,335 in the matrix.mtx file).

I have checked multiple StarSolo runs/datasets and this difference still seems to be present. Based on a previous github issue, I thought that the matrix.mtx file should contain all the unique reads in the SJ.out.tab file (#1138). Is there some sort of filtering going on between the generation of the SJ.out.tab file and the splice junction count matrix?

Thank you so much!

Best,
Claire

Hi Claire,

The counts in the matrix.mtx are "collapsed", i.e. reads with the same start/end are counted as one. In the SJ.out.tab, all reads are counted. This may explain the difference you are observing.

Thanks for the response! Is there any way to prevent this behavior i.e. keep all reads, even if they have the same start/end? And does this happen when the reads have different barcodes and UMIs (for UMI data)?

Yes, --soloUMIdedup NoDedup should count all reads without collapsing.

Ah I see, will try that thanks! For one of the sequencing datasets, however, the alignment of the SmartSeq data was generated using no deduplication and this discrepancy was still present. Could something else be causing the difference? Here is the command that was used:

/home/bin/STAR-2.7.8a/bin/Linux_x86_64/STAR --readFilesManifest manifest.tsv \
                                                         --twopassMode None \
                                                         --quantMode GeneCounts \
                                                         --soloBarcodeReadLength 0 \
                                                         --readFilesCommand zcat \
                                                         --outFileNamePrefix star_output_NoDedup/ \
                                                         --outSAMtype None \
                                                         --runThreadN 20 \
                                                         --genomeDir /mm10_spike_in/STARsolo_idx/ \
                                                         --soloFeatures Gene SJ \
                                                         --soloCellFilter None \
                                                         --soloType SmartSeq \
                                                         --soloUMIdedup NoDedup

Apologies for all the continued questions, but I just reran the alignment on the SplitSeq data that we have and got different counts once again, although the matrix does have more counts that before (1004455 in the SJ.out.tab file and 835060.0 in the matrix.mtx file). Here is the command that I used:

STAR \
--runThreadN 16 \
--genomeDir  /mnt/lareaulab/claireleblanc/Genomes/STAR_index \
--readFilesIn /mnt/lareaulab/claireleblanc/fastq/SRR6750041_1.fastq /mnt/lareaulab/claireleblanc/fastq/SRR6750041_2_fixed_BC1.fastq \
--soloType CB_UMI_Complex \
--soloCBposition 0_10_0_17 0_48_0_55 0_86_0_93 \
--soloUMIposition 0_0_0_9 \
--soloCBwhitelist ../whitelists/barcode3.txt ../whitelists/barcode2.txt ../whitelists/barcode1.txt \
--soloFeatures Gene SJ \
--soloCBmatchWLtype EditDist_2 \
--soloUMIfiltering MultiGeneUMI \
--soloCellFilter None \
--soloMultiMappers EM \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes CR CY UR UY GX GN CB UB sM sS sQ \
--outSJtype Standard \
--soloUMIdedup NoDedup

This was with STAR version 2.7.11a. Thank you for your help!

This is much closer. The remaining difference is probably largely due to spliced reads that do not have proper barcodes.