Missing splice junctions in matrix.mtx file

Question

Missing splice junctions in matrix.mtx file

claireleblanc opened this issue 6 months ago · comments

Hi Alex,

I am currently trying to do a splicing analysis of some SmartSeq single cell RNA-Seq data, aligned using StarSolo. During this, I have noticed a difference between the splice junction counts in the matrix.mtx file and the splice junction counts in the “unique” column of the SJ.out.tab file. Specifically, there are many more splice junction counts in the SJ.out.tab “unique” column versus the matrix (1,004,455 in the SJ.out.tab file versus 121,335 in the matrix.mtx file).

I have checked multiple StarSolo runs/datasets and this difference still seems to be present. Based on a previous github issue, I thought that the matrix.mtx file should contain all the unique reads in the SJ.out.tab file (#1138). Is there some sort of filtering going on between the generation of the SJ.out.tab file and the splice junction count matrix?

Thank you so much!

Best,
Claire

Alexander Dobin · Answer 1 · Wed Feb 07 2024 22:22:24 GMT+0800 (China Standard Time)

Hi Claire,

The counts in the matrix.mtx are "collapsed", i.e. reads with the same start/end are counted as one. In the SJ.out.tab, all reads are counted. This may explain the difference you are observing.

claireleblanc · Answer 2 · Fri Feb 09 2024 12:33:12 GMT+0800 (China Standard Time)

Thanks for the response! Is there any way to prevent this behavior i.e. keep all reads, even if they have the same start/end? And does this happen when the reads have different barcodes and UMIs (for UMI data)?

Alexander Dobin · Answer 3 · Sat Feb 10 2024 02:32:07 GMT+0800 (China Standard Time)

Yes, --soloUMIdedup NoDedup should count all reads without collapsing.

claireleblanc · Answer 4 · Sat Feb 10 2024 05:47:09 GMT+0800 (China Standard Time)

Ah I see, will try that thanks! For one of the sequencing datasets, however, the alignment of the SmartSeq data was generated using no deduplication and this discrepancy was still present. Could something else be causing the difference? Here is the command that was used:

/home/bin/STAR-2.7.8a/bin/Linux_x86_64/STAR --readFilesManifest manifest.tsv \
                                                         --twopassMode None \
                                                         --quantMode GeneCounts \
                                                         --soloBarcodeReadLength 0 \
                                                         --readFilesCommand zcat \
                                                         --outFileNamePrefix star_output_NoDedup/ \
                                                         --outSAMtype None \
                                                         --runThreadN 20 \
                                                         --genomeDir /mm10_spike_in/STARsolo_idx/ \
                                                         --soloFeatures Gene SJ \
                                                         --soloCellFilter None \
                                                         --soloType SmartSeq \
                                                         --soloUMIdedup NoDedup

claireleblanc · Answer 5 · Sat Feb 10 2024 06:12:19 GMT+0800 (China Standard Time)

Apologies for all the continued questions, but I just reran the alignment on the SplitSeq data that we have and got different counts once again, although the matrix does have more counts that before (1004455 in the SJ.out.tab file and 835060.0 in the matrix.mtx file). Here is the command that I used:

STAR \
--runThreadN 16 \
--genomeDir  /mnt/lareaulab/claireleblanc/Genomes/STAR_index \
--readFilesIn /mnt/lareaulab/claireleblanc/fastq/SRR6750041_1.fastq /mnt/lareaulab/claireleblanc/fastq/SRR6750041_2_fixed_BC1.fastq \
--soloType CB_UMI_Complex \
--soloCBposition 0_10_0_17 0_48_0_55 0_86_0_93 \
--soloUMIposition 0_0_0_9 \
--soloCBwhitelist ../whitelists/barcode3.txt ../whitelists/barcode2.txt ../whitelists/barcode1.txt \
--soloFeatures Gene SJ \
--soloCBmatchWLtype EditDist_2 \
--soloUMIfiltering MultiGeneUMI \
--soloCellFilter None \
--soloMultiMappers EM \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes CR CY UR UY GX GN CB UB sM sS sQ \
--outSJtype Standard \
--soloUMIdedup NoDedup

This was with STAR version 2.7.11a. Thank you for your help!

Alexander Dobin · Answer 6 · Fri Feb 23 2024 03:55:21 GMT+0800 (China Standard Time)

This is much closer. The remaining difference is probably largely due to spliced reads that do not have proper barcodes.