alexdobin / STAR

RNA-seq aligner

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

T2T chrm13 genome mapping with highly unmapped reads AND parameters tuning didn't work

gnilihzeux opened this issue · comments

Dear author,
There are very high ratio unmapped reads for 'too short' and 'other' while mapping to T2T chrm13 genome, but it worked for hg19 genome. BWT, there is a 83% reads mapping to T2T with bowtie2.
Our data is RNA-seq with ribosome fractions.
Our group had modified some parameters related to repeats, including --winAnchorMultimapNmax higer, --outFilterMultimapNmax higher, --alignIntronMin 1. But all tunes didn't work.

What parameters should been set?

Thanks a lot.

The logs are follow:
T2T

Started job on |       Jan 25 02:22:54
                             Started mapping on |       Jan 25 02:23:47
                                    Finished on |       Jan 25 02:43:58
       Mapping speed, Million of reads per hour |       108.31

                          Number of input reads |       36434753
                      Average input read length |       283
                                    UNIQUE READS:
                   Uniquely mapped reads number |       2537986
                        Uniquely mapped reads % |       6.97%
                          Average mapped length |       278.61
                       Number of splices: Total |       1204395
            Number of splices: Annotated (sjdb) |       1136229
                       Number of splices: GT/AG |       1164313
                       Number of splices: GC/AG |       9637
                       Number of splices: AT/AC |       1101
               Number of splices: Non-canonical |       29344
                      Mismatch rate per base, % |       0.48%
                         Deletion rate per base |       0.09%
                        Deletion average length |       1.85
                        Insertion rate per base |       0.04%
                       Insertion average length |       1.37
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       456923
             % of reads mapped to multiple loci |       1.25%
        Number of reads mapped to too many loci |       190199
             % of reads mapped to too many loci |       0.52%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.06%
                 % of reads unmapped: too short |       48.17%
                     % of reads unmapped: other |       43.03%
                                  CHIMERIC READS:
                       Number of chimeric reads |       26935
                            % of chimeric reads |       0.07%

hg19

Mapping speed, Million of reads per hour |       230.11

                          Number of input reads |       36434753
                      Average input read length |       283
                                    UNIQUE READS:
                   Uniquely mapped reads number |       8063144
                        Uniquely mapped reads % |       22.13%
                          Average mapped length |       285.69
                       Number of splices: Total |       1959408
            Number of splices: Annotated (sjdb) |       1133733
                       Number of splices: GT/AG |       1246660
                       Number of splices: GC/AG |       36075
                       Number of splices: AT/AC |       2267
               Number of splices: Non-canonical |       674406
                      Mismatch rate per base, % |       0.39%
                         Deletion rate per base |       0.12%
                        Deletion average length |       1.18
                        Insertion rate per base |       0.02%
                       Insertion average length |       1.12
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       27457561
             % of reads mapped to multiple loci |       75.36%
        Number of reads mapped to too many loci |       13369
             % of reads mapped to too many loci |       0.04%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.07%
                 % of reads unmapped: too short |       2.32%
                     % of reads unmapped: other |       0.09%
                                  CHIMERIC READS:
                       Number of chimeric reads |       614048
                            % of chimeric reads |       1.69%

Hi @gnilihzeux

I would recommend exploring the reads that were mapped by bowtie2 and not mapped by STAR.

@alexdobin Yes, I seemed have found what happened to unmapped reads, of which most are palindrome sequence beween Read1 and Read2.

However, I have not found a solution to this problem yet.

Some sequences are listed as follows

>@illumina:8501:1210 mate1
GAGGCATTTGGCTACCTTAAGAGAGTCATAGTTACTCCCGCCGTTTACCCGCGCTTCATTGAATTTCTTCACTTTG
>@illumina:8501:1210 mate2
CAAAGTGAAGAAATTCAATGAAGCGCGGGTAAACGGCGGGAGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTC
>@illumina:36606:2009 mate1
AGCCGTCCCGGAGCCGGTCGCGGCGCACCGCCGCGGTGGAAATGCGCCCGGCGGCGGCCGGTCGCCGGTCGGGGGACGGTCCCCCGCCGACCCCACCCCCGGCCCCGCCCGCCCACCCCCGCACCCGCCGGAGCCCGCCCCCTCCGGGGA
>@illumina:36606:2009 mate2
GGCCGTGTCGGCGGCCCGGCGGATCTTTCCCGCCCCCCGTTCCTCCCGACCCCTCCACCCGCCCTCCCTTCCCCCGCCGCCCCTCCTCCTCCTCCCCGGAGGGGGCGGGCTCCGGCGGGTGCGGGGGTGGGCGGGCGGGGCCGGGGGTGG
>@illumina:36347:2009 mate1
ATCGGCGAGTGCTGCTGCCGGGGGGGCTGTAACACTCGGGGGGGGTTTCGGTCCCGCCGCCGCCGCCGCCGCCGCCACCGCCGCCGCGAGGGGGGGGGAATCA
>@illumina:36347:2009 mate2
TGATTCCCCCCCCCTCGCGGCGGCGGTGGCGGCGGCGGCGGCGGCGGCGGGACCGAAACCCCCCCCGAGTGTTACAGCCCCCCCGGCAGCAGCACTCGCCGAT
>@illumina:49804:3788 mate1
GTAGTTCACCATCTTTCGGGTCCTAACACGTGCGCTCGTGCTCCACCTCCCCGGCGCGGCGGGCGAGACGGGCCGGTGGTGCGCCCTCGGCGGACTGGAGAGGCATCGGGATCCCACCTCGGGAAGCG
>@illumina:49804:3788 mate2
CAAGGAGTCTAACACGTGCGCGAGTCGGGGGCTCGCACGAAAGCCGCCGTGGCGCAATGAAGGTGAAGGCCGGCGCGCTCGCCGGCCGAGGTGGGATCCCGAGGCCTCTCCAGTCCGCCGAGGGCGCACCACCGGCCCGTCTCGCCCGCC
>@illumina:38521:29680 mate1
GTTTCGGTCCCGCCGCCGCCGCCGCCGCCGCCACCGCCGCCGCCGCCGCCGCCCCGACCCGCGCGCCCTCCCGAGGGAGGACGCGGGGCCGGGGGGCGGAGACGGGGGAGGAGGAGGACGGACGGACGGACGGACGGGGCCCCCCGAGCC
>@illumina:38521:29680 mate2
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>@illumina:46639:29680 mate1
TACTATTCAAAGTTCTTTTCAACTTTCCCTTACGGTACTTGTTGACTCCC
>@illumina:46639:29680 mate2
GGGAGTCAACAAGTACCGTAAGGGAAAGTTGAAAAGAACTTTGAATAGTA
>@illumina:48673:29712 mate1
CCCATTTAAAGTTTGAGAATAGGTTGAGATCGTTTTCGGCCCCAAGACCTCTAATCNTTCGCTTTACCGGATAAAACTGCGTGGCGGGGGTGCGTCGGGTCTGCGAGAGCGCCAGCTATCCTGAGGGAAACTTCGGAGGGAACCAGCTAC
>@illumina:48673:29712 mate2
GAAACTCTGGTGGAGGTCCGTAGCGGTCCTGACGTGCAAATCGGTCGTCCGACCTGGGTATAGGGGCNAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTNGCGCTCTCGCAGACCCGACGCAC