alexdobin / STAR

RNA-seq aligner

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Selection of GTF file from gencode

SueFletcher opened this issue · comments

Hello, I'm a first-year master's student, and I'm attempting to use STAR to index the mouse genome. I'm using the following command:

import os
import subprocess

class STAR:
def init(self, genome_dir, genome_fasta_files, sjdb_gtf_file, runThreadN):
self.exec_path = "/opt/conda/envs/STAR/bin/STAR"
self.genome_dir = genome_dir
self.genome_fasta_files = genome_fasta_files
self.sjdb_gtf_file = sjdb_gtf_file
self.runThreadN = runThreadN

def build_genome_index(self):
    # Create the genome_dir directory if it doesn't exist
    os.makedirs(self.genome_dir, exist_ok=True)

    cmd = [
        self.exec_path,
        "--runMode", "genomeGenerate",
        "--runThreadN", str(self.runThreadN),
        "--genomeChrBinNbits", "12",
        "--limitGenomeGenerateRAM", "60000000000",
        "--genomeDir", self.genome_dir,
        "--genomeFastaFiles", self.genome_fasta_files,
        "--sjdbGTFfile", self.sjdb_gtf_file,
        "--genomeSAsparseD", "3"
    ]
    subprocess.check_call(cmd)

genome_dir = "/desktop/output/mouse_genome_index/"
genome_fasta_files = "/desktop/mouse_input_data/mouse_gencode_transcripts.fa"
sjdb_gtf_file = "/desktop/mouse_input_data/mouse_gencode_annotation.gtf"
runThreadN = 8

star = STAR(genome_dir, genome_fasta_files, sjdb_gtf_file, runThreadN)
star.build_genome_index()

I downloaded the mouse genome FASTA and GTF files from the GENCODE website : https://www.gencodegenes.org/mouse/
I used the following GTF file

image

and this fasta file:
image

However, I encountered an error that I'm having trouble understanding:
/opt/conda/envs/STAR/bin/STAR-avx2 --runMode genomeGenerate --runThreadN 8 --genomeChrBinNbits 12 --limitGenomeGenerateRAM 60000000000 --genomeDir /desktop/output/mouse_genome_index/ --genomeFastaFiles /desktop/mouse_input_data/mouse_gencode_transcripts.fa --sjdbGTFfile /desktop/mouse_input_data/mouse_gencode_annotation.gtf --genomeSAsparseD 3
STAR version: 2.7.11b compiled: 2024-01-29T15:15:38+0000 :/opt/conda/conda-bld/star_1706541070242/work/source
Feb 29 15:51:23 ..... started STAR run
Feb 29 15:51:23 ... starting to generate Genome files
Feb 29 15:51:29 ..... processing annotations GTF

Fatal INPUT FILE error, no valid exon lines in the GTF file: /desktop/mouse_input_data/mouse_gencode_annotation.gtf
Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.

Feb 29 15:51:32 ...... FATAL ERROR, exiting
Traceback (most recent call last):
File "mouse_star_index.py", line 39, in
star.build_genome_index()
File "mouse_star_index.py", line 31, in build_genome_index
subprocess.check_call(cmd)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/conda/envs/STAR/bin/STAR', '--runMode', 'genomeGenerate', '--runThreadN', '8', '--genomeChrBinNbits', '12', '--limitGenomeGenerateRAM', '60000000000', '--genomeDir', '/desktop/mouse_genome_index/', '--genomeFastaFiles', '/desktop/mouse_input_data/mouse_gencode_transcripts.fa', '--sjdbGTFfile', '/desktop/mouse_input_data/mouse_gencode_annotation.gtf', '--genomeSAsparseD', '3']' returned non-zero exit status 104.

it is related to the GTF file, but I don't know which GTF file I have to download from gencode in this case ( --sjdbGTFfile )

You need to use the PRI fasta file (genome sequences, not transcriptome):
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M34/GRCm39.primary_assembly.genome.fa.gz
You can also use the PRI GTF file which has more comprehensive annotations than the basic.

Yes, correct!