gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error parsing strand (?) from GFF line

hermidalc opened this issue · comments

It causes the entire program to stop and then I can't use it to perform actions on the file. Here's an example:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/247/105/GCA_006247105.1_UU_GM_1.1/GCA_006247105.1_UU_GM_1.1_genomic.gff.gz

$ gffread -E GCA_006247105.1_UU_GM_1.1_genomic.gff
Command line was:
gffread -E GCA_006247105.1_UU_GM_1.1_genomic.gff
Error parsing strand (?) from GFF line:
CM016926.1	Genbank	mRNA	1926137	1999447	.	?	.	ID=rna-gnl|WGS:VDLU|GMRT_22684;Parent=gene-GMRT_22684;exception=trans-splicing;gbkey=mRNA;locus_tag=GMRT_22684;orig_protein_id=gnl|WGS:VDLU|GMRT_22684;orig_transcript_id=gnl|WGS:VDLU|GMRT_22684;product=putative RNA-dependent helicase p68

I'm parsing a lot of GFF/GTFs at the same time, so having to pre-filter out possible offending lines sort of defeats the purpose, I think gffread should be able to ignore these without halting?

Hi,
It seems like gffread doesn't support the recognition of symbol "?" within the .gff file.
Column 7 of the .gff file represents the strand of the molecule and "?" stands for unknown.
To solve this problem you can just simply change the "?" into "." with the following python script:

input_gff = ""
output_gff = ""

with open(input_gff, "r") as input_file, open(output_gff, "w") as output_file:
    for line in input_file:
        line = line.strip()
        if line.startswith("#"):
            output_file.write(line + "\n")
        else:
            columns = line.split("\t")
            if len(columns) >= 6 and columns[6] == "?":
                columns[6] = "."
            output_file.write("\t".join(columns) + "\n")

Just place the path of your files and run this script my solve your problem.
Best,
Xylon