Error parsing strand (?) from GFF line
hermidalc opened this issue · comments
It causes the entire program to stop and then I can't use it to perform actions on the file. Here's an example:
$ gffread -E GCA_006247105.1_UU_GM_1.1_genomic.gff
Command line was:
gffread -E GCA_006247105.1_UU_GM_1.1_genomic.gff
Error parsing strand (?) from GFF line:
CM016926.1 Genbank mRNA 1926137 1999447 . ? . ID=rna-gnl|WGS:VDLU|GMRT_22684;Parent=gene-GMRT_22684;exception=trans-splicing;gbkey=mRNA;locus_tag=GMRT_22684;orig_protein_id=gnl|WGS:VDLU|GMRT_22684;orig_transcript_id=gnl|WGS:VDLU|GMRT_22684;product=putative RNA-dependent helicase p68
I'm parsing a lot of GFF/GTFs at the same time, so having to pre-filter out possible offending lines sort of defeats the purpose, I think gffread
should be able to ignore these without halting?
Also same issue with ?
strand in this NCBI genome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/435/GCA_000002435.2_UU_WB_2.1/GCA_000002435.2_UU_WB_2.1_genomic.gff.gz
Hi,
It seems like gffread doesn't support the recognition of symbol "?" within the .gff
file.
Column 7 of the .gff
file represents the strand of the molecule and "?" stands for unknown.
To solve this problem you can just simply change the "?" into "." with the following python script:
input_gff = ""
output_gff = ""
with open(input_gff, "r") as input_file, open(output_gff, "w") as output_file:
for line in input_file:
line = line.strip()
if line.startswith("#"):
output_file.write(line + "\n")
else:
columns = line.split("\t")
if len(columns) >= 6 and columns[6] == "?":
columns[6] = "."
output_file.write("\t".join(columns) + "\n")
Just place the path of your files and run this script my solve your problem.
Best,
Xylon