chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R

Home Page:http://bcbio.wordpress.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parse simple returns different start location

NtBaru opened this issue · comments

I need to parse GFF file and I found a little problem.

In my GFF file is (for example) this line:
BAC1_SV_50C14_semf_p2_contig1 ltr LTR_retrotransposon 38 9461 0 . 0 ID=ele00002;

So the location should be [38, 9461], but parse_simple(...) returns location [37, 9461].
It happened in all files I tried.

My code look like this:
from BCBio import GFF

  in_file = "your_file.gff"
  in_handle = open(in_file)

  for record in GFF.parse_simple(in_handle):
        print record['type']
        print record['id']
        print record['location']

  in_handle.close()

Ant it prints this:
LTR_retrotransposon
ele00002
[37, 9461]

Can someone explain me what is wrong please? I don know if it is on purpose or if it is some bug.

Thanks for using the GFF module and sorry for any confusion. The library follows the Biopython conventions of using 0-based coordinates internally, which is identical to the normal slicing coordinates that Python itself uses. GFF is a 1-based coordinate system, so the difference you're seeing is due to handling that conversion which happens on import/export to GFF.

This is a nice general discussion about the difference between zero and 1-based coordinate systems:

http://www.biostars.org/p/6373/

Hope this helps.

I didn't know about this difference before, its my first project in the field of bioinformatics.
Thanks a lot for explanation, it was really helpful :)