chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R

Home Page:http://bcbio.wordpress.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GFF.parser target_lines returns interchange order of contigs if contigs sorted in certain order

genestack-solomatin opened this issue · comments

Steps to reproduce:
Create file sorted in certain order (III -> I)

III protein_coding  CDS 100063  100183  .   +   1    gene_id "CADAFUAG00000321"; transcript_id "CADAFUAT00000321"; exon_number "4"; gene_name "AFUA_3G00460"; transcript_name "AFUA_3G00460"; protein_id "CADAFUAP00000321";
III protein_coding  CDS 1004211 1004214 .   +   0    gene_id "CADAFUAG00000267"; transcript_id "CADAFUAT00000267"; exon_number "1"; gene_name "AFUA_3G03700"; transcript_name "AFUA_3G03700"; protein_id "CADAFUAP00000267";
III protein_coding  CDS 1004428 1004850 .   +   2    gene_id "CADAFUAG00000267"; transcript_id "CADAFUAT00000267"; exon_number "2"; gene_name "AFUA_3G03700"; transcript_name "AFUA_3G03700"; protein_id "CADAFUAP00000267";
I   tRNA    exon    883674  883709  .   +   .    gene_id "CADAFUAG00009730"; transcript_id "CADAFUAT00009730"; exon_number "2"; gene_name "AFUA_5G03266"; transcript_name "AFUA_5G03266"; seqedit "false";
I   tRNA_pseudogene exon    3717600 3717932 .   -   .    gene_id "CADAFUAG00005891"; transcript_id "CADAFUAT00005891"; exon_number "1"; gene_name "AFUA_5G14275"; transcript_name "AFUA_5G14275"; seqedit "false";
I   tRNA_pseudogene exon    3916324 3920790 .   +   .    gene_id "CADAFUAG00006577"; transcript_id "CADAFUAT00006577"; exon_number "1"; gene_name "AFUA_5G15102"; transcript_name "AFUA_5G15102"; seqedit "false";

Run python code :

from BCBio.GFF import parse

# target_lines must be not 1 and less then number of features in block
for record in parse('test_file.gff', target_lines=2):
    print 'record id: %10s     number of features %5s' % (record.id, len(record.features))

Observed result:

record id:        III     number of features     2
record id:          I     number of features     1
record id:        III     number of features     1
record id:          I     number of features     2

Expected result:
Contig sort order should not affect output

record id:        III     number of features     1
record id:        III     number of features     1
record id:          I     number of features     2
record id:          I     number of features     1

Note:
Sorting file in order (I -> III) will give:

record id:          I     number of features     2
record id:          I     number of features     1
record id:        III     number of features     1
record id:        III     number of features     1

Thanks for the feedback and example case. Unfortunately it's not easy to maintain order when you have a small target_lines parameter, since it will end up being forced to break inside nested features. The easy fix is either to ignore nesting (target_lines=1) or not set target_lines at all and let it process across the file. target_lines is meant as a workaround to handle memory issues.

Since GFF does not have a defined order and nested items can be anywhere in the file, this library was not designed with order guarantees under all conditions. If you need that it might be worth looking at Ryan Dale's gffutils:

https://github.com/daler/gffutils/tree/refactor

This stores the GFF in a database, allowing you to grab out sections based on your ordering criteria. Hope this helps.