GFF.parser target_lines returns interchange order of contigs if contigs sorted in certain order
genestack-solomatin opened this issue · comments
Steps to reproduce:
Create file sorted in certain order (III -> I)
III protein_coding CDS 100063 100183 . + 1 gene_id "CADAFUAG00000321"; transcript_id "CADAFUAT00000321"; exon_number "4"; gene_name "AFUA_3G00460"; transcript_name "AFUA_3G00460"; protein_id "CADAFUAP00000321";
III protein_coding CDS 1004211 1004214 . + 0 gene_id "CADAFUAG00000267"; transcript_id "CADAFUAT00000267"; exon_number "1"; gene_name "AFUA_3G03700"; transcript_name "AFUA_3G03700"; protein_id "CADAFUAP00000267";
III protein_coding CDS 1004428 1004850 . + 2 gene_id "CADAFUAG00000267"; transcript_id "CADAFUAT00000267"; exon_number "2"; gene_name "AFUA_3G03700"; transcript_name "AFUA_3G03700"; protein_id "CADAFUAP00000267";
I tRNA exon 883674 883709 . + . gene_id "CADAFUAG00009730"; transcript_id "CADAFUAT00009730"; exon_number "2"; gene_name "AFUA_5G03266"; transcript_name "AFUA_5G03266"; seqedit "false";
I tRNA_pseudogene exon 3717600 3717932 . - . gene_id "CADAFUAG00005891"; transcript_id "CADAFUAT00005891"; exon_number "1"; gene_name "AFUA_5G14275"; transcript_name "AFUA_5G14275"; seqedit "false";
I tRNA_pseudogene exon 3916324 3920790 . + . gene_id "CADAFUAG00006577"; transcript_id "CADAFUAT00006577"; exon_number "1"; gene_name "AFUA_5G15102"; transcript_name "AFUA_5G15102"; seqedit "false";
Run python code :
from BCBio.GFF import parse
# target_lines must be not 1 and less then number of features in block
for record in parse('test_file.gff', target_lines=2):
print 'record id: %10s number of features %5s' % (record.id, len(record.features))
Observed result:
record id: III number of features 2
record id: I number of features 1
record id: III number of features 1
record id: I number of features 2
Expected result:
Contig sort order should not affect output
record id: III number of features 1
record id: III number of features 1
record id: I number of features 2
record id: I number of features 1
Note:
Sorting file in order (I -> III) will give:
record id: I number of features 2
record id: I number of features 1
record id: III number of features 1
record id: III number of features 1
Thanks for the feedback and example case. Unfortunately it's not easy to maintain order when you have a small target_lines
parameter, since it will end up being forced to break inside nested features. The easy fix is either to ignore nesting (target_lines=1
) or not set target_lines
at all and let it process across the file. target_lines
is meant as a workaround to handle memory issues.
Since GFF does not have a defined order and nested items can be anywhere in the file, this library was not designed with order guarantees under all conditions. If you need that it might be worth looking at Ryan Dale's gffutils:
https://github.com/daler/gffutils/tree/refactor
This stores the GFF in a database, allowing you to grab out sections based on your ordering criteria. Hope this helps.