daler / gffutils

GFF and GTF file manipulation and interconversion

Home Page:http://daler.github.io/gffutils

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Updating db with intergenic space features

novigit opened this issue · comments

I'm trying to infer intergenic space features with the interfeatures functionality, and then update the db with these features.

By default db.interfeatures() seems to set the ID of the intergenic feature using the ID of the neighboring genes. So for example, ID=gene0001,gene0002, The db.update() doesnt accept this. It only wants a single value for ID. So I tried to merge both IDs into a single values, i.e. 'gene0001-gene0002', but this leaves me with another cryptic error. See below.

My code:


import gffutils

db = gffutils.create_db('test.simple.gff3', ':memory:', merge_strategy='error')
genes = db.features_of_type('gene')
igss = list( db.interfeatures(genes,new_featuretype='intergenic_space') )

def transform(f):
    f['ID'] = [ '-'.join(f.attributes['ID']) ]
    return f

db.update(igss, transform=transform, merge_strategy='error')

Running this however leaves me with


Traceback (most recent call last):
  File "<stdin>", line 37, in <module>
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/interface.py", line 1060, in update
    db._populate_from_lines(data)
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 606, in _populate_from_lines
    for i, f in enumerate(lines):
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/iterators.py", line 101, in __iter__
    if i:
ValueError: __len__() should return >= 0

Can you post the file (or part of the file) that reproduces the error?

Here is the first couple of lines of the file

tig00000012  EVM  gene  248937   250896   .  +  .  ID=ctg012.gene0001;Name=gene0001
tig00000012  EVM  mRNA  248937   250896   .  +  .  ID=ctg012.mRNA0001;Parent=ctg012.gene0001;Name=mRNA0001
tig00000012  EVM  exon  248937   250153   .  +  .  ID=ctg012.mRNA0001.exon01;Parent=ctg012.mRNA0001
tig00000012  EVM  CDS   248937   250153   .  +  0  ID=ctg012.mRNA0001.CDS01;Parent=ctg012.mRNA0001
tig00000012  EVM  exon  250212   250310   .  +  .  ID=ctg012.mRNA0001.exon02;Parent=ctg012.mRNA0001
tig00000012  EVM  CDS   250212   250310   .  +  1  ID=ctg012.mRNA0001.CDS02;Parent=ctg012.mRNA0001
tig00000012  EVM  exon  250452   250896   .  +  .  ID=ctg012.mRNA0001.exon03;Parent=ctg012.mRNA0001
tig00000012  EVM  CDS   250452   250896   .  +  1  ID=ctg012.mRNA0001.CDS03;Parent=ctg012.mRNA0001

tig00000012  EVM  gene  250928   253264   .  +  .  ID=ctg012.gene0002;Name=gene0002
tig00000012  EVM  mRNA  250928   253264   .  +  .  ID=ctg012.mRNA0002;Parent=ctg012.gene0002;Name=mRNA0002
tig00000012  EVM  exon  250928   253264   .  +  .  ID=ctg012.mRNA0002.exon01;Parent=ctg012.mRNA0002
tig00000012  EVM  CDS   250928   253264   .  +  0  ID=ctg012.mRNA0002.CDS01;Parent=ctg012.mRNA0002

tig00000012  EVM  gene  253391   254953   .  +  .  ID=ctg012.gene0003;Name=gene0003
tig00000012  EVM  mRNA  253391   254953   .  +  .  ID=ctg012.mRNA0003;Parent=ctg012.gene0003;Name=mRNA0003
tig00000012  EVM  exon  253391   254953   .  +  .  ID=ctg012.mRNA0003.exon01;Parent=ctg012.mRNA0003
tig00000012  EVM  CDS   253391   254953   .  +  0  ID=ctg012.mRNA0003.CDS01;Parent=ctg012.mRNA0003

tig00000012  EVM  gene  255080   255926   .  +  .  ID=ctg012.gene0004;Name=gene0004
tig00000012  EVM  mRNA  255080   255926   .  +  .  ID=ctg012.mRNA0004;Parent=ctg012.gene0004;Name=mRNA0004
tig00000012  EVM  exon  255080   255843   .  +  .  ID=ctg012.mRNA0004.exon01;Parent=ctg012.mRNA0004
tig00000012  EVM  CDS   255080   255843   .  +  0  ID=ctg012.mRNA0004.CDS01;Parent=ctg012.mRNA0004
tig00000012  EVM  exon  255887   255926   .  +  .  ID=ctg012.mRNA0004.exon02;Parent=ctg012.mRNA0004
tig00000012  EVM  CDS   255887   255926   .  +  1  ID=ctg012.mRNA0004.CDS02;Parent=ctg012.mRNA0004

tig00000012  EVM  gene  256727   257758   .  -  .  ID=ctg012.gene0005;Name=gene0005
tig00000012  EVM  mRNA  256727   257758   .  -  .  ID=ctg012.mRNA0005;Parent=ctg012.gene0005;Name=mRNA0005
tig00000012  EVM  exon  256727   257758   .  -  .  ID=ctg012.mRNA0005.exon01;Parent=ctg012.mRNA0005
tig00000012  EVM  CDS   256727   257758   .  -  0  ID=ctg012.mRNA0005.CDS01;Parent=ctg012.mRNA0005

tig00000012  EVM  gene  257894   259663   .  +  .  ID=ctg012.gene0006;Name=gene0006
tig00000012  EVM  mRNA  257894   259663   .  +  .  ID=ctg012.mRNA0006;Parent=ctg012.gene0006;Name=mRNA0006
tig00000012  EVM  exon  257894   259141   .  +  .  ID=ctg012.mRNA0006.exon01;Parent=ctg012.mRNA0006
tig00000012  EVM  CDS   257894   259141   .  +  0  ID=ctg012.mRNA0006.CDS01;Parent=ctg012.mRNA0006
tig00000012  EVM  exon  259259   259663   .  +  .  ID=ctg012.mRNA0006.exon02;Parent=ctg012.mRNA0006
tig00000012  EVM  CDS   259259   259663   .  +  0  ID=ctg012.mRNA0006.CDS02;Parent=ctg012.mRNA0006

When I save those lines to a file test.simple.gff3 and run your example code, it works fine. What version of gffutils are you using?

Yes I just noticed this as well! I'm currently doing some trial and error with the file to get some insights. I'm using the conda version 0.11.0

OK so I tried some things. Subsetting the GFF3 file to only records of the first contig (tig00000012) yielded no error. Subsetting to only those of the second contig (tig00000492) also yielded no error. However, when combined, the same error as above was returned.

This is the smallest subset of the file that returns the error for me. It contains records associated with the last gene of the first contig and the first two genes of the second contig

tig00000012     EVM     gene    2181975 2182655 .       +       .       ID=ctg012.gene0754;Name=gene0754
tig00000012     EVM     mRNA    2181975 2182655 .       +       .       ID=ctg012.mRNA0754;Parent=ctg012.gene0754;Name=mRNA0754
tig00000012     EVM     exon    2181975 2182655 .       +       .       ID=ctg012.mRNA0754.exon01;Parent=ctg012.mRNA0754
tig00000012     EVM     CDS     2181975 2182655 .       +       0       ID=ctg012.mRNA0754.CDS01;Parent=ctg012.mRNA0754
tig00000492     EVM     gene    46225   47235   .       -       .       ID=ctg492.gene0001;Name=gene0001
tig00000492     EVM     mRNA    46225   47235   .       -       .       ID=ctg492.mRNA0001;Parent=ctg492.gene0001;Name=mRNA0001
tig00000492     EVM     exon    46225   47235   .       -       .       ID=ctg492.mRNA0001.exon01;Parent=ctg492.mRNA0001
tig00000492     EVM     CDS     46225   47235   .       -       0       ID=ctg492.mRNA0001.CDS01;Parent=ctg492.mRNA0001
tig00000492     EVM     gene    47351   48256   .       -       .       ID=ctg492.gene0002;Name=gene0002
tig00000492     EVM     mRNA    47351   48256   .       -       .       ID=ctg492.mRNA0002;Parent=ctg492.gene0002;Name=mRNA0002
tig00000492     EVM     exon    47351   48256   .       -       .       ID=ctg492.mRNA0002.exon01;Parent=ctg492.mRNA0002
tig00000492     EVM     CDS     47351   48256   .       -       0       ID=ctg492.mRNA0002.CDS01;Parent=ctg492.mRNA0002

From the above file subset, db.interfeatures appears to create a single intergenic_space feature:

tig00000492     gffutils_derived        intergenic_space        2182656 47350   .       -       .       ID=ctg492.gene0001,ctg492.gene0002;Name=gene0001,gene0002

What is off is that it takes the end coordinate of the last gene in the first contig (gene0754), adds one, and uses that as the start value for the intergenic space between gene0001 and gene0002 of contig00000492

@daler Have you had any luck figuring out what is going on?

Should be fixed in #199 -- needed an overhaul of the entire method and I fixed some other things while I was in there. Your example now works, can you try with full file and reopen if it doesn't work?

The fixes are now in v0.11.1 (will take a couple hours to make it through the bioconda release, but it's available on PyPI).

It works now, also with the full file! Thanks so much!