tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

download_ncbi_associations() fails while decompressing file

msbentsen opened this issue · comments

Hi,

Thank you for this great package! It has worked for me in the past, but lately I get an error when trying to download the NCBI associations as seen here:

from goatools.base import download_ncbi_associations
file_gene2go = download_ncbi_associations()

This produces the error:

FTP RETR ftp.ncbi.nlm.nih.gov gene/DATA gene2go.gz -> gene2go.gz
  gunzip gene2go.gz

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-4-001d0dcec111> in <module>
      1 # Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
      2 from goatools.base import download_ncbi_associations
----> 3 file_gene2go = download_ncbi_associations()

~/.conda/envs/py3/lib/python3.7/site-packages/goatools/base.py in download_ncbi_associations(gene2go, prt, loading_bar)
    131         file_remote = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/{GZ}".format(
    132             GZ=os.path.basename(gzip_file))
--> 133         dnld_file(file_remote, gene2go, prt, loading_bar)
    134     else:
    135         if prt is not None:

~/.conda/envs/py3/lib/python3.7/site-packages/goatools/base.py in dnld_file(src_ftp, dst_file, prt, loading_bar)
    221             if prt is not None:
    222                 prt.write("  gunzip {FILE}\n".format(FILE=dst_wget))
--> 223             gzip_open_to(dst_wget, dst_file)
    224     except IOError as errmsg:
    225         import traceback

~/.conda/envs/py3/lib/python3.7/site-packages/goatools/base.py in gzip_open_to(fin_gz, fout)
    233     with gzip.open(fin_gz, 'rb') as zstrm:
    234         with  open(fout, 'wb') as ostrm:
--> 235             ostrm.write(zstrm.read())
    236     assert os.path.isfile(fout), "COULD NOT GUNZIP({G}) TO FILE({F})".format(G=fin_gz, F=fout)
    237     os.remove(fin_gz)

~/.conda/envs/py3/lib/python3.7/gzip.py in read(self, size)
    274             import errno
    275             raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 276         return self._buffer.read(size)
    277 
    278     def read1(self, size=-1):

~/.conda/envs/py3/lib/python3.7/gzip.py in read(self, size)
    469             buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
    470 
--> 471             uncompress = self._decompressor.decompress(buf, size)
    472             if self._decompressor.unconsumed_tail != b"":
    473                 self._fp.prepend(self._decompressor.unconsumed_tail)

error: Error -3 while decompressing data: invalid block type

It seems to be correctly downloading the .gz file, but reading it fails, and so the gene2go-file is empty:
image

If I use an old gene2go file, it works perfectly (I have one from 10.11.2020 which works), but it seems that any new download fails.

I am running python==3.7.6 and goatools==1.1.6 on a Debian system.

Thank you for any help you might be able to provide for solving this!

Thank you for using GOA TOOLs in your day-to-day work and for taking your time to write us.

I have augmented the test, tests/test_i147_all_taxids.py so that it always downloads NCBI's gene2go annotation file for better testing, but am not able to duplicate what you are seeing. So we need more information.

In the meantime, here are a couple things to try:

1. Include the full name of the gene2go file you are downloading; here is an example:

from os import getcwd
from os.path import join
from goatools.base import download_ncbi_associations

fin_anno = join(getcwd(), 'gene2go')
download_ncbi_associations(fin_anno)

2. Download the gene2go file by hand

$ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
$ gunzip gene2go.gz

Hi, thank you for getting back to me. I tried the second option, and I think it might be a system-specific issue on my end. I get an "invalid compressed data--format violated" error from gunzip, but I was able to download it from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz and unzip without issue. So probably something to do with restrictions on downloading from ftp - not quite sure. But my problem was solved, thank you!