Fetching from figshare DOIs always gets the latest version

leouieda opened this issue · comments

Description of the problem:

Pooch uses the figshare API to find the download URL for a particular file based on the DOI. We do this by getting the figshare ID associated with that archive and then requesting the list of files from that ID. The problem is that the ID doesn't change between versions of the archive. For example, https://doi.org/10.6084/m9.figshare.21665630.v1 and https://doi.org/10.6084/m9.figshare.21665630.v1 have the same ID of 21665630. With the API call https://api.figshare.com/v2/articles/21665630/files (which is what we do in Pooch), figshare defaults to listing the files in the latest version of the ID.

To get files specifically for the v1 DOI, the API call has to be https://api.figshare.com/v2/articles/21665630/versions/1 which provides the files key in the JSON response. The /1 will have to be infered from the DOI and added to the end of the query URL.

Full code that generated the error

Getting pooch to fetch a file from this DOI https://doi.org/10.6084/m9.figshare.21665630.v1 leads to an error:

path = pooch.retrieve(

Full error message

ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 path = pooch.retrieve(
      2     url="doi:10.6084/m9.figshare.21665630.v1/cropped-before.tar.gz",
      3     known_hash="md5:d2a503c944bb7ef3b41294d44b77e98c",
      4 )

File ~/bin/conda/envs/xlandsat/lib/python3.10/site-packages/pooch/core.py:240, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
    237 if downloader is None:
    238     downloader = choose_downloader(url, progressbar=progressbar)
--> 240 stream_download(url, full_path, known_hash, downloader, pooch=None)
    242 if known_hash is None:
    243     get_logger().info(
    244         "SHA256 hash of downloaded file: %s\n"
    245         "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
    248         file_hash(str(full_path)),
    249     )

File ~/bin/conda/envs/xlandsat/lib/python3.10/site-packages/pooch/core.py:772, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    768 try:
    769     # Stream the file to a temporary so that we can safely check its
    770     # hash before overwriting the original.
    771     with temporary_file(path=str(fname.parent)) as tmp:
--> 772         downloader(url, tmp, pooch)
    773         hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    774         shutil.move(tmp, str(fname))

File ~/bin/conda/envs/xlandsat/lib/python3.10/site-packages/pooch/downloaders.py:573, in DOIDownloader.__call__(self, url, output_file, pooch)
    566 if repository not in converters:
    567     raise ValueError(
    568         f"Invalid data repository '{repository}'. Must be one of "
    569         f"{list(converters.keys())}. "
    570         "To request or contribute support for this repository, "
    571         "please open an issue at https://github.com/fatiando/pooch/issues"
    572     )
--> 573 download_url = converters[repository](
    574     archive_url=archive_url,
    575     file_name=parsed_url["path"].split("/")[-1],
    576     doi=doi,
    577 )
    578 downloader = HTTPDownloader(
    579     progressbar=self.progressbar, chunk_size=self.chunk_size, **self.kwargs
    580 )
    581 downloader(download_url, output_file, pooch)

File ~/bin/conda/envs/xlandsat/lib/python3.10/site-packages/pooch/downloaders.py:667, in figshare_download_url(archive_url, file_name, doi)
    665 files = {item["name"]: item for item in response.json()}
    666 if file_name not in files:
--> 667     raise ValueError(
    668         f"File '{file_name}' not found in data archive {archive_url} (doi:{doi})."
    669     )
    670 download_url = files[file_name]["download_url"]
    671 return download_url

ValueError: File 'cropped-before.tar.gz' not found in data archive https://figshare.com/articles/dataset/Landsat8_scenes_before_and_after_the_Brumadinho_Brazil_tailings_dam_collapse/21665630/1 (doi:10.6084/m9.figshare.21665630.v1).

System information

  • Operating system: Manjaro
  • Python installation (Anaconda, system, ETS): Mambaforge
  • Version of Python: 3.10
  • Version of this package: 1.6.0
  • If using conda, paste the output of conda list below:
