sckott / habanero

client for Crossref search API

Home Page:https://habanero.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Habanero fails to terminate deep paging when using a cursor partway through the dataset

gorbynet opened this issue · comments

If you try and resume download of a large dataset partway through i.e. by specifying a cursor other than *, habanero doesn't terminate when the remaining results have been retrieved, i.e. the Crossref API returns 0 results.
Sample code:

import requests
import logging
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
from habanero import Crossref
cr = Crossref(mailto = "name@example.com")

# this works, retrieves 235 records
x = cr.prefixes(ids="10.1108", works=True, select=["DOI","type"], filter={"type": "book-chapter"}, cursor="*", cursor_max=300, limit=10)

# example cursor from partway through paging is AoJ66+vj6+ICPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODA1ODU0NzQ2MDE=

# this gets stuck in a loop
# repeatedly requesting cursor AoJxp+WMrecCPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODE3ODcxNDgzMzg= which returns no results
x = cr.prefixes(ids="10.1108", works=True, select=["DOI","type"], filter={"type": "book-chapter"}, cursor="AoJ66+vj6+ICPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODA1ODU0NzQ2MDE=", cursor_max=300, limit=10)

thanks for the report @gorbynet - will have a look

This seems to work. Adding a check in request_class for zero items returned by the API.

  def _redo_req(self, js, payload, cu, max_avail):
    if(cu.__class__.__name__ != 'NoneType' and self.cursor_max > len(js['message']['items'])):
      res = [js]
      total = len(js['message']['items'])
      rows_left = True # assume there's more data to fetch. This might be incorrect once.
      while(cu.__class__.__name__ != 'NoneType' 
          and self.cursor_max > total 
          and total < max_avail 
          and rows_left): # add a check for more data
        payload['cursor'] = cu
        out = self._req(payload = payload)
        cu = out['message'].get('next-cursor')
        # set flag to false if the response was empty
        if len(out['message']['items']) == 0:
          rows_left = False
        res.append(out)
        total = sum([ len(z['message']['items']) for z in res ])
      return res
    else:
      return js

@gorbynet i can't replicate the problem you gave above. I tried with the dev version on master as well as the version on pypi.

can you test again and if you still get the problem what python version and habanero version do you have

closing inactivity