Habanero fails to terminate deep paging when using a cursor partway through the dataset

Question

Habanero fails to terminate deep paging when using a cursor partway through the dataset

gorbynet opened this issue 6 years ago · comments

If you try and resume download of a large dataset partway through i.e. by specifying a cursor other than *, habanero doesn't terminate when the remaining results have been retrieved, i.e. the Crossref API returns 0 results.
Sample code:

import requests
import logging
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
from habanero import Crossref
cr = Crossref(mailto = "name@example.com")

# this works, retrieves 235 records
x = cr.prefixes(ids="10.1108", works=True, select=["DOI","type"], filter={"type": "book-chapter"}, cursor="*", cursor_max=300, limit=10)

# example cursor from partway through paging is AoJ66+vj6+ICPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODA1ODU0NzQ2MDE=

# this gets stuck in a loop
# repeatedly requesting cursor AoJxp+WMrecCPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODE3ODcxNDgzMzg= which returns no results
x = cr.prefixes(ids="10.1108", works=True, select=["DOI","type"], filter={"type": "book-chapter"}, cursor="AoJ66+vj6+ICPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODA1ODU0NzQ2MDE=", cursor_max=300, limit=10)

Scott Chamberlain · Answer 1 · Thu Nov 29 2018 02:20:32 GMT+0800 (China Standard Time)

thanks for the report @gorbynet - will have a look

gorbynet · Answer 2 · Thu Nov 29 2018 19:00:15 GMT+0800 (China Standard Time)

This seems to work. Adding a check in request_class for zero items returned by the API.

  def _redo_req(self, js, payload, cu, max_avail):
    if(cu.__class__.__name__ != 'NoneType' and self.cursor_max > len(js['message']['items'])):
      res = [js]
      total = len(js['message']['items'])
      rows_left = True # assume there's more data to fetch. This might be incorrect once.
      while(cu.__class__.__name__ != 'NoneType' 
          and self.cursor_max > total 
          and total < max_avail 
          and rows_left): # add a check for more data
        payload['cursor'] = cu
        out = self._req(payload = payload)
        cu = out['message'].get('next-cursor')
        # set flag to false if the response was empty
        if len(out['message']['items']) == 0:
          rows_left = False
        res.append(out)
        total = sum([ len(z['message']['items']) for z in res ])
      return res
    else:
      return js

Scott Chamberlain · Answer 3 · Fri Nov 30 2018 03:36:30 GMT+0800 (China Standard Time)

thanks @gorbynet

Scott Chamberlain · Answer 4 · Sat Dec 15 2018 02:00:05 GMT+0800 (China Standard Time)

@gorbynet i can't replicate the problem you gave above. I tried with the dev version on master as well as the version on pypi.

can you test again and if you still get the problem what python version and habanero version do you have

Scott Chamberlain · Answer 5 · Fri Aug 23 2019 02:37:17 GMT+0800 (China Standard Time)

closing inactivity