Habanero fails to terminate deep paging when using a cursor partway through the dataset
gorbynet opened this issue · comments
If you try and resume download of a large dataset partway through i.e. by specifying a cursor other than *, habanero doesn't terminate when the remaining results have been retrieved, i.e. the Crossref API returns 0 results.
Sample code:
import requests
import logging
import http.client as http_client
http_client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
from habanero import Crossref
cr = Crossref(mailto = "name@example.com")
# this works, retrieves 235 records
x = cr.prefixes(ids="10.1108", works=True, select=["DOI","type"], filter={"type": "book-chapter"}, cursor="*", cursor_max=300, limit=10)
# example cursor from partway through paging is AoJ66+vj6+ICPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODA1ODU0NzQ2MDE=
# this gets stuck in a loop
# repeatedly requesting cursor AoJxp+WMrecCPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODE3ODcxNDgzMzg= which returns no results
x = cr.prefixes(ids="10.1108", works=True, select=["DOI","type"], filter={"type": "book-chapter"}, cursor="AoJ66+vj6+ICPwhodHRwOi8vZHguZG9pLm9yZy8xMC4xMTA4Lzk3ODA1ODU0NzQ2MDE=", cursor_max=300, limit=10)
thanks for the report @gorbynet - will have a look
This seems to work. Adding a check in request_class
for zero items returned by the API.
def _redo_req(self, js, payload, cu, max_avail):
if(cu.__class__.__name__ != 'NoneType' and self.cursor_max > len(js['message']['items'])):
res = [js]
total = len(js['message']['items'])
rows_left = True # assume there's more data to fetch. This might be incorrect once.
while(cu.__class__.__name__ != 'NoneType'
and self.cursor_max > total
and total < max_avail
and rows_left): # add a check for more data
payload['cursor'] = cu
out = self._req(payload = payload)
cu = out['message'].get('next-cursor')
# set flag to false if the response was empty
if len(out['message']['items']) == 0:
rows_left = False
res.append(out)
total = sum([ len(z['message']['items']) for z in res ])
return res
else:
return js
thanks @gorbynet
@gorbynet i can't replicate the problem you gave above. I tried with the dev version on master as well as the version on pypi.
can you test again and if you still get the problem what python version and habanero version do you have
closing inactivity