CrossRef.works return type

Question

CrossRef.works return type

holub008 opened this issue 3 years ago · comments

habanero version: 1.0.0
python version: 3.9.6

Thank you for habanero! As I've started using it, I'm finding myself frustrated by the variety of return types that CrossRef.works can produce. Here's an example of these differences, in what I imagine are pretty typical use cases:

from habanero import Crossref
cr = Crossref(mailto="example@example.com", ua_string="example")

cr.works(ids='10.1136/jclinpath-2020-206745') # returns a work dict
cr.works(ids=['10.1136/jclinpath-2020-206745', '10.1136/esmoopen-2020-000776']) # returns a list of work dicts
cr.works(query_bibliographic='cancer') # returns work-list dict, where works are contained in items

It seems that habanero would be easier to work with if these calls returned the same structure. However, I can understand why they return different structure, and it may just be a reflection of the API offered by CrossRef. I'm wondering if we could at least improve the documentation for this method, in which the return type isn't accurate (does not always return a dict) or detailed.

Karl Holub · Answer 1 · Tue Jan 04 2022 13:22:46 GMT+0800 (China Standard Time)

One other case for consideration:

cr.works(ids=['10.1136/jclinpath-2020-206745']) # returns a work dict

In this case, the return type isn't determined by the argument type, but by its properties, which makes handling variable-input from client code more difficult.

Scott Chamberlain · Answer 2 · Thu Jan 06 2022 06:39:08 GMT+0800 (China Standard Time)

Thanks @holub008 for the issue

I hear you on the difficulty of working with returned data. It is indeed a reflection of what the Crossref API returns. I did that on purpose so it's easy to work with the Crossref API or this library and except the same data structures.

Definitely at the very least documentation should be updated to reflect the return types accurately.

wrt changing what's returned, I don't know what the right answer is. On one hand matching what the Crossref API does makes it easier to go between raw API requests and here, but on the other hand, the benefit of a 3rd party library is that you can make things easier for users where pain points are significant enough.

If we did change what's returned, perhaps it would make sense to always return a works-list or funders-list, etc. So e.g., when a singleton work is returned, put that into a works-list within message.items, where that list will have length 1. However, this would mean the works-list keys for a singleton would be filled with dummy values for the keys facets, total-results, items-per-page, and query. I guess we could put None for those 4 keys?

z = cr.works(query_bibliographic='cancer')
z['message'].keys()
# dict_keys(['facets', 'total-results', 'items', 'items-per-page', 'query'])

Or, could forget about the Crossref data model, and do something else. Thoughts?

for reference, data models are included here http://api.crossref.org/swagger-ui/index.html

Karl Holub · Answer 3 · Thu Jan 06 2022 14:01:10 GMT+0800 (China Standard Time)

On one hand matching what the Crossref API does makes it easier to go between raw API requests and here, but on the other hand, the benefit of a 3rd party library is that you can make things easier for users where pain points are significant enough.

Exactly the dilemma I was imagining! My preference is "make things easier for clients". But I should advertise my bias- I arrived at this library before I read any of the CR API docs. If a majority of users are like me, I'd argue strict fidelity to the CR API doesn't provide much value. We "just want the data" as seamlessly as possible.

However, you're right that some fields would not fit with the works-list abstraction. And I can see the argument that if I'm always retrieving a single record (e.g. cr.works(id='single/doi')), it may be frustrating to index a list to get it.

I'm afraid I don't know the extent of the CR API well enough to make a technical suggestion. Given the uncertainty, I should walk back the scope of my original request, and for now focus on:

Improved docs (I'd be interested to research the API and generate a PR)
What could be done for the below two calls giving different return types. This seems like a foot cannon in variable-input client code, and the only solution is tedious conditional checks.
- cr.works(ids=['10.1136/jclinpath-2020-206745', '10.1136/esmoopen-2020-000776'])
- cr.works(ids=['10.1136/jclinpath-2020-206745'])

Scott Chamberlain · Answer 4 · Tue Jan 11 2022 23:58:36 GMT+0800 (China Standard Time)

Thinking about changes. Main feeling right now is that I don't want to break the library API - so thinking about ways to standardize at least output from works requests (including from other methods on Crossref) without changing current behavior of existing classes/methods

Scott Chamberlain · Answer 5 · Thu Jan 13 2022 05:40:08 GMT+0800 (China Standard Time)

@holub008 See https://github.com/sckott/habanero/blob/works-handler/habanero/crossref/crossrefworks.py - perhaps this will work

Scott Chamberlain · Answer 6 · Fri Jan 28 2022 06:58:45 GMT+0800 (China Standard Time)

any thoughts @holub008 ?

Scott Chamberlain · Answer 7 · Sun Feb 13 2022 00:00:31 GMT+0800 (China Standard Time)

rebased to main