sckott / habanero

client for Crossref search API

Home Page:https://habanero.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error message requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url even when ua and mailto is set

WolfgangFahl opened this issue · comments

see OpenRefine/OpenRefine#1669

the test below used to work. Now i am using habanero 1.2.2 and i get the error above. Curl via comand line and direct API access in my browser strangely works.

pip list | grep habanero
habanero                      1.2.2

with a wrapper

class Crossref:
    """
    Crossref access
    """
    
    def __init__(self,mailto=None,ua_string=None):
        """
        constructor
        """
        if mailto is None:
            mailto="...." # here is ma mail address
        if ua_string is None:
            ua_string=f"pysotsog/{skg.__version__} (https://pypi.org/project/pysotsog/; mailto:{mailto})"
        self.cr = habanero.Crossref(mailto=mailto,ua_string=ua_string)  
    
    def doiMetaData(self, dois:list):
        """ 
        get the meta data for the given dois
        
        Args:
            doi(list): a list of dois
        """
        metadata = None
        response = self.cr.works(ids=dois)
        if 'status' in response and 'message' in response and response['status'] == 'ok':
            metadata = response['message']
        return metadata
 def test_crossref(self):
        """
        test crossref
        """
        dois=["10.1016/J.ARTMED.2017.07.002"]
        crossref=Crossref()
        #bib_entry=crossref.doiBibEntry(doi)
        meta_data=crossref.doiMetaData(dois)
        print(meta_data)

Thanks for the issue. I can't run this as is. Where is the skg package? That issue you link to is 4 yrs old. There may have been an issue with Crossref at that time, but it's unlikely to be the same problem

This error is very strange. See https://github.com/WolfgangFahl/pysotsog/blob/main/tests/test_crossref.py for the test source code and https://github.com/WolfgangFahl/pysotsog/blob/main/skg/crossref.py for the helper package.
The CI runs fine and the code runs on most of my machines with no problems. The python versions are 3.9 and 3.10 and the operating sytems linux and MacOs. The machine that is not working is using Python 3.10.8 on MacOS 11.6.2. I have tried quite a few work-arounds - see below. None of the work arounds worked so i wonder why i can get a 401.

To reproduce the code

git clone https://github.com/WolfgangFahl/pysotsog
pip install green
cd pysotsog
green
  File "/Users/wf/Library/Python/3.10/lib/python/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://api.crossref.org/v1/works/10.1016%2FJ.ARTMED.2017.07.002/transform
def test_curl_style(self): 
        session = requests.Session()
        session.headers.update({
            'User-Agent': 'curl/7.86.0',
            'Accept': 'application/x-bibtex',
        })
        from http.cookiejar import DefaultCookiePolicy
        session.cookies.set_policy(DefaultCookiePolicy(allowed_domains=[]))
        response=session.get('https://doi.org/10.1021/acs.jpcc.0c05161')
        print (response.status_code)
        print (response.text)
    
    def doi2bib(self,doi):
        """
        Return a bibTeX string of metadata for a given DOI.
        """
        url = f"https://doi.org/{doi}" 
        headers = {
            "accept": "application/x-bibtex"
        }
        r = requests.get(url, headers = headers)
        if r.status_code==200:
            return r.text
        else:
            return r.status_code
    
    def test_crossref_bib(self):
        doi="10.1016/J.ARTMED.2017.07.002"
        bib_text=self.doi2bib(doi)
        print (bib_text)
    
    def test_crossref_direct(self):
        """
        """
        headers = {
            'User-Agent': 'Mozilla/5.0; mailto:@doe.com',
        } 
        doi="10.1016/J.ARTMED.2017.07.002"
        url=f"https://api.crossref.org/v1/works/{doi}"
        print (url)
        response = requests.get(url,headers=headers)
        print(response.status_code)
        if response.status_code==200:
            print(response.json())

Just tried python 3.9 and getting the same error.

It is very strange. The error is computer dependend not IP, not MAC address. What on earth could crossref evaluate do create a 401 specifically for a computer?

Does habanero have some kind of proxy cabability e.g. to ask another computer todo the actual work?

Thanks for the details @WolfgangFahl I'll take a look soon.

I'd be surprised if the problem was with habanero, but it's possible i guess

I have opened a ticket with CrossRef in the meantime but didn't get a reply yet. For my daily work this is still a showstopper and if have to use a different machine. I wonder whether a simple docker environment would change the situation and may try it out in the upcoming weeks if not other solution comes up.

I ran the code in your comment #110 (comment) and green ran without any problems. If you can find where the issue is coming from - and if its coming from habanero - then I can help fix.

There is now a reply from CrossRef and i explained that this is only on a single machine and only when using habanero. I can access the service itself just fine using the class below. See latests changes at WolfgangFahl/pysotsog@64bf3c9

test_doi.py

from unittest import IsolatedAsyncioTestCase
import json

class TestDOILookup(IsolatedAsyncioTestCase): 
    """
    test DOI lookup
    """
    async def testDOILookup(self):
        """
        test DOI lookup 
        """
        debug=True
        dois=["10.1109/TBDATA.2022.3224749"]
        expected=["@article{Li_2022,","@inproceedings{Faruqui_2015,"]
        for i,doi in enumerate(dois):
            doi_obj=DOI(doi)
            result=await doi_obj.doi2bibTex()
            if debug:
                print(result)
            self.assertTrue(result.startswith(expected[i]))
            
    async def testCiteproc(self):
        """
        cite proc lookup
        """ 
        dois=["10.3115/v1/N15-1184"]
        debug=True
        for doi in dois:
            doi_obj=DOI(doi)
            json_data=await doi_obj.doi2Citeproc()
            if debug:
                print(json.dumps(json_data,indent=2))
            self.assertTrue("DOI" in json_data)
            self.assertEqual(doi.lower(),json_data["DOI"])
        
    async def testDataCiteLookup(self):
        """
        test the dataCite Lookup api
        """
        debug=True
        dois=["10.5438/0012"]
        for doi in dois:
            doi_obj=DOI(doi)
            json_data=await doi_obj.dataCiteLookup()
            if debug:
                print(json.dumps(json_data,indent=2))
            self.assertTrue("data" in json_data)
            data=json_data["data"]
            self.assertTrue("id" in data)
            self.assertEquals(doi,data["id"])
            pass

doi.py

'''
Created on 2022-11-22

@author: wf
'''
import re
import aiohttp

class DOI:
    """
    Digital Object Identifier handling
    
    see e.g. https://www.wikidata.org/wiki/Property:P356
    see https://www.doi.org/doi_handbook/2_Numbering.html#2.2
    see https://github.com/davidagraf/doi2bib2/blob/master/server/doi2bib.js
    see https://citation.crosscite.org/docs.html
    
    """
    pattern=re.compile(r"((?P<directory_indicator>10)\.(?P<registrant_code>[0-9]{4,})(?:\.[0-9]+)*(?:\/|%2F)(?:(?![\"&\'])\S)+)")
  
    def __init__(self,doi:str):
        """
        a DOI
        """
        self.doi=doi
        match=re.match(DOI.pattern,doi)
        self.ok=bool(match)
        if self.ok:
            self.registrant_code=match.group("registrant_code")
        
    @classmethod
    def isDOI(cls,doi:str):
        """
        check that the given string is a doi
        
        Args:
            doi(str): the potential DOI string
        """
        if not doi:
            return False
        if isinstance(doi,list):
            ok=len(doi)>0
            for single_doi in doi:
                ok=ok and cls.isDOI(single_doi)
            return ok
        if not isinstance(doi,str):
            return False
        doi_obj=DOI(doi)
        return doi_obj.ok
    
    async def fetch_json(self,url,headers):
        """
        fetch text for the given url with the given headers
        """
        async with aiohttp.ClientSession(headers=headers) as session:
            async with session.get(url) as response:
                return await response.json()
    
    async def fetch_text(self,url,headers):
        """
        fetch text for the given url with the given headers
        """
        async with aiohttp.ClientSession(headers=headers) as session:
            async with session.get(url) as response:
                return await response.text()
    
    async def doi2bibTex(self):
        """
        get the bibtex result for my doi
        """
        url=f"https://doi.org/{self.doi}"
        headers= {
            'Accept': 'application/x-bibtex; charset=utf-8'
        }
        return await self.fetch_text(url,headers)     
    
    async def doi2Citeproc(self):
        """
        get the Citeproc JSON result for my doi
        see https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
        """
        url=f"https://doi.org/{self.doi}"
        headers= {
            'Accept': 'application/vnd.citationstyles.csl+json; charset=utf-8'
        }
        return await self.fetch_json(url, headers)
    
    async def dataCiteLookup(self):
        """
        get the dataCite json result for my doi
        """
        url=f"https://api.datacite.org/dois/{self.doi}"
        headers= {
            'Accept': 'application/vnd.api+json; charset=utf-8'
        }
        return await self.fetch_json(url, headers)

great, glad it works for you. sounds like no changes are needed here

i still can't use habanero - the above is only a workaround

Okay, sorry it doesn't work! I closed it because i'ts been a while and I have no ideas of how to fix this for you.

The 401 Client Error: Unauthorized for url error doesn't make sense because the API does not require authentication. The mailto header is just to get in the "faster lane" where requests should be more reliable/faster .

The only thing I can think is that perhaps your IP address got on their block list. Perhaps you were hitting the API pretty hard at some point? I dont know if they do that kind of thing or not