Corrupt PDFs from EuropePMC
petermr opened this issue · comments
I am running getpapers
and every PDF download is corrupt.
localhost:workspace pm286$ node --version
v6.2.1
localhost:workspace pm286$ getpapers --version
0.4.12
localhost:workspace pm286$ getpapers -p -x -q "systematic review" -k 200 -o systematic
none of the PDFs read into AdobeReader or PDFDebugger (PDFBox). The structure of the document appears to be correct (it identifies pages, but they are visually blank and have virtually no content).
I have uninstalled and then reinstalled getpapers
.
Running on MACOsx 10.9.5
Have repeated with arXiv : --api arxiv -p
. This gives PDFs of uniformly 2KB. This is clearly wrong but I don't know what the correct answer is.
Confirmed on node 8.5.0 using Okular as PDF reader. However, it seems to work fine for me with getpapers 0.4.14.
The second thing gives a 403 message concerning the user agent, confirmed on 0.4.14 as well.
Edit: Confirmed with ModHeader that User-Agent
is the problem. See https://arxiv.org/denied.html, https://arxiv.org/help/robots. Using User-Agent: getpapers/TDM
worked once, but after changing the header to that in config.js, it broke too (some sort of blacklist, I guess). The error page, however, changed too, changing
Sadly, your client "getpapers/(TDM Crawler contact@contentmine.org)" violates the automated access guidelines posted at arxiv.org, and is consequently excluded.
into
Sadly, you do not currently appear to have permission to access
https://arxiv.org/pdf/0710.0054v1.pdf
so I might have only blacklisted myself.
I had a similar issue with PDFs from a arxiv search (I'm running the process with getpapers 0.4.17 on a Windows 10.0.19041 machine)
I used the following query:
getpapers --api "arxiv" --query "abs:wikidata" --pdf --outdir wikidataarticles --limit 10
When I opened the downloaded pdf file in Notepad++ I got this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>Access Denied</h1>
<p>Sadly, your client "<b>getpapers/(TDM Crawler contact@contentmine.org)</b>" violates
the automated access guidelines posted at arxiv.org,
and is consequently excluded.</p>
Hope this information works to solve the issue :)
arxiv
don't like crawlers . I think it was OK 5 years ago. It banned the crawler through its email. I am not sure how they would want it to be mined - they have a raw file download dump, I think.
The files aren't corrupt - they are replaced by HTML
We (@ayushgarg) are writing pygetpapers
and will need to think about this. If there is enough demand we will go back to arxiv
and work out an. agreed solution.