Need ur help
90000988 opened this issue · comments
Could you please lemme know how did you downloaded the data from PubMed. And why this url is not workinghttps://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&email=johannes.rueckert@fh-dortmund.de&id=.
Thank you.
Hello,
the URL is working, it just needs a PMCID, e.g.
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&email=your.email@example.com&id=PMC4608653
It returns an XML response which includes the FTP link to the article, which we then use to download it.
If you are writing your own script to download articles from PMC, please make sure to use your own e-mail address in the link instead of mine.
For more information, please refer to the documentation:
- Download through AWS
- Download through FTP
- OA Web Service API to discover resources (e.g., new articles added since date X)
I'm not sure what exactly you are trying to do. If you just need the dataset, you can check out ROCOv2.
The only things you can get directly from PubMed are the articles, images, and image captions.
- first you download the archives from the FTP / AWS
- then you extract the images
- then you extract the captions for the images from the NXML file
- then you need to classify the images to keep only non-compound radiological images
- then you need to clean the captions
- then you need to extract CUIs from the captions using something like MedCAT
- then you need to filter CUIs otherwise you get lots of nonsense
Good luck.
I guess you are downloading the wrong archives.
$ wget -r ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/96/5a/PMC1660580.tar.gz
[…]
2024-03-27 11:42:12 (1,20 MB/s) - ‘ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/96/5a/PMC1660580.tar.gz’ saved [701386]
$ tar xvf ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/96/5a/PMC1660580.tar.gz
PMC1660580/
PMC1660580/1746-160X-2-40-6.jpg
PMC1660580/1746-160X-2-40-2.gif
PMC1660580/1746-160X-2-40-3.jpg
PMC1660580/1746-160X-2-40-6.gif
PMC1660580/1746-160X-2-40-8.jpg
PMC1660580/1746-160X-2-40-7.jpg
PMC1660580/1746-160X-2-40-8.gif
PMC1660580/1746-160X-2-40-4.gif
PMC1660580/1746-160X-2-40-4.jpg
PMC1660580/1746-160X-2-40-1.jpg
PMC1660580/1746-160X-2-40-3.gif
PMC1660580/1746-160X-2-40-5.gif
PMC1660580/1746-160X-2-40-5.jpg
PMC1660580/1746-160X-2-40-2.jpg
PMC1660580/1746-160X-2-40.pdf
PMC1660580/1746-160X-2-40-7.gif
PMC1660580/1746-160X-2-40-1.gif
PMC1660580/1746-160X-2-40.nxml
All images and nxml are there.