ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

User-side feedback

dmitriz opened this issue Β· comments

Opening this issue to document feedback and recommendation from the users' perspectives.

It is 2018 2020 and we still talk about papers. πŸ˜„

  • Nice and easy installation with npm.
  • Many deprecated packages reported, might need fixing in the future, maybe not an immediate priority.

Minimal usage

$ getpapers -q covid
info: Searching using eupmc API
error: No output directory given. You must provide the --outdir argument.
  • Use a default directory to save some typing for users in a hurry?
  • Even nicer: Use the search string as directory name by default. That will make it really easy to use.

Next simplest choice:

$ getpapers -q covid -o covid
info: Searching using eupmc API
info: Found 37494 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==----------------------------] 8% (eta 232.8s)^C
  • Does it really download 40k results by default? That could be unexpected.
  • I have terminated download and the directory is empty. An alternative could be to keep some of the results.

Smaller searches work nicely, apart from the warnings that are a bit confusing.

$ getpapers -q "covid tracing" -o tracing
info: Searching using eupmc API
info: Found 1155 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt

Now refining:

getpapers -q "covid tracing korea" -o tracing
info: Searching using eupmc API
info: Found 254 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt

And again:

getpapers -q "covid tracing korea taiwan vietnam" -o tracing
info: Searching using eupmc API
info: Found 26 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.2 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
  • It looks like the results are merged rather than lost, which is nice.
  • But each created subdirectory has only one JSON file. Would it be easier to navigate just a list of files instead of directories?

Thanks. Valuable.

I am NOT the author of getpapers . Rick Smith-Unna is and we should try to get his views. Here are mine. I think they should be refiled as issues.

  1. default directory.
  • pros: it's simple
  • cons: some queries are a page long. We either truncate or hash.
  1. infinite download.
    Yes, this is a major problem. There needs to be an inbuilt limit

  2. cached download.
    The JSON is (I think) ordered by scientific priority. I don't know if the download order follows this.

  3. overwriting and merging.
    This is an important issue. It's nice that you can download on top of an existing dir/CProject. But there may be implicit context that is lost. It probably useful to have a switch --overwrite

I am having to deal with some of this in ami download https://github.com/petermr/ami3

Thanks. Valuable.

Thank you for your appreciation. :)

I am NOT the author of getpapers . Rick Smith-Unna is and we should try to get his views.

Judged by the lack of responses to previous issues and last code back in 2016, this could be off his radar for quite a while.

default directory.
pros: it's simple
cons: some queries are a page long. We either truncate or hash.

What about using the search string?

infinite download.
Yes, this is a major problem. There needs to be an inbuilt limit

100 results seem like a common default I've seen with many APIs.
Also, the order is needed, maybe the 100 most recent ones?

cached download.
The JSON is (I think) ordered by scientific priority. I don't know if the download order follows this.

By scientific priority, you mean the first mention? I didn't know the APIs could do such things. :)

overwriting and merging.
This is an important issue. It's nice that you can download on top of an existing dir/CProject. But there may be implicit context that is lost. It probably useful to have a switch --overwrite

Agree. The user-friendliest way is probably to print an overwrite warning with options to select: yes, no, or yes to all to skip the rest of warnings.

I am having to deal with some of this in ami download https://github.com/petermr/ami3

Do you still need getpapers then?

Yes, we still need it. There are tutorials out there.