bzz / scholar-alert-digest

Aggregate unread emails from Google Scholar alerts

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle "clusters" on paper extraction

bzz opened this issue · comments

commented

On extracting publications (papers) from emails, a class of papers that in email look like

  • https://scholar.google.com/scholar?cluster=14905208172666766997&hl=en&oi=scholaralrt&hist=KBiQzPUAAAAJ:3103465405719670724:AAGBfm3tO_7Uk2dTXZseJcyJq0Kjaug97Q&html=&folt=rel

are skipped (14 papers out of +2k) as ATM we use a regex to extract the pdf URL from such links and it fails to match.
Instead of the usual /scholar_url?url=<url-to-the.pdf> pattern, these links looks like /scholar?cluster=14905208172666766997&... and a way to get the URL to individual pdf (any from the cluster) is not obvious.

One option is too keep those links as-is, so the user will have to choose the PDF from a scholar page themselves.