Handle "clusters" on paper extraction

Question

Handle "clusters" on paper extraction

bzz opened this issue 2 years ago · comments

On extracting publications (papers) from emails, a class of papers that in email look like

https://scholar.google.com/scholar?cluster=14905208172666766997&hl=en&oi=scholaralrt&hist=KBiQzPUAAAAJ:3103465405719670724:AAGBfm3tO_7Uk2dTXZseJcyJq0Kjaug97Q&html=&folt=rel

are skipped (14 papers out of +2k) as ATM we use a regex to extract the pdf URL from such links and it fails to match.
Instead of the usual /scholar_url?url=<url-to-the.pdf> pattern, these links looks like /scholar?cluster=14905208172666766997&... and a way to get the URL to individual pdf (any from the cluster) is not obvious.

One option is too keep those links as-is, so the user will have to choose the PDF from a scholar page themselves.