Handle "clusters" on paper extraction
bzz opened this issue · comments
On extracting publications (papers) from emails, a class of papers that in email look like
https://scholar.google.com/scholar?cluster=14905208172666766997&hl=en&oi=scholaralrt&hist=KBiQzPUAAAAJ:3103465405719670724:AAGBfm3tO_7Uk2dTXZseJcyJq0Kjaug97Q&html=&folt=rel
are skipped (14 papers out of +2k) as ATM we use a regex to extract the pdf URL from such links and it fails to match.
Instead of the usual /scholar_url?url=<url-to-the.pdf>
pattern, these links looks like /scholar?cluster=14905208172666766997&...
and a way to get the URL to individual pdf (any from the cluster) is not obvious.
One option is too keep those links as-is, so the user will have to choose the PDF from a scholar page themselves.