rdmpage / biorss

Harvest and repurpose RSS feeds

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bioRSS

A potential successor to uBioRSS and uBioRSS Nomina Nova. See also my experience with bioGUID a decade or more ago.

Idea

Take RSS feeds from journals and databases, creating them if needed, then index by taxon and geography. Output RSS feeds keyed by taxon and/or geography. Create simple visualisations.

Original goal was to rely on RSS feeds, or generate my own RSS from various sources. Now seems better to use RSS if available, but otherwise generate schema.org-style JSON and use that directly for other, potentially richer sources.

Approach

  1. RSS feeds from journals regularly polled and added. RSS converted to “internal” format, then augmented by adding DOIs, geography and taxa. Store the status of each feed in feed status.json. Sadly many RSS feeds don’t support conditional GET.

  2. Some sources (e.g., Google Scholar, ZooBank) will be converted directly to “internal” format, then augmented.

  3. Feed item is modelled as a schema.org DataFeedItem with the publication as an item.

Need to set up the harvesting to be automatic. Would be nice to cache things for reanalysis if needed.

RSS

  1. harvest-feeds.php to read feed list and cache each feed as an XML file in the folder cache/latest` (an alias).
  2. process-rss-feeds.php parses each XML file in cache/latest and for each item adds it to the data store, then augments that, updating the item in the data store.

Google Scholar

  1. harvest-gs-email.php reads any .eml files in folder cache/latest and converts them to native JSON.
  2. process—internal.php parses each JSON file in cache/latest and for each item adds it to the data store, then augments that, updating the item in the data store.

DOAJ

  1. harvest-doaj.php fetches journal articles using ISSN as key, then generates native JSON from the BibJSON the DOAJ API returns.
  2. process—internal.php parses each JSON file in cache/latest and for each item adds it to the data store, then augments that, updating the item in the data store.

Feeds

Reading on feeds

Apps that create feeds

RSS.app

Google Scholar Alerts

Google Scholar can send email alerts for a search term, so an obvious approach is to use these alerts as a source. How do we do this? One approach is to use a service such as CloudMailin which can take an email sent to a CloudMailin email address and forward that email as a JSON document to a URL (webhook). We can then parse the contents of the email. For debugging purposes we can use a service such as PostBin to receive these emails, for example https://postb.in/1632815014159-2470838529989. When using PostBin note that you can retrieve the body of the request using a URL like https://postb.in/api/bin/[bin-id]/req/[request-id].

The Google Scholar alert email is in HTML so we need to parse it and extract the information we require. Note that Google Scholar doesn’t include DOIs in the results, so we may have to resolve URLs and go hunting for DOIs. Some links may be PDFs, ideally we can find the corresponding HTML link so that we can parse that.

Pensoft

Lyubo mentions OAI endpoint, investigate further.

PubMed

PubMed supports the creation of RSS feeds based on user searches, e.g. ("new species") OR ("n. sp.") OR ("sp. nov.") OR ("n. gen.") OR ("gen. nov.") OR ("n. comb.") OR ("comb. nov.”)

Wanfang

Scrape using JSON.

ZooBank

ZooBank has RSS but it doesn’t seem to be updated(?). Can also query using year as a search term. JSON data doesn’t have precise time, nor does it have the DOI. GBIF https://www.gbif.org/dataset/c8227bb4-4143-443f-8cb2-51f9576aff14 https://doi.org/10.15468/wkr0kn seems to lag behind ZooBank.

Zootaxa

Zootaxa has RSS feeds, but also has a taxon search feature, e.g., https://www.mapress.com/zt/search/search?query=Coleoptera&authors=&dateFromYear=2021&dateFromMonth=11&dateFromDay=&dateToYear=&dateToMonth=&dateToDay=&subject=&title=&abstract=&indexTerms= which might be used to generate taxon-specific feeds.

Gotcha’s

RSS feeds are variable in terms of tags included and how they handle external namespaces. Note also that dates in RSS feeds need not be in English, which means we need to translate them before converting to ISO8601.

Validation

Data Validation tool
JSON-LD https://json-ld.org/playground/
OPML http://validator.opml.org
RSS feed https://validator.w3.org/feed/
Structured data using schema.org https://validator.schema.org

Visualisation

Feed is a list in descending time order, taxon facet is a treemap, geography facet is a map.

Search

Experimenting with simple full text search based on Inside Wade, source code on GitHub. Uses a CouchDB view to convert text to list of terms then query that view to return a list of documents sorted by how well they match the query.

References

Patrick R. Leary, David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109

Little, D. P. (2020). Recognition of Latin scientific names using artificial neural networks. Applications in Plant Sciences, 8(7). doi:10.1002/aps3.11378

Mindell, D. P., Fisher, B. L., Roopnarine, P., Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating, Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. doi:10.1371/journal.pone.0019491

About

Harvest and repurpose RSS feeds


Languages

Language:PHP 96.6%Language:HTML 3.2%Language:JavaScript 0.2%