ContentMine / canary

Canary is a UI to the contentmine tools getpapers, quickscrape, norma, and ami.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Find a way to get a list of IUCn species

markmacgillivray opened this issue · comments

Where is this list? @blahah probably knows.

If there is an API we can query for species that we pull each day, that would be good. If not, a way to get a dump of it and keep it up to date. Or a way to scrape it off a web page somewhere. Whichever approach, a python script that can be called as an exec by canary would be good.

Good question. I've asked in Slack.

commented

I have added a full export CSV and XML to the slack #development channel

From the IUCN search results, there should be 89,586 results. The export data has the following columns.

Species ID
Kingdom
Phylum
Class
Order
Family
Genus
Species
Authority
Infraspecific rank
Infraspecific name
Infraspecific authority
Stock/subpopulation
Synonyms
Common names (Eng)
Common names (Fre)
Common names (Spa)
Red List status
Red List criteria
Red List criteria version
Year assessed
Population trend
Petitioned

The red list status page lists the number of IUCN species as 6,260

The red list categories are (link to IUCN document)

Category Code
Not Evaluated NE
Data Deficient DD
Least Concern LC
Lower Risk LR
Near Threatened NT
Vulnerable VU
Endangered EN
Critically Endangered CR
Extinct in the Wild EW
Extinct EX

There is an API which should make getting the complete list of species easier - http://rlapiv3-beta.iucnredlist.org/api/v3/docs.

At present, the API seems to be down (502 Bad gateway errors).

How best should this data be made available to Canary?

If we have a way of retrieving the data programmatically that would be nice. If it is a case of a manual download that is fine too, in which case just write up how we get the data and how often we would have to manually download to have the latest data. Whichever way we get the data, ideally it would be loaded into an elasticsearch index - so, the code should iterate every record found in IUCn data and send them to an es index address such as http://localhost:9200/contentmine/iucn. Each record should have a UUID if there is not a unique ID provided in the IUCn data. The method of updating the indexed data will depend on how we can retrieve data from IUCn - either update specific records by their IUCn ID, or blow away the whole index and rebuild each time. There is a pretty useful generic mapping.json file that should be used whenever an index type is created, it is at http://static.cottagelabs.com/mapping.json - however if upon looking at the data you find that there would be benefit in a custom mapping then of course just make one and include it with the code.

For now I plan to retrieve the information using their API. I didn't find any way of getting updates using the API. I will index it in ES and send you a link to have a look at once I have something. We could periodically update the index. They do have identifiers for each of the species, but somewhere in their website I read that they are not persistent and we shouldn't use them as such. So we will create our own uuid.

OK yes, create our own UUID for the records and store their species ID as
informational. RSU commented saying he provided a link to a data dump on
slack - did he get that via the API or does he have another method? The API
method should work fine, it is just annoying it is flaky, but still just
write some catches for their downtime and it would be OK.

On Wed, Oct 28, 2015 at 11:02 AM, Anusha Ranganathan <
notifications@github.com> wrote:

For now I plan to retrieve the information using their API. I didn't find
any way of getting updates using the API. I will index it in ES and send
you a link to have a look at once I have something. We could periodically
update the index. They do have identifiers for each of the species, but
somewhere in their website I read that they are not persistent and we
shouldn't use them as such. So we will create our own uuid.


Reply to this email directly or view it on GitHub
#6 (comment).

RSU downloaded the search results (linked above - IUCN search results). He thought he had downloaded the full list, but it's just a small result set (480 rows). I think he also mentioned retrying in Slack and that the downloads timing out. That has been my experience too. The API doc does state that this is expected behaviour and to use the API for this. If the API continues to be down, that is the route I will be forced to take. It's a manual laborious process and one I will do my best to avoid.

OK. Probably you could automate against the search results and just trick
it into thinking you are human, if that is necessary :)
On 28 Oct 2015 12:11, "Anusha Ranganathan" notifications@github.com wrote:

RSU downloaded the search results (linked above - IUCN search results
http://www.iucnredlist.org/search/link/5627b7b0-218891a4). He thought
he had downloaded the full list, but it's just a small result set (480
rows). I think he also mentioned retrying in Slack and that the downloads
timing out. That has been my experience too. The API doc does state that
this is expected behaviour and to use the API for this. If the API
continues to be down, that is the route I will be forced to take. It's a
manual laborious process and one I will do my best to avoid.


Reply to this email directly or view it on GitHub
#6 (comment).

I was just looking into the possibility of doing just that.

commented

The site is pretty slow, and I eventually gave up trying to get the full data download to work. Scraping might be the easiest way - or perhaps just emailing them about the issue?

The api site is still down. I have written to them about this. In the meantime I finally managed to download all of the data (8 searches by category) and have saved the csv files. A manual process, but for next time (when we need to update the data) it shouldn't be too time consuming. Scraping their site for information doesn't look easy, given their UI, or to replicate human actions for the searches. I have for now saved these files in the CottageLabs/ContentMine Google drive.

Okay, heard back from the IUCN people. The link to the API I had found was to a beta version. The URL I should be using is http://apiv3.iucnredlist.org/api/v3/docs

Calls to make

  1. Get number of species

    http://apiv3.iucnredlist.org/api/v3/speciescount?token=YOUR_TOKEN_ID

  2. Get list of species by page
    _Need the species count to calculate number of pages. The response for page 1 has number of rows_

    http://apiv3.iucnredlist.org/api/v3/species/page/1?token=YOUR_TOKEN_ID
    http://apiv3.iucnredlist.org/api/v3/species/page/2?token=YOUR_TOKEN_ID
    This will return the fields - species id (taxonid), scientific_name, infra_rank, infra_name, population and category

  3. Get information for each of the species by ID

    http://apiv3.iucnredlist.org/api/v3/species/id/species_id?token=YOUR_TOKEN_ID
    This will give us additional information regarding the species.
    The fields returned are - taxonid, scientific_name, kingdom, phylum, class, order, family, genus, main_common_name, authority, published_year, category, criteria, marine_system, freshwater_system, terrestrial_system, assessor, reviewer

NOTE from the API page :
The species ID might change and should not be used as persistent identifier. To find the species ID, use the weblink api call with the species name

http://apiv3.iucnredlist.org/api/v3/weblink/loxodonta%20africana
The only issue I find with this is that, looking at the data, the species name does not seem to be unique. The species concolor is listed 5 times in the csv data with different species ID.
Oh understood, the combination of "genus species_name" is what needs to be used to find the current id. Maybe, this combination can be used as a unique key in elastic search.

Thanks,

When you say the "species" is not unique, are you referring to the binomial
Latin name ( https://en.wikipedia.org/wiki/Binomial_nomenclature ) of two
words or just to the second word? (It's quite common for the second word,
identifying the species within the genus to be found in many names. e.g.
"Passer domesticus " is the House Sparrow and Acheta domesticus is the
House Cricket
_)_ Ross will probably give more examples. However there are
cases where binomial names are not unique , where one is a plant and
another an animal for example.

In either case I wouldn't worry - we can sort it out.

On Fri, Oct 30, 2015 at 4:56 PM, Anusha Ranganathan <
notifications@github.com> wrote:

Okay, heard back from the IUCN people. The link to the API I had found was
to a beta version. The URL I should be using is
http://apiv3.iucnredlist.org/api/v3/docs

Calls to make

Get number of species

http://apiv3.iucnredlist.org/api/v3/speciescount?token=YOUR_TOKEN_ID

Get list of species by page

Need the species count to calculate number of pages. The response for
page 1 has number of rows

http://apiv3.iucnredlist.org/api/v3/species/page/1?token=YOUR_TOKEN_ID
http://apiv3.iucnredlist.org/api/v3/species/page/2?token=YOUR_TOKEN_ID
This will return the fields - species id (taxonid), scientific_name,
infra_rank, infra_name, population and category

Get information for each of the species by ID

http://apiv3.iucnredlist.org/api/v3/species/id/species_id?token=YOUR_TOKEN_ID
This will give us additional information regarding the species.
The fields returned are - taxonid, scientific_name, kingdom, phylum,
class, order, family, genus, main_common_name, authority, published_year,
category, criteria, marine_system, freshwater_system, terrestrial_system,
assessor, reviewer

NOTE from the API page :
The species ID might change and should not be used as persistent
identifier. To find the species ID, use the species name

http://apiv3.iucnredlist.org/api/v3/weblink/loxodonta%20africana
The only issue I find with this is that, looking at the data, the species
name does not seem to be unique. The species concolor is listed 5 times in
the csv data with different species ID


Reply to this email directly or view it on GitHub
#6 (comment).

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

I used the csv file that Anusha got but in the end just created a simple python script to upload it rather than the repo she put together, so I could create a dynamic mapping on the index easily. Species IDs do appear to be unique, as do species by binomial name. Website demo is now also up and running.

Code to index IUCN redlists data - https://github.com/anusharanganathan/redlist-indexer.

Will index the data in the file in Google drive - CottageLabs/ContentMine/IUCN-Redlist-Data/all.csv

Hi! After being stumped on the IUCN website, I stumbled upon this thread in my search for a complete export of IUCN species. I am using the google doc @anusharanganathan posted. Do you all have a more recent, full export from the IUCN database?