gnames / gnfinder

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

best way to use gnfinder to find names in tabulated data and get results tabulated as in origin

abubelinha opened this issue · comments

Hello

I am planning to use gnfinder to process a column from a table with about 2500 rows.

  • In first column I have an identifier from a museum specimen ID.
  • In second column I have the whole unprocessed old specimen label, which contains one or several species names, locality, collector and maybe some comments.
ID LABEL
1 Blah blah blah Scientificname_A blah blah Scientificname_B blah blah
2 Scientificname_C bleh blah blah Scientificname_A
... ...
2500 Blah blah blih blah Scientificname_X bluh blah blah Scientificname_F blah blah

So, in fact, what I need to pass in to gnfinder is each cell of the second column, to extract names from it and return matches against some preferred name sources. But of course, I need to keep the returned info associated to each specimen ID (1st column in my table).

  • Is it possible to somehow pass the 2nd column to gnfinder in just one call, so gnfinder returns me an array of 2500 responses for each row of my original table?
  • Or do I need to make 2500 separate gnfinder calls?

I was planning to use the API but I suppose I could try to use the CLI if it is more suitable to this purpose.

Thanks a lot

EDIT: not sure if this has relation to #56 but I am not using R dataframes. Just processing a CSV file in Python.

Hi @abubelinha, one way you can do it locally is to set a pipe in python to talk to command liine gnfinder on you computer. It would be similar to https://github.com/gnames/gnparser#pipes

2500 separate calls to API also does not sound too strenuous for the service.

Thanks @dimus
But I guess even using pipes, this would imply 2500 local gnfinder pipe calls, wouldn't it? (which again means 2500 online requests when verification is turned on, correct?)
I would prefer to use one call, just in case I end up using this technic for something much bigger in the future.

Anyway, I had not realized that gnfinder returns start/end position of each name found in the long text string. That could be so useful for my use case.
Perhaps creating a couple of new calculated columns in my table, label_length, plus cummulative_labels_length, and then concatenating all labels' cells and passing them to gnfinder as a single long string ... I might be able to match found names against the correct rows by comparing returned start & end values of each name against these two columns' values

If you do not mind to use the start/end positions, all should work in one go. However, take in account #38. If your file is tab-separated, all will work, if it is comma-separated, you would probably need to preprocess the file and add a space after commas.

Good point!
As I am generating the original CSV I can control its format and make it tab-separated.
Anyway, what I am passing to gnfinder is only one column (see LABEL column in table above), with all rows concatenated, like this (so no column separators affecting here):

"Blah blah blah Scientificname_A blah blah Scientificname_B blah blah|Scientificname_C bleh blah blah Scientificname_A|Blah blah blih blah Scientificname_X bluh blah blah Scientificname_F blah blah"

I use | symbols here to show you the limits between original colum cells (from up to down). But if I concatenate them, those symbols are not present in the text passed to gnfinder ... or should I better use them? Which character would you use (if any) to separate content from contiguous cells, before feeding gnfinder?

I try to figure out what will happen if the taxon name is just at the end or beginning of the cell (if no separator is added, then both names will be concatenated).

Perhaps a space before and after separator would be better? (so 3 characters instead of just one)

originally gnfinder was made to detect names in BHL, so it uses a space of any kind as a separator between words. The | characters should not affect anything, as long as there is a space after them.

several spaces are ok

CSV and TSV files should work fine, because they are going to be normalized to a plain text with spaces.