AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DwC fields not being indexed

nickdos opened this issue · comments

See support ticket https://support.ehelp.edu.au/a/tickets/81984.

User flagged that some DwC fields do not appear in a download file but the fields can be seen on an individual record page.

EDIT: Outstanding tasks moved to #394

See https://biocache-ws.ala.org.au/ws/occurrences/search?q=data_resource_uid%3Adr342&facets=georeferenced_by,georeference_protocol,georeferenced_date,georeference_sources&pageSize=0

Only georeferenced_date shows values and this is also the only column populated for CSV downloads. All the georef* fields are marked as being indexed and stored - https://biocache.ala.org.au/fields?filter=georef*.

Investigate why these fields are not being added to the SOLR index.

The raw fields get indexed. https://biocache-ws.ala.org.au/ws/occurrences/search?q=data_resource_uid%3Adr342&facets=raw_georeferenced_by,raw_georeference_protocol,raw_georeferenced_date,raw_georeference_sources&pageSize=0

Looking at the cassandra table, georeferencedBy_p is not being updated from georeferencedBy. However, georeferencedDate_p is.

@charvolant user came back and said samplingProtocol also not showing up - should I create a new issue or leave it here?

@nickdos wrote: "User flagged that some DwC fields do not appear in a download file but the fields can be seen on an individual record page."

From 2018 paper (https://doi.org/10.3897/zookeys.751.24791)

"identifiedBy: ...The original identifiedBy_raw data item appears on the ALA webpage as “Identified by” for the record but is missing from the standard (recommended) download."
"locality: ...The original locality_raw data item appears on the ALA webpage as “Locality” for the record but is missing from the standard (recommended) download."

These 2 were subsequently fixed, but was no automated check put in place to ensure that downloaded fields were the same as the databased fields, or at least not empty vs non-empty? Left it to users to spot, instead?

Additional fields to add if applicable:

  • num_identification_agreements, eg "2"
  • identification_verification_status, eg "research"

These are related to iNaturalist and the community identification of a sighting. Neither of these is currently exported in any download, making it impossible to determine the community's confidence on a record's ID in any downloaded set of iNat data.

Issue raised in helpdesk ticket 84773 as I couldn't advise the user to specifically use those fields in a download to gauge accuracy of records.

AtlasOfLivingAustralia/biocache-service#317 is still an issue even though it was closed at one point due to confusion about the nature of the bug.

The sampling protocol processed field is not consistently populated with the raw values, so downloads look odd and are missing values in the "samplingProtocol" column because of the bug.

Not yet appearing in prod SOLR. Keeping in QA

  • test on sandbox test on nectar.

Facets now have values.