IQSS / dataverse

Open source research data repository software

Home Page:http://dataverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Datasets labeled with "Incomplete metadata" due to invalid geospatial bounding box quietly disappear in search results

kbrueckmann opened this issue · comments

What steps does it take to reproduce the issue?
For us, only previously published datasets are affected. An example can be found here:
Dataset: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/10000
Dataverse (the dataset is missing in the listed child datasets): https://heidata.uni-heidelberg.de/dataverse/iwrgraphics
The dataset is still in the database (and can be reached via API), but it disappears from the search results and can only be found by using the direct link/doi.

When does this issue occur?
We encountered the problem with datasets that use the bounding box in the geospatial metadata section. Apparently, this might be connected to the changes in #10142 . On editing the metadata, the following message is shown: "Geographic Bounding Box has invalid coordinates. East must be greater than West and North must be greater than South. Missing values are NOT allowed." This is quite easily corrected and afterwards the dataset reappears in the search results.

However, we do not know how to find the datasets affected in order to be able to correct them. Is there any feature for this?

To whom does it occur (all users, curators, superusers)?
All users

What did you expect to happen?
The datasets to be still visible in the search results even if parts of the metadata are now invalid. Ideally, to receive a notice that they are invalid. We are looking for a way to find all affected datasets.

Which version of Dataverse are you using?
6.1

Any related open or closed issues to this bug report?
The question was previously asked by @lmaylein in Issue #10116 and opened here on @pdurbin 's request.
#10116 #10142

FWIW: I don't know of any way to find specifically which datasets have this geo box issue, but the https://guides.dataverse.org/en/latest/admin/solr-search-index.html#index-and-database-consistency status api call would write in the log a list of any datasets that didn't get indexed. That might help if the datasets are not getting indexed at all versus just being marked as having incomplete metadata/not having the geo box indexed.

@qqmyers Strangely enough, the affected datasets that we know of were not in the index check list. However, we will now correct the datasets we know and then completely re-index everything.

It's really a bummer that it's not easy to find the affected datasets. If you do a complete reindex, are these problematic datasets logged? (It may be better to try this on a test server first.)

I've long dreamed of an API that will exercise our validation rules on a dataset (we use Bean Validation for this). New rules were added in #10142 which is why data that was treated as perfectly fine in 6.0 is now treated as invalid (for good reasons). Obviously, we need a way to know when old data no longer complies with new rules. 😅

No errors were logged during reindexing. I assume that after correcting the metadata, all datasets will now be displayed again.